K-mean clustering in Neo4j

The last few days I’ve been playing with Neo4j. I’ve always been intrigued by graphs, and I’ve wanted to learn more about graph databases for some time now, and finally, I was able to find some time for it.

One challenge I usually have when I want to test out new technology is to find a suitable use case. Simple enough, but still one that illustrates some amount of the features and functionality of the new technology. In this case I ended up with a simple toy example to illustrate how the k-mean clustering algorithm, used in machine learning, can partition a data set into k disjoint clusters.

The k-mean clustering algorithm

The algorithm itself is fairly simple. Assume that we have a data set with observations, and each sample has data for d features, hence each sample can be considered as a real d-dimensional vector [x_1, \ldots, x_d]. This means that we can calculate the squared Euclidean distance of two vectors \textbf{x} = [x_1, \ldots, x_d] and \textbf{y} = [y_1, \ldots, y_d] as

    \[ \|\textbf{x} - \textbf{y}\|^2 = (x_1 - y_1)^2 + (x_2 - y_2)^2 + \dots + (x_d - y_d)^2, \]

and if we have m vectors \textbf{x}^1,\textbf{x}^2, \ldots, \textbf{x}^m we can calculate the mean or average as

    \[ \frac{1}{m}\sum_{j = 1}^{m}{\textbf{x}^j} = \frac{1}{m}[\sum_{j = 1}^{m}{x_1^j}, \ldots, \sum_{j = 1}^{m}{x_d^j}]. \]

To use the algorithm, we first have to decide the number of clusters, and then initialize what is called centroids, one for each cluster, \{ c_1, \ldots, c_k\}. The centroids will be the centres of each cluster. One way of initializing the centroids is to randomly pick k samples from the data set. Once we have the initial k centroids the following two steps are repeated.

Cluster assignment step

In this step each sample s is assigned to the cluster corresponding to the centroid with the smallest Euclidean distance to the sample, i.e. the centroid closest to s. Hence, s is assigned to one of the centroids of the following set. (It might happen that the sample is exactly in the middle of two or more centroids, but usually, this set consists of only one centroid).

    \[ \{c_i:  \|c_i - s\|^2 =  \min_{j}{\|c_j - s\|^2}\} \]

Centroid update step

When all the samples are assigned to a centroid, this step will calculate new centroids for each of the k clusters by taking the mean of all the assigned samples. So for assigned samples s^1, \ldots, s^r, the vector for the new centroid c_i will be

    \[ \textbf{c}_i = \frac{1}{r}\sum_{j = 1}^{r}{\textbf{s}^j} \]

These two steps are repeated until the algorithm converges, i.e., the assignment step doesn’t change the previous assignments. But the algorithm might converge to a local optimum, and the algorithm doesn’t guarantee that the global optimum is found. So a common approach would be to run the algorithm several times with different initialization of the centroids and choose the best one.

The data set

In order to do the clustering I needed a data set, and I ended up with data set of seeds from UCI Machine Learning Repository. The set consist of 210 samples of kernels of three types of wheat, 70 samples of each type. The samples each have seven attributes; area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove, and hence each sample can be considered as a 7-dimentional vector. The set is labelled, so we know which samples that belongs to the same type of wheat. The k-mean algorithm if often used with unlabelled sets and doesn’t use the labels, but I’ve included them in the example so we easily can see how the clustering works compared to the original labelling.

Implementation in Neo4j

I have two kinds of nodes in my model, seeds and centroids. They both contain the seven attributes from the data set, and seeds also have their original labelling. Further, centroids have an index which gives the number of the corresponding cluster, and an attribute “iteration”, which contains the iteration the centroid is used in. (Another approach would be to remove the old centroids after calculating the new ones, and thus, there is no need to keep track of the iterations. But we want to see the change so we are keeping all centroids). The centroids are chosen very non-random, I’ve just picked data from one seed of each label, hence the clustering will be good already after the first cluster iteration.

The creation of the nodes is pretty straight forward. The attributes of a node in Neo4j are in JSON format, so I started off by reading the data set file in Java, creating Seed objects, and then use Gson to convert the objects to JSON strings. Before the colon and the type name one can add an identifier if one wants to refer to the same node later in the same script, but we don’t need that in our case.

CREATE (:Seed {area:15.26,perimeter:14.84,compactness:0.871,kernelLength:5.763,kernelWidth:3.312,asymmetryCoefficient:2.221,kernelGrooveLength:5.22,label:1})
CREATE (:Centroid {area:13.84,perimeter:13.94,compactness:0.8955,kernelLength:5.324,kernelWidth:3.379,asymmetryCoefficient:2.259,kernelGrooveLength:4.805, index:1, iteration:1})

The complete script file for creating all the nodes is in this file.

Cluster assignment

The code below shows how I assign the seeds to the right centroid. It feels a bit clumsy, but it was the best I could come up with in pure Neo4j. If this was a part of an application, one would probably let the application code calculate the distances, and find the minimum. And as far as I can tell, Neo4j does’t support user defined functions, so I cannot create like a distance-function instead of duplicating the calculation of distance three times.
The MATCH statement does pattern matching, and one can specify paths, nodes and relations, with our without types (which I think is called labels in the graph database world) on nodes or relationships, and with identifiers to refer to the elements later in the statement. So in our case we need all the seeds, and for each seed we will calculate the distances to each of the three centroids, and then find the closest centroid. The MATCH clause finds 210 rows, one for each seed, containing the seed and the three centroids.
The three SET statements add new attributes distC1, distC2 and distC3 to each seed, containing the distance from the seed to each centroid. The following WITH clause is used to bind variables from the matching to be used later. So we want to keep each seed s, and then for each seed the closest centroid, kept as minC, and finally, we create a IN_CLUSTER relationship from the seed s to the centroid minC. After the CREATE one could have tidied up the seeds, and deleted the three distance attributes.

MATCH (s:Seed), (c1:Centroid{index: 1, iteration: 1}), (c2:Centroid{index: 2, iteration: 1}), (c3:Centroid{index: 3, iteration: 1}) 
SET s.distC1 = (s.area - c1.area)^2 + (s.perimeter - c1.perimeter)^2 + (s.compactness - c1.compactness)^2 + (s.kernelLength - c1.kernelLength)^2 + (s.kernelWidth - c1.kernelWidth)^2 + (s.asymmetryCoefficient - c1.asymmetryCoefficient)^2 + (s.kernelGrooveLength - c1.kernelGrooveLength)
SET s.distC2 = (s.area - c2.area)^2 + (s.perimeter - c2.perimeter)^2 + (s.compactness - c2.compactness)^2 + (s.kernelLength - c2.kernelLength)^2 + (s.kernelWidth - c2.kernelWidth)^2 + (s.asymmetryCoefficient - c2.asymmetryCoefficient)^2 + (s.kernelGrooveLength - c2.kernelGrooveLength)
SET s.distC3 = (s.area - c3.area)^2 + (s.perimeter - c3.perimeter)^2 + (s.compactness - c3.compactness)^2 + (s.kernelLength - c3.kernelLength)^2 + (s.kernelWidth - c3.kernelWidth)^2 + (s.asymmetryCoefficient - c3.asymmetryCoefficient)^2 + (s.kernelGrooveLength - c3.kernelGrooveLength)
WITH s, 
case 
when s.distC1 <= s.distC2 and s.distC1 <= s.distC3 then
c1
when s.distC2 <= s.distC1 and s.distC2 <= s.distC3 then c2 else c3 end as minC 
CREATE (s)-[:IN_CLUSTER]->(minC)
RETURN *

The picture shows the three cluster after the seeds have been assigned, and the table shows the statistics, the first cluster has few seeds, they should have 70 seeds each if the algorithm gets everything correct, but the first cluster has a high rate of correct seeds, compared to the third cluster where only 80% of the seeds are correct.
clustering_1

Cluster no Total assigned Correct assigned Percentage correct
1 50 44 88%
2 80 69 86.25%
3 80 64 80%

Centroid update

The next step is to find the new centroids for the next assignment step, by calculating the average values of the seven attributes for the seeds assigned to each cluster. I found the useful avg-function that calculates the average of an attribute value over all nodes with the same identifier. But in this case it is important to think through what you include in a single match. If the match statement was like MATCH (s1:Seed)-[:IN_CLUSTER]->(c1:Centroid {index: 1, iteration: 1}), (s2:Seed)-[:IN_CLUSTER]->(c2:Centroid {index: 2, iteration: 1}), (s3:Seed)-[:IN_CLUSTER]->(c3:Centroid {index: 3, iteration: 1}) we would get the Cartesian product over s1, s2 and s3, and the number of rows returned would 50 * 80 * 80 = 320 000. This would still give the right numbers when using the average function since taking the average over multiple copies of a value will give back the original value, but for other aggregate functions, like sum, one would of course get wrong values.

MATCH (s1:Seed)-[:IN_CLUSTER]->(c1:Centroid {index: 1, iteration: 1})
WITH avg(s1.area) as s1Area, avg(s1.perimeter) as s1Perimeter, avg(s1.compactness) as s1Compactness, avg(s1.kernelLength) as s1KernelLength, avg(s1.kernelWidth) as s1KernelWidth, avg(s1.asymmetryCoefficient) as s1AsymmertryCoefficient, avg(s1.kernelGrooveLength) as s1KernelGrooveLength
MATCH (s2:Seed)-[:IN_CLUSTER]->(c2:Centroid {index: 2, iteration: 1})
WITH s1Area, s1Perimeter, s1Compactness, s1KernelLength, s1KernelWidth, s1AsymmertryCoefficient, s1KernelGrooveLength, avg(s2.area) as s2Area, avg(s2.perimeter) as s2Perimeter, avg(s2.compactness) as s2Compactness, avg(s2.kernelLength) as s2KernelLength, avg(s2.kernelWidth) as s2KernelWidth, avg(s2.asymmetryCoefficient) as s2AsymmertryCoefficient, avg(s2.kernelGrooveLength) as s2KernelGrooveLength
MATCH (s3:Seed)-[:IN_CLUSTER]->(c3:Centroid {index: 3, iteration: 1})
WITH s1Area, s1Perimeter, s1Compactness, s1KernelLength, s1KernelWidth, s1AsymmertryCoefficient, s1KernelGrooveLength, 
s2Area, s2Perimeter, s2Compactness, s2KernelLength, s2KernelWidth, s2AsymmertryCoefficient, s2KernelGrooveLength, 
avg(s3.area) as s3Area, avg(s3.perimeter) as s3Perimeter, avg(s3.compactness) as s3Compactness, avg(s3.kernelLength) as s3KernelLength, avg(s3.kernelWidth) as s3KernelWidth, avg(s3.asymmetryCoefficient) as s3AsymmertryCoefficient, avg(s3.kernelGrooveLength) as s3KernelGrooveLength
CREATE (:Centroid {area: s1Area, perimeter: s1Perimeter, compactness: s1Compactness, kernelLength: s1KernelLength, kernelWidth: s1KernelWidth, asymmetryCoefficient: s1AsymmertryCoefficient, kernelGrooveLength: s1KernelGrooveLength, index: 1, iteration: 2})
CREATE (:Centroid {area: s2Area, perimeter: s2Perimeter, compactness: s2Compactness, kernelLength: s2KernelLength, kernelWidth: s2KernelWidth, asymmetryCoefficient: s2AsymmertryCoefficient, kernelGrooveLength: s2KernelGrooveLength, index: 2, iteration: 2})
CREATE (:Centroid {area: s3Area, perimeter: s3Perimeter, compactness: s3Compactness, kernelLength: s3KernelLength, kernelWidth: s3KernelWidth, asymmetryCoefficient: s3AsymmertryCoefficient, kernelGrooveLength: s3KernelGrooveLength, index: 3, iteration: 2})
RETURN *

Next cluster assignment

The next cluster assignment is done by the same statement as the last, only with the iteration number in the centroids increased to two. An now we can easily see which of the seeds that are moved to a new cluster. And the statistics show that cluster two and three are improved, whereas the first cluster now has more seeds, but the correctness has decreased.
graph

Cluster no Total assigned Correct assigned Percentage correct
1 66 55 83.3%
2 73 66 90.4%
3 71 63 88.7%

Now, it is up to you to continue the clustering if you’d like, all the code is in this gist. If you find errors in this post, have suggestions for improvements or have other comments or questions, please let me know! 🙂

References

  1. http://neo4j.com/
  2. http://archive.ics.uci.edu/ml/datasets/seeds
  3. https://en.wikipedia.org/wiki/K-means_clustering