Data analysis and machine learning for 11-year-olds

My 11-year-old son recently started to watch movies about science and facts on YouTube, in addition to the usual gamer shows. (I hope it is not just because he has run out of gaming videos).
A couple of weeks ago he was super excited when I came home from work, he had watched a video about Titanic and could not stop talking about it. He had leaned all about why it sunk, all the different circumstances, and how the outcome could have been different if some of even then smaller things had been different.
Then it occurred to me that one of the standard data set when getting started with data analysis and machine learning is the passenger list from Titanic, with information about fare, passenger class, age, gender, if the person had spouse or siblings on board and if it had child or parents on board.
Since my son recently learned some elementary statistics in school, I thought it would be fun to show him how we can work with the Titanic data set. I have earlier given some workshops on Apache Spark in Java, so that is what we used. Spark has great documentation of their machine learning library here.
The Titanic data set is available, with some variations, from many places. I downloaded the titanic_original.csv from this GitHub repository. (I tried the cleaned data set first, but persons with missing age is given the average age, and that would mess up what I wanted to do with the data, so I continued with the original data set).

Load data from csv

The first thing we have to do is to start spark and load the data. It is quite easy to read data with Spark. If you are lucky, the option inferSchema will figure out the correct types.

SparkConf conf = new SparkConf().setAppName("Titanic").setMaster("local[*]");
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();

Dataset<Row> passengers =
        .option("inferSchema", "true")
        .option("delimiter", ",")
        .option("header", true)

When the data has been loaded into a Dataset, it is a good idea to check that it actually contains what it should, and that the columns have the right data types. We can do that with the methods and passengers.printSchema(). The first one prints the first twenty rows of the data set, and the latter prints the type for each column.

|pclass|survived|                name|   sex|   age|sibsp|parch|  ticket|    fare|  cabin|embarked|boat|body|           home.dest|
|     1|       1|Allen, Miss. Elis...|female|  29.0|    0|    0|   24160|211.3375|     B5|       S|   2|null|        St Louis, MO|
|     1|       1|Allison, Master. ...|  male|0.9167|    1|    2|  113781|  151.55|C22 C26|       S|  11|null|Montreal, PQ / Ch...|
|     1|       0|Allison, Miss. He...|female|   2.0|    1|    2|  113781|  151.55|C22 C26|       S|null|null|Montreal, PQ / Ch...|
|     1|       0|Allison, Mr. Huds...|  male|  30.0|    1|    2|  113781|  151.55|C22 C26|       S|null| 135|Montreal, PQ / Ch...|
|     1|       0|Allison, Mrs. Hud...|female|  25.0|    1|    2|  113781|  151.55|C22 C26|       S|null|null|Montreal, PQ / Ch...|
|     1|       1| Anderson, Mr. Harry|  male|  48.0|    0|    0|   19952|   26.55|    E12|       S|   3|null|        New York, NY|
|     1|       1|Andrews, Miss. Ko...|female|  63.0|    1|    0|   13502| 77.9583|     D7|       S|  10|null|          Hudson, NY|
|     1|       0|Andrews, Mr. Thom...|  male|  39.0|    0|    0|  112050|     0.0|    A36|       S|null|null|         Belfast, NI|
|     1|       1|Appleton, Mrs. Ed...|female|  53.0|    2|    0|   11769| 51.4792|   C101|       S|   D|null| Bayside, Queens, NY|
|     1|       0|Artagaveytia, Mr....|  male|  71.0|    0|    0|PC 17609| 49.5042|   null|       C|null|  22| Montevideo, Uruguay|
|     1|       0|Astor, Col. John ...|  male|  47.0|    1|    0|PC 17757| 227.525|C62 C64|       C|null| 124|        New York, NY|
|     1|       1|Astor, Mrs. John ...|female|  18.0|    1|    0|PC 17757| 227.525|C62 C64|       C|   4|null|        New York, NY|
|     1|       1|Aubart, Mme. Leon...|female|  24.0|    0|    0|PC 17477|    69.3|    B35|       C|   9|null|       Paris, France|
|     1|       1|"Barber, Miss. El...|female|  26.0|    0|    0|   19877|   78.85|   null|       S|   6|null|                null|
|     1|       1|Barkworth, Mr. Al...|  male|  80.0|    0|    0|   27042|    30.0|    A23|       S|   B|null|       Hessle, Yorks|
|     1|       0| Baumann, Mr. John D|  male|  null|    0|    0|PC 17318|  25.925|   null|       S|null|null|        New York, NY|
|     1|       0|Baxter, Mr. Quigg...|  male|  24.0|    0|    1|PC 17558|247.5208|B58 B60|       C|null|null|        Montreal, PQ|
|     1|       1|Baxter, Mrs. Jame...|female|  50.0|    0|    1|PC 17558|247.5208|B58 B60|       C|   6|null|        Montreal, PQ|
|     1|       1|Bazzani, Miss. Al...|female|  32.0|    0|    0|   11813| 76.2917|    D15|       C|   8|null|                null|
|     1|       0|Beattie, Mr. Thomson|  male|  36.0|    0|    0|   13050| 75.2417|     C6|       C|   A|null|        Winnipeg, MN|
 |-- pclass: integer (nullable = true)
 |-- survived: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: double (nullable = true)
 |-- sibsp: integer (nullable = true)
 |-- parch: integer (nullable = true)
 |-- ticket: string (nullable = true)
 |-- fare: double (nullable = true)
 |-- cabin: string (nullable = true)
 |-- embarked: string (nullable = true)
 |-- boat: string (nullable = true)
 |-- body: integer (nullable = true)
 |-- home.dest: string (nullable = true)

As mentioned, age is missing for some of the passengers, the same holds for fare, so for our purpose we will just remove the rows where age or fare is null, with a syntax quite similar to SQL.

passengers = passengers.where(col("age").isNotNull().and(col("fare").isNotNull()));

Most of the columns are self explanatory, but pclass is the passenger class, sibsp is number of siblings and spouse the passenger had on board, and similarly, parch is the number of parents and children.

Average, median and mode

Now that we’ve got the data we can finally start doing some statistics. The statistics taught in 5th grade basically covers average, median and mode of a set of numbers, and how to draw different types of charts. I asked my son what he believed would be the average age of the passengers, he guessed 30 years. Let’s see how we can get the average, together with median and mode from Spark.

There is a useful method describe on a dataset, which gives us various information about the columns we ask for. It turns out that 30 was a good guess for the average 🙂 .

|summary|               age|
|  count|              1045|
|   mean|29.851834162679427|
| stddev|14.389200976290613|
|    min|            0.1667|
|    max|              80.0|

To find the mode, the value that occurs most times, we draw a plot with ages along the x-axis, and the count of passengers with that age along the y-axis. The plot is made with the Java library XChart.

By looking at the chart it seems that 24 years is the mode, we can verify that by querying the data set and look at the first row that is printed.


The median is the middle data point when the data is sorted. We can sort the data set on age, but a dataset does not have an index we can query for. The data set has 1045 entries, so the easiest thing would be to do .show(523) and look at the last row that is printed, or we can add an id column like this (according to the docs it is not guaranteed to be consecutive, but it will be if your data is not partitioned, as in our case).

passengers.orderBy("age").withColumn("id", monotonically_increasing_id());

A more proper, but much less intutive, way of finding the median would be to use the quantile function like this

passengers.stat().approxQuantile("age", new double[]{0.5}, 0);

No matter how we find the median, the result is 28.

That was the basic statistics, let us move on to analyze who the survivors were.

Survivors by gender

Let us start by finding the amount and fraction of survivors in total. It can be done with the following line

passengers.groupBy("survived").count().withColumn("fraction", col("count").divide(sum("count").over()));

The table below shows the depressing result that only 41% of the 1045 passengers in our data set survived.

Survived Count Fraction
1 427 0,41
0 618 0,59

So how is the rate of survivors by gender?
I told my son that it was common to save children and women before men, and he was shocked; “What, is that true? That’s totally unfair!”.
Well, I actually found the paper Gender, social norms, and survival in maritime disasters where the authors have studied maritime disasters and to which extend the “social norm of women and children first” is followed. They conclude that “Women have a distinct survival disadvantage compared with men. Captains and crew survive at a significantly higher rate than passengers.”

If survival was independent of gender, we would assume that around 41% of the woman would survive, and likewise, 41% of the men. That is logical, also for a 5th grader. But what are the actual rates based on gender?

Gender Survived Total Percentage survivors by gender
Female 292 388 75,3%
Male 135 657 20,5%

We clearly see that survival is dependent on gender, so will can assume that gender will be an important feature for predicting if a person survived or not. This is the intuition behind the Chi-squared test, which also is implemented in Spark as the ChiSqSelector, which can be used to find the most important features for a data set.

Prediction with decision trees

Decision trees are one of the simplest types of algorithms in machine learning and it is easy to understand the result of the algorithm, I think of it as if-else-statements written by the program and not by the programmer. Decision trees are (usually) calculated top down, by selecting the feature that separates the data points best, in terms of grouping data with the same label (the value we try to predict).
A decision tree classifier can have a tendency to over-fit, which means that the model fit the training set very well, but becomes less good for other data sets. A common way to avoid this is to make several decision trees, a random forest, where some randomness is added to the generation of the trees to make them different. The prediction for an item is the label it gets most times when scoring it in each of the separate trees.

In spark we start by splitting the data set randomly into two sets, one training set for fitting the model, containing 70% of the data, and a test set used to evaluate the model.

Dataset<Row>[] splits = passengers.randomSplit(new double[]{0.7, 0.3});
Dataset<Row> training = splits[0];
Dataset<Row> test = splits[1];

Before we can feed data into the a random forest classifier we need to transform the data. Spark’s machine learning algorithms wants data with a column “label” that contains what we are predicting, and a column “features” that contains a vector of the data attributes we want to include. I find the RFormula in Spark very useful for making label and features. The syntax is a bit strange, but the value to the left of “~” is the label, and on the right side one can add the attributes one wants, either by naming each of them, with + between them, or my staring with “.” which gives all, and then “subtract” the attributes you do not want. Survived is our label, and as features we select pclass, sex, age, fare, sibsp and parch.

RFormula formula = new RFormula().setFormula("survived ~ pclass + sex + age + fare + sibsp + parch");

We then use the RandomForestClassifer, with default configuration. Spark has this nice pipeline which makes it easy to work with multiple steps in the machine learning process. We only have to fit and transform data once on the pipeline, instead of for each individual step, and it is easy to experiment and replace parts of the chain. The following lines of code creates a pipeline, and fit pipeline on the training data, which gives back a pipelineModel.

RandomForestClassifier forestClassifier = new RandomForestClassifier();
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{formula, forestClassifier});
PipelineModel model =;

To find out how the model is doing on the test set, we can use the binary classification evaluator, since we are doing classification with a binary label (survived/died). The evaluator will classify each entry in the data set, compare with the actual label, and count up the amount of correctly classified entries.

Dataset<Row> predictions = model.transform(test);
BinaryClassificationEvaluator eval = new BinaryClassificationEvaluator();
double res = eval.evaluate(predictions);

The result will vary a bit each time the program is run since there is randomness both in splitting of the data set, and in the algorithm, but an example of accuracy we got is 86,5%.

The random forest model has a string representation of the trees one can use to look at the actual trees.

RandomForestClassificationModel treeModel = (RandomForestClassificationModel) model.stages()[1];

I visualized one of these trees, and my son wanted to see if he would have survived or not. Luckily he just turned eleven, and would according to this tree survive if we were travelling by first or second class. I would also have survived, but it certainly doesn’t look good for dad/husband.
decision tree

Wavelets and the Abel Prize

The Abel Prize is a Norwegian mathematics prize, awarded yearly to international mathematicians with extraordinary contributions within mathematics. Each year I try to read a bit about the winner and his work, and at least get an idea of what the winner has contributed to in his field.

This year’s winner is Yves Meyer who is awarded the price for his role in developing the mathematical theory of wavelets, a theory he accidentally came across by the photocopier at the École Polytechnique in Paris [1]. He is in Oslo now, and will receive his prize in an award ceremony tomorrow, May 23.

So what are wavelets and what can they be used for?
I’m certainly not the right person to answer, with my few hours of knowledge on the topic, but I will share what I have discovered so far.
A wavelet is, as the word indicates, a wave-like function, or rather a family of functions, with some important properties that make them useful for signal processing. They are for instance better at detecting sudden changes in the signal than Fourier transforms.

Meyer has his own wavelet as shown in the image below [2].
The Meyer wavelet looks a bit complicated, doesn’t it? So I will stick to a much simpler wavelet, actually the very first example of a wavelet, the Haar wavelet. Haar constructed this function system in 1909 [3], long before the term Wavelet was introduced, but the system appears later as a special case of the Daubechies wavelet [4].
The wavelet is defined by the wavelet function \psi(t), called the mother wavelet, and a scaling function \phi(t), sometimes referred to the father wavelet. The mother wavelet is defined as

    \[    \psi(t) = \begin{cases}     1 & 0 \leq t < \frac{1}{2} \\     -1 & \frac{1}{2} \leq t < 1 \\     0 & \text{otherwise}   \end{cases} \]

with the scaling function

    \[  \phi(t) = \begin{cases}     1 & 0 \leq t < 1 \\     0 & \text{otherwise}  \]

If we continue to split the unit interval [0, 1] in 2^n pieces, we get a system of functions \psi_i^n(t) = 2^{\frac{n}{2}}\psi(2^n t - i) and \phi_i^n(t) = 2^{\frac{n}{2}}\phi(2^n t - i), where i \in \{ 0, \ldots, 2^n - 1\}. The functions \psi_i^n forms an orthonormal basis for square integrable functions on the unit interval. That means that functions can be approximated by a linear combinations of \psi_i^n, the same holds for the set of \phi_i^n.

The YouTube video below shows what the \psi_i^n looks like and how they are scaled and translated from the mother wavelet.

We will start by looking at the following function, the four purple line fragments, which one can think of as the four one-dimensional pixels [9, 5, 2, 4].

If we look at the father wavelets \phi_0^2, \phi_1^2, \phi_2^2 and \phi_3^2, we see that f can be written as

    \[ f = 9\phi_0^2 + 5\phi_1^2 + 2\phi_2^2 + 4\phi_3^2 \]

From the mother and the father wavelets one can construct subspaces V_n = \text{span}(\phi_i^n) and W_n = \text{span}(\psi_i^n) and we have that V_n is the orthogonal complement of W_n in V_{m+1} [2]:

    \[ V_n \oplus W_n = V_{m+1} \]

Let us see how we can represent f by a point in V_n and a point in W_n. The function is in V_2 and we want to write the function decomposed over V_1 and W_1. In V_1 and W_1 we have the functions \phi_0^1 = \sqrt{2}\phi(2t), \phi_1^1 = \sqrt{2}\phi(2t-1), \psi_0^1 = \sqrt{2}\psi(2t) and \psi_1^1 = \sqrt{2}\psi(2t-1). I skip the square root of two to get simpler calculations, and it only contributes with a constant factor anyway. We then have the following functions:

    \[    \psi_0^1(t) = \begin{cases}     1 & 0 \leq t < \frac{1}{4} \\     -1 & \frac{1}{4} \leq t < \frac{1}{2} \\     0 & \text{otherwise}   \end{cases} \]

    \[    \psi_1^1(t) = \begin{cases}     1 & \frac{1}{2} \leq t < \frac{3}{4} \\     -1 & \frac{3}{4} \leq t < 1 \\     0 & \text{otherwise}   \end{cases} \]

    \[  \phi_0^1(t) = \begin{cases}     1 & 0 \leq t < \frac{1}{2} \\     0 & \text{otherwise} \end{cases}  \]

    \[  \phi_1^1(t) = \begin{cases}     1 & \frac{1}{2} \leq t < 1 \\     0 & \text{otherwise} \end{cases}  \]

We see that \psi_0^1 and \phi_0^1 are non-zero in [0, \frac{1}{2}) and \psi_1^1 and \phi_1^1 are non-zero in [\frac{1}{2}, 1), so we want to find a and b such that

    \[ a\phi_0^1(t) + b\psi_0^1(t) = \begin{cases} a + b & 0 \leq t < \frac{1}{4} \\ a - b & \frac{1}{4} \leq t < \frac{1}{2} \end{cases} = \begin{cases} 9 &  0 \leq t < \frac{1}{4} \\ 5 & \frac{1}{4} \leq t < \frac{1}{2} \end{cases} \]

If we solve these equations we find that a = \frac{9+5}{2} and b = \frac{9-5}{2}, we get the equivalent for the pair 2 and 4, hence f decomposes into the point (\frac{9+5}{2}, \frac{2+4}{2}) in V_1 and the point (\frac{9-5}{2}, \frac{2-4}{2}) in W_1. We see that the point in V_1 is the average of pair of coefficients of f and the point of W_1 is the difference of the pair of coefficients, and we can represent f by [7, 3, 2, -1], where the two first points are the approximation coefficients of the original function (if you decide to zoom out and only keep half of the points in the unit interval, is seems quite natural to average the pair of points), and the two last points are called the detail coefficients.
The calculations for this example holds in general for 2^k points, we can approximate the original with half as many points by averaging pair of points, and the detail coefficients will give information about the difference of the pair of points. The detail coefficients can be used to go the other way and reconstruct the original point.

The same calculations can be applied to a 2-D image by first calculate average and difference for each row, and then for each column. I have written a simple Java program which does these calculations for an image, and thus transforms the image to a compressed image, together with three images of detail coefficients. The interesting part of the program is where the averages and differences are computed, the rest is just converting Java’s BufferedImage from color to gray scale, and from an Image to a matrix, and back again. The code snippet below shows how I computed the average and difference for the rows, the rest of the code can be found in the gist here.

private double[][] transformRows(double[][] input){
        int size = input.length;
        double[][] result = new double[size][size];
        for (int i = 0; i < size; i++) {
            for (int j = 0; j < size / 2; j++) {
                result[i][j] = (input[i][2*j] + input[i][2*j + 1])/2.0;
                result[i][size/2 + j] = (input[i][2*j] - input[i][2*j + 1])/2.0;
        return result;

Many of the pixels in an image will be very similar to their neighbouring pixels, hence many of the differences will be close to zero. The differences are largest when there are sudden changes in the image. For instance beetween the hair and the face of Lenna. Another interesting observation is that the in upper right image the vertical details are most prominent, like the nose, whereas the lower left has horizontal details like the mouth.This is because of the way we calculate average and sums, the upper right will be averaged differences of pixels in the same row, and the lower left will be averaged differences of pixels in same column.

So what can this wavelet transform be used for? One obvious application is image compression. We have got a compressed version of the original, and the detail coefficients are often zero (or a very small number). One can repeat the process, and transform the transformed image several times, and then use the detail coefficients to reconstruct the original, with or without loss [5]. Another interesting application is to use the detail coefficients as features in image recognition. In [6] it was used with logistic regression to detect facial expression, and in [7] the Haar transform was used to classify leaves.

[1] If it were true, it would be known – The Abel Committee explains the
choice of Yves Meyer as this year’s Abel Prize Laureate

[2] Wavelet. (2017, April 24). In Wikipedia, The Free Encyclopedia. Retrieved 16:17, May 21, 2017, from
[3] A. Haar Zur Theorie der orthogonalen Funktionensysteme
[4] Haar wavelet. (2017, March 4). In Wikipedia, The Free Encyclopedia. Retrieved 21:24, May 21, 2017, from
[5] Colm Mucahy, Image compression using the Haar wavelet transform, pdf
[6] Mahesh Goyani and Narendra Patel, Multi-Level Haar Wavelet based Facial Expression Recognition using Logistic Regression, pdf
[7] Haiyan Zhang and Xingke Tao, Leaf Image Recognition Based on Wavelet and Fractal Dimension,

What do you feel?

DollEmotionThis weekend I spent some time making a simple mobile app for emotion recognition, using Microsoft Cognitive Services. There are several cool APIs there, for instance the Computer Vision API which can recognize content in an image, recognize celebrities, and extract text from images. I guess I just ended up with the Emotion API because it also might be useful for work.
The APIs are well documented, and there are client libraries and sample applications on GitHub, and the client libraries are available at Maven central repository for Java and NuGet for C#. And they are free of use for registered developers, within some limits (30000 transactions pr month for the Emotion API).
I first decided to try to make the app in Xamarin, since it’s now free with Visual Studio, but I didn’t get the the client library to work there, it seemed to be something wrong with serialization in mono, so the library might not be entirely cross-platform (or I messed up somewhere). So I switched to Android studio, and made the very simple demo app there instead. ClownEmotion

The app allows you to take a photo, and the photo is then posted to Microsofts service for emotion recognition The service returns a bounding box for each recognized face, together with confidence across the eight feelings; happiness, sadness, surprise, anger, fear, contempt, disgust and neutral.
The initial testing showed that the old doll is neutral, but it also scores some at surprise. The clown was happy with very high confidence, 0.99997, but the clown scored similar on happiness without the nose. Further, the faces of teddy bears and guinea pigs are not recognized, and the same was occasionally true for my husband, too much beard perhaps?

K-mean clustering in Neo4j

The last few days I’ve been playing with Neo4j. I’ve always been intrigued by graphs, and I’ve wanted to learn more about graph databases for some time now, and finally, I was able to find some time for it.

One challenge I usually have when I want to test out new technology is to find a suitable use case. Simple enough, but still one that illustrates some amount of the features and functionality of the new technology. In this case I ended up with a simple toy example to illustrate how the k-mean clustering algorithm, used in machine learning, can partition a data set into k disjoint clusters.

The k-mean clustering algorithm

The algorithm itself is fairly simple. Assume that we have a data set with observations, and each sample has data for d features, hence each sample can be considered as a real d-dimensional vector [x_1, \ldots, x_d]. This means that we can calculate the squared Euclidean distance of two vectors \textbf{x} = [x_1, \ldots, x_d] and \textbf{y} = [y_1, \ldots, y_d] as

    \[ \|\textbf{x} - \textbf{y}\|^2 = (x_1 - y_1)^2 + (x_2 - y_2)^2 + \dots + (x_d - y_d)^2, \]

and if we have m vectors \textbf{x}^1,\textbf{x}^2, \ldots, \textbf{x}^m we can calculate the mean or average as

    \[ \frac{1}{m}\sum_{j = 1}^{m}{\textbf{x}^j} = \frac{1}{m}[\sum_{j = 1}^{m}{x_1^j}, \ldots, \sum_{j = 1}^{m}{x_d^j}]. \]

To use the algorithm, we first have to decide the number of clusters, and then initialize what is called centroids, one for each cluster, \{ c_1, \ldots, c_k\}. The centroids will be the centres of each cluster. One way of initializing the centroids is to randomly pick k samples from the data set. Once we have the initial k centroids the following two steps are repeated.

Cluster assignment step

In this step each sample s is assigned to the cluster corresponding to the centroid with the smallest Euclidean distance to the sample, i.e. the centroid closest to s. Hence, s is assigned to one of the centroids of the following set. (It might happen that the sample is exactly in the middle of two or more centroids, but usually, this set consists of only one centroid).

    \[ \{c_i:  \|c_i - s\|^2 =  \min_{j}{\|c_j - s\|^2}\} \]

Centroid update step

When all the samples are assigned to a centroid, this step will calculate new centroids for each of the k clusters by taking the mean of all the assigned samples. So for assigned samples s^1, \ldots, s^r, the vector for the new centroid c_i will be

    \[ \textbf{c}_i = \frac{1}{r}\sum_{j = 1}^{r}{\textbf{s}^j} \]

These two steps are repeated until the algorithm converges, i.e., the assignment step doesn’t change the previous assignments. But the algorithm might converge to a local optimum, and the algorithm doesn’t guarantee that the global optimum is found. So a common approach would be to run the algorithm several times with different initialization of the centroids and choose the best one.

The data set

In order to do the clustering I needed a data set, and I ended up with data set of seeds from UCI Machine Learning Repository. The set consist of 210 samples of kernels of three types of wheat, 70 samples of each type. The samples each have seven attributes; area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove, and hence each sample can be considered as a 7-dimentional vector. The set is labelled, so we know which samples that belongs to the same type of wheat. The k-mean algorithm if often used with unlabelled sets and doesn’t use the labels, but I’ve included them in the example so we easily can see how the clustering works compared to the original labelling.

Implementation in Neo4j

I have two kinds of nodes in my model, seeds and centroids. They both contain the seven attributes from the data set, and seeds also have their original labelling. Further, centroids have an index which gives the number of the corresponding cluster, and an attribute “iteration”, which contains the iteration the centroid is used in. (Another approach would be to remove the old centroids after calculating the new ones, and thus, there is no need to keep track of the iterations. But we want to see the change so we are keeping all centroids). The centroids are chosen very non-random, I’ve just picked data from one seed of each label, hence the clustering will be good already after the first cluster iteration.

The creation of the nodes is pretty straight forward. The attributes of a node in Neo4j are in JSON format, so I started off by reading the data set file in Java, creating Seed objects, and then use Gson to convert the objects to JSON strings. Before the colon and the type name one can add an identifier if one wants to refer to the same node later in the same script, but we don’t need that in our case.

CREATE (:Seed {area:15.26,perimeter:14.84,compactness:0.871,kernelLength:5.763,kernelWidth:3.312,asymmetryCoefficient:2.221,kernelGrooveLength:5.22,label:1})
CREATE (:Centroid {area:13.84,perimeter:13.94,compactness:0.8955,kernelLength:5.324,kernelWidth:3.379,asymmetryCoefficient:2.259,kernelGrooveLength:4.805, index:1, iteration:1})

The complete script file for creating all the nodes is in this file.

Cluster assignment

The code below shows how I assign the seeds to the right centroid. It feels a bit clumsy, but it was the best I could come up with in pure Neo4j. If this was a part of an application, one would probably let the application code calculate the distances, and find the minimum. And as far as I can tell, Neo4j does’t support user defined functions, so I cannot create like a distance-function instead of duplicating the calculation of distance three times.
The MATCH statement does pattern matching, and one can specify paths, nodes and relations, with our without types (which I think is called labels in the graph database world) on nodes or relationships, and with identifiers to refer to the elements later in the statement. So in our case we need all the seeds, and for each seed we will calculate the distances to each of the three centroids, and then find the closest centroid. The MATCH clause finds 210 rows, one for each seed, containing the seed and the three centroids.
The three SET statements add new attributes distC1, distC2 and distC3 to each seed, containing the distance from the seed to each centroid. The following WITH clause is used to bind variables from the matching to be used later. So we want to keep each seed s, and then for each seed the closest centroid, kept as minC, and finally, we create a IN_CLUSTER relationship from the seed s to the centroid minC. After the CREATE one could have tidied up the seeds, and deleted the three distance attributes.

MATCH (s:Seed), (c1:Centroid{index: 1, iteration: 1}), (c2:Centroid{index: 2, iteration: 1}), (c3:Centroid{index: 3, iteration: 1}) 
SET s.distC1 = (s.area - c1.area)^2 + (s.perimeter - c1.perimeter)^2 + (s.compactness - c1.compactness)^2 + (s.kernelLength - c1.kernelLength)^2 + (s.kernelWidth - c1.kernelWidth)^2 + (s.asymmetryCoefficient - c1.asymmetryCoefficient)^2 + (s.kernelGrooveLength - c1.kernelGrooveLength)
SET s.distC2 = (s.area - c2.area)^2 + (s.perimeter - c2.perimeter)^2 + (s.compactness - c2.compactness)^2 + (s.kernelLength - c2.kernelLength)^2 + (s.kernelWidth - c2.kernelWidth)^2 + (s.asymmetryCoefficient - c2.asymmetryCoefficient)^2 + (s.kernelGrooveLength - c2.kernelGrooveLength)
SET s.distC3 = (s.area - c3.area)^2 + (s.perimeter - c3.perimeter)^2 + (s.compactness - c3.compactness)^2 + (s.kernelLength - c3.kernelLength)^2 + (s.kernelWidth - c3.kernelWidth)^2 + (s.asymmetryCoefficient - c3.asymmetryCoefficient)^2 + (s.kernelGrooveLength - c3.kernelGrooveLength)
WITH s, 
when s.distC1 <= s.distC2 and s.distC1 <= s.distC3 then
when s.distC2 <= s.distC1 and s.distC2 <= s.distC3 then c2 else c3 end as minC 
CREATE (s)-[:IN_CLUSTER]->(minC)

The picture shows the three cluster after the seeds have been assigned, and the table shows the statistics, the first cluster has few seeds, they should have 70 seeds each if the algorithm gets everything correct, but the first cluster has a high rate of correct seeds, compared to the third cluster where only 80% of the seeds are correct.

Cluster no Total assigned Correct assigned Percentage correct
1 50 44 88%
2 80 69 86.25%
3 80 64 80%

Centroid update

The next step is to find the new centroids for the next assignment step, by calculating the average values of the seven attributes for the seeds assigned to each cluster. I found the useful avg-function that calculates the average of an attribute value over all nodes with the same identifier. But in this case it is important to think through what you include in a single match. If the match statement was like MATCH (s1:Seed)-[:IN_CLUSTER]->(c1:Centroid {index: 1, iteration: 1}), (s2:Seed)-[:IN_CLUSTER]->(c2:Centroid {index: 2, iteration: 1}), (s3:Seed)-[:IN_CLUSTER]->(c3:Centroid {index: 3, iteration: 1}) we would get the Cartesian product over s1, s2 and s3, and the number of rows returned would 50 * 80 * 80 = 320 000. This would still give the right numbers when using the average function since taking the average over multiple copies of a value will give back the original value, but for other aggregate functions, like sum, one would of course get wrong values.

MATCH (s1:Seed)-[:IN_CLUSTER]->(c1:Centroid {index: 1, iteration: 1})
WITH avg(s1.area) as s1Area, avg(s1.perimeter) as s1Perimeter, avg(s1.compactness) as s1Compactness, avg(s1.kernelLength) as s1KernelLength, avg(s1.kernelWidth) as s1KernelWidth, avg(s1.asymmetryCoefficient) as s1AsymmertryCoefficient, avg(s1.kernelGrooveLength) as s1KernelGrooveLength
MATCH (s2:Seed)-[:IN_CLUSTER]->(c2:Centroid {index: 2, iteration: 1})
WITH s1Area, s1Perimeter, s1Compactness, s1KernelLength, s1KernelWidth, s1AsymmertryCoefficient, s1KernelGrooveLength, avg(s2.area) as s2Area, avg(s2.perimeter) as s2Perimeter, avg(s2.compactness) as s2Compactness, avg(s2.kernelLength) as s2KernelLength, avg(s2.kernelWidth) as s2KernelWidth, avg(s2.asymmetryCoefficient) as s2AsymmertryCoefficient, avg(s2.kernelGrooveLength) as s2KernelGrooveLength
MATCH (s3:Seed)-[:IN_CLUSTER]->(c3:Centroid {index: 3, iteration: 1})
WITH s1Area, s1Perimeter, s1Compactness, s1KernelLength, s1KernelWidth, s1AsymmertryCoefficient, s1KernelGrooveLength, 
s2Area, s2Perimeter, s2Compactness, s2KernelLength, s2KernelWidth, s2AsymmertryCoefficient, s2KernelGrooveLength, 
avg(s3.area) as s3Area, avg(s3.perimeter) as s3Perimeter, avg(s3.compactness) as s3Compactness, avg(s3.kernelLength) as s3KernelLength, avg(s3.kernelWidth) as s3KernelWidth, avg(s3.asymmetryCoefficient) as s3AsymmertryCoefficient, avg(s3.kernelGrooveLength) as s3KernelGrooveLength
CREATE (:Centroid {area: s1Area, perimeter: s1Perimeter, compactness: s1Compactness, kernelLength: s1KernelLength, kernelWidth: s1KernelWidth, asymmetryCoefficient: s1AsymmertryCoefficient, kernelGrooveLength: s1KernelGrooveLength, index: 1, iteration: 2})
CREATE (:Centroid {area: s2Area, perimeter: s2Perimeter, compactness: s2Compactness, kernelLength: s2KernelLength, kernelWidth: s2KernelWidth, asymmetryCoefficient: s2AsymmertryCoefficient, kernelGrooveLength: s2KernelGrooveLength, index: 2, iteration: 2})
CREATE (:Centroid {area: s3Area, perimeter: s3Perimeter, compactness: s3Compactness, kernelLength: s3KernelLength, kernelWidth: s3KernelWidth, asymmetryCoefficient: s3AsymmertryCoefficient, kernelGrooveLength: s3KernelGrooveLength, index: 3, iteration: 2})

Next cluster assignment

The next cluster assignment is done by the same statement as the last, only with the iteration number in the centroids increased to two. An now we can easily see which of the seeds that are moved to a new cluster. And the statistics show that cluster two and three are improved, whereas the first cluster now has more seeds, but the correctness has decreased.

Cluster no Total assigned Correct assigned Percentage correct
1 66 55 83.3%
2 73 66 90.4%
3 71 63 88.7%

Now, it is up to you to continue the clustering if you’d like, all the code is in this gist. If you find errors in this post, have suggestions for improvements or have other comments or questions, please let me know! 🙂



A tiny theorem prover

lispAt work we have many great (and some quite funny and not so great) old books, and when I came across this LISP book, I had to borrow it. (At some point there must have been a real library, the books have numbering, and an old fashioned book card at the back, with date and name of previous borrowers. So maybe I didn’t borrow the book properly, I just took it off the shelf). The book has many great chapters, for instance “Symbolic pattern matching and simple theorem proving”, “Program writing programs and natural language interfaces” and “LISP in LISP”. Here is a little theorem prover from chapter “Symbolic pattern matching and simple theorem proving” in the book, based on the resolution rule, and rewritten in Racket.

The resolution rule produces a new clause if two clauses contains complementary literals, i.e, one clause contains c and the other contains \neg c.

    \[ \frac{a_1\lor \ldots \lor a_i \lor c \lor a_{i+1}\lor \ldots \lor a_n, b_1\lor \ldots \lor b_j \lor \neg c \lor b_{j+1}\lor \ldots \lor b_m}{a_1\lor \ldots \lor a_i \lor a_{i+1}\lor \ldots \lor a_n, b_1\lor \ldots \lor b_j \lor b_{j+1}\lor \ldots \lor b_m} \]

The simplest case is when we have the clauses a \lor c and b \lor \neg c. Since c and \neg c cannot both be true at the same time we know that either a or b has to be true, hence \frac{a \lor c, b \lor \neg c}{a \lor b}.

(If you’re not familiar with logic operands and terms, here is the shortest intro, from a non-expert: the sign \neg is negation, and a literal is either a letter or the negation of a letter, and a letter is a variable that can be either true or false. The operators \lor is “or” (disjunction) and \land is “and” (conjuction), and a clause is a finite disjuction of literal, i.e., letters or negation of letters connected with or. In the fraction above, the clauses above the line are arguments, and the clause below the line is the conclusion, and it follows/can be deduced from the arguments. Hence, the conclusion is true if all the arguments are true)

The logic operators and the literals are symbols/quotes in the code (since we’re not in the “Lisp in Lisp”-chapter of the book), and the following resolve-function is my Racket version of the equivalent function in Common Lisp from the book. The code in the book uses SETQ to update variables, and GO for jumping in code, a lot, but I’m using let-expressions and recursing instead.
The resolve function uses two helper functions; invert puts the symbol ‘not in front of a literal, or removes the ‘not if the literal is a negation, and the combine function is a cleanup function that adds a literal to a list of literals if itself or the negation is not in the list already, and removes duplicates.

(define (resolve x y)
  (letrec ([res (λ (rest-x rest-y)
                   [(null? rest-x) 'no-resolvent]
                   [(member? (invert (car rest-x)) rest-y) (cons 'or (combine (car rest-x) (append (cdr x) (cdr y))))]
                   [else (res (cdr rest-x) rest-y)]))])
          (res (cdr x) (cdr y))))

And some tests to show how the function works:

(check-equal? (resolve '(or P) '(or Q)) 'no-resolvent)
(check-equal? (resolve '(or P) '(or (not P))) '(or))
(check-equal? (resolve '(or P (not Q)) '(or Q (not R))) '(or P (not R)))

So how can we use the resolution rule in a proof?

A conjecture is proved if the premise implies the conclusion, hence a conjecture is false if there is a combination of literals that makes the premise true and the conclusion false at the same time. Which again is equivalent to say that the conjecture is false if there is a combination of literals that makes the premise and the negation of the conclusion true at the same time.

We require both the premise and the negation of the conclusion to be on conjunctive normal form, which means that they both are of the form of clauses connected with “and”s. So for the premise and the negation of the conclusion to be true at the same time all the clauses have to be true.

If then the combined clauses contain both a literal and its negation then it is impossible for all the clauses to be true at the same time, hence the conjecture is not false, it is true. And this is the key idea of the proof algorithm; to loop through all clauses, and use the resolution rule one each pair of clauses. If the resolvent is empty, we know that the teorem is true, if we get a non-empty resolvent back, it is added to the list of clauses, and the resolvent is again used in the resolution rule with each of the other clauses in the list. If we have looped all the clauses and not have met an empty resolvent then the conjecture is false.

(define (prove premise negation)
  (letrec ([try-next-remainder (λ (remainder clauses)
                     (cond [(null? remainder)
                            (displayln '(theorem not proved))
                           [else (try-next-resolvent (car remainder) (cdr remainder) remainder clauses)]))]
           [try-next-resolvent (λ (first rest remainder clauses)
                       (if (null? rest) (try-next-remainder (cdr remainder) clauses)
                       (let ([resolvent (resolve first (car rest))])
                           [(or (equal? resolvent 'no-resolvent)
                                (member? resolvent clauses)) (try-next-resolvent first (cdr rest) remainder clauses)]
                           [(null? (cdr resolvent)) (display-clauses first (car rest))
                                                    (displayln '(produce the empty resolvent))
                                                    (displayln '(theorem proved))
                           [else (display-clauses first (car rest))
                                 (displayln (append '(produce a resolvent:) (list resolvent)))
                                 (try-next-remainder (cons resolvent clauses) (cons resolvent clauses))]))))]
           [clauses (append (cdr premise) (cdr negation))])
    (try-next-remainder clauses clauses)))
 "prove true"
 (let ([premise '(and (or Q (not P))
                      (or R (not Q))
                      (or S (not R))
                      (or (not U) (not S)))]
       [negation '(and (or P) (or U))])
   (check-true (prove premise negation))))

The complete code, with tests, can be found here.


  5. Lisp by Winston, Patrick Henry; Horn, B.K.P.

Week summary

One of the reasons for having this blog running again was to share more of what I do at work. I have been working as a software developer for some years, but a couple of months ago I began as a research scientist at a research institute, in the ICT department. I’m still sort of figuring out what I’m doing and how things work, but I also think that others are curious to what researchers do for a living.

At work

Currently I’m working on three projects, but this week I only managed to work on two of them, a pre-project for writing a project proposal to the research council, and a development project for making a web site in PHP.

The income of the institute consists to great extend of project fundings from the research council here in Norway, or from the EU, and there is always a continous work on writing proposals and to stay up to date with upcoming calls. The proposal I’m working on is for a large project with many partners, and I’m learning a lot on how these type of projects work. I’ve met many great people from the industry and other research institutions over the last few weeks on various workshops and meetings, and I have met users that will benefit from our solutions if get the funding. This week I’ve worked on the budget for the project (really not that fun, but it has to be done 🙂 ) and I’ve been writing and searching for references for the proposal and a state-of-the-art report we are working on, so I’m learning a lot about the actual research area as well.

When it comes to the web site, developing in PHP is also quite new to me. I’m not very found of the dynamic typing, and the tedious syntax with $this->that. Not to mention the extended use of arrays, where new keys and values are added ad hoc as needed. The latter is probably not just the language’s fault.
I’ve suggested that we should do some tidying in the code, and next week we will discuss what the domain model should look like. Hopefully we will be able to replace some arrays with real objects, do general tidying and write some test, and that will make the code more fit for fight.
Other than that I also spent some time trying to get a constructor to work for a new class I wrote. The reason why it didn’t work was the most newbie reason of all; I didn’t notice that there should be a double underscore at the beginning, and had only a single one in my code 🙂

After work

  • Often I read books on the subway to and from work, but this week I started to listen to the Type Theory Podcast. I listen to the two first episodes, which were great, and the webpage has good posts on the episodes, including many relevant references. I hope they will make more episodes.
  • I also watched Jenny Tong’s talk from Google’s Ubiquity, and now I want to make all sorts of stuff with Arduino. Great video!
  • I ran into one of the managers for the last customer I worked with when I still was a consultant, and I was invited to a cake celebration the next day, celebrating that the application I worked one had been put in production. It was nice to meet all my great friends and former colleagues there, both from my old company and from the customers.
  • It is the final week of the Machine Learning course I’m following at Coursera, taught by Andrew Ng. It’s a great course, and I would like to do more machine learning stuff in the future. I also believe that all programmers should learn a bit about machine learning and data mining, everyone wants to have applications that make suggestions or decisions for the user based on earlier behaviour nowadays.
  • We had a great meeting in the Lambda club, which is a group that originated from my former company, with enthusiastic programmers with an interest in functional programming, languages in general, and all weird and fun stuff one can do with code. At the meeting we had great fun solving the following puzzle; given squares divided into eight regions, where each region can be coloured in white or black, how many different squares are there, when squares obtained by rotation or flipping (reflection) of another square is consider to be the same. It’s a great puzzle which can be solved in various ways, and would fit into a codekata or similar.

Årets fuglekassevinner er…

… Spettmeisen!

Et litt groggy bilde av spettmeisen tatt med mobilen

Et litt groggy bilde av spettmeisen tatt med mobilen

Noe av det første vi gjorde etter at vi kjøpte hytta i mai i fjor var å henge opp en fuglekasse. Vi trodde kanskje det var for sent på sesongen til at noen ville ta den i bruk, men et kjøttmeispar flytta inn, og vi fulgte med fra paret flytta inn til ungene fløy ut senere på sommeren.

I løpet av høsten har noen justert diameteren på hullet, den hovedmistenkte er flaggspetten som holder til i området. Vi hadde planer om fikse kassa slik at hullet igjen ble i riktig størrelse for kjøttmeis eller andre med samme ønske. Vi ville ihvertfall ikke risikere at kjøttmeisen flytta inn og at større fugler kom og kuppet kassa.

Men siden sist vi var på hytta har et spettmeispar flytta inn, og sånn sett løst problemene for oss. Spettmeisen har nemlig noen særegne ferdigheter. I tillegg til at den er alene om å klatre nedover trestammen med hodet først kan den også mure. Den vil i utgangspunktet ha større hull en kjøttmeisen, i tillegg tilpasser den bare hullstørrelsen som den vil ved å mure igjen til passe diameter. Det ser ikke ut som de liker trekk heller, for de har allerede murt igjen øverst på kassa der toppen kan åpnes.
Det blir spennende å følge spettmeisene videre utover i våren!

Dagens utflukt – Vestli

Det er den store honninginnhøstingshelga for Eivind, så Lars og jeg har vært på utflukt for å gi bedre arbeidsforhold der hjemme.
Jeg spurte Lars hva han ville gjøre, og det han ville aller mest var å kjøre t-banens linje 5 helt til Vestli, for der hadde han jo ikke vært før.


På vei ned til Tøyen ble linjekartet studert nøye. Vi byttet til den mye omtalte linje 5, og kom omsider fram til Vestli.


Jeg trodde da at vi hadde nådd reisens mål. Vi tok banen nedover igjen, men gikk av på Stover for å se oss omkring i Stovner senter.




Her var det både biler og tog man kunne sitte i, og en liten skjerm med brio-film i lekebutikken. Det var visst også en Burger King her…
Etter tid og stunder tenkte jeg det var på tide å komme seg hjemover igjen, men da kom Lars på at vi jo ikke hadde sett hvordan det så ut ute på Vestli. (Stasjonen er under bakken). Så da måtte vi jo ta banen tilbake.


Så utenfor t-basestasjonen har vi Vestli torg omringet av masse fint terrasseleiligheter. Og etter at vi fikk en liten stund kom vi fram til et kjempefint lekeapparat, på høyde med det i Frognerparken, men uten alle folka.



Så etter fire og en halv times utflukt har vi omsider kommet oss hjem, slitne og fornøyde 🙂