This assignment has been closed on April 03, 2024.

You must be authenticated to submit your files

INF442 - Agglomerative hierarchical clustering

Context
Structure
Before you begin
1. (Quiz) Data analysis
2. Some good practices
3. Single-linkage clustering
4. Clusters
5. Running the clustering algorithm

Context

We will apply a hierarchical clustering algorithm to data and interpret the results
We will practice structuring a project into separate compilation units
We will implement the single-linkage hierarchical clustering algorithm, using the union–find data structure, which you have seen in INF321 and INF411

Structure

This TD is structured as follows:

Section 1 describes how to run hierarchical clustering algorithms using scikit-learn to prepare you to working with data for the quiz
Section 2 presents some practices of structuring a large C++ project, such as division into compilation units and using getters/setters (there is a practical exercise for the latter)
Section 3 guides you through implementing a single-linkage clustering algorithm
Section 4 describes how to retrieve useful information from the cluster data computed in Section 3
Section 5 describes how you can collect the code you wrote into a C++ clustering tool

Before you begin

Download and unzip the archive INF442-td4-1-handin.zip.

It contains:

source files, in particular the files point.cpp, graph.cpp, and dendrogram.cpp, which you will be asked to modify,
a Makefile that you can use to compile the various tests (see the following sections),
a csv folder with the data used in the TD, and
a quiz folder with a starter script in Python, for the quiz.

1. (Quiz) Data analysis

In the quiz this week, you are asked to perform hierarchical clustering on several datasets—some synthetic, others from the real world. To this end, we use the libraries pandas (which you have already seen) and scipy. They can be installed using the following command in a terminal:

$ pip3 install --user --only-binary=:all: pandas scipy PyQt5

To perform the clustering, we provide you with a Python script named hierarchical_clustering.py, located in the quiz folder. It uses the following imports:

import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

We use pandas for reading and manipulating data, matplotlib for generating plots, and scipy for performing the clustering.

The script currently reads in the file ../csv/bluered.csv, which contains 22 points with coordinates x and y labeled by integers from 1 to 22, as shown below:

The script then performs hierarchical clustering using the linkage function from scipy. Afterward, it generates and plots the resulting dendrogram, using the dendrogram function:

tree = linkage(data[['x', 'y']])
D = dendrogram(tree, labels = data['name'].to_numpy(), orientation = 'left')
plt.show()

When running the script (from the folder quiz), you should see the following picture:

You are encouraged to look at the official documentation for linkage and dendrogram. We would like to highlight the following aspects:

Important: The x-coordinate corresponds to the height of the cluster multiplied by two. So, if you are interested, for example, in the clusters at height 1.0, you should look at the part of the dendrogram with the x-coordinate being at most 2.0.
The linkage function is called by default with method='single', which tells it to use single-linkage, but other methods are implemented as well.
By default, the linkage function equips the data with the Euclidean metric, but other metrics are also available. You can even provide a custom distance matrix (see below).
When calling dendrogram, we provide the labels to be displayed at the leaf nodes.
One can also use the color_threshold parameter of dendrogram to find clusters of a given height. If one sets color_threshold=h, then only clusters of heights at most h/2 are colored. For example, after setting color_threshold=2.0, you should see the following:

Use this script to answer the questions in the Quiz. Some hints:

One of the questions is about the iris dataset you have seen before. The dataset is relatively large, so you may want to increase the size of the figure using plt.figure(figsize=(x, y)), where x and y are the desired width and height in inches (this should be done before the dendrogram has been created).
The last question is about the file languages.csv, which encodes a graph with 19 nodes (each being a language, listed on lines 2–20 of the file) and the pairwise distances, given as a matrix (on the subsequent lines). Note that linkage accepts a distance matrix in a condensed 1-D format as input (see the documentation). Use the following function to transform the file languages.csv into such a condensed format:

def load_distance_matrix(fname):
    """
    Takes as input a name of a file containing the information about a graph:
      - first line: number of vertices
      - then labels, one per line
      - then the distance matrix, one row per line, the entries separated by commas
    Returns a tuple containing a distance matrix in the condensed 1-D format and a list of labels
    """
    with open(fname, 'r') as f:
        n = int(f.readline())
        labels = [f.readline().strip() for _ in range(n)]
        dist_matrix = [[float(x) for x in f.readline().split(',')] for _ in range(n)]
        return (squareform(dist_matrix), labels)

Note: In order to make this function work, you need to add the following additional import for the squareform function:

from scipy.spatial.distance import squareform

2. Some good practices

2.1 Source and header files

Click to expand

It is good practice to split your project into several compilation units. This improves modularity and clarity, and it simplifies code reuse. Having a class declared and implemented in separate files has several further advantages. In particular, it allows separate compilation: Consider a program of 1 000 000 lines of C++ code. Written as single file, say a.cpp, the whole program has to be recompiled every time a single line is changed:

$ g++ -o a a.cpp

(here and below, the dollar sign $ represents the Linux terminal prompt—it is not part of the command).

Suppose now that this program is split into three files: a.cpp (containing the main function), b.cpp, and c.cpp (the latter two containing definitions of classes used in a.cpp). We can then precompile the dependencies:

$ g++ -c b.cpp
$ g++ -c c.cpp

generating the corresponding object files b.o and c.o. When one line of code is changed in a.cpp, it is the only file that needs to be recompiled:

$ g++ -o a a.cpp b.o c.o

In a large project, this can save many hours for a single compilation, translating into many days or weeks for the duration of the project’s life time.

Notice that it is not sufficient to simply break the large program into several source files! In addition, the compilation process has to be adapted accordingly so as to not re-compile the whole program as here:

$ g++ -o a a.cpp b.cpp c.cpp    # This is bad!

Notice the use of source files instead of the object files. To this end, we use the program make with the dependency structure defined in the Makefile. Integrated development environments often use other similar approaches.

Note 1: When your program is structured into multiple compilation units, the make file describing the compilation procedure becomes more complex. Check out this very clear tutorial on writing makefiles (or this one).

Note 2: Deciding which headers to include where and in what order is not trivial. Here are some good rules to follow:

A header file must not make assumptions about other header files included alongside it. For example, you can test cloud.hpp and point.hpp by issuing the following two commands:

$ g++ -c point.hpp
$ g++ -c cloud.hpp

These should compile correctly, generating files point.hpp.gch and cloud.hpp.gch, respectively.

In a .cpp file, always include the corresponding header first (e.g., #include "cloud.hpp" is the first line in cloud.cpp).
In a .cpp file, always include explicitly all header files on which you know that it depends, even if you know that some of these are indirectly included through others. For example, if a.cpp includes b.hpp, which in turn includes c.hpp, you might be tempted not to include c.hpp explicitly. However, there is no guarantee that a future change to b.hpp does not remove the include of c.hpp breaking your assumptions.
In almost all situations (except very few that you are not likely to encounter in the near future), it is a good idea to include

#pragma once

as the first line of all header files.

2.2 Getters and setters

The most common use case for access modifiers is to protect the object data from external modifications. For instance, consider the point class from TD3. Objects of this class have three fields: d, coords, and label. Although a change to any of them could negatively affect the correctness of the program, a change of the dimension could be particularly disastrous, potentially leading to segmentation faults.

A good practice consists in keeping the member fields private as much as possible and accessing them using the so-called getter and setter functions. These grant limited access to data from outside the object, and they provide control over how the values are updated.

Consider that we make the static variable int d private and that we add the functions static int get_dim() and static bool set_dim(int _d) with the following intention:

the getter function static int get_dim() returns the value of d;
the setter function static bool set_dim(int _d) updates the value of d only if it has never been updated before, and it returns true if the update has been performed and false otherwise.

Exercise 1

Complete these two functions in point.cpp. You can test your code by first running the command make grader and then the command ./grader 1. Then upload your file point.cpp here:

Upload form is only available when connected

3. Single-linkage clustering

Now we move to implementing the clustering algorithm and related functionality.

3.1 Creating a graph

In the file graph.hpp you will find the declaration of the class graph. It contains the following attributes:

long n: the number of vertices.
long size: the number of edges.
edge *edges: an array of the edges of the graph. An edge consists of its two end points as well as the distance among these two points (see also edge.hpp). The array is assumed to be sorted by non-decreasing edge lengths.
std::string *node_names: an array of the names of the nodes in the graph.
long iterator_pos: the position of the iterator, which we will use later.

Exercise 2

We start with implementing two constructors and one destructor. Before you implement them though, please read until the end of this subsection.

The constructor graph(const cloud &_c) takes a cloud of points as input. It should initialize n and node_names from the information in the cloud (see cloud.hpp for the class description). The value of size should be computed based on n. After this, the constructor should build the array edges and sort it. The array should consist of all the possible pairs of the points (with their respective distances). In order to sort the edges by length, use the following version of the std::sort function:
```
    std::sort(edges, edges + size, edge::compare); // requires #include <algorithm>
```
Its third parameter is the static method edge::compare, which is used for comparison while sorting.
Similarly, the constructor graph(long n, std::string node_names[], double **dist_matrix) should construct a graph with n vertices from a list of n names and from an $n\times n$ matrix dist_matrix of the distances of all pairs of points. Do not forget to sort the edges at the end.
The destructor ~graph() should free the memory occupied by the arrays for the edges and the names.

Before you implement the methods above, you are encouraged to look at edge.cpp, especially at the edge::compare method and the constructor of the edge class.

You are also asked to implement the getter long get_num_edges(), providing access to the field size.

Similar to before, you can test your code by first running make grader and then ./grader 2.

Upload your file graph.cpp here:

Upload form is only available when connected

3.2 Iterating over the edges

For the subsequent exercises, we will need to iterate over the array of all edges of the graph. To this end, we implement a simple iterator: a private field iterator_pos of the class graph indicates the next edge to process.

Exercise 3

Implement the following methods:

The method void start_iteration() is intended to initialize the iterator by setting the field iterator_pos to 0.
The method edge *get_next() returns the edge currently referred to by iterator_pos and advances this field to the next index in the array of edges. It returns NULL if the end of the array has been reached.

Command ./grader 3 will run the tests for this exercise (after make grader).

Upload your file graph.cpp here:

Upload form is only available when connected

3.3 Hierarchical clustering with union–find: overview

Recall form the lecture (p. 65 in the booklet) that the clustering algorithm iterates over the edges provided by the class graph above in the following way:

If the next edge connects 2 points from different clusters $(C,C')$, it triggers a merge ($C$ and $C'$ become the single cluster $C \cup C'$).
Otherwise, the edge is ignored.

The data structure managing the clusters is provided by the class dendrogram declared in dendrogram.hpp. It follows the union–find paradigm, with an additional tree structure for top-down exploration.

Each cluster is represented by one of the points it contains. Whenever, as above, two clusters $(C, C')$ are joined, one of the clusters, say $C'$, is merged into the other, here $C$. The representative of $C$ becomes that of $C \cup C'$. A link is established from the representative of $C'$ to that of $C$ to keep track of the merge. In order to not chain too many links, we rely on the ranks of the clusters, which are the ranks of their representatives. The rank of a node representing a point in the resulting data structure is the depth of the sub-tree rooted at it, formed by the links.

As mentioned above, in addition to the union–find structure, our dendrogram data structure contains a tree that is used for top-down exploration. To this end, in each node, we record a down pointer to its rightmost child (i.e., the one that was merged last) and a left pointer to its next sibling.

The figure below illustrates the steps of the clustering algorithm for a cloud of five 1-dimensional points (0, 1, 3, 4 and 9) as well as the resulting dendrogram. The ranks of all points at every step are shown in red. A blue number, when present, indicates the height in the dendrogram at which the node/cluster is merged into its parent, i.e., the length of the edge that triggered the merge divided by 2. The black edges show the parent relation in the union–find structure, whereas the orange edges show the down and left relations in the tree for the top-down exploration.

The class dendrogram is declared in dendrogram.hpp. It contains the elements described above defined as arrays:

int *parent;
int *rank;
int *left;
int *down;
double *height;

For instance, the variables of the structure in the last line of the figure above look as follows (recall that the array indices start at 0):

i      == {  0,  1,   2,   3,   4} // Not a field -- only given here for the sake of clarity
----------------------------------
parent == {  1,  3,   3,  -1,   3}
rank   == {  0,  1,   0,   2,   0}
left   == { -1,  2,  -1,  -1,   1}
down   == { -1,  0,  -1,   4,  -1}
height == {0.5,  1, 0.5,  -1, 2.5}

3.4 Finding the representative

For each node i, one can find the node representing the cluster containing i by following the edge to the parent (that is, from i to parent[i] etc.) until the parent is $-1$. (We do not use any path compression here.)

Exercise 4

Use the strategy just explained to implement (in dendrogram.cpp) the method int find(int i), which should return the representative of the cluster containing i.

You can test your code by running ./grader 4 (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

3.5 Merging clusters

Now we come to the core part of the algorithm: merging clusters. To this end, we use the method void merge(edge *e), which should merge the clusters containing the nodes of the edge e. Keep in mind that:

the two nodes of the edge need not be the representatives of their clusters, so you have to find those first (using the method find you have implemented earlier).
the two nodes of the edge can be already be in the same cluster. If this is the case, then no action is required.
when necessary, merge must update all five fields above (parent, rank, left, down, and height).
in order to avoid deep trees, take as the new representative node the one of the highest rank. If the ranks of the two representatives are equal, take the one with the larger index—the built-in tests of the function merge() will not pass if this choice is made differently, even though it is not important for the algorithm.

Exercise 5

Implement the method merge in dendrogram.cpp. Run the command ./grader 5 to test your solution (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

3.6 Combining everything

Now we implement the clustering algorithm, following the approach outlined above: iterate over the edges and merge existing clusters.

Excercise 6

In dendrogram.cpp, implement the method void build(), which should construct a dendrogram structure as sketched in the figures above and explained in the accompanying text.

After running make grader, you can test your code via ./grader 6.

Upload your file dendrogram.cpp here:

Upload form is only available when connected

4. Clusters

Now that the dendrogram is built, we want to get some information about the clusters that it contains.

4.1 Identifying the clusters below a given height

Our next goal is to allow to split the points into clusters at fixed height h (called $C_h(p)$ in the notation of slide 78). The corresponding method void dendrogram::set_clusters(double h) should eventually mark for each point to which cluster it belongs. To this end, we make use of the int array clusters inside of the dendrogram class. The algorithm should do the following:

For each point (with index i), set its label (that is, clusters[i]) to be the index of the node representing the cluster that the point with index i belongs to.
Set the attribute int total_clusters to be equal to the number of clusters obtained this way.

This method is already implemented but relies on the auxiliary method int dendrogram::set_clusters(int i, double h), which you need to implement. For a given point with index i, the method should compute cluster[i], assuming that the parent of i already knows its cluster. Afterward, it should run recursively on the the left and the down node of i. It should also increment total_clusters if necessary (which is the case if node i is the root of a cluster).

An example

For the figure above, at height 0.5, we have two non-singleton clusters—nodes 0 (point 0) and 1 (point 1) as well as nodes 2 (point 3) and 3 (point 4)—represented by nodes 1 and 3, respectively. Thus:

clusters == {1, 1, 3, 3, 4}

Similarly, at height 1, we have one non-singleton cluster, represented by node 3 (point 4), which is the result of a merge of the two clusters above. Hence,

clusters = {3, 3, 3, 3, 4}

For the discussion below, let us take h = 0.5 for the dendrogram in the above figure.

If a node has a parent and joins it at a height no greater than h, it belongs to the same cluster. For example, node 2 (point 3) has the height 0.5, so it belongs to the same cluster as node 3 (point 4).
Otherwise, if a node does not have a parent (e.g., node 3 (point 4)) or if it joins its parent at a height greater than h (e.g., node 4 (point 9)), then it does not belong to the same cluster. Since we are exploring the tree top-down, this node is necessarily the representative of a new cluster (in the case of the node 4, this turns out to be a singleton cluster, which we will ignore subsequently).

Excercise 7

Implement int dendrogram::set_clusters(int i, double h) as explained above. You can test your code by running ./grader 7 (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

4.2 Counting non-singleton clusters

We continue using the terminology defined in Section 4.1 (the attribute clusters with the cluster labels).

Excercise 8

Implement (in the file dendrogram.cpp) the method int dendrogram::_count_ns_clusters(), which should count the number of non-singleton clusters.

Test your code via ./grader 8 (after running make grader).

Upload here your file dendrogram.cpp:

Upload form is only available when connected

4.3 Height of a cluster

We now determine the hight of a cluster.

Excercise 9

In the file dendrogram.cpp, implement the method double dendrogram::get_cluster_height(int cluster), which should determine the height in the dendrogram of the cluster, assuming that the array clusters is already initialized via set_clusters(double h) (from Exercise 7) and that the dendrogram is built. For cluster indices of nodes that are only representatives of a singleton cluster, this function should return 0. Note that such nodes are the ones that do not have children.

As an example, in the figure above, point 0 (node 0) is only ever the representative of its own cluster, hence, its height is 0. In contrast, point 1 (node 1) is, at the end, the representative of the cluster containing points 0 and 1, hence, its height is 0.5 (although it is also the representative of the singleton cluster that only contains point 1, which we ignore here). Similarly, the height of point 4 (node 3) is 1, although it is also the representative of two lower clusters (namely, the one containing points 3 and 4 (with height 0.5) as well as the one containing only point 4).

Recall that, by design, nodes at the same level of the data structure that represents all merged clusters have a non-decreasing height from left to right (and each node stores a pointer to its left node, if it exists). Further, a node stores the height of the cluster that it got merged into. That is, a node does not store the height of the cluster it represents itself. Instead, this height is stored in one of its children, the rightmost of which is accessed via down.

You can test your code via ./grader 9 (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

5. Running the clustering algorithm

Once you have done all the exercises above, you can compile a C++ clustering tool by make test-dendrogram. This command produces a binary test-dendrogram. You can use it with

csv files containing the clouds of points, for example,

./test-dendrogram ./csv/iris.csv 4 150

csv files containing graphs encoded by a distance matrix, for example,

./test-dendrogram ./csv/languages.csv

The tool prints lists of clusters at various depths.