This assignment has been closed on April 03, 2024.
You must be authenticated to submit your files

INF442 - Agglomerative hierarchical clustering

Context

Structure

This TD is structured as follows:

Before you begin

Download and unzip the archive INF442-td4-1-handin.zip.

It contains:

1. (Quiz) Data analysis

In the quiz this week, you are asked to perform hierarchical clustering on several datasets—some synthetic, others from the real world. To this end, we use the libraries pandas (which you have already seen) and scipy. They can be installed using the following command in a terminal:

$ pip3 install --user --only-binary=:all: pandas scipy PyQt5

To perform the clustering, we provide you with a Python script named hierarchical_clustering.py, located in the quiz folder. It uses the following imports:

import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

We use pandas for reading and manipulating data, matplotlib for generating plots, and scipy for performing the clustering.

The script currently reads in the file ../csv/bluered.csv, which contains 22 points with coordinates x and y labeled by integers from 1 to 22, as shown below:

The script then performs hierarchical clustering using the linkage function from scipy. Afterward, it generates and plots the resulting dendrogram, using the dendrogram function:

tree = linkage(data[['x', 'y']])
D = dendrogram(tree, labels = data['name'].to_numpy(), orientation = 'left')
plt.show()

When running the script (from the folder quiz), you should see the following picture:

You are encouraged to look at the official documentation for linkage and dendrogram. We would like to highlight the following aspects:

Use this script to answer the questions in the Quiz. Some hints:

def load_distance_matrix(fname):
    """
    Takes as input a name of a file containing the information about a graph:
      - first line: number of vertices
      - then labels, one per line
      - then the distance matrix, one row per line, the entries separated by commas
    Returns a tuple containing a distance matrix in the condensed 1-D format and a list of labels
    """
    with open(fname, 'r') as f:
        n = int(f.readline())
        labels = [f.readline().strip() for _ in range(n)]
        dist_matrix = [[float(x) for x in f.readline().split(',')] for _ in range(n)]
        return (squareform(dist_matrix), labels)

Note: In order to make this function work, you need to add the following additional import for the squareform function:

from scipy.spatial.distance import squareform

2. Some good practices

2.1 Source and header files
Click to expand

It is good practice to split your project into several compilation units. This improves modularity and clarity, and it simplifies code reuse. Having a class declared and implemented in separate files has several further advantages. In particular, it allows separate compilation: Consider a program of 1 000 000 lines of C++ code. Written as single file, say a.cpp, the whole program has to be recompiled every time a single line is changed:

$ g++ -o a a.cpp

(here and below, the dollar sign $ represents the Linux terminal prompt—it is not part of the command).

Suppose now that this program is split into three files: a.cpp (containing the main function), b.cpp, and c.cpp (the latter two containing definitions of classes used in a.cpp). We can then precompile the dependencies:

$ g++ -c b.cpp
$ g++ -c c.cpp

generating the corresponding object files b.o and c.o. When one line of code is changed in a.cpp, it is the only file that needs to be recompiled:

$ g++ -o a a.cpp b.o c.o

In a large project, this can save many hours for a single compilation, translating into many days or weeks for the duration of the project’s life time.

Notice that it is not sufficient to simply break the large program into several source files! In addition, the compilation process has to be adapted accordingly so as to not re-compile the whole program as here:

$ g++ -o a a.cpp b.cpp c.cpp    # This is bad!

Notice the use of source files instead of the object files. To this end, we use the program make with the dependency structure defined in the Makefile. Integrated development environments often use other similar approaches.

Note 1: When your program is structured into multiple compilation units, the make file describing the compilation procedure becomes more complex. Check out this very clear tutorial on writing makefiles (or this one).

Note 2: Deciding which headers to include where and in what order is not trivial. Here are some good rules to follow:

  • A header file must not make assumptions about other header files included alongside it. For example, you can test cloud.hpp and point.hpp by issuing the following two commands:
$ g++ -c point.hpp
$ g++ -c cloud.hpp

These should compile correctly, generating files point.hpp.gch and cloud.hpp.gch, respectively.

  • In a .cpp file, always include the corresponding header first (e.g., #include "cloud.hpp" is the first line in cloud.cpp).

  • In a .cpp file, always include explicitly all header files on which you know that it depends, even if you know that some of these are indirectly included through others. For example, if a.cpp includes b.hpp, which in turn includes c.hpp, you might be tempted not to include c.hpp explicitly. However, there is no guarantee that a future change to b.hpp does not remove the include of c.hpp breaking your assumptions.

  • In almost all situations (except very few that you are not likely to encounter in the near future), it is a good idea to include

#pragma once
as the first line of all header files.
2.2 Getters and setters

The most common use case for access modifiers is to protect the object data from external modifications. For instance, consider the point class from TD3. Objects of this class have three fields: d, coords, and label. Although a change to any of them could negatively affect the correctness of the program, a change of the dimension could be particularly disastrous, potentially leading to segmentation faults.

A good practice consists in keeping the member fields private as much as possible and accessing them using the so-called getter and setter functions. These grant limited access to data from outside the object, and they provide control over how the values are updated.

Consider that we make the static variable int d private and that we add the functions static int get_dim() and static bool set_dim(int _d) with the following intention:

Exercise 1

Complete these two functions in point.cpp. You can test your code by first running the command make grader and then the command ./grader 1. Then upload your file point.cpp here:

Upload form is only available when connected

3. Single-linkage clustering

Now we move to implementing the clustering algorithm and related functionality.

3.1 Creating a graph

In the file graph.hpp you will find the declaration of the class graph. It contains the following attributes:

Exercise 2

We start with implementing two constructors and one destructor. Before you implement them though, please read until the end of this subsection.

Before you implement the methods above, you are encouraged to look at edge.cpp, especially at the edge::compare method and the constructor of the edge class.

You are also asked to implement the getter long get_num_edges(), providing access to the field size.

Similar to before, you can test your code by first running make grader and then ./grader 2.

Upload your file graph.cpp here:

Upload form is only available when connected

3.2 Iterating over the edges

For the subsequent exercises, we will need to iterate over the array of all edges of the graph. To this end, we implement a simple iterator: a private field iterator_pos of the class graph indicates the next edge to process.

Exercise 3

Implement the following methods:

Command ./grader 3 will run the tests for this exercise (after make grader).

Upload your file graph.cpp here:

Upload form is only available when connected

3.3 Hierarchical clustering with union–find: overview

Recall form the lecture (p. 65 in the booklet) that the clustering algorithm iterates over the edges provided by the class graph above in the following way:

The data structure managing the clusters is provided by the class dendrogram declared in dendrogram.hpp. It follows the union–find paradigm, with an additional tree structure for top-down exploration.

Each cluster is represented by one of the points it contains. Whenever, as above, two clusters \((C, C')\) are joined, one of the clusters, say \(C'\), is merged into the other, here \(C\). The representative of \(C\) becomes that of \(C \cup C'\). A link is established from the representative of \(C'\) to that of \(C\) to keep track of the merge. In order to not chain too many links, we rely on the ranks of the clusters, which are the ranks of their representatives. The rank of a node representing a point in the resulting data structure is the depth of the sub-tree rooted at it, formed by the links.

As mentioned above, in addition to the union–find structure, our dendrogram data structure contains a tree that is used for top-down exploration. To this end, in each node, we record a down pointer to its rightmost child (i.e., the one that was merged last) and a left pointer to its next sibling.

The figure below illustrates the steps of the clustering algorithm for a cloud of five 1-dimensional points (0, 1, 3, 4 and 9) as well as the resulting dendrogram. The ranks of all points at every step are shown in red. A blue number, when present, indicates the height in the dendrogram at which the node/cluster is merged into its parent, i.e., the length of the edge that triggered the merge divided by 2. The black edges show the parent relation in the union–find structure, whereas the orange edges show the down and left relations in the tree for the top-down exploration.

The class dendrogram is declared in dendrogram.hpp. It contains the elements described above defined as arrays:

int *parent;
int *rank;
int *left;
int *down;
double *height;

For instance, the variables of the structure in the last line of the figure above look as follows (recall that the array indices start at 0):

i      == {  0,  1,   2,   3,   4} // Not a field -- only given here for the sake of clarity
----------------------------------
parent == {  1,  3,   3,  -1,   3}
rank   == {  0,  1,   0,   2,   0}
left   == { -1,  2,  -1,  -1,   1}
down   == { -1,  0,  -1,   4,  -1}
height == {0.5,  1, 0.5,  -1, 2.5}
3.4 Finding the representative

For each node i, one can find the node representing the cluster containing i by following the edge to the parent (that is, from i to parent[i] etc.) until the parent is \(-1\). (We do not use any path compression here.)

Exercise 4

Use the strategy just explained to implement (in dendrogram.cpp) the method int find(int i), which should return the representative of the cluster containing i.

You can test your code by running ./grader 4 (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

3.5 Merging clusters

Now we come to the core part of the algorithm: merging clusters. To this end, we use the method void merge(edge *e), which should merge the clusters containing the nodes of the edge e. Keep in mind that:

Exercise 5

Implement the method merge in dendrogram.cpp. Run the command ./grader 5 to test your solution (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

3.6 Combining everything

Now we implement the clustering algorithm, following the approach outlined above: iterate over the edges and merge existing clusters.

Excercise 6

In dendrogram.cpp, implement the method void build(), which should construct a dendrogram structure as sketched in the figures above and explained in the accompanying text.

After running make grader, you can test your code via ./grader 6.

Upload your file dendrogram.cpp here:

Upload form is only available when connected

4. Clusters

Now that the dendrogram is built, we want to get some information about the clusters that it contains.

4.1 Identifying the clusters below a given height

Our next goal is to allow to split the points into clusters at fixed height h (called \(C_h(p)\) in the notation of slide 78). The corresponding method void dendrogram::set_clusters(double h) should eventually mark for each point to which cluster it belongs. To this end, we make use of the int array clusters inside of the dendrogram class. The algorithm should do the following:

This method is already implemented but relies on the auxiliary method int dendrogram::set_clusters(int i, double h), which you need to implement. For a given point with index i, the method should compute cluster[i], assuming that the parent of i already knows its cluster. Afterward, it should run recursively on the the left and the down node of i. It should also increment total_clusters if necessary (which is the case if node i is the root of a cluster).

An example

For the figure above, at height 0.5, we have two non-singleton clusters—nodes 0 (point 0) and 1 (point 1) as well as nodes 2 (point 3) and 3 (point 4)—represented by nodes 1 and 3, respectively. Thus:

clusters == {1, 1, 3, 3, 4}

Similarly, at height 1, we have one non-singleton cluster, represented by node 3 (point 4), which is the result of a merge of the two clusters above. Hence,

clusters = {3, 3, 3, 3, 4}

For the discussion below, let us take h = 0.5 for the dendrogram in the above figure.

Excercise 7

Implement int dendrogram::set_clusters(int i, double h) as explained above. You can test your code by running ./grader 7 (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

4.2 Counting non-singleton clusters

We continue using the terminology defined in Section 4.1 (the attribute clusters with the cluster labels).

Excercise 8

Implement (in the file dendrogram.cpp) the method int dendrogram::_count_ns_clusters(), which should count the number of non-singleton clusters.

Test your code via ./grader 8 (after running make grader).

Upload here your file dendrogram.cpp:

Upload form is only available when connected

4.3 Height of a cluster

We now determine the hight of a cluster.

Excercise 9

In the file dendrogram.cpp, implement the method double dendrogram::get_cluster_height(int cluster), which should determine the height in the dendrogram of the cluster, assuming that the array clusters is already initialized via set_clusters(double h) (from Exercise 7) and that the dendrogram is built. For cluster indices of nodes that are only representatives of a singleton cluster, this function should return 0. Note that such nodes are the ones that do not have children.

As an example, in the figure above, point 0 (node 0) is only ever the representative of its own cluster, hence, its height is 0. In contrast, point 1 (node 1) is, at the end, the representative of the cluster containing points 0 and 1, hence, its height is 0.5 (although it is also the representative of the singleton cluster that only contains point 1, which we ignore here). Similarly, the height of point 4 (node 3) is 1, although it is also the representative of two lower clusters (namely, the one containing points 3 and 4 (with height 0.5) as well as the one containing only point 4).

Recall that, by design, nodes at the same level of the data structure that represents all merged clusters have a non-decreasing height from left to right (and each node stores a pointer to its left node, if it exists). Further, a node stores the height of the cluster that it got merged into. That is, a node does not store the height of the cluster it represents itself. Instead, this height is stored in one of its children, the rightmost of which is accessed via down.

You can test your code via ./grader 9 (after running make grader).

Upload your file dendrogram.cpp here:

Upload form is only available when connected

5. Running the clustering algorithm

Once you have done all the exercises above, you can compile a C++ clustering tool by make test-dendrogram. This command produces a binary test-dendrogram. You can use it with

./test-dendrogram ./csv/iris.csv 4 150
./test-dendrogram ./csv/languages.csv

The tool prints lists of clusters at various depths.