INF442 - Agglomerative hierarchical clustering
Context
- We will apply a hierarchical clustering algorithm to data and interpret the results
- We will practice structuring a project into separate compilation units
- We will implement the single-linkage hierarchical clustering algorithm, using the union–find data structure, which you have seen in INF321 and INF411
Structure
This TD is structured as follows:
- Section 1 describes how to run hierarchical
clustering algorithms using
scikit-learn
to prepare you to working with data for the quiz - Section 2 presents some practices of structuring a large C++ project, such as division into compilation units and using getters/setters (there is a practical exercise for the latter)
- Section 3 guides you through implementing a single-linkage clustering algorithm
- Section 4 describes how to retrieve useful information from the cluster data computed in Section 3
- Section 5 describes how you can collect the code you wrote into a C++ clustering tool
Before you begin
Download and unzip the archive INF442-td4-1-handin.zip
.
It contains:
- source files, in particular the files
point.cpp
,graph.cpp
, anddendrogram.cpp
, which you will be asked to modify, - a
Makefile
that you can use to compile the various tests (see the following sections), - a
csv
folder with the data used in the TD, and - a
quiz
folder with a starter script in Python, for the quiz.
1. (Quiz) Data analysis
In the quiz
this week, you are asked to perform hierarchical clustering on several
datasets—some synthetic, others from the real world. To this end, we use
the libraries pandas
(which you have already seen) and
scipy
. They can be installed using the following command in
a terminal:
$ pip3 install --user --only-binary=:all: pandas scipy PyQt5
To perform the clustering, we provide you with a Python script named
hierarchical_clustering.py
, located in the
quiz
folder. It uses the following imports:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
We use pandas
for reading and manipulating data,
matplotlib
for generating plots, and scipy
for
performing the clustering.
The script currently reads in the file
../csv/bluered.csv
, which contains 22 points with
coordinates x
and y
labeled by integers from 1
to 22, as shown below:
The script then performs hierarchical clustering using the
linkage
function from scipy
. Afterward, it
generates and plots the resulting dendrogram, using the
dendrogram
function:
= linkage(data[['x', 'y']])
tree = dendrogram(tree, labels = data['name'].to_numpy(), orientation = 'left')
D plt.show()
When running the script (from the folder quiz
), you
should see the following picture:
You are encouraged to look at the official documentation for linkage and dendrogram. We would like to highlight the following aspects:
- Important: The x-coordinate corresponds to the
height of the cluster multiplied by two. So, if you are
interested, for example, in the clusters at height
1.0
, you should look at the part of the dendrogram with the x-coordinate being at most2.0
. - The
linkage
function is called by default withmethod='single'
, which tells it to use single-linkage, but other methods are implemented as well. - By default, the
linkage
function equips the data with the Euclidean metric, but other metrics are also available. You can even provide a custom distance matrix (see below). - When calling
dendrogram
, we provide the labels to be displayed at the leaf nodes. - One can also use the
color_threshold
parameter ofdendrogram
to find clusters of a given height. If one setscolor_threshold=h
, then only clusters of heights at mosth/2
are colored. For example, after settingcolor_threshold=2.0
, you should see the following:
Use this script to answer the questions in the Quiz. Some hints:
- One of the questions is about the iris dataset you have seen before.
The dataset is relatively large, so you may want to increase the size of
the figure using
plt.figure(figsize=(x, y))
, wherex
andy
are the desired width and height in inches (this should be done before the dendrogram has been created). - The last question is about the file
languages.csv
, which encodes a graph with 19 nodes (each being a language, listed on lines 2–20 of the file) and the pairwise distances, given as a matrix (on the subsequent lines). Note thatlinkage
accepts a distance matrix in a condensed 1-D format as input (see the documentation). Use the following function to transform the filelanguages.csv
into such a condensed format:
def load_distance_matrix(fname):
"""
Takes as input a name of a file containing the information about a graph:
- first line: number of vertices
- then labels, one per line
- then the distance matrix, one row per line, the entries separated by commas
Returns a tuple containing a distance matrix in the condensed 1-D format and a list of labels
"""
with open(fname, 'r') as f:
= int(f.readline())
n = [f.readline().strip() for _ in range(n)]
labels = [[float(x) for x in f.readline().split(',')] for _ in range(n)]
dist_matrix return (squareform(dist_matrix), labels)
Note: In order to make this function work, you need
to add the following additional import
for the squareform
function:
from scipy.spatial.distance import squareform
2. Some good practices
2.1 Source and header files
Click to expand
It is good practice to split your project into several
compilation units. This improves modularity and clarity, and it
simplifies code reuse. Having a class declared and implemented in
separate files has several further advantages. In particular, it allows
separate compilation: Consider a program of 1 000 000 lines of C++ code.
Written as single file, say a.cpp
, the whole program has to
be recompiled every time a single line is changed:
$ g++ -o a a.cpp
(here and below, the dollar sign $
represents the Linux
terminal prompt—it is not part of the command).
Suppose now that this program is split into three files:
a.cpp
(containing the main
function),
b.cpp
, and c.cpp
(the latter two containing
definitions of classes used in a.cpp
). We can then
precompile the dependencies:
$ g++ -c b.cpp
$ g++ -c c.cpp
generating the corresponding object files b.o
and c.o
. When one line of code is changed in
a.cpp
, it is the only file that needs to be recompiled:
$ g++ -o a a.cpp b.o c.o
In a large project, this can save many hours for a single compilation, translating into many days or weeks for the duration of the project’s life time.
Notice that it is not sufficient to simply break the large program into several source files! In addition, the compilation process has to be adapted accordingly so as to not re-compile the whole program as here:
$ g++ -o a a.cpp b.cpp c.cpp # This is bad!
Notice the use of source files instead of the object files. To this
end, we use the program make
with the dependency structure
defined in the Makefile
. Integrated development
environments often use other similar approaches.
Note 1: When your program is structured into multiple compilation units, the make file describing the compilation procedure becomes more complex. Check out this very clear tutorial on writing makefiles (or this one).
Note 2: Deciding which headers to include where and in what order is not trivial. Here are some good rules to follow:
- A header file must not make assumptions about other header files
included alongside it. For example, you can test
cloud.hpp
andpoint.hpp
by issuing the following two commands:
$ g++ -c point.hpp
$ g++ -c cloud.hpp
These should compile correctly, generating files
point.hpp.gch
and cloud.hpp.gch
,
respectively.
In a
.cpp
file, always include the corresponding header first (e.g.,#include "cloud.hpp"
is the first line incloud.cpp
).In a
.cpp
file, always include explicitly all header files on which you know that it depends, even if you know that some of these are indirectly included through others. For example, ifa.cpp
includesb.hpp
, which in turn includesc.hpp
, you might be tempted not to includec.hpp
explicitly. However, there is no guarantee that a future change tob.hpp
does not remove the include ofc.hpp
breaking your assumptions.In almost all situations (except very few that you are not likely to encounter in the near future), it is a good idea to include
#pragma once
as the first line of all header files.
2.2 Getters and setters
The most common use case for access modifiers is to protect the
object data from external modifications. For instance, consider the
point
class from TD3. Objects of this class have three
fields: d
, coords
, and label
.
Although a change to any of them could negatively affect the correctness
of the program, a change of the dimension could be particularly
disastrous, potentially leading to segmentation faults.
A good practice consists in keeping the member fields private as much as possible and accessing them using the so-called getter and setter functions. These grant limited access to data from outside the object, and they provide control over how the values are updated.
Consider that we make the static variable int d
private
and that we add the functions static int get_dim()
and
static bool set_dim(int _d)
with the following
intention:
- the getter function
static int get_dim()
returns the value ofd
; - the setter function
static bool set_dim(int _d)
updates the value ofd
only if it has never been updated before, and it returnstrue
if the update has been performed andfalse
otherwise.
Exercise 1
Complete these two functions in point.cpp
. You can test
your code by first running the command make grader
and then
the command ./grader 1
. Then upload your file
point.cpp
here:
3. Single-linkage clustering
Now we move to implementing the clustering algorithm and related functionality.
3.1 Creating a graph
In the file graph.hpp
you will find the declaration of
the class graph
. It contains the following attributes:
long n
: the number of vertices.long size
: the number of edges.edge *edges
: an array of the edges of the graph. An edge consists of its two end points as well as the distance among these two points (see alsoedge.hpp
). The array is assumed to be sorted by non-decreasing edge lengths.std::string *node_names
: an array of the names of the nodes in the graph.long iterator_pos
: the position of the iterator, which we will use later.
Exercise 2
We start with implementing two constructors and one destructor. Before you implement them though, please read until the end of this subsection.
The constructor
graph(const cloud &_c)
takes a cloud of points as input. It should initializen
andnode_names
from the information in the cloud (seecloud.hpp
for the class description). The value ofsize
should be computed based onn
. After this, the constructor should build the arrayedges
and sort it. The array should consist of all the possible pairs of the points (with their respective distances). In order to sort the edges by length, use the following version of thestd::sort
function:std::sort(edges, edges + size, edge::compare); // requires #include <algorithm>
Its third parameter is the static method
edge::compare
, which is used for comparison while sorting.Similarly, the constructor
graph(long n, std::string node_names[], double **dist_matrix)
should construct a graph withn
vertices from a list ofn
names and from an \(n\times n\) matrixdist_matrix
of the distances of all pairs of points. Do not forget to sort the edges at the end.The destructor
~graph()
should free the memory occupied by the arrays for the edges and the names.
Before you implement the methods above, you are encouraged to look at
edge.cpp
, especially at the edge::compare
method and the constructor of the edge
class.
You are also asked to implement the getter
long get_num_edges()
, providing access to the field
size
.
Similar to before, you can test your code by first running
make grader
and then ./grader 2
.
Upload your file graph.cpp
here:
3.2 Iterating over the edges
For the subsequent exercises, we will need to iterate over the array
of all edges of the graph. To this end, we implement a simple
iterator: a private field iterator_pos
of the
class graph
indicates the next edge to process.
Exercise 3
Implement the following methods:
The method
void start_iteration()
is intended to initialize the iterator by setting the fielditerator_pos
to 0.The method
edge *get_next()
returns theedge
currently referred to byiterator_pos
and advances this field to the next index in the array of edges. It returnsNULL
if the end of the array has been reached.
Command ./grader 3
will run the tests for this exercise
(after make grader
).
Upload your file graph.cpp
here:
3.3 Hierarchical clustering with union–find: overview
Recall form the lecture (p. 65 in the booklet)
that the clustering algorithm iterates over the edges provided by the
class graph
above in the following way:
- If the next edge connects 2 points from different clusters \((C,C')\), it triggers a merge (\(C\) and \(C'\) become the single cluster \(C \cup C'\)).
- Otherwise, the edge is ignored.
The data structure managing the clusters is provided by the class
dendrogram
declared in dendrogram.hpp
. It
follows the union–find paradigm, with an additional tree structure for
top-down exploration.
Each cluster is represented by one of the points it contains. Whenever, as above, two clusters \((C, C')\) are joined, one of the clusters, say \(C'\), is merged into the other, here \(C\). The representative of \(C\) becomes that of \(C \cup C'\). A link is established from the representative of \(C'\) to that of \(C\) to keep track of the merge. In order to not chain too many links, we rely on the ranks of the clusters, which are the ranks of their representatives. The rank of a node representing a point in the resulting data structure is the depth of the sub-tree rooted at it, formed by the links.
As mentioned above, in addition to the union–find structure, our
dendrogram data structure contains a tree that is used for top-down
exploration. To this end, in each node, we record a down
pointer to its rightmost child (i.e., the one that was merged
last) and a left
pointer to its next sibling.
The figure below illustrates the steps of the clustering algorithm for a cloud of five 1-dimensional points (0, 1, 3, 4 and 9) as well as the resulting dendrogram. The ranks of all points at every step are shown in red. A blue number, when present, indicates the height in the dendrogram at which the node/cluster is merged into its parent, i.e., the length of the edge that triggered the merge divided by 2. The black edges show the parent relation in the union–find structure, whereas the orange edges show the down and left relations in the tree for the top-down exploration.
The class dendrogram
is declared in
dendrogram.hpp
. It contains the elements described above
defined as arrays:
int *parent;
int *rank;
int *left;
int *down;
double *height;
For instance, the variables of the structure in the last line of the figure above look as follows (recall that the array indices start at 0):
i == { 0, 1, 2, 3, 4} // Not a field -- only given here for the sake of clarity
----------------------------------
parent == { 1, 3, 3, -1, 3}
rank == { 0, 1, 0, 2, 0}
left == { -1, 2, -1, -1, 1}
down == { -1, 0, -1, 4, -1}
height == {0.5, 1, 0.5, -1, 2.5}
3.4 Finding the representative
For each node i
, one can find the node representing the
cluster containing i
by following the edge to the parent
(that is, from i
to parent[i]
etc.) until the
parent is \(-1\). (We do not use any
path compression here.)
Exercise 4
Use the strategy just explained to implement (in
dendrogram.cpp
) the method int find(int i)
,
which should return the representative of the cluster containing
i
.
You can test your code by running ./grader 4
(after
running make grader
).
Upload your file dendrogram.cpp
here:
3.5 Merging clusters
Now we come to the core part of the algorithm: merging clusters. To
this end, we use the method void merge(edge *e)
, which
should merge the clusters containing the nodes of the edge
e
. Keep in mind that:
- the two nodes of the edge need not be the representatives of their
clusters, so you have to find those first (using the method
find
you have implemented earlier). - the two nodes of the edge can be already be in the same cluster. If this is the case, then no action is required.
- when necessary,
merge
must update all five fields above (parent
,rank
,left
,down
, andheight
). - in order to avoid deep trees, take as the new representative node
the one of the highest rank. If the ranks of the two representatives are
equal, take the one with the larger index—the built-in tests of
the function
merge()
will not pass if this choice is made differently, even though it is not important for the algorithm.
Exercise 5
Implement the method merge
in
dendrogram.cpp
. Run the command ./grader 5
to
test your solution (after running make grader
).
Upload your file dendrogram.cpp
here:
3.6 Combining everything
Now we implement the clustering algorithm, following the approach outlined above: iterate over the edges and merge existing clusters.
Excercise 6
In dendrogram.cpp
, implement the method
void build()
, which should construct a dendrogram structure
as sketched in the figures above and explained in the accompanying
text.
After running make grader
, you can test your code via
./grader 6
.
Upload your file dendrogram.cpp
here:
4. Clusters
Now that the dendrogram is built, we want to get some information about the clusters that it contains.
4.1 Identifying the clusters below a given height
Our next goal is to allow to split the points into clusters at fixed
height h
(called \(C_h(p)\) in the notation of slide
78). The corresponding method
void dendrogram::set_clusters(double h)
should eventually
mark for each point to which cluster it belongs. To this end, we make
use of the int
array clusters
inside of the
dendrogram
class. The algorithm should do the
following:
For each point (with index
i
), set its label (that is,clusters[i]
) to be the index of the node representing the cluster that the point with indexi
belongs to.Set the attribute
int total_clusters
to be equal to the number of clusters obtained this way.
This method is already implemented but relies on the auxiliary method
int dendrogram::set_clusters(int i, double h)
, which you
need to implement. For a given point with index i
, the
method should compute cluster[i]
, assuming that the parent
of i
already knows its cluster. Afterward, it should run
recursively on the the left
and the down
node
of i
. It should also increment total_clusters
if necessary (which is the case if node i
is the root of a
cluster).
An example
For the figure above, at height 0.5, we have two non-singleton clusters—nodes 0 (point 0) and 1 (point 1) as well as nodes 2 (point 3) and 3 (point 4)—represented by nodes 1 and 3, respectively. Thus:
clusters == {1, 1, 3, 3, 4}
Similarly, at height 1, we have one non-singleton cluster, represented by node 3 (point 4), which is the result of a merge of the two clusters above. Hence,
clusters = {3, 3, 3, 3, 4}
For the discussion below, let us take h = 0.5
for the
dendrogram in the above figure.
If a node has a parent and joins it at a height no greater than
h
, it belongs to the same cluster. For example, node 2 (point 3) has the height 0.5, so it belongs to the same cluster as node 3 (point 4).Otherwise, if a node does not have a parent (e.g., node 3 (point 4)) or if it joins its parent at a height greater than
h
(e.g., node 4 (point 9)), then it does not belong to the same cluster. Since we are exploring the tree top-down, this node is necessarily the representative of a new cluster (in the case of the node 4, this turns out to be a singleton cluster, which we will ignore subsequently).
Excercise 7
Implement int dendrogram::set_clusters(int i, double h)
as explained above. You can test your code by running
./grader 7
(after running make grader
).
Upload your file dendrogram.cpp
here:
4.2 Counting non-singleton clusters
We continue using the terminology defined in Section 4.1 (the
attribute clusters
with the cluster labels).
Excercise 8
Implement (in the file dendrogram.cpp
) the method
int dendrogram::_count_ns_clusters()
, which should count
the number of non-singleton clusters.
Test your code via ./grader 8
(after running
make grader
).
Upload here your file dendrogram.cpp
:
4.3 Height of a cluster
We now determine the hight of a cluster.
Excercise 9
In the file dendrogram.cpp
, implement the method
double dendrogram::get_cluster_height(int cluster)
, which
should determine the height in the dendrogram of the cluster, assuming
that the array clusters
is already initialized via
set_clusters(double h)
(from Exercise 7) and that the
dendrogram is built. For cluster indices of nodes that are only
representatives of a singleton cluster, this function should return 0.
Note that such nodes are the ones that do not have children.
As an example, in the figure above, point 0 (node 0) is only ever the representative of its own cluster, hence, its height is 0. In contrast, point 1 (node 1) is, at the end, the representative of the cluster containing points 0 and 1, hence, its height is 0.5 (although it is also the representative of the singleton cluster that only contains point 1, which we ignore here). Similarly, the height of point 4 (node 3) is 1, although it is also the representative of two lower clusters (namely, the one containing points 3 and 4 (with height 0.5) as well as the one containing only point 4).
Recall that, by design, nodes at the same level of the data structure
that represents all merged clusters have a non-decreasing height from
left to right (and each node stores a pointer to its left
node, if it exists). Further, a node stores the height of the cluster
that it got merged into. That is, a node does not store the
height of the cluster it represents itself. Instead, this height is
stored in one of its children, the rightmost of which is accessed via
down
.
You can test your code via ./grader 9
(after running
make grader
).
Upload your file dendrogram.cpp
here:
5. Running the clustering algorithm
Once you have done all the exercises above, you can compile a C++
clustering tool by make test-dendrogram
. This command
produces a binary test-dendrogram
. You can use it with
- csv files containing the clouds of points, for example,
./test-dendrogram ./csv/iris.csv 4 150
- csv files containing graphs encoded by a distance matrix, for example,
./test-dendrogram ./csv/languages.csv
The tool prints lists of clusters at various depths.