CSC_43042_EP - Agglomerative hierarchical clustering
Context
In this tutorial, we will apply a hierarchical clustering algorithm to data, and interpret the results. Specifically, we will implement the single-linkage hierarchical clustering algorithm using the union-find data structure, which you have seen in INF321 and CSC_41011_EP.
Structure
This TD is structured as follows:
- Part 1 involves running hierarchical clustering
algorithms using
scikit-learn
to prepare you to working with data for the quiz, which you should complete some time before next week’s course (deadline: midnight on Tuesday, April 2). - Part 2 looks at the data structures for points and clouds.
- Part 3 looks at the weighted graph structures that we will build dendrograms from.
- Part 4 is an interlude on union-find clustering.
- Part 5 guides you through implementing a single-linkage clustering algorithm to build dendrograms.
- Part 6 describes how to retrieve useful information from the cluster data computed in Part 5.
Before you begin
Download and unzip the archive CSC43042EP-td3-1-handin.zip
.
It contains:
- source files, in particular the files
cloud.py
,graph.py
, anddendrogram.py
, which you will be asked to modify, - a
csv
folder with the data used in the TD, and - a
quiz
folder with a starter script in Python, for the quiz.
Part 1: Data analysis
In the quiz
this week, you are asked to perform hierarchical clustering on several
datasets—some synthetic, others from the real world. To this end, we use
the libraries pandas
(which you have already seen) and
scipy
. If you are using pip
to manage your
Python installation, then you can get the required libraries using the
following command in a terminal:
$ pip3 install --user --only-binary=:all: pandas scipy PyQt5 matplotlib
Note: you will have already installed pandas, scipy, and matplotlib for the previous tutorials. If you need to install PyQt5, install the most recent version available - but any 5.15.x version should be OK.
To perform the clustering, we provide you with a Python script named
hierarchical_clustering.py
, located in the
quiz
folder. It uses the following imports:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
We use pandas
for reading and manipulating data,
matplotlib
for generating plots, and scipy
for
performing the clustering.
The script currently reads in the file
../csv/bluered.csv
, which contains 22 points with
coordinates x
and y
labeled by integers from 1
to 22, as shown below:

The script then performs hierarchical clustering using the
linkage
function from scipy
. Afterward, it
generates and plots the resulting dendrogram, using the
dendrogram
function:
= linkage(data[['x', 'y']])
tree = dendrogram(tree, labels = data['name'].to_numpy(), orientation = 'left')
D plt.show()
When running the script (from the folder quiz
), you
should see the following picture:

You are encouraged to look at the official documentation for linkage and dendrogram. We would like to highlight the following aspects:
- Important: The x-coordinate corresponds to the
height of the cluster multiplied by two. So, if you are
interested, for example, in the clusters at height
1.0
, you should look at the part of the dendrogram with the x-coordinate being at most2.0
. - The
linkage
function is called by default withmethod='single'
, which tells it to use single-linkage, but other methods are implemented as well. - By default, the
linkage
function equips the data with the Euclidean metric, but other metrics are also available. You can even provide a custom distance matrix (see below). - When calling
dendrogram
, we provide the labels to be displayed at the leaf nodes. - One can also use the
color_threshold
parameter ofdendrogram
to find clusters of a given height. If one setscolor_threshold=h
, then only clusters beneath the cut heighth
are colored. For example, after settingcolor_threshold=2.0
, you should see the following:

The Quiz
Use this script to answer the questions in the Quiz. Some hints:
- One of the questions is about the iris dataset you have seen before.
The dataset is relatively large, so you may want to increase the size of
the figure using
plt.figure(figsize=(x, y))
, wherex
andy
are the desired width and height in inches (this should be done before the dendrogram has been created). - The last question is about the file
languages.csv
, which encodes a graph with 19 nodes (each being a language, listed on lines 2–20 of the file) and the pairwise distances, given as a matrix (on the subsequent lines). Note thatlinkage
accepts a distance matrix in a condensed 1-D format as input (see the documentation). Use the following function to transform the filelanguages.csv
into such a condensed format:
from scipy.spatial.distance import squareform
def load_distance_matrix(fname):
"""
Takes as input a name of a file containing the information about a graph:
- first line: number of vertices
- then labels, one per line
- then the distance matrix, one row per line, the entries separated by commas
Returns a tuple containing a distance matrix in the condensed 1-D format and a list of labels
"""
with open(fname, 'r') as infile:
= int(infile.readline())
n = [infile.readline().strip() for _ in range(n)]
labels = [[float(x) for x in line.split(',')] for line in infile]
dist_matrix return (squareform(dist_matrix), labels)
Note: The import
line at the top is
essential to define the squareform
function.
Part 2: Clouds of points
Now we move to the main task for this week: implementing the clustering algorithm. First, we define our most basic data structures: points and clouds of points..
Open the source file cloud.py
. The
cloud
module defines two classes:
Point
, which wraps a numpyndarray
of coordinates and has an (optional) label stringCloud
, which is little more than a list ofPoint
objects.
2. Clouds
Take a moment to inspect some of the special “dunder” methods of
Cloud
:
- the
__str__
method produces a printable string representation of aCloud
object, with each of its points on a separate line. Observe how we iterate over the_points
list usingenumerate
. In general,
for (i, x) in enumerate(seq):
# do something with i and x ...
is equivalent to
for i in range(len(seq)):
= seq[i]
x # do something with i and x ...
- the
__len__
method returns the number ofPoint
s in theCloud
. You should never call this method directly: instead, this method allows you to apply the builtinlen
function toCloud
objects. - the
__iter__
method returns an iterator (created by the built-initer
function) over the internal list of points. This allows us to iterate directly over the points in the cloud. Try it:
from TD.cloud import Point, Cloud
= Cloud()
c 1.0, 0.0, 0.0], 'x'))
c.add_point(Point([0.0, 1.0, 0.0], 'y'))
c.add_point(Point([0.0, 0.0, 1.0], 'z'))
c.add_point(Point([for p in c:
print(p)
Or, combining with the enumerate
trick above:
for (i, p_i) in enumerate(c):
print(f'The {i}-th point is {p_i}')
- the
__getitem__
method allows us to directly access the i-th point in the cloud using the subscript operator. Continuing with the example above:
= c[2]
p_2 print(p_2)
These methods allow us to work with Cloud
objects as if
they were basic Python lists. But they also leave us free to change the
internal representation—changing _points
from a
list
to an ndarray
, for example—without
changing the rest of the program.
Exercise 0
You skipped running the examples above, didn’t you? Go back and try them!
Part 3: Graphs
Now open the file graph.py
. You will
find a module defining
- A class
Graph
, whose nodes are names of points in a cloud - A class
Edge
representing the edges in aGraph
, weighted by Euclidean distances between corresponding points.
Edges
Let’s start with Edge
objects. These are very simple:
they contain indices (into the node list)
p1
and p2
of the two Point
objects joined by the edge, and also a floating-point
length
(for us, this is the Euclidean distance between
those points).
Exercise 1
Add a new method __lt__(self, other)
the Edge
class. This method should
- return
True
if the current/activeEdge
object’slength
is (strictly) less than thelength
ofother
, and - return
False
otherwise.
Again, since __lt__
is a “dunder” (double-underscore)
method, you should never call it directly: it exists so that Python can
apply the <
comparison operator to Edge
objects, and therefore sort collections of Edges.
Testing and grading
Test your code. You should be able to replicate the following behaviour:
>>> from TD.graph import Edge
>>> e_0 = Edge(4, 5, 1.7)
>>> e_1 = Edge(0, 1, 10.0)
>>> e_2 = Edge(0, 1, 2.3)
>>> e_3 = Edge(2, 3, 10.0)
>>> e_0 < e_2 < e_1 # Should evaluate to True
True
>>> e_1 < e_3 # Should evaluate to False
False
>>> e_2 < e_0 # Should evaluate to False
False
These <
comparisons are translated directly into
implicit calls to __lt__
(for example,
e_1 < e_3
is translated to
e_1.__lt__(e_3)
).
Now check that this enables proper
sorting for Edge
objects:
>>> edgelist = [e_0, e_1, e_2, e_3]
>>> print(edgelist)
4, 5, 1.7), Edge(0, 1, 10.0), Edge(0, 1, 2.3), Edge(2, 3, 10.0)]
[Edge(>>> edgelist.sort() # Sort by increasing length
>>> print(edgelist)
4, 5, 1.7), Edge(0, 1, 2.3), Edge(0, 1, 10.0), Edge(2, 3, 10.0)] [Edge(
Once this is working, you can try the automated
tests: python grader.py 1
will check your
__lt__
method.
Upload your graph.py
file here:
Graphs
Graph
objects have the following attributes:
node_names
: a list of the names of the nodes in the graph.edges
: a list of theEdge
s in the graph. This list is assumed to be sorted by non-decreasing edge length: any time you add more edges to the graph, you will have to.sort()
the list afterwards to maintain this invariant.
Every Graph
is created empty: we need to be able to add
the nodes and edges.
Exercises 2 and 3
Complete the methods add_nodes
and
add_edges
.
def add_nodes(self, ns : [str]) -> None:
"""Add a list of (names of) nodes to the graph."""
...
def add_edges(self, es : [Edge]) -> None:
"""Add a list of edges to the graph,
maintaining the length-sorted invariant.
"""
...
Remember to make sure the edges are stored in non-decreasing length order!
Testing and grading
Test your code. You should be able to replicate the following behaviour:
>>> from TD.graph import Graph, Edge
>>> g = Graph()
>>> g.add_nodes([])
>>> print(g)
Graph:0 Nodes
0 Edges
>>> g.add_edges([])
>>> print(g)
Graph:0 Nodes
0 Edges
>>> g.add_nodes(['A', 'B', 'C'])
>>> print(g)
Graph:3 Nodes: 0: A 1: B 2: C
0 Edges
>>> g.add_edges([Edge(0, 1, 1.2), Edge(0, 2, -1)])
>>> print(g)
Graph:3 Nodes: 0: A 1: B 2: C
2 Edges: [Edge(0, 2, -1), Edge(0, 1, 1.2)]
Once this is working, you can try the automated tests:
python grader.py 2
testsadd_nodes
python grader.py 3
testsadd_edges
Upload your graph.py
file here:
Building a graph from a cloud
Suppose we are given a cloud of data points; we want to form the complete graph on these points with edge lengths corresponding to Euclidean distances.
Exercise 4
Complete the function graph_from_cloud
(in graph.py
, but outside the
Graph
class: this is a function, not a method!).
def graph_from_cloud(c : Cloud):
"""Construct the complete graph whose nodes are names of points in c
and where the length of the edge between two points is the Euclidean
distance between them.
"""
...
Testing and grading
Test your code. You should be able to replicate the following behaviour:
>>> from TD.cloud import Cloud
>>> from TD.graph import *
>>> c = Cloud()
>>> c.add_point(Point([1.0, 0.0, 0.0], 'x'))
>>> c.add_point(Point([0.0, 1.0, 0.0], 'y'))
>>> c.add_point(Point([0.0, 0.0, 1.0], 'z'))
>>> print(c)
Cloud:0 [1. 0. 0.] (x)
1 [0. 1. 0.] (y)
2 [0. 0. 1.] (z)
>>> g = graph_from_cloud(c)
>>> print(g)
Graph:3 Nodes: 0: x 1: y 2: z
3 Edges: [Edge(1, 0, 1.4142135623730951), Edge(2, 0, 1.4142135623730951), Edge(2, 1, 1.4142135623730951)]
Once this is working, you can run some of our automated
tests: python grader.py 4
Upload your graph.py
file here:
Building a graph from a pre-computed matrix
Building a graph from a cloud using graph_from_cloud
required computing the distances between each pair of points. Sometimes
we have this data in advance: we would like to be able to build a
Graph
directly from a data file containing the matrix of
edge lengths.
Exercise 5
Complete the functions
graph_from_matrix
and graph_from_matrix_file
(in the graph.py
file, but outside the
Graph
class: these are functions, not methods!).
def graph_from_matrix(node_names : [str], dist_matrix: [[float]]) :
"""Construct the complete graph on the given list of node names
with the length of the edge between nodes i and j given by the
(i,j)-th entry of the matrix.
"""
...
Testing and grading
Test your code. For graph_from_matrix
,
you should be able to replicate this behaviour:
>>> from TD.graph import *
>>> nodes = ['A', 'B', 'C']
>>> lengths = [[0,1,2],[1,0,3],[2,3,0]]
>>> g = graph_from_matrix(nodes, lengths)
>>> print(g)
Graph:3 Nodes: 0: A 1: B 2: C
3 Edges: [Edge(1, 0, 1), Edge(2, 0, 2), Edge(2, 1, 3)]
When this is working, you can try some of our automated
tests: python grader.py 5
checks
graph_from_matrix
.
Building a graph from a data file
For large data sets, we don’t want to build the distance matrix by hand every time: we want to just load it from a data file.
Look at the data file
csv/languages.csv
:
19
English
...
Sranan
0,13,29,30,30,28,26,26,32,40,33,34,39,37,38,42,38,36,37
13, 0,33,31,34,31,26,30,33,40,36,39,42,37,42,44,41,42,36
...
36,42,37,33,32,38,41,36,34,45,37,36,18,19,20,20,15, 0,38
37,36,41,40,40,44,40,43,39,51,45,42,42,39,44,43,41,38, 0
Exercise 6
Complete the function
graph_from_matrix_file
(again, in the graph.py
file but outside the Graph
class).
def graph_from_matrix_file(filename):
"""Construct the graph specified in the named file. The first line
in the file is the number n of nodes; the next n lines give the node
names; and the following n lines are the rows of the distance matrix
(n entries per line, comma-separated).
"""
...
Your graph_from_matrix_file
should read the data
from the given file, and pass the results into
graph_from_matrix
to build the required Graph
object. Don’t reinvent the wheel!
Testing and grading
First, make sure that you can reproduce this behaviour:
>>> from TD.graph import *
>>> nodes = ['A', 'B', 'C']
>>> lengths = [[0,1,2],[1,0,3],[2,3,0]]
>>> g = graph_from_matrix(nodes, lengths)
>>> print(g)
Graph:3 Nodes: 0: A 1: B 2: C
3 Edges: [Edge(1, 0, 1), Edge(2, 0, 2), Edge(2, 1, 3)]
For graph_from_matrix_file
, you should be able to
reproduce this:
>>> g = graph_from_matrix_file('csv/languages.csv')
>>> print(g)
Graph:19 Nodes: 0: English, 1: Scotts, 2: Dutch, 3: Afrikaans, 4: Low_Saxon, 5: Limburgs, 6: West_Frisian, 7: Saterland_Frisian, 8: North_Frisian, 9: Luxembourgish, 10: German, 11: Yiddish, 12: Danish, 13: Swedish, 14: Faroese, 15: Icelandic, 16: Norwegian(bokmål), 17: Norwegian(nynorsk), 18: Sranan
171 Edges: [Edge(16, 12, 4.0), Edge(3, 2, 7.0), Edge(15, 14, 10.0), Edge(1, 0, 13.0), Edge(4, 2, 15.0), Edge(5, 2, 15.0), Edge(17, 16, 15.0), Edge(4, 3, 17.0), Edge(17, 12, 18.0), Edge(5, 3, 19.0), Edge(13, 12, 19.0), Edge(17, 13, 19.0), Edge(16, 13, 20.0), Edge(17, 14, 20.0), Edge(17, 15, 20.0), Edge(8, 4, 21.0), Edge(16, 14, 21.0), Edge(5, 4, 22.0), Edge(8, 3, 23.0), Edge(7, 4, 24.0), Edge(10, 4, 24.0), Edge(14, 12, 24.0), Edge(7, 3, 25.0), Edge(8, 2, 25.0), Edge(16, 15, 25.0), Edge(6, 0, 26.0), Edge(6, 1, 26.0), Edge(7, 0, 26.0), Edge(7, 2, 26.0), Edge(7, 6, 26.0), Edge(8, 6, 27.0), Edge(10, 9, 27.0), Edge(15, 12, 27.0), Edge(5, 0, 28.0), Edge(6, 2, 28.0), Edge(6, 5, 28.0), Edge(10, 5, 28.0), Edge(14, 13, 28.0), Edge(15, 13, 28.0), Edge(2, 0, 29.0), Edge(8, 7, 29.0), Edge(10, 2, 29.0), Edge(3, 0, 30.0), Edge(4, 0, 30.0), Edge(6, 3, 30.0), Edge(6, 4, 30.0), Edge(7, 1, 30.0), Edge(8, 5, 30.0), Edge(11, 4, 30.0), Edge(11, 10, 30.0), Edge(3, 1, 31.0), Edge(5, 1, 31.0), Edge(7, 5, 31.0), Edge(10, 3, 31.0), Edge(13, 3, 31.0), Edge(8, 0, 32.0), Edge(10, 7, 32.0), Edge(11, 7, 32.0), Edge(17, 4, 32.0), Edge(2, 1, 33.0), Edge(8, 1, 33.0), Edge(9, 5, 33.0), Edge(10, 0, 33.0), Edge(10, 6, 33.0), Edge(11, 2, 33.0), Edge(13, 2, 33.0), Edge(17, 3, 33.0), Edge(4, 1, 34.0), Edge(9, 3, 34.0), Edge(11, 0, 34.0), Edge(11, 3, 34.0), Edge(11, 5, 34.0), Edge(12, 3, 34.0), Edge(13, 4, 34.0), Edge(13, 5, 34.0), Edge(13, 10, 34.0), Edge(13, 11, 34.0), Edge(17, 8, 34.0), Edge(9, 2, 35.0), Edge(16, 3, 35.0), Edge(9, 4, 36.0), Edge(10, 1, 36.0), Edge(10, 8, 36.0), Edge(13, 8, 36.0), Edge(14, 3, 36.0), Edge(14, 11, 36.0), Edge(15, 11, 36.0), Edge(16, 11, 36.0), Edge(17, 0, 36.0), Edge(17, 7, 36.0), Edge(17, 11, 36.0), Edge(18, 1, 36.0), Edge(11, 8, 37.0), Edge(12, 5, 37.0), Edge(12, 8, 37.0), Edge(13, 0, 37.0), Edge(13, 1, 37.0), Edge(13, 6, 37.0), Edge(13, 7, 37.0), Edge(14, 2, 37.0), Edge(14, 4, 37.0), Edge(15, 2, 37.0), Edge(15, 3, 37.0), Edge(15, 4, 37.0), Edge(15, 8, 37.0), Edge(16, 4, 37.0), Edge(16, 8, 37.0), Edge(17, 2, 37.0), Edge(17, 10, 37.0), Edge(18, 0, 37.0), Edge(11, 6, 38.0), Edge(12, 2, 38.0), Edge(12, 4, 38.0), Edge(12, 11, 38.0), Edge(14, 0, 38.0), Edge(14, 8, 38.0), Edge(16, 0, 38.0), Edge(16, 5, 38.0), Edge(16, 10, 38.0), Edge(17, 5, 38.0), Edge(18, 17, 38.0), Edge(9, 6, 39.0), Edge(11, 1, 39.0), Edge(12, 0, 39.0), Edge(12, 10, 39.0), Edge(14, 6, 39.0), Edge(15, 5, 39.0), Edge(16, 2, 39.0), Edge(16, 7, 39.0), Edge(18, 8, 39.0), Edge(18, 13, 39.0), Edge(9, 0, 40.0), Edge(9, 1, 40.0), Edge(11, 9, 40.0), Edge(12, 7, 40.0), Edge(14, 5, 40.0), Edge(15, 10, 40.0), Edge(18, 3, 40.0), Edge(18, 4, 40.0), Edge(18, 6, 40.0), Edge(9, 7, 41.0), Edge(9, 8, 41.0), Edge(16, 1, 41.0), Edge(17, 6, 41.0), Edge(18, 2, 41.0), Edge(18, 16, 41.0), Edge(12, 1, 42.0), Edge(12, 6, 42.0), Edge(13, 9, 42.0), Edge(14, 1, 42.0), Edge(14, 10, 42.0), Edge(15, 0, 42.0), Edge(16, 6, 42.0), Edge(17, 1, 42.0), Edge(18, 11, 42.0), Edge(18, 12, 42.0), Edge(14, 7, 43.0), Edge(15, 6, 43.0), Edge(18, 7, 43.0), Edge(18, 15, 43.0), Edge(15, 1, 44.0), Edge(18, 5, 44.0), Edge(18, 14, 44.0), Edge(12, 9, 45.0), Edge(15, 7, 45.0), Edge(16, 9, 45.0), Edge(17, 9, 45.0), Edge(18, 10, 45.0), Edge(14, 9, 50.0), Edge(15, 9, 51.0), Edge(18, 9, 51.0)]
When your functions are working properly, you can run our
automated tests: python grader.py 6
Upload your graph.py
file here:
Part 4: Hierarchical clustering with union-find: overview
Recall from the lecture (p. 65 in the booklet)
that the clustering algorithm iterates over the edges provided by the
class graph
above in the following way:
- If the next edge connects 2 points from different clusters \((C,C')\), it triggers a merge (\(C\) and \(C'\) become the single cluster \(C \cup C'\)).
- Otherwise, the edge is ignored.
The data structure managing the clusters is provided by the class
Dendrogram
defined in dendrogram.py
. It
follows the union-find paradigm, with an additional tree structure for
top-down exploration.
Each cluster is represented by one of the points it contains. Whenever, as above, two clusters \((C, C')\) are joined, one of the clusters, say \(C'\), is merged into the other, here \(C\). The representative of \(C\) becomes that of \(C \cup C'\). A link is established from the representative of \(C'\) to that of \(C\) to keep track of the merge. In order to not chain too many links, we rely on the ranks of the clusters, which are the ranks of their representatives. The rank of a node representing a point in the resulting data structure is the depth of the sub-tree (of links) rooted at that node.
As mentioned above, in addition to the union-find structure, our
dendrogram data structure contains a tree that is used for top-down
exploration. To this end, for each node, we record a down
pointer to its rightmost child (i.e., the one that was merged
last) and a left
pointer to its next sibling.
The figure below illustrates the steps of the clustering algorithm for a cloud of five 1-dimensional points (0, 1, 3, 4 and 9) as well as the resulting dendrogram. The ranks of all points at every step are shown in red. A blue number, when present, indicates the height in the dendrogram at which the node/cluster is merged into its parent, i.e., the length of the edge that triggered the merge divided by 2. The black edges show the parent relation in the union–find structure, whereas the orange edges show the down and left relations in the tree for the top-down exploration.

Part 5: Building dendrograms
Now we switch focus to the Dendrogram
class defined in
dendrogram.py
. Its attributes include five lists:
parent
: a list indicating the parent of each node (for union-find)rank
: a list of integers ranks giving the rank of each node (for union-find)left
: a list indicating the sibling on the left of each nodedown
: a list indicating the rightmost child of each nodeheight
: a list of floats, each representing the height at which the represented cluster is merged into its parent
If we consider the clustering example above, after the last step, the
five lists in the corresponding Dendrogram
would be in the
following state:
1D point 0.0, 1.0, 3.0, 4.0, 9.0
node index 0, 1, 2, 3, 4
----------------------------------
parent == [ 1, 3, 3, -1, 3]
rank == [ 0, 1, 0, 2, 0]
left == [ -1, 2, -1, -1, 1]
down == [ -1, 0, -1, 4, -1]
height == [0.5, 1, 0.5, -1, 2.5]
Here, -1
means that there is no corresponding
node.
Finding representatives
For each node i
, one can find the node representing the
cluster containing i
by following the edge to the parent
(that is, from i
to parent[i]
etc.) until the
parent is \(-1\). (We do not use any
path compression here.)
Exercise 7
Complete the method find_rep
(in the
Dendrogram
class, in dendrogram.py
) following
the strategy above:
def find_rep(self, i : int) -> int:
"""Find the representative in the union-find structure
for the cluster containing node i.
"""
...
Testing and grading
Test your code. First, see if you can replicate this behaviour:
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> g = Graph()
>>> g.add_nodes(['A', 'B', 'C', 'D', 'E'])
>>> d = Dendrogram(g)
>>> print(d)
node parent rank left down height cluster0 - 0 - - - -
1 - 0 - - - -
2 - 0 - - - -
3 - 0 - - - -
4 - 0 - - - -
>>> [d.find_rep(i) for i in range(5)]
0, 1, 2, 3, 4]
[>>> d.parent = [1, 2, -1, 4, -1]
>>> [d.find_rep(i) for i in range(5)]
2, 2, 2, 4, 4]
[>>> print(d)
node parent rank left down height cluster0 1 0 - - - -
1 2 0 - - - -
2 - 0 - - - -
3 4 0 - - - -
4 - 0 - - - -
Once this is working, you can try some automated
tests: python grader.py 7
Upload your dendrogram.py
file
here:
Merging clusters
Now we come to the core part of the algorithm: merging clusters. To
this end, we use the method merge(e: Edge)
, which should
merge the clusters containing the nodes of the edge e
.
Keep in mind that:
the two nodes of the edge need not be the representatives of their clusters, so you have to find those first (using the method
find_rep
you have implemented earlier).the two nodes of the edge may already be in the same cluster. If this is the case, then no action is required.
when necessary,
merge
must update all five lists above (parent
,rank
,left
,down
, andheight
).to avoid deep trees, we take the highest-ranked of the two representatives to be the representative of the new cluster. If there is a tie (i.e., the ranks are equal), then we take the representative with the larger index. (In theory the tie-breaking rule doesn’t matter, but in practice our automated tests for
merge
will fail if you do otherwise.)
Exercise 8
Complete the method merge
in the
Dendrogram
class:
def merge(self, e : Edge):
"""Merge clusters connected by the edge."""
...
Here’s a plan of attack:
- Find the representatives,
- choose the higher of the two,
- adjust
parent
,left
, anddown
, - update
rank
, and finally - update
height
.
Testing and grading
Test your code. You should be able to reproduce the following:
>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> p1 = Point([0.0, 0.0], name='O')
>>> p2 = Point([1.0, 0.0], name='A')
>>> p3 = Point([1.0, 0.5], name='B')
>>> p4 = Point([1.0, 2.0], name='C')
>>> c = Cloud()
>>> c.add_point(p1)
>>> c.add_point(p2)
>>> c.add_point(p3)
>>> c.add_point(p4)
>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> print(d)
node parent rank left down height cluster0 - 0 - - - -
1 - 0 - - - -
2 - 0 - - - -
3 - 0 - - - -
>>> print(g)
Graph:4 Nodes: 0: O, 1: A, 2: B, 3: C
6 Edges: [Edge(2, 1, 0.5), Edge(1, 0, 1.0), Edge(2, 0, 1.118033988749895), Edge(3, 2, 1.5), Edge(3, 1, 2.0), Edge(3, 0, 2.23606797749979)]
>>> d.merge(g.edges[0])
>>> print(d)
node parent rank left down height cluster0 - 0 - - - -
1 2 0 - - 0.25 -
2 - 1 - 1 - -
3 - 0 - - - -
>>> d.merge(g.edges[1])
>>> print(d)
node parent rank left down height cluster0 2 0 1 - 0.5 -
1 2 0 - - 0.25 -
2 - 1 - 0 - -
3 - 0 - - - -
Now you can try some of our automated tests:
python grader.py 8
Upload your dendrogram.py
file
here:
Combining everything: building the dendrogram
Now we implement the clustering algorithm, following the approach outlined above: iterate over the edges and merge existing clusters.
Exercise 9
Complete the build
method in the
Dendrogram
class:
def build(self):
"""Merge along each edge in non-decreasing length order
to build the dendrogram.
"""
...
The method’s docstring explains what to do!
Testing and grading
Test your code. First, let’s try the 1-dimensional example above:
>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
... c.add_point(Point([x]))
...>>> g = graph_from_cloud(c)
>>>
>>> print(g)
Graph:5 Nodes: 0: , 1: , 2: , 3: , 4:
10 Edges: [Edge(1, 0, 1.0), Edge(3, 2, 1.0), Edge(2, 1, 2.0), Edge(2, 0, 3.0), Edge(3, 1, 3.0), Edge(3, 0, 4.0), Edge(4, 3, 5.0), Edge(4, 2, 6.0), Edge(4, 1, 8.0), Edge(4, 0, 9.0)]
>>> d = Dendrogram(g)
>>> print(d)
node parent rank left down height cluster0 - 0 - - - -
1 - 0 - - - -
2 - 0 - - - -
3 - 0 - - - -
4 - 0 - - - -
>>> d.build()
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 -
1 3 1 2 0 1.0 -
2 3 0 - - 0.5 -
3 - 2 - 4 - -
4 3 0 1 - 2.5 -
After this sanity check, you can try some automated
tests: python grader.py 9
Upload your dendrogram.py
file
here:
Part 6: Using dendrograms
Identifying the clusters below a given height

Our next goal is to allow to split the points into clusters at fixed
height h
(called \(C_h(p)\) in the notation of slide
78). The method set_clusters(self, h : float)
in the
Dendrogram
class is supposed to compute the
height-h
cluster \(C_h(p)\) for every node \(p\), recording the results in the
clusters
list: that is, if We follow this algorithm:
- For each point (with index
i
), find the indexc_h_i
of the representative of the cluster containing this point, and setclusters[i]
toc_h_i
. - Ensure the attribute
total_clusters
is equal to the number of clusters obtained this way.
Exercise 10
The method set_clusters
is already implemented in
Dendrogram
, but it has a recursive subroutine
_set_clusters(self, i : int, h : float)
which you need to
implement.
- For a point with index
i
,_set_clusters
should computecluster[i]
, assuming the parent ofi
already knows its cluster. - Afterwards, it should run recursively on the the
left
and thedown
node ofi
. - If the node
i
is the root of a cluster, then_set_clusters
should also incrementtotal_clusters
.
An example
For the figure above, at height 0.5, we have two non-singleton clusters—nodes 0 (point 0) and 1 (point 1) as well as nodes 2 (point 3) and 3 (point 4)—represented by nodes 1 and 3, respectively. Thus:
clusters == {1, 1, 3, 3, 4}
Similarly, at height 1, we have one non-singleton cluster, represented by node 3 (point 4), which is the result of a merge of the two clusters above. Hence,
clusters = {3, 3, 3, 3, 4}
For the discussion below, let us take h = 0.5
for the
dendrogram in the above figure.
If a node has a parent and joins it at a height no greater than
h
, it belongs to the same cluster. For example, node 2 (point 3) has the height 0.5, so it belongs to the same cluster as node 3 (point 4).Otherwise, if a node does not have a parent (e.g., node 3 (point 4)) or if it joins its parent at a height greater than
h
(e.g., node 4 (point 9)), then it does not belong to the same cluster. Since we are exploring the tree top-down, this node is necessarily the representative of a new cluster (in the case of the node 4, this turns out to be a singleton cluster, which we will ignore subsequently).
Now implement _set_clusters
, as
described above.
Testing and grading
Test your code on the examples above. First, let’s set up the cloud, graph, and dendrogram objects:
>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
... c.add_point(Point([x]))
...>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 -
1 3 1 2 0 1.0 -
2 3 0 - - 0.5 -
3 - 2 - 4 - -
4 3 0 1 - 2.5 -
Now, let’s cut at height 0.5. We expect cluster indices 1, 1, 3, 3, 4:
>>> d.set_clusters(0.5)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 1
1 3 1 2 0 1.0 1
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
Now let’s clear that clustering and cut at height 1. This time, we expect cluster indices 3, 3, 3, 3, 4:
>>> d.clear_clusters()
>>> d.set_clusters(1)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 3
1 3 1 2 0 1.0 3
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
... c.add_point(Point([x]))
...>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 -
1 3 1 2 0 1.0 -
2 3 0 - - 0.5 -
3 - 2 - 4 - -
4 3 0 1 - 2.5 -
>>> d.set_clusters(0.5)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 1
1 3 1 2 0 1.0 1
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> d.clear_clusters()
>>> d.set_clusters(1)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 3
1 3 1 2 0 1.0 3
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
Now you can try some automated tests:
python grader.py 10
Upload your dendrogram.py
file
here:
Counting non-singleton clusters
Now we want to count the number of non-singleton clusters:
that is, the number of clusters containing more than one node. The
method count_ns_clusters
does this, caching the result (to
avoid re-computing it if the value has not changed). However, the
subroutine _count_ns_clusters
which does the real work
(when the result is not yet known) is not yet complete!
Exercise 11
Complete the method
def _count_ns_clusters(self):
"""Count non-singleton clusters from scratch"""
...
Testing and grading
Test your code. First, let’s try some simple sanity checks based on the example above. Setting up as before:
>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
... c.add_point(Point([x]))
...>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()
Now let’s try clustering below height 0.1: every cluster is a singleton.
>>> d.set_clusters(0.1)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 0
1 3 1 2 0 1.0 1
2 3 0 - - 0.5 2
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> d.count_ns_clusters()
0
Now let’s try cutting at height 0.5: this time, there are two clusters of two nodes each, and one singleton:
>>> d.clear_clusters()
>>> d.set_clusters(0.5)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 1
1 3 1 2 0 1.0 1
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> d.count_ns_clusters()
2
Finally, let’s try cutting at 1. This time we get one cluster of four nodes, and one singleton:
>>> d.clear_clusters()
>>> d.set_clusters(1)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 3
1 3 1 2 0 1.0 3
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> d.count_ns_clusters()
1
Now you can try some automated tests:
python grader.py 11
Upload your dendrogram.py
file
here:
Computing cluster heights
Suppose the list of clusters of height bounded by h
has
been computed using set_clusters(h)
. Now we want to
determine the height of a given cluster. For example, in the figure
above:
- Point 0 (node 0) is only ever the representative of its own cluster, hence, its height is 0.
- In contrast, point 1 (node 1) is, at the end, the representative of the cluster containing points 0 and 1; hence, its height is 0.5 (although it is also the representative of the singleton cluster that only contains point 1, which we ignore here).
- Similarly, the height of point 4 (node 3) is 1, though it is also the representative of two lower clusters (namely, the one containing points 3 and 4 (with height 0.5) as well as the one containing only point 4).
Exercise 12
Complete the method
def get_cluster_height(self, c : int) -> float:
"""Compute the height of the cluster c.
Assumes build() has been called and the clusters of height <= h
have been computed using set_clusters(h).
"""
...
Note: get_cluster_height(c)
should return
0
if c
is the index of a node representing a
singleton cluster. These nodes are the ones that do not have
children.
Recall that
- By design, nodes at the same level of the data structure that represents all merged clusters have non-decreasing height from left to right.
height[node]
does not store the height of the cluster it represents: it stores the height of the cluster that it got merged into!- Instead, this height is stored in one of its children, the rightmost
of which is accessed via
down
.
Testing and grading
Test your code. Setting up as before:
>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
... c.add_point(Point([x]))
...>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()
First, clustering with height 0.1:
>>> d.set_clusters(0.1)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 0
1 3 1 2 0 1.0 1
2 3 0 - - 0.5 2
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> [ d.get_cluster_height(i) for i in range(5) ]
0, 0, 0, 0, 0] [
Now, at height 0.5:
>>> d.clear_clusters()
>>> d.set_clusters(0.5)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 1
1 3 1 2 0 1.0 1
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> [ d.get_cluster_height(i) for i in range(5) ]
0, 0.5, 0, 0.5, 0] [
Next, height 2.0:
>>> d.clear_clusters()
>>> d.set_clusters(2)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 3
1 3 1 2 0 1.0 3
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 4
>>> [ d.get_cluster_height(i) for i in range(5) ]
0, 0, 0, 1.0, 0]
[>>> d.clear_clusters()
And finally, height 2.5:
>>> d.set_clusters(2.5)
>>> print(d)
node parent rank left down height cluster0 1 0 - - 0.5 3
1 3 1 2 0 1.0 3
2 3 0 - - 0.5 3
3 - 2 - 4 - 3
4 3 0 1 - 2.5 3
>>> [ d.get_cluster_height(i) for i in range(5) ]
0, 0, 0, 2.5, 0] [
Now you can try some automated tests:
python grader.py 12
Upload your dendrogram.py
file
here:
Running the clustering algorithm
Once you have done all the exercises above, you can run the Python
script test_dendrogram.py
to print lists of clusters at
various depths. You can use it with
- csv files containing clouds of points: for example,
python ./test_dendrogram.py ./csv/iris.csv 4 150
- csv files containing graphs encoded by a distance matrix, for example,
python ./test_dendrogram.py ./csv/languages.csv