This assignment has been closed on April 02, 2025.
You must be authenticated to submit your files

CSC_43042_EP - Agglomerative hierarchical clustering

Context

In this tutorial, we will apply a hierarchical clustering algorithm to data, and interpret the results. Specifically, we will implement the single-linkage hierarchical clustering algorithm using the union-find data structure, which you have seen in INF321 and CSC_41011_EP.

Structure

This TD is structured as follows:

Before you begin

Download and unzip the archive CSC43042EP-td3-1-handin.zip.

It contains:

Part 1: Data analysis

In the quiz this week, you are asked to perform hierarchical clustering on several datasets—some synthetic, others from the real world. To this end, we use the libraries pandas (which you have already seen) and scipy. If you are using pip to manage your Python installation, then you can get the required libraries using the following command in a terminal:

$ pip3 install --user --only-binary=:all: pandas scipy PyQt5 matplotlib

Note: you will have already installed pandas, scipy, and matplotlib for the previous tutorials. If you need to install PyQt5, install the most recent version available - but any 5.15.x version should be OK.

To perform the clustering, we provide you with a Python script named hierarchical_clustering.py, located in the quiz folder. It uses the following imports:

import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

We use pandas for reading and manipulating data, matplotlib for generating plots, and scipy for performing the clustering.

The script currently reads in the file ../csv/bluered.csv, which contains 22 points with coordinates x and y labeled by integers from 1 to 22, as shown below:

The script then performs hierarchical clustering using the linkage function from scipy. Afterward, it generates and plots the resulting dendrogram, using the dendrogram function:

tree = linkage(data[['x', 'y']])
D = dendrogram(tree, labels = data['name'].to_numpy(), orientation = 'left')
plt.show()

When running the script (from the folder quiz), you should see the following picture:

You are encouraged to look at the official documentation for linkage and dendrogram. We would like to highlight the following aspects:

The Quiz

Use this script to answer the questions in the Quiz. Some hints:

from scipy.spatial.distance import squareform

def load_distance_matrix(fname):
    """
    Takes as input a name of a file containing the information about a graph:
      - first line: number of vertices
      - then labels, one per line
      - then the distance matrix, one row per line, the entries separated by commas
    Returns a tuple containing a distance matrix in the condensed 1-D format and a list of labels
    """
    with open(fname, 'r') as infile:
        n = int(infile.readline())
        labels = [infile.readline().strip() for _ in range(n)]
        dist_matrix = [[float(x) for x in line.split(',')] for line in infile]
        return (squareform(dist_matrix), labels)

Note: The import line at the top is essential to define the squareform function.

Part 2: Clouds of points

Now we move to the main task for this week: implementing the clustering algorithm. First, we define our most basic data structures: points and clouds of points..

Open the source file cloud.py. The cloud module defines two classes:

2. Clouds

Take a moment to inspect some of the special “dunder” methods of Cloud:

  1. the __str__ method produces a printable string representation of a Cloud object, with each of its points on a separate line. Observe how we iterate over the _points list using enumerate. In general,
for (i, x) in enumerate(seq):
    ...  # do something with i and x

is equivalent to

for i in range(len(seq)):
    x = seq[i]
    ...  # do something with i and x
  1. the __len__ method returns the number of Points in the Cloud. You should never call this method directly: instead, this method allows you to apply the builtin len function to Cloud objects.
  2. the __iter__ method returns an iterator (created by the built-in iter function) over the internal list of points. This allows us to iterate directly over the points in the cloud. Try it:
from TD.cloud import Point, Cloud
c = Cloud()
c.add_point(Point([1.0, 0.0, 0.0], 'x'))
c.add_point(Point([0.0, 1.0, 0.0], 'y'))
c.add_point(Point([0.0, 0.0, 1.0], 'z'))
for p in c:
    print(p)

Or, combining with the enumerate trick above:

for (i, p_i) in enumerate(c):
    print(f'The {i}-th point is {p_i}')
  1. the __getitem__ method allows us to directly access the i-th point in the cloud using the subscript operator. Continuing with the example above:
p_2 = c[2]
print(p_2)

These methods allow us to work with Cloud objects as if they were basic Python lists. But they also leave us free to change the internal representation—changing _points from a list to an ndarray, for example—without changing the rest of the program.

Exercise 0

You skipped running the examples above, didn’t you? Go back and try them!

Part 3: Graphs

Now open the file graph.py. You will find a module defining

Edges

Let’s start with Edge objects. These are very simple: they contain indices (into the node list) p1 and p2 of the two Point objects joined by the edge, and also a floating-point length (for us, this is the Euclidean distance between those points).

Exercise 1

Add a new method __lt__(self, other) the Edge class. This method should

Again, since __lt__ is a “dunder” (double-underscore) method, you should never call it directly: it exists so that Python can apply the < comparison operator to Edge objects, and therefore sort collections of Edges.

Testing and grading

Test your code. You should be able to replicate the following behaviour:

>>> from TD.graph import Edge
>>> e_0 = Edge(4, 5, 1.7)
>>> e_1 = Edge(0, 1, 10.0)
>>> e_2 = Edge(0, 1, 2.3)
>>> e_3 = Edge(2, 3, 10.0)
>>> e_0 < e_2 < e_1  # Should evaluate to True
True
>>> e_1 < e_3        # Should evaluate to False
False
>>> e_2 < e_0        # Should evaluate to False
False

These < comparisons are translated directly into implicit calls to __lt__ (for example, e_1 < e_3 is translated to e_1.__lt__(e_3)).

Now check that this enables proper sorting for Edge objects:

>>> edgelist = [e_0, e_1, e_2, e_3]
>>> print(edgelist)
[Edge(4, 5, 1.7), Edge(0, 1, 10.0), Edge(0, 1, 2.3), Edge(2, 3, 10.0)]
>>> edgelist.sort()  # Sort by increasing length
>>> print(edgelist)
[Edge(4, 5, 1.7), Edge(0, 1, 2.3), Edge(0, 1, 10.0), Edge(2, 3, 10.0)]

Once this is working, you can try the automated tests: python grader.py 1 will check your __lt__ method.

Upload your graph.py file here:

Upload form is only available when connected

Graphs

Graph objects have the following attributes:

Every Graph is created empty: we need to be able to add the nodes and edges.

Exercises 2 and 3

Complete the methods add_nodes and add_edges.

    def add_nodes(self, ns : [str]) -> None:
        """Add a list of (names of) nodes to the graph."""
        ...

    def add_edges(self, es : [Edge]) -> None:
        """Add a list of edges to the graph,
        maintaining the length-sorted invariant.
        """
        ...

Remember to make sure the edges are stored in non-decreasing length order!

Testing and grading

Test your code. You should be able to replicate the following behaviour:

>>> from TD.graph import Graph, Edge
>>> g = Graph()
>>> g.add_nodes([])
>>> print(g)
Graph:
0 Nodes
0 Edges
>>> g.add_edges([])
>>> print(g)
Graph:
0 Nodes
0 Edges
>>> g.add_nodes(['A', 'B', 'C'])
>>> print(g)
Graph:
3 Nodes: 0: A 1: B 2: C
0 Edges
>>> g.add_edges([Edge(0, 1, 1.2), Edge(0, 2, -1)])
>>> print(g)
Graph:
3 Nodes: 0: A 1: B 2: C
2 Edges: [Edge(0, 2, -1), Edge(0, 1, 1.2)]

Once this is working, you can try the automated tests:

Upload your graph.py file here:

Upload form is only available when connected

Building a graph from a cloud

Suppose we are given a cloud of data points; we want to form the complete graph on these points with edge lengths corresponding to Euclidean distances.

Exercise 4

Complete the function graph_from_cloud (in graph.py, but outside the Graph class: this is a function, not a method!).

def graph_from_cloud(c : Cloud):
    """Construct the complete graph whose nodes are names of points in c
    and where the length of the edge between two points is the Euclidean
    distance between them.
    """
    ...
Testing and grading

Test your code. You should be able to replicate the following behaviour:

>>> from TD.cloud import Cloud
>>> from TD.graph import *
>>> c = Cloud()
>>> c.add_point(Point([1.0, 0.0, 0.0], 'x'))
>>> c.add_point(Point([0.0, 1.0, 0.0], 'y'))
>>> c.add_point(Point([0.0, 0.0, 1.0], 'z'))
>>> print(c)
Cloud:
 0 [1. 0. 0.] (x)
 1 [0. 1. 0.] (y)
 2 [0. 0. 1.] (z)

>>> g = graph_from_cloud(c)
>>> print(g)
Graph:
3 Nodes: 0: x 1: y 2: z
3 Edges: [Edge(1, 0, 1.4142135623730951), Edge(2, 0, 1.4142135623730951), Edge(2, 1, 1.4142135623730951)]

Once this is working, you can run some of our automated tests: python grader.py 4

Upload your graph.py file here:

Upload form is only available when connected

Building a graph from a pre-computed matrix

Building a graph from a cloud using graph_from_cloud required computing the distances between each pair of points. Sometimes we have this data in advance: we would like to be able to build a Graph directly from a data file containing the matrix of edge lengths.

Exercise 5

Complete the functions graph_from_matrix and graph_from_matrix_file (in the graph.py file, but outside the Graph class: these are functions, not methods!).

def graph_from_matrix(node_names : [str], dist_matrix: [[float]]) :
    """Construct the complete graph on the given list of node names
    with the length of the edge between nodes i and j given by the
    (i,j)-th entry of the matrix.
    """
    ...
Testing and grading

Test your code. For graph_from_matrix, you should be able to replicate this behaviour:

>>> from TD.graph import *
>>> nodes = ['A', 'B', 'C']
>>> lengths = [[0,1,2],[1,0,3],[2,3,0]]
>>> g = graph_from_matrix(nodes, lengths)
>>> print(g)
Graph:
3 Nodes: 0: A 1: B 2: C
3 Edges: [Edge(1, 0, 1), Edge(2, 0, 2), Edge(2, 1, 3)]

When this is working, you can try some of our automated tests: python grader.py 5 checks graph_from_matrix.

Building a graph from a data file

For large data sets, we don’t want to build the distance matrix by hand every time: we want to just load it from a data file.

Look at the data file csv/languages.csv:

19
English
...
Sranan
0,13,29,30,30,28,26,26,32,40,33,34,39,37,38,42,38,36,37
13, 0,33,31,34,31,26,30,33,40,36,39,42,37,42,44,41,42,36 
...
36,42,37,33,32,38,41,36,34,45,37,36,18,19,20,20,15, 0,38
37,36,41,40,40,44,40,43,39,51,45,42,42,39,44,43,41,38, 0
Exercise 6

Complete the function graph_from_matrix_file (again, in the graph.py file but outside the Graph class).

def graph_from_matrix_file(filename):
    """Construct the graph specified in the named file.  The first line
    in the file is the number n of nodes; the next n lines give the node
    names; and the following n lines are the rows of the distance matrix
    (n entries per line, comma-separated).
    """
    ...

Your graph_from_matrix_file should read the data from the given file, and pass the results into graph_from_matrix to build the required Graph object. Don’t reinvent the wheel!

Testing and grading

First, make sure that you can reproduce this behaviour:

>>> from TD.graph import *
>>> nodes = ['A', 'B', 'C']
>>> lengths = [[0,1,2],[1,0,3],[2,3,0]]
>>> g = graph_from_matrix(nodes, lengths)
>>> print(g)
Graph:
3 Nodes: 0: A 1: B 2: C
3 Edges: [Edge(1, 0, 1), Edge(2, 0, 2), Edge(2, 1, 3)]

For graph_from_matrix_file, you should be able to reproduce this:

>>> g = graph_from_matrix_file('csv/languages.csv')
>>> print(g)
Graph:
19 Nodes: 0: English, 1: Scotts, 2: Dutch, 3: Afrikaans, 4: Low_Saxon, 5: Limburgs, 6: West_Frisian, 7: Saterland_Frisian, 8: North_Frisian, 9: Luxembourgish, 10: German, 11: Yiddish, 12: Danish, 13: Swedish, 14: Faroese, 15: Icelandic, 16: Norwegian(bokmål), 17: Norwegian(nynorsk), 18: Sranan
171 Edges: [Edge(16, 12, 4.0), Edge(3, 2, 7.0), Edge(15, 14, 10.0), Edge(1, 0, 13.0), Edge(4, 2, 15.0), Edge(5, 2, 15.0), Edge(17, 16, 15.0), Edge(4, 3, 17.0), Edge(17, 12, 18.0), Edge(5, 3, 19.0), Edge(13, 12, 19.0), Edge(17, 13, 19.0), Edge(16, 13, 20.0), Edge(17, 14, 20.0), Edge(17, 15, 20.0), Edge(8, 4, 21.0), Edge(16, 14, 21.0), Edge(5, 4, 22.0), Edge(8, 3, 23.0), Edge(7, 4, 24.0), Edge(10, 4, 24.0), Edge(14, 12, 24.0), Edge(7, 3, 25.0), Edge(8, 2, 25.0), Edge(16, 15, 25.0), Edge(6, 0, 26.0), Edge(6, 1, 26.0), Edge(7, 0, 26.0), Edge(7, 2, 26.0), Edge(7, 6, 26.0), Edge(8, 6, 27.0), Edge(10, 9, 27.0), Edge(15, 12, 27.0), Edge(5, 0, 28.0), Edge(6, 2, 28.0), Edge(6, 5, 28.0), Edge(10, 5, 28.0), Edge(14, 13, 28.0), Edge(15, 13, 28.0), Edge(2, 0, 29.0), Edge(8, 7, 29.0), Edge(10, 2, 29.0), Edge(3, 0, 30.0), Edge(4, 0, 30.0), Edge(6, 3, 30.0), Edge(6, 4, 30.0), Edge(7, 1, 30.0), Edge(8, 5, 30.0), Edge(11, 4, 30.0), Edge(11, 10, 30.0), Edge(3, 1, 31.0), Edge(5, 1, 31.0), Edge(7, 5, 31.0), Edge(10, 3, 31.0), Edge(13, 3, 31.0), Edge(8, 0, 32.0), Edge(10, 7, 32.0), Edge(11, 7, 32.0), Edge(17, 4, 32.0), Edge(2, 1, 33.0), Edge(8, 1, 33.0), Edge(9, 5, 33.0), Edge(10, 0, 33.0), Edge(10, 6, 33.0), Edge(11, 2, 33.0), Edge(13, 2, 33.0), Edge(17, 3, 33.0), Edge(4, 1, 34.0), Edge(9, 3, 34.0), Edge(11, 0, 34.0), Edge(11, 3, 34.0), Edge(11, 5, 34.0), Edge(12, 3, 34.0), Edge(13, 4, 34.0), Edge(13, 5, 34.0), Edge(13, 10, 34.0), Edge(13, 11, 34.0), Edge(17, 8, 34.0), Edge(9, 2, 35.0), Edge(16, 3, 35.0), Edge(9, 4, 36.0), Edge(10, 1, 36.0), Edge(10, 8, 36.0), Edge(13, 8, 36.0), Edge(14, 3, 36.0), Edge(14, 11, 36.0), Edge(15, 11, 36.0), Edge(16, 11, 36.0), Edge(17, 0, 36.0), Edge(17, 7, 36.0), Edge(17, 11, 36.0), Edge(18, 1, 36.0), Edge(11, 8, 37.0), Edge(12, 5, 37.0), Edge(12, 8, 37.0), Edge(13, 0, 37.0), Edge(13, 1, 37.0), Edge(13, 6, 37.0), Edge(13, 7, 37.0), Edge(14, 2, 37.0), Edge(14, 4, 37.0), Edge(15, 2, 37.0), Edge(15, 3, 37.0), Edge(15, 4, 37.0), Edge(15, 8, 37.0), Edge(16, 4, 37.0), Edge(16, 8, 37.0), Edge(17, 2, 37.0), Edge(17, 10, 37.0), Edge(18, 0, 37.0), Edge(11, 6, 38.0), Edge(12, 2, 38.0), Edge(12, 4, 38.0), Edge(12, 11, 38.0), Edge(14, 0, 38.0), Edge(14, 8, 38.0), Edge(16, 0, 38.0), Edge(16, 5, 38.0), Edge(16, 10, 38.0), Edge(17, 5, 38.0), Edge(18, 17, 38.0), Edge(9, 6, 39.0), Edge(11, 1, 39.0), Edge(12, 0, 39.0), Edge(12, 10, 39.0), Edge(14, 6, 39.0), Edge(15, 5, 39.0), Edge(16, 2, 39.0), Edge(16, 7, 39.0), Edge(18, 8, 39.0), Edge(18, 13, 39.0), Edge(9, 0, 40.0), Edge(9, 1, 40.0), Edge(11, 9, 40.0), Edge(12, 7, 40.0), Edge(14, 5, 40.0), Edge(15, 10, 40.0), Edge(18, 3, 40.0), Edge(18, 4, 40.0), Edge(18, 6, 40.0), Edge(9, 7, 41.0), Edge(9, 8, 41.0), Edge(16, 1, 41.0), Edge(17, 6, 41.0), Edge(18, 2, 41.0), Edge(18, 16, 41.0), Edge(12, 1, 42.0), Edge(12, 6, 42.0), Edge(13, 9, 42.0), Edge(14, 1, 42.0), Edge(14, 10, 42.0), Edge(15, 0, 42.0), Edge(16, 6, 42.0), Edge(17, 1, 42.0), Edge(18, 11, 42.0), Edge(18, 12, 42.0), Edge(14, 7, 43.0), Edge(15, 6, 43.0), Edge(18, 7, 43.0), Edge(18, 15, 43.0), Edge(15, 1, 44.0), Edge(18, 5, 44.0), Edge(18, 14, 44.0), Edge(12, 9, 45.0), Edge(15, 7, 45.0), Edge(16, 9, 45.0), Edge(17, 9, 45.0), Edge(18, 10, 45.0), Edge(14, 9, 50.0), Edge(15, 9, 51.0), Edge(18, 9, 51.0)]

When your functions are working properly, you can run our automated tests: python grader.py 6

Upload your graph.py file here:

Upload form is only available when connected

Part 4: Hierarchical clustering with union-find: overview

Recall from the lecture (p. 65 in the booklet) that the clustering algorithm iterates over the edges provided by the class graph above in the following way:

The data structure managing the clusters is provided by the class Dendrogram defined in dendrogram.py. It follows the union-find paradigm, with an additional tree structure for top-down exploration.

Each cluster is represented by one of the points it contains. Whenever, as above, two clusters \((C, C')\) are joined, one of the clusters, say \(C'\), is merged into the other, here \(C\). The representative of \(C\) becomes that of \(C \cup C'\). A link is established from the representative of \(C'\) to that of \(C\) to keep track of the merge. In order to not chain too many links, we rely on the ranks of the clusters, which are the ranks of their representatives. The rank of a node representing a point in the resulting data structure is the depth of the sub-tree (of links) rooted at that node.

As mentioned above, in addition to the union-find structure, our dendrogram data structure contains a tree that is used for top-down exploration. To this end, for each node, we record a down pointer to its rightmost child (i.e., the one that was merged last) and a left pointer to its next sibling.

The figure below illustrates the steps of the clustering algorithm for a cloud of five 1-dimensional points (0, 1, 3, 4 and 9) as well as the resulting dendrogram. The ranks of all points at every step are shown in red. A blue number, when present, indicates the height in the dendrogram at which the node/cluster is merged into its parent, i.e., the length of the edge that triggered the merge divided by 2. The black edges show the parent relation in the union–find structure, whereas the orange edges show the down and left relations in the tree for the top-down exploration.

Part 5: Building dendrograms

Now we switch focus to the Dendrogram class defined in dendrogram.py. Its attributes include five lists:

If we consider the clustering example above, after the last step, the five lists in the corresponding Dendrogram would be in the following state:

1D point   0.0, 1.0, 3.0, 4.0, 9.0
node index   0,   1,   2,   3,   4
----------------------------------
parent == [  1,   3,   3,  -1,   3]
rank   == [  0,   1,   0,   2,   0]
left   == [ -1,   2,  -1,  -1,   1]
down   == [ -1,   0,  -1,   4,  -1]
height == [0.5,   1, 0.5,  -1, 2.5]

Here, -1 means that there is no corresponding node.

Finding representatives

For each node i, one can find the node representing the cluster containing i by following the edge to the parent (that is, from i to parent[i] etc.) until the parent is \(-1\). (We do not use any path compression here.)

Exercise 7

Complete the method find_rep (in the Dendrogram class, in dendrogram.py) following the strategy above:

    def find_rep(self, i : int) -> int:
        """Find the representative in the union-find structure
        for the cluster containing node i.
        """
        ...
Testing and grading

Test your code. First, see if you can replicate this behaviour:

>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> g = Graph()
>>> g.add_nodes(['A', 'B', 'C', 'D', 'E'])
>>> d = Dendrogram(g)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	-	0	-	-	-	-
1	-	0	-	-	-	-
2	-	0	-	-	-	-
3	-	0	-	-	-	-
4	-	0	-	-	-	-
>>> [d.find_rep(i) for i in range(5)]
[0, 1, 2, 3, 4]
>>> d.parent = [1, 2, -1, 4, -1]
>>> [d.find_rep(i) for i in range(5)]
[2, 2, 2, 4, 4]
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	-	-
1	2	0	-	-	-	-
2	-	0	-	-	-	-
3	4	0	-	-	-	-
4	-	0	-	-	-	-

Once this is working, you can try some automated tests: python grader.py 7

Upload your dendrogram.py file here:

Upload form is only available when connected

Merging clusters

Now we come to the core part of the algorithm: merging clusters. To this end, we use the method merge(e: Edge), which should merge the clusters containing the nodes of the edge e.

Keep in mind that:

Exercise 8

Complete the method merge in the Dendrogram class:

    def merge(self, e : Edge):
        """Merge clusters connected by the edge."""
        ...

Here’s a plan of attack:

  1. Find the representatives,
  2. choose the higher of the two,
  3. adjust parent, left, and down,
  4. update rank, and finally
  5. update height.
Testing and grading

Test your code. You should be able to reproduce the following:

>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> p1 = Point([0.0, 0.0], name='O')
>>> p2 = Point([1.0, 0.0], name='A')
>>> p3 = Point([1.0, 0.5], name='B')
>>> p4 = Point([1.0, 2.0], name='C')
>>> c = Cloud()
>>> c.add_point(p1)
>>> c.add_point(p2)
>>> c.add_point(p3)
>>> c.add_point(p4)
>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	-	0	-	-	-	-
1	-	0	-	-	-	-
2	-	0	-	-	-	-
3	-	0	-	-	-	-
>>> print(g)
Graph:
4 Nodes: 0: O, 1: A, 2: B, 3: C
6 Edges: [Edge(2, 1, 0.5), Edge(1, 0, 1.0), Edge(2, 0, 1.118033988749895), Edge(3, 2, 1.5), Edge(3, 1, 2.0), Edge(3, 0, 2.23606797749979)]
>>> d.merge(g.edges[0])
>>> print(d)
node	parent	rank	left	down	height	cluster
0	-	0	-	-	-	-
1	2	0	-	-	0.25	-
2	-	1	-	1	-	-
3	-	0	-	-	-	-
>>> d.merge(g.edges[1])
>>> print(d)
node	parent	rank	left	down	height	cluster
0	2	0	1	-	0.5	-
1	2	0	-	-	0.25	-
2	-	1	-	0	-	-
3	-	0	-	-	-	-

Now you can try some of our automated tests: python grader.py 8

Upload your dendrogram.py file here:

Upload form is only available when connected

Combining everything: building the dendrogram

Now we implement the clustering algorithm, following the approach outlined above: iterate over the edges and merge existing clusters.

Exercise 9

Complete the build method in the Dendrogram class:

    def build(self):
        """Merge along each edge in non-decreasing length order
        to build the dendrogram.
        """
        ...

The method’s docstring explains what to do!

Testing and grading

Test your code. First, let’s try the 1-dimensional example above:

>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
...     c.add_point(Point([x]))
...
>>> g = graph_from_cloud(c)
>>>
>>> print(g)
Graph:
5 Nodes: 0: , 1: , 2: , 3: , 4:
10 Edges: [Edge(1, 0, 1.0), Edge(3, 2, 1.0), Edge(2, 1, 2.0), Edge(2, 0, 3.0), Edge(3, 1, 3.0), Edge(3, 0, 4.0), Edge(4, 3, 5.0), Edge(4, 2, 6.0), Edge(4, 1, 8.0), Edge(4, 0, 9.0)]
>>> d = Dendrogram(g)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	-	0	-	-	-	-
1	-	0	-	-	-	-
2	-	0	-	-	-	-
3	-	0	-	-	-	-
4	-	0	-	-	-	-
>>> d.build()
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	-
1	3	1	2	0	1.0	-
2	3	0	-	-	0.5	-
3	-	2	-	4	-	-
4	3	0	1	-	2.5	-

After this sanity check, you can try some automated tests: python grader.py 9

Upload your dendrogram.py file here:

Upload form is only available when connected

Part 6: Using dendrograms

Identifying the clusters below a given height

Our next goal is to allow to split the points into clusters at fixed height h (called \(C_h(p)\) in the notation of slide 78). The method set_clusters(self, h : float) in the Dendrogram class is supposed to compute the height-h cluster \(C_h(p)\) for every node \(p\), recording the results in the clusters list: that is, if We follow this algorithm:

  1. For each point (with index i), find the index c_h_i of the representative of the cluster containing this point, and set clusters[i] to c_h_i.
  2. Ensure the attribute total_clusters is equal to the number of clusters obtained this way.
Exercise 10

The method set_clusters is already implemented in Dendrogram, but it has a recursive subroutine _set_clusters(self, i : int, h : float) which you need to implement.

An example

For the figure above, at height 0.5, we have two non-singleton clusters—nodes 0 (point 0) and 1 (point 1) as well as nodes 2 (point 3) and 3 (point 4)—represented by nodes 1 and 3, respectively. Thus:

clusters == {1, 1, 3, 3, 4}

Similarly, at height 1, we have one non-singleton cluster, represented by node 3 (point 4), which is the result of a merge of the two clusters above. Hence,

clusters = {3, 3, 3, 3, 4}

For the discussion below, let us take h = 0.5 for the dendrogram in the above figure.

Now implement _set_clusters, as described above.

Testing and grading

Test your code on the examples above. First, let’s set up the cloud, graph, and dendrogram objects:

>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
...     c.add_point(Point([x]))
...
>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	-
1	3	1	2	0	1.0	-
2	3	0	-	-	0.5	-
3	-	2	-	4	-	-
4	3	0	1	-	2.5	-

Now, let’s cut at height 0.5. We expect cluster indices 1, 1, 3, 3, 4:

>>> d.set_clusters(0.5)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	1
1	3	1	2	0	1.0	1
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4

Now let’s clear that clustering and cut at height 1. This time, we expect cluster indices 3, 3, 3, 3, 4:

>>> d.clear_clusters()
>>> d.set_clusters(1)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	3
1	3	1	2	0	1.0	3
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
...     c.add_point(Point([x]))
...
>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	-
1	3	1	2	0	1.0	-
2	3	0	-	-	0.5	-
3	-	2	-	4	-	-
4	3	0	1	-	2.5	-
>>> d.set_clusters(0.5)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	1
1	3	1	2	0	1.0	1
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> d.clear_clusters()
>>> d.set_clusters(1)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	3
1	3	1	2	0	1.0	3
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4

Now you can try some automated tests: python grader.py 10

Upload your dendrogram.py file here:

Upload form is only available when connected

Counting non-singleton clusters

Now we want to count the number of non-singleton clusters: that is, the number of clusters containing more than one node. The method count_ns_clusters does this, caching the result (to avoid re-computing it if the value has not changed). However, the subroutine _count_ns_clusters which does the real work (when the result is not yet known) is not yet complete!

Exercise 11

Complete the method

    def _count_ns_clusters(self):
        """Count non-singleton clusters from scratch"""
        ...
Testing and grading

Test your code. First, let’s try some simple sanity checks based on the example above. Setting up as before:

>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
...     c.add_point(Point([x]))
...
>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()

Now let’s try clustering below height 0.1: every cluster is a singleton.

>>> d.set_clusters(0.1)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	0
1	3	1	2	0	1.0	1
2	3	0	-	-	0.5	2
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> d.count_ns_clusters()
0

Now let’s try cutting at height 0.5: this time, there are two clusters of two nodes each, and one singleton:

>>> d.clear_clusters()
>>> d.set_clusters(0.5)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	1
1	3	1	2	0	1.0	1
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> d.count_ns_clusters()
2

Finally, let’s try cutting at 1. This time we get one cluster of four nodes, and one singleton:

>>> d.clear_clusters()
>>> d.set_clusters(1)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	3
1	3	1	2	0	1.0	3
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> d.count_ns_clusters()
1

Now you can try some automated tests: python grader.py 11

Upload your dendrogram.py file here:

Upload form is only available when connected

Computing cluster heights

Suppose the list of clusters of height bounded by h has been computed using set_clusters(h). Now we want to determine the height of a given cluster. For example, in the figure above:

Exercise 12

Complete the method

    def get_cluster_height(self, c : int) -> float:
        """Compute the height of the cluster c.
        Assumes build() has been called and the clusters of height <= h
        have been computed using set_clusters(h).
        """
        ...

Note: get_cluster_height(c) should return 0 if c is the index of a node representing a singleton cluster. These nodes are the ones that do not have children.

Recall that

  1. By design, nodes at the same level of the data structure that represents all merged clusters have non-decreasing height from left to right.
  2. height[node] does not store the height of the cluster it represents: it stores the height of the cluster that it got merged into!
  3. Instead, this height is stored in one of its children, the rightmost of which is accessed via down.
Testing and grading

Test your code. Setting up as before:

>>> from TD.cloud import *
>>> from TD.graph import *
>>> from TD.dendrogram import *
>>> c = Cloud()
>>> for x in [0.0, 1.0, 3.0, 4.0, 9.0]:
...     c.add_point(Point([x]))
...
>>> g = graph_from_cloud(c)
>>> d = Dendrogram(g)
>>> d.build()

First, clustering with height 0.1:

>>> d.set_clusters(0.1)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	0
1	3	1	2	0	1.0	1
2	3	0	-	-	0.5	2
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> [ d.get_cluster_height(i) for i in range(5) ]
[0, 0, 0, 0, 0]

Now, at height 0.5:

>>> d.clear_clusters()
>>> d.set_clusters(0.5)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	1
1	3	1	2	0	1.0	1
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> [ d.get_cluster_height(i) for i in range(5) ]
[0, 0.5, 0, 0.5, 0]

Next, height 2.0:

>>> d.clear_clusters()
>>> d.set_clusters(2)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	3
1	3	1	2	0	1.0	3
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	4
>>> [ d.get_cluster_height(i) for i in range(5) ]
[0, 0, 0, 1.0, 0]
>>> d.clear_clusters()

And finally, height 2.5:

>>> d.set_clusters(2.5)
>>> print(d)
node	parent	rank	left	down	height	cluster
0	1	0	-	-	0.5	3
1	3	1	2	0	1.0	3
2	3	0	-	-	0.5	3
3	-	2	-	4	-	3
4	3	0	1	-	2.5	3
>>> [ d.get_cluster_height(i) for i in range(5) ]
[0, 0, 0, 2.5, 0]

Now you can try some automated tests: python grader.py 12

Upload your dendrogram.py file here:

Upload form is only available when connected

Running the clustering algorithm

Once you have done all the exercises above, you can run the Python script test_dendrogram.py to print lists of clusters at various depths. You can use it with

python ./test_dendrogram.py ./csv/iris.csv 4 150
python ./test_dendrogram.py ./csv/languages.csv