CSC_43042_EP – \(k\)-Means
Before you begin
Download and unzip the archive CSC43042EP-td2-1-handin.zip
.
It contains in particular:
- the file
grader.py
, which is used for the various test scripts, - the file
TD/kmeans_application.py
, which is used for the very final exercise, and - a dataset (in the folder
csv/
) as well as the fileTD/kmeans_performance_check.py
used for some additional tests.
In the following, we assume that the directory TD/
is
the root directory from which our Python files are run. When asked to
create a file, please create it in the directory TD/
if not
mentioned otherwise.
In order to get full credit, you need to solve and submit all exercises except for tasks 7 and 8, which are optional.
Log in to be able to upload your files.
Upfront: Installing important libraries
We are going to use the library numpy
, which is a very
prominent and important Python library for numeric computations. Our
main reason for using this library here is that it defines the class
ndarray
, which is a more efficient implementation of
Python’s native list
s, and it is also more compact when
storing numbers.
numpy
may already be installed on your system. You can
test this by running the following in Python.
>>> import numpy
If there are no errors, then numpy
is already installed.
If numpy
is not already installed, then you can
install it locally using the following command (recalling that we
recommend version 1.26.4):
$ pip install --user --only-binary=:all: numpy==1.26.4
1 Modeling data points
We use a class in order to represent the geometric
points in our datasets. To this end, create a new file
kmeans.py
, which will contain all the code to be graded in
this TD and which is expanded with each exercise below. Copy the code
below and paste it into kmeans.py
. Recall that you need to
have numpy
installed, as explained in the
previous section.
import numpy as np
class Point:
"""A point in a dataset.
Attributes:
d: int -- dimension of the ambient space
coords: np.ndarray -- coordinates of the point
label: int = 0 -- label of the cluster the point is assigned to
"""
Importing numpy
as np
allows us to save
characters when referring to identifiers defined within the
numpy
module. Note that when we want to create an
ndarray
of numpy
, we can now write
np.array
. (Warning: do not call
np.ndarray()
directly to create arrays! Use
np.array
, np.zeros
, or np.full
instead.)
For more information on numpy
, refer to the documentation (and use the
search bar on the top right).
Add an initializer method (__init__
) to
your new class Point:
def __init__(self, d: int):
assert d > 0, "The dimension needs to be positive."
...
Set the dimension d
of the point to the value of the
parameter d
. Initialize coords
as an array of
size d
with only 0.0
s, and set
label
to 0.
In order to change the values of coords
, add a
method update_coords
that copies
the values from a given array to coords
.
def update_coords(self, new_coords: np.ndarray) -> None:
"""Copy the values of new_coords to coords."""
...
For this and the following method, try to use functions from
np
that operate directly on arrays!
Now add a method
squared_dist(self, other: Point) -> float
to your
Point
class that computes the squared Euclidean
distance between this point and other
.
def squared_dist(self, other) -> float:
"""The square of the Euclidean distance to other."""
...
Warning: ndarray
elements are typed!
Although their type is flexible, numpy
selects the most
efficient representation. This can have unexpected side effects:
notably, if you divide an entry of an ndarray
of integers
by an entry of another ndarray
of integers, then the result
of the division is also going to be an integer, not a
floating-point number! Since we are interested in floating-point
coordinates, make sure that you use floating-point literals (such as
0.0
) and not integer literals (such as 0
) when
working with ndarray
s here. This is not always necessary,
but it is the safest option if you do not know precisely what
you are doing. You have been warned.
Test your implementation by running
$ python ./grader.py 1
Once your program passes all the tests, upload your file
kmeans.py
:
The \(k\)-means algorithm
Recall that the \(k\)-means
algorithm aims to partition a set of data points into clusters such that
the total intracluster variance is minimal. We propose to translate the
involved notions into Python in the following way. We represent point
clouds using the following class Cloud
, which you should
copy into kmeans.py
:
class Cloud:
"""A cloud of points for the k-means algorithm.
Data attributes:
- d: int -- dimension of the ambient space
- points: list[Point] -- list of points
- k: int -- number of centers
- centers: np.ndarray -- array of centers
"""
def __init__(self, d: int, k: int):
self.d = d
self.k = k
self.points = []
self.centers = np.array([Point(d) for _ in range(self.k)])
def add_point(self, p: Point, label: int) -> None:
"""Copy p to the cloud, in cluster label."""
= Point(self.d)
new_point
new_point.update_coords(p.coords)self.points.append(new_point)
self.points[-1].label = label
A partition is a mapping \(\sigma\)
from data points to cluster numbers, which we also call labels. We will
use the data member label
in the class Point
to represent this mapping.
2 Intracluster variance
We want to calculate the total intracluster variance, which is the quality measure for our partitions in the \(k\)-means algorithm. Recall that the intracluster variance is defined to be
\[ \frac{1}{n} \sum_{p\in P} \lVert p - c_{\sigma(p)}\rVert_2^2, \]
where \(P\) is the set of data points, \(n\) is its size, \(c_j\) is the center point of the \(j\)-th cluster, and \(\sigma\) is the partition function.
Add a new method
intracluster_variance(self) -> float
to your
Cloud
class, to calculate the intracluster variance.
Test your implementation by running
$ python ./grader.py 2
Upload your file kmeans.py
here:
3 Voronoi partition
Each iteration of the \(k\)-means
algorithm comprises two steps. The first updates the labels of the data
points, that is, the partition into clusters, according to the nearest
center point. More precisely, we change the label
of a data
point to the number of the cluster whose center point is at minimal
distance from the data point. (In case of a tie, the cluster with the
smallest index is selected.)
Add a new method
set_voronoi_labels(self) -> int
to your
Cloud
class, implementing this Voronoi partitioning
algorithm, and returning the number of points that changed their
label.
Test your implementation by running
$ python ./grader.py 3
Upload your file kmeans.py
here:
4 Centroids
The second step in each iteration of the \(k\)-means algorithm is to update the center point to the centroid (that is, the unweighted average) of the clusters’ data points. (If a cluster is empty, its center point is not changed.)
Add a new method
set_centroid_centers(self) -> None
to your
Cloud
class to perform this update. The time complexity of
your method should be in \(O(d\cdot(n+k))\), where \(d\) is the dimension of the points and
\(n\) is their number.
Test your implementation by running
$ python ./grader.py 4
Upload your file kmeans.py
here:
5 Random-partition initialization
In the random-partition initialization method, we assign, uniformly at random, every data point to a cluster.
Add a new method
init_random_partition(self) -> None
to your
Cloud
class, implementing random-partition initialization,
in other words:
- Assign each
Point
to a uniformly random cluster, - Then call
self.set_centroid_centers()
to initialize the centers.
Note: you can generate a uniform random integer between
0
and k - 1
using
random.randrange(k)
, which requires you to
import random
(at the top of your kmeans.py
file).
Test your implementation by running
$ python ./grader.py 5
Upload your file kmeans.py
here:
6 Putting it together
We now complete the \(k\)-means
algorithm, using the set_voronoi_labels
and
set_centroid_centers
methods you defined above.
Add a new method lloyd()
to your
Cloud
class:
def lloyd(self) -> None:
"""Lloyd’s algorithm.
Assumes the clusters have already been initialized somehow.
"""
...
As noted in the docstring, you can assume the initialization has
already been done before lloyd
is called. Your method
simply iterates the main loop of Lloyd’s algorithm until the clusters
stop changing.
Test your implementation by running
$ python ./grader.py 6
Upload your file kmeans.py
here:
Alternative initializations (optional)
In the remaining two exercises of this section, we will treat several alternative initialization methods for the \(k\)-means algorithm.
7 Forgy initialization
The Forgy initialization method chooses the center point of each cluster uniformly at random from the set of points. However, we must ensure that each cluster has a different center point.
Add a new method init_forgy()
to your
Cloud
class:
def init_forgy(self) -> None:
"""Forgy's initialization: distinct centers are sampled
uniformly at random from the points of the cloud.
"""
...
Remember to use .update_coords()
when copying points. If
p
is a Point
, then c = p
just
binds a new name c
onto the same point as p
.
Then, the points p
and c
are not just
identical: they are the same point, and any later modifications
in p
or c
will create unwanted
side-effects.
Test your implementation by running
$ python ./grader.py 7
Upload your file kmeans.py
here:
8 \(k\)-means++
Add a new method
init_plusplus(self) -> None
implementing the \(k\)-means++ initialization method (see the
\(k\)-means slides).
To generate a random number between 0.0
and
1.0
, you can use random.uniform(0.0, 1.0)
(as
before, you need to import random
at the top of your
kmeans.py
file). Then, partition the interval \([0, 1]\) into sub-intervals whose lengths
are the relative squared distances to the nearest cluster.
Test your implementation by running
$ python ./grader.py 8
Upload your file kmeans.py
here:
Putting it to the test
Finally, we want to test how well our algorithm performs on data that
is unknown to us. The file csv/samples_excerpt.csv
contains a part of a larger dataset (not revealed here!) that will be
used to grade the performance of your clustering algorithm. Since the
algorithm is deterministic up to its initialization, the task in this
final exercise is to choose a good initialization.
This is a hard real-world task: there is no optimal way of doing this. You have to play around and test different settings.
You will be graded based on how well you classify the real dataset.
The file kmeans_performance_check.py
gives you a glimpse of
the test file that is run on our server. It relies on the file
kmeans_application.py
.
As you can see in kmeans_application.py
, you need to
choose the number of clusters, the initial labels, and the coordinates
of the initial centers. You are graded on how many points you label
ultimately correctly. In the final dataset, there are 1200 points.
You are graded as follows:
- If you have at most 120 correct points, you get a D.
- If you have between 121 and 500 correct points, you get a C.
- If you have between 501 and 800 points, you get a B.
- If you have between 801 and 1199 points, you get an A.
- If you get the labels of all 1200 points correct, you get an A+.
Good luck!
Once you are happy with your parameterization, upload your file
kmeans_application.py
: