This assignment has been closed on March 26, 2025.

You must be authenticated to submit your files

CSC_43042_EP – \(k\)-Means

Before you begin
Upfront: Installing important libraries
Alternative initializations (optional)
Putting it to the test

Before you begin

Download and unzip the archive CSC43042EP-td2-1-handin.zip.

It contains in particular:

the file grader.py, which is used for the various test scripts,
the file TD/kmeans_application.py, which is used for the very final exercise, and
a dataset (in the folder csv/) as well as the file TD/kmeans_performance_check.py used for some additional tests.

In the following, we assume that the directory TD/ is the root directory from which our Python files are run. When asked to create a file, please create it in the directory TD/ if not mentioned otherwise.

In order to get full credit, you need to solve and submit all exercises except for tasks 7 and 8, which are optional.

Log in to be able to upload your files.

Upfront: Installing important libraries

We are going to use the library numpy, which is a very prominent and important Python library for numeric computations. Our main reason for using this library here is that it defines the class ndarray, which is a more efficient implementation of Python’s native lists, and it is also more compact when storing numbers.

numpy may already be installed on your system. You can test this by running the following in Python.

>>> import numpy

If there are no errors, then numpy is already installed. If numpy is not already installed, then you can install it locally using the following command (recalling that we recommend version 1.26.4):

$ pip install --user --only-binary=:all: numpy==1.26.4

1 Modeling data points

We use a class in order to represent the geometric points in our datasets. To this end, create a new file kmeans.py, which will contain all the code to be graded in this TD and which is expanded with each exercise below. Copy the code below and paste it into kmeans.py. Recall that you need to have numpy installed, as explained in the previous section.

import numpy as np

class Point:
    """A point in a dataset.

    Attributes:
        d: int             -- dimension of the ambient space
        coords: np.ndarray -- coordinates of the point
        label: int = 0     -- label of the cluster the point is assigned to
    """

Importing numpy as np allows us to save characters when referring to identifiers defined within the numpy module. Note that when we want to create an ndarray of numpy, we can now write np.array. (Warning: do not call np.ndarray() directly to create arrays! Use np.array, np.zeros, or np.full instead.)

For more information on numpy, refer to the documentation (and use the search bar on the top right).

Add an initializer method (__init__) to your new class Point:

    def __init__(self, d: int):
        assert d > 0, "The dimension needs to be positive."
        ...

Set the dimension d of the point to the value of the parameter d. Initialize coords as an array of size d with only 0.0s, and set label to 0.

In order to change the values of coords, add a method update_coords that copies the values from a given array to coords.

    def update_coords(self, new_coords: np.ndarray) -> None:
        """Copy the values of new_coords to coords."""
        ...

For this and the following method, try to use functions from np that operate directly on arrays!

Now add a method squared_dist(self, other: Point) -> float to your Point class that computes the squared Euclidean distance between this point and other.

    def squared_dist(self, other) -> float:
        """The square of the Euclidean distance to other."""
        ...

Warning: ndarray elements are typed! Although their type is flexible, numpy selects the most efficient representation. This can have unexpected side effects: notably, if you divide an entry of an ndarray of integers by an entry of another ndarray of integers, then the result of the division is also going to be an integer, not a floating-point number! Since we are interested in floating-point coordinates, make sure that you use floating-point literals (such as 0.0) and not integer literals (such as 0) when working with ndarrays here. This is not always necessary, but it is the safest option if you do not know precisely what you are doing. You have been warned.

Test your implementation by running

$ python ./grader.py 1

Once your program passes all the tests, upload your file kmeans.py:

Upload form is only available when connected

The \(k\)-means algorithm

Recall that the \(k\)-means algorithm aims to partition a set of data points into clusters such that the total intracluster variance is minimal. We propose to translate the involved notions into Python in the following way. We represent point clouds using the following class Cloud, which you should copy into kmeans.py:

class Cloud:
    """A cloud of points for the k-means algorithm.

    Data attributes:
    - d: int              -- dimension of the ambient space
    - points: list[Point] -- list of points
    - k: int              -- number of centers
    - centers: np.ndarray -- array of centers
    """

    def __init__(self, d: int, k: int):
        self.d = d
        self.k = k
        self.points = []
        self.centers = np.array([Point(d) for _ in range(self.k)])

    def add_point(self, p: Point, label: int) -> None:
        """Copy p to the cloud, in cluster label."""
        new_point = Point(self.d)
        new_point.update_coords(p.coords)
        self.points.append(new_point)
        self.points[-1].label = label

A partition is a mapping \(\sigma\) from data points to cluster numbers, which we also call labels. We will use the data member label in the class Point to represent this mapping.

2 Intracluster variance

We want to calculate the total intracluster variance, which is the quality measure for our partitions in the \(k\)-means algorithm. Recall that the intracluster variance is defined to be

\[ \frac{1}{n} \sum_{p\in P} \lVert p - c_{\sigma(p)}\rVert_2^2, \]

where \(P\) is the set of data points, \(n\) is its size, \(c_j\) is the center point of the \(j\)-th cluster, and \(\sigma\) is the partition function.

Add a new method intracluster_variance(self) -> float to your Cloud class, to calculate the intracluster variance.

Test your implementation by running

$ python ./grader.py 2

Upload your file kmeans.py here:

Upload form is only available when connected

3 Voronoi partition

Each iteration of the \(k\)-means algorithm comprises two steps. The first updates the labels of the data points, that is, the partition into clusters, according to the nearest center point. More precisely, we change the label of a data point to the number of the cluster whose center point is at minimal distance from the data point. (In case of a tie, the cluster with the smallest index is selected.)

Add a new method set_voronoi_labels(self) -> int to your Cloud class, implementing this Voronoi partitioning algorithm, and returning the number of points that changed their label.

Test your implementation by running

$ python ./grader.py 3

Upload your file kmeans.py here:

Upload form is only available when connected

4 Centroids

The second step in each iteration of the \(k\)-means algorithm is to update the center point to the centroid (that is, the unweighted average) of the clusters’ data points. (If a cluster is empty, its center point is not changed.)

Add a new method set_centroid_centers(self) -> None to your Cloud class to perform this update. The time complexity of your method should be in \(O(d\cdot(n+k))\), where \(d\) is the dimension of the points and \(n\) is their number.

Test your implementation by running

$ python ./grader.py 4

Upload your file kmeans.py here:

Upload form is only available when connected

5 Random-partition initialization

In the random-partition initialization method, we assign, uniformly at random, every data point to a cluster.

Add a new method init_random_partition(self) -> None to your Cloud class, implementing random-partition initialization, in other words:

Assign each Point to a uniformly random cluster,
Then call self.set_centroid_centers() to initialize the centers.

Note: you can generate a uniform random integer between 0 and k - 1 using random.randrange(k), which requires you to import random (at the top of your kmeans.py file).

Test your implementation by running

$ python ./grader.py 5

Upload your file kmeans.py here:

Upload form is only available when connected

6 Putting it together

We now complete the \(k\)-means algorithm, using the set_voronoi_labels and set_centroid_centers methods you defined above.

Add a new method lloyd() to your Cloud class:

    def lloyd(self) -> None:
        """Lloyd’s algorithm.
        Assumes the clusters have already been initialized somehow.
        """
        ...

As noted in the docstring, you can assume the initialization has already been done before lloyd is called. Your method simply iterates the main loop of Lloyd’s algorithm until the clusters stop changing.

Test your implementation by running

$ python ./grader.py 6

Upload your file kmeans.py here:

Upload form is only available when connected

Alternative initializations (optional)

In the remaining two exercises of this section, we will treat several alternative initialization methods for the \(k\)-means algorithm.

7 Forgy initialization

The Forgy initialization method chooses the center point of each cluster uniformly at random from the set of points. However, we must ensure that each cluster has a different center point.

Add a new method init_forgy() to your Cloud class:

    def init_forgy(self) -> None:
        """Forgy's initialization: distinct centers are sampled
        uniformly at random from the points of the cloud.
        """
        ...

Remember to use .update_coords() when copying points. If p is a Point, then c = p just binds a new name c onto the same point as p. Then, the points p and c are not just identical: they are the same point, and any later modifications in p or c will create unwanted side-effects.

Test your implementation by running

$ python ./grader.py 7

Upload your file kmeans.py here:

Upload form is only available when connected

8 \(k\)-means++

Add a new method init_plusplus(self) -> None implementing the \(k\)-means++ initialization method (see the \(k\)-means slides).

To generate a random number between 0.0 and 1.0, you can use random.uniform(0.0, 1.0) (as before, you need to import random at the top of your kmeans.py file). Then, partition the interval \([0, 1]\) into sub-intervals whose lengths are the relative squared distances to the nearest cluster.

Test your implementation by running

$ python ./grader.py 8

Upload your file kmeans.py here:

Upload form is only available when connected

Putting it to the test

Finally, we want to test how well our algorithm performs on data that is unknown to us. The file csv/samples_excerpt.csv contains a part of a larger dataset (not revealed here!) that will be used to grade the performance of your clustering algorithm. Since the algorithm is deterministic up to its initialization, the task in this final exercise is to choose a good initialization.

This is a hard real-world task: there is no optimal way of doing this. You have to play around and test different settings.

You will be graded based on how well you classify the real dataset. The file kmeans_performance_check.py gives you a glimpse of the test file that is run on our server. It relies on the file kmeans_application.py.

As you can see in kmeans_application.py, you need to choose the number of clusters, the initial labels, and the coordinates of the initial centers. You are graded on how many points you label ultimately correctly. In the final dataset, there are 1200 points.

You are graded as follows:

If you have at most 120 correct points, you get a D.
If you have between 121 and 500 correct points, you get a C.
If you have between 501 and 800 points, you get a B.
If you have between 801 and 1199 points, you get an A.
If you get the labels of all 1200 points correct, you get an A+.

Good luck!

Once you are happy with your parameterization, upload your file kmeans_application.py:

Upload form is only available when connected