This assignment has been closed on March 27, 2024.

You must be authenticated to submit your files

INF442 - \(k\)-Means

Before you begin
1. (Quiz) Real-world dataset
2. Classes in C++
3. The \(k\)-means algorithm
4. Alternative initializations (optional)

Before you begin

Download and unzip the archive INF442-td3-1-handin.zip.

It contains in particular:

the file kmeans.cpp, which you will have to complete and/or modify,
the file main.cpp, which is used for the various test scripts, and
a Makefile that you can use to compile the various tests.

In order to get full credit, you need to solve and submit all exercises up to Section 3 (Section 4 is optional). Log in to be able to upload your files.

1. (Quiz) Real-world dataset

Click to expand

The file cereals.csv in the csv/ directory contains various data about a selection of commercially available cereals. Among the data is nutritional information, but also information about the shelf in which the products are placed, the package size, and also the average customer rating.

First, install the necessary libraries:

$ pip3 install --user --only-binary=:all: pandas scikit-learn matplotlib

Then use the Python script kmeans.py in the quiz/ directory to analyze the data set and answer the questions in the quiz for TD3 on Moodle. The script takes four arguments: csv_file, lc, nb_clusters, and ld. The first argument determines which .csv file to open, in our case cereals.csv (keep in mind that the file is in another directory than the Python script). In the second argument, lc, you can give a comma-separated list of variables used during the clustering, or : if all variables should be used. The third argument, nb_clusters, defines the number of clusters that should be used. The fourth argument, ld, is only used for the visualization, not for the computation of the results. It is a comma-separated list of 2 or 3 variables to be used for the plot (which is then 2- or 3-dimensional). Running the script without arguments gives a message detailing the program’s usage. Example usage:

$ python3 kmeans.py ../csv/cereals.csv : 2 protein,weight,cups

After execution, an image file result.png should appear in the current directory. This file contains the generated plots.

Let us be more specific about what the program does. In the file kmeans.py, after reading in the .csv file, we normalize each variable so that it takes values between 0 and 1:

scaler = MinMaxScaler()
ndata = scaler.fit_transform(data)

We then call Lloyd’s algorithm for \(k\)-means clustering. First, a KMeans object is constructed, specifying the number of clusters and the initial state. Then, the clustering is done in the method called fit, which takes the actual data set as its parameter. Here is the corresponding code:

cl = KMeans(n_clusters=nb_clusters, random_state=0, init='k-means++', n_init='auto').fit(ndata)

We then calculate data to perform the elbow heuristic. To this end, we should construct multiple KMeans objects, with different values of the n_clusters parameter. Question 4 in the quiz asks you to implement this step. Afterward, we call the fit method with the same data on the various KMeans objects. Finally, we access the inertia_ attribute of the result of the fit method to get the (non-normalized) intracluster variance of the clustering for each choice of number of clusters. Here is the corresponding code:

max_nb_clusters = 10
km = [] # Add the appropriate KMeans objects
score = [km[i].fit(ndata).inertia_ for i in range(max_nb_clusters-1)]

2. Classes in C++

2.1 Using the String class

To get started with classes in C++, we look at the example of the string class. This class encapsulates raw C-style strings, i.e., char *, and provides a range of functionality via its member functions. Check out its documentation to learn about the member functions used in the following program. In addition to the member functions, the bracket operator [] makes it possible to access or modify the character at position i in a string str, by writing str[i].

#include <string>
#include <iostream>

int main()
{
    std::string str1 = "To be or not to be, that is the question";
    std::string str2 = "only ";
    std::string str3 = str1.substr(6, 12);

    str1.insert(32, str2);
    str1.replace(str1.find("to be", 0), 5, "to jump");
    str1.erase(9, 4);
    str1[16] = ':';

    std::cout << str1 << std::endl;

    for(int i = 0; i < str3.length(); i++)
        std::cout << str3[i] << std::endl;

    return 0;
}

What is the output of the above program? Compile and run it to check whether your prediction was correct.

2.2 Rolling your own class

As your first own class in C++, let us implement a class for representing geometric points.

Missing from the file kmeans.cpp is the class point, which you will implement now. Copy and paste the following piece of code into kmeans.cpp, just below the include directives.

class point
{
    public:

        static int d;
        double *coords;
        int label;
};

int point::d;

Note the static data member d. In C++, contrary to Java, you have to both declare and define static data members separately. Declaration of d is done via the line static int d; inside of the class. Definition is done outside of the class, in the last line of the above code snippet.

Add the following member functions to the point class:

an empty constructor (i.e., without any arguments) that initializes coords to an array of size d and label to 0; furthermore, it initializes all elements of the array to 0.0.
a destructor that deletes the array.
a member function void print() const that prints the point’s coordinates to std::cout, all in a line, with a tab character (\t) in between entries and no tab at the end of the line before the end-of-line character (\n).
a member function double squared_dist(const point &q) const that takes a reference to a second point and returns the squared Euclidean distance of the current point to the point q.

Watch out! The lifetime of instances in C++ follows a paradigm known as RAII, which means that the lifetime of an instance is bound to its scope. Most importantly, if you declare a variable inside of a scope (i.e., a pair of curly brackets), it is removed once this scope is exited. For objects, such as point, this means that the destructor is called. If one is not aware of this, this can lead to drastic unwanted side effects, especially when pointers are involved (like coords).

Imagine you already declared an instance of point, called p, and that in another scope you copy this value to a new point q. Then p.coords and q.coords point to the identical address (since only the address is copied, not the data; also known as a shallow copy). Once q leaves its scope, its destructor is called, entailing that the memory behind q.coords is freed. Since p still points to this address, bad things are bound to happen: If we dereference p.coords, we access memory that is no longer maintained by us and can thus contain arbitrary data. This results in undefined behavior, that is, we have no guarantees about our program anymore. Even worse, once p itself gets out of scope, its destructor is called, which tries to also free the memory behind p.coords. Since this memory was already freed, we receive an error (mentioning double free).

In order to avoid this, one usually uses a copy constructor, which copies all of the data such that if the copy goes out of scope the original instance is still left intact. In cases where this is too expensive, one should use references or pointers instead. Be careful. You have been warned!

You can test your implementation by typing the commands make grader then ./grader 1 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 1 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected

3. The \(k\)-means algorithm

Recall that the \(k\)-means algorithm seeks to partition a set of data points into clusters such that the total intracluster variance is minimal. We propose to translate the involved notions into C++ in the following way, implemented in the class cloud, which represents a point cloud:

The data points will be instances of your class point from the previous exercise.
The dataset is stored as an array of data points, to which the class provides a pointer point *points. The number of data points is stored in the member variable n.
A partition is stored as a mapping \(\sigma\) from data points to cluster numbers, which we also call labels. In practice, we will use the data member label in the class point to represent this mapping.
For every cluster, we store its center point. We thus store a second array of (center) points, to which the class provides a pointer point *centers. The number of clusters (hence also of cluster centers) is stored in the member variable k.

We have already set up the basic data structures described above in the provided code. Your task in the remaining exercises is to implement the \(k\)-means algorithm and some extensions in kmeans.cpp.

3.1 Intracluster variance

We want to calculate the total intracluster variance, which is the quality measure for our partitions in the \(k\)-means algorithm. Recall that the intracluster variance is defined as

\[ \frac{1}{n} \sum_{p\in P} \lVert p - c_{\sigma(p)}\rVert_2^2, \]

where \(P\) is the set of data points, \(n\) is its size, the \(c_j\) is the center point of the \(j\)-th cluster, and \(\sigma\) is the partition function.

Complete the member function double intracluster_variance() const, which calculates the intracluster variance.

You can test your implementation by typing the commands make grader then ./grader 2 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected

3.2 Voronoi partition

The first of the two steps in the iterations of the \(k\)-means algorithm is to update the labels of the data points, i.e., the partition into clusters, according to the nearest center point. More precisely, we change the label of a data point to the cluster number that has the center point at minimal distance to the data point. In case of a tie, the cluster with the smallest index is selected.

Implement this Voronoi partitioning method in the member function int set_voronoi_labels(), returning the number of points that have had their label changed.

You can test your implementation by typing the commands make grader then ./grader 3 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected

3.3 Centroids

The second part in the iterations of the \(k\)-means algorithm is to update the center point to the centroids (i.e., the unweighted average) of the clusters’ data points.

Complete the member function void set_centroid_centers() that does this update. If a cluster is empty, its center point is kept at its previous position. The time complexity of your function should be in \(O(d\cdot(n+k))\) where \(d\) is the dimension of the points and \(n\) is their number.

You can test your implementation by typing the commands make grader then ./grader 4 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected

3.4 Random-partition initialization

In the random-partition initialization method, we assign, uniformly at random, every data point to a cluster.

Complete the member function void init_random_partition() that implements the random-partition initialization method. That is, each point should be assigned to a uniformly random cluster, followed by a call to set_centroid_centers(). To generate a random integer between 0 and k-1, you can use rand() % k.

You can test your implementation by typing the commands make grader then ./grader 5 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected

3.5 Putting it together

We can now combine the previous two functions to finish the implementation of the \(k\)-means algorithm. Starting from an initialization, which you can assume is already done and which you do not need to call inside the function, keep iterating the main loop of Lloyd’s algorithm until the clusters stop changing. Implement this procedure in the member function void lloyd().

You can test your implementation by typing the commands make grader then ./grader 6 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected

4. Alternative initializations (optional)

In the remaining exercises, we will treat several alternative initialization methods for the \(k\)-means algorithm.

4.1 Forgy initialization

In the Forgy initialization method, we choose the center point of each cluster uniformly at random from the set of points. However, we ensure that each cluster has a different center point.

Write a member function void init_forgy() that implements the Forgy initialization method.

You can test your implementation by typing the commands make grader then ./grader 7 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected

4.2 K-means++

Write a member function void init_plusplus() that implements the kmeans++ initialization method (see slide 12, p. 53 of the Data Science booklet).

To generate a random number between 0 and 1, you can use (double)rand() / RAND_MAX. Then, partition the interval \([0,1]\) into sub-intervals whose lengths are the relative squared distances to the nearest cluster.

You can test your implementation by typing the commands make grader then ./grader 8 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kmeans.cpp:

Upload form is only available when connected