INF442 - \(k\)-Means
Before you begin
Download and unzip the archive INF442-td3-1-handin.zip
.
It contains in particular:
- the file
kmeans.cpp
, which you will have to complete and/or modify, - the file
main.cpp
, which is used for the various test scripts, and - a
Makefile
that you can use to compile the various tests.
In order to get full credit, you need to solve and submit all exercises up to Section 3 (Section 4 is optional). Log in to be able to upload your files.
1. (Quiz) Real-world dataset
Click to expand
The file cereals.csv
in the csv/
directory
contains various data about a selection of commercially available
cereals. Among the data is nutritional information, but also information
about the shelf in which the products are placed, the package size, and
also the average customer rating.
First, install the necessary libraries:
$ pip3 install --user --only-binary=:all: pandas scikit-learn matplotlib
Then use the Python script kmeans.py
in the
quiz/
directory to analyze the data set and answer the
questions in the quiz for TD3 on Moodle. The script takes four
arguments: csv_file
, lc
,
nb_clusters
, and ld
. The first argument
determines which .csv
file to open, in our case
cereals.csv
(keep in mind that the file is in another
directory than the Python script). In the second argument,
lc
, you can give a comma-separated list of variables used
during the clustering, or :
if all variables should be
used. The third argument, nb_clusters
, defines the number
of clusters that should be used. The fourth argument, ld
,
is only used for the visualization, not for the computation of the
results. It is a comma-separated list of 2 or 3 variables to be used for
the plot (which is then 2- or 3-dimensional). Running the script without
arguments gives a message detailing the program’s usage. Example
usage:
$ python3 kmeans.py ../csv/cereals.csv : 2 protein,weight,cups
After execution, an image file result.png
should appear
in the current directory. This file contains the generated plots.
Let us be more specific about what the program does. In the file
kmeans.py
, after reading in the .csv
file, we
normalize each variable so that it takes values between 0 and 1:
= MinMaxScaler()
scaler = scaler.fit_transform(data) ndata
We then call Lloyd’s algorithm for \(k\)-means clustering. First, a
KMeans
object is constructed, specifying the number of
clusters and the initial state. Then, the clustering is done in the
method called fit
, which takes the actual data set as its
parameter. Here is the corresponding code:
= KMeans(n_clusters=nb_clusters, random_state=0, init='k-means++', n_init='auto').fit(ndata) cl
We then calculate data to perform the elbow heuristic. To this end,
we should construct multiple KMeans
objects, with different
values of the n_clusters
parameter. Question 4 in the quiz
asks you to implement this step. Afterward, we call the fit
method with the same data on the various KMeans
objects.
Finally, we access the inertia_
attribute of the result of
the fit
method to get the (non-normalized) intracluster
variance of the clustering for each choice of number of clusters. Here
is the corresponding code:
= 10
max_nb_clusters = [] # Add the appropriate KMeans objects
km = [km[i].fit(ndata).inertia_ for i in range(max_nb_clusters-1)] score
2. Classes in C++
2.1 Using the String class
To get started with classes in C++, we look at the example of the
string class. This class encapsulates raw C-style strings, i.e.,
char *
, and provides a range of functionality via its
member functions. Check out its documentation
to learn about the member functions used in the following program. In
addition to the member functions, the bracket operator
[]
makes it possible to access or modify the character at
position i
in a string str
, by writing
str[i]
.
#include <string>
#include <iostream>
int main()
{
std::string str1 = "To be or not to be, that is the question";
std::string str2 = "only ";
std::string str3 = str1.substr(6, 12);
.insert(32, str2);
str1.replace(str1.find("to be", 0), 5, "to jump");
str1.erase(9, 4);
str1[16] = ':';
str1
std::cout << str1 << std::endl;
for(int i = 0; i < str3.length(); i++)
std::cout << str3[i] << std::endl;
return 0;
}
What is the output of the above program? Compile and run it to check whether your prediction was correct.
2.2 Rolling your own class
As your first own class in C++, let us implement a class for representing geometric points.
Missing from the file kmeans.cpp
is the class
point
, which you will implement now. Copy and paste the
following piece of code into kmeans.cpp
, just below the
include directives.
class point
{
public:
static int d;
double *coords;
int label;
};
int point::d;
Note the static data member d
. In C++, contrary to Java,
you have to both declare and define static data
members separately. Declaration of d
is done via the line
static int d;
inside of the class. Definition is done
outside of the class, in the last line of the above code snippet.
Add the following member functions to the point class:
- an empty constructor (i.e., without any arguments) that initializes
coords
to an array of sized
andlabel
to0
; furthermore, it initializes all elements of the array to0.0
. - a destructor that deletes the array.
- a member function
void print() const
that prints the point’s coordinates tostd::cout
, all in a line, with a tab character (\t
) in between entries and no tab at the end of the line before the end-of-line character (\n
). - a member function
double squared_dist(const point &q) const
that takes a reference to a second point and returns the squared Euclidean distance of the current point to the pointq
.
Watch out! The lifetime of instances in C++ follows
a paradigm known as RAII,
which means that the lifetime of an instance is bound to its scope. Most
importantly, if you declare a variable inside of a scope (i.e., a pair
of curly brackets), it is removed once this scope is exited. For
objects, such as point
, this means that the destructor is
called. If one is not aware of this, this can lead to drastic unwanted
side effects, especially when pointers are involved (like
coords
).
Imagine you already declared an instance of point
,
called p
, and that in another scope you
copy this value to a new point q
. Then
p.coords
and q.coords
point to the identical
address (since only the address is copied, not the data; also known as a
shallow copy). Once q
leaves its scope, its
destructor is called, entailing that the memory behind
q.coords
is freed. Since p
still points to
this address, bad things are bound to happen: If we dereference
p.coords
, we access memory that is no longer maintained by
us and can thus contain arbitrary data. This results in undefined
behavior, that is, we have no guarantees about our program anymore. Even
worse, once p
itself gets out of scope, its destructor is
called, which tries to also free the memory behind
p.coords
. Since this memory was already freed, we receive
an error (mentioning double free
).
In order to avoid this, one usually uses a copy constructor, which copies all of the data such that if the copy goes out of scope the original instance is still left intact. In cases where this is too expensive, one should use references or pointers instead. Be careful. You have been warned!
You can test your implementation by typing the commands
make grader
then ./grader 1
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 1
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
:
3. The \(k\)-means algorithm
Recall that the \(k\)-means
algorithm seeks to partition a set of data points into clusters such
that the total intracluster variance is minimal. We propose to translate
the involved notions into C++ in the following way, implemented in the
class cloud
, which represents a point cloud:
- The data points will be instances of your class
point
from the previous exercise. - The dataset is stored as an array of data points, to which the class
provides a pointer
point *points
. The number of data points is stored in the member variablen
. - A partition is stored as a mapping \(\sigma\) from data points to cluster
numbers, which we also call labels. In practice, we will use the data
member
label
in the classpoint
to represent this mapping. - For every cluster, we store its center point. We thus store a second
array of (center) points, to which the class provides a pointer
point *centers
. The number of clusters (hence also of cluster centers) is stored in the member variablek
.
We have already set up the basic data structures described above in
the provided code. Your task in the remaining exercises is to implement
the \(k\)-means algorithm and some
extensions in kmeans.cpp
.
3.1 Intracluster variance
We want to calculate the total intracluster variance, which is the quality measure for our partitions in the \(k\)-means algorithm. Recall that the intracluster variance is defined as
\[ \frac{1}{n} \sum_{p\in P} \lVert p - c_{\sigma(p)}\rVert_2^2, \]
where \(P\) is the set of data points, \(n\) is its size, the \(c_j\) is the center point of the \(j\)-th cluster, and \(\sigma\) is the partition function.
Complete the member function
double intracluster_variance() const
, which calculates the
intracluster variance.
You can test your implementation by typing the commands
make grader
then ./grader 2
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
:
3.2 Voronoi partition
The first of the two steps in the iterations of the \(k\)-means algorithm is to update the labels of the data points, i.e., the partition into clusters, according to the nearest center point. More precisely, we change the label of a data point to the cluster number that has the center point at minimal distance to the data point. In case of a tie, the cluster with the smallest index is selected.
Implement this Voronoi partitioning method in the member function
int set_voronoi_labels()
, returning the number of points
that have had their label changed.
You can test your implementation by typing the commands
make grader
then ./grader 3
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
:
3.3 Centroids
The second part in the iterations of the \(k\)-means algorithm is to update the center point to the centroids (i.e., the unweighted average) of the clusters’ data points.
Complete the member function void set_centroid_centers()
that does this update. If a cluster is empty, its center point is kept
at its previous position. The time complexity of your function should be
in \(O(d\cdot(n+k))\) where \(d\) is the dimension of the points and
\(n\) is their number.
You can test your implementation by typing the commands
make grader
then ./grader 4
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
:
3.4 Random-partition initialization
In the random-partition initialization method, we assign, uniformly at random, every data point to a cluster.
Complete the member function
void init_random_partition()
that implements the
random-partition initialization method. That is, each point should be
assigned to a uniformly random cluster, followed by a call to
set_centroid_centers()
. To generate a random integer
between 0
and k-1
, you can use
rand() % k
.
You can test your implementation by typing the commands
make grader
then ./grader 5
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
:
3.5 Putting it together
We can now combine the previous two functions to finish the
implementation of the \(k\)-means
algorithm. Starting from an initialization, which you can assume is
already done and which you do not need to call inside the function, keep
iterating the main loop of Lloyd’s algorithm until the clusters stop
changing. Implement this procedure in the member function
void lloyd()
.
You can test your implementation by typing the commands
make grader
then ./grader 6
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
:
4. Alternative initializations (optional)
In the remaining exercises, we will treat several alternative initialization methods for the \(k\)-means algorithm.
4.1 Forgy initialization
In the Forgy initialization method, we choose the center point of each cluster uniformly at random from the set of points. However, we ensure that each cluster has a different center point.
Write a member function void init_forgy()
that
implements the Forgy initialization method.
You can test your implementation by typing the commands
make grader
then ./grader 7
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
:
4.2 K-means++
Write a member function void init_plusplus()
that
implements the kmeans++ initialization method (see slide 12, p. 53 of
the Data Science booklet).
To generate a random number between 0
and
1
, you can use (double)rand() / RAND_MAX
.
Then, partition the interval \([0,1]\)
into sub-intervals whose lengths are the relative squared distances to
the nearest cluster.
You can test your implementation by typing the commands
make grader
then ./grader 8
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kmeans.cpp
: