This assignment has been closed on March 20, 2024.

You must be authenticated to submit your files

INF442 - TD2: K-Dimensional Tree (kd-tree)

Context
Structure
Before you begin
Section 0 (optional) — An overview of k-dimensional trees
Section 1 — Quiz: querying the Iris data set
Section 2 — Nearest-neighbor search via naive linear scan
Section 3 — Nearest-neighbor search via k-dimensional trees

Context

We study the problem of k nearest neighbors search (kNN), which is to find the k nearest neighbors of a query point from a given data set. For this we will build kd-trees.
The parts of the code to be completed correspond to basic operations and algorithms to build a kd-tree and use it to answer nearest-neighbor queries. Then, you will be able to:
- compare the three types of search: linear, defeatist and backtracking search,
- compute the average running time of each method, and
- compute the success rates of the two kd-tree based searches.
If you are a beginner in C++, then we suggest that you start by reading the short text Pointers in C++, which illustrates the use of pointers in C++.

Structure

This TD is structured as follows:

Section 0 gives a brief reminder of the theory of kd-trees.
Section 1 gives a brief introduction on how to construct and use a kd-tree in python using the pandas and scikit-learn libraries. You will need this functionality to answer some of the questions of the quiz.
Section 2 is devoted to the C++ implementation of the straightforward, linear-time approach for finding the nearest neighbor of a query point in a database.
Section 3 is devoted to the C++ implementation of kd-trees and the various associated search strategies.

Before you begin

Download and unzip the archive INF442-td2-1-handin.zip.

It contains in particular:

The file retrieval.cpp, which you will have to complete and/or modify,
The header file retrieval.hpp, which you do not have to change
The file main.cpp, which is used for the various test scripts
A Makefile that you can use to compile the various tests (see the next sections).

In order to get full credit, you need to solve and submit exercises 1, 2, 3, 4, 5, 6, and 7 (exercise 8 is optional). Log in to be able to upload your files.

Section 0 (optional) — An overview of k-dimensional trees

Click to expand. You can safely skip this section if you attended the lecture.

Overview of the data structure

A kd-tree is a binary tree, where data in each node is a k-dimensional point in space. It is a space-partitioning data structure for organizing points in a k-dimensional space.

Kd-trees are useful data structures for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches).

Fig. 1. Example of a 2d-tree.

Every non-leaf node (also called internal nodes) corresponds to a splitting hyperplane that divides the space of points into two parts (half-spaces).

Points on one side of this hyperplane are included in the left subtree of that node and points on the other side of the hyperplane are represented in the right subtree.

For example, if a particular split separates points across the x-axis: all points in the subtree with a smaller x-value than the node will appear in the left subtree and all points with larger x-value will be in the right subtree.

Remark: Refer to the lecture slides for a detailed description of the kd-tree and further references.

Kd-tree Construction:

A kd-tree can be constructed as follows:

Iterate according to the ordered coordinates
If there is a single (k-dimensional) point, form a "leaf node" with that point.
Otherwise, each internal node implements a spatial partition induced by a hyperplane H, splitting the point cloud into two equal subsets: all points lying on one side of H in the right subtree, all other points in the left subtree.
Recursively construct kd-trees for the two resulting sub-sets of points.

Searching for the nearest neighbor of a query point:

The computation of neighbors of a point q (i.e., those points that are closer to q than a given distance R) is performed by recursively traversing the tree (starting from the root).

At each stage it is necessary to test whether the ball of radius R (distance) centered at q intersects the separating hyperplane:

if the ball intersects the splitting hyperplane (i.e. if the ball is not contained in a single sub-space/subtree), then we must continue the search recursively in both subtrees (left and right): the result of the query will be the union of the two recursive queries in the two subtrees.
if there is no intersection with the separating hyperplane, then the ball of distance R around point q is contained in the sub-space of q. This means that the search only needs to be performed in the corresponding subtree. It is then necessary to determine on which side of the splitting plane the ball lies.

Section 1 — Quiz: querying the Iris data set

Click to expand.

In the quiz you will answer some questions related to the content of the lecture to evaluate your understanding of the various notions. In addition, you will answer some “data science” questions related to the use of kd-trees.

For these questions we use (a variant of) the famous iris data set that is widely used as a benchmark by the Machine Learning community. The dataset describes each iris by a vector of 4 numerical features measured on the flower (sepal length/width and petal length/width), together with the botanical name of the species. The data files are in the csv/ directory. There is no reason to modify these files.

Before proceeding with the quiz, let us write a small Python script that will allow us to create a kd-tree on the iris data set and then perform some (basic) query operations.

First, we must install the pandas and scikit-learn libraries (if not already done):

$ pip3 install --user --only-binary=:all: pandas scikit-learn

Now we open a new file kd-tree-search.py in the quiz/ directory, and we add the following import statements

import pandas as pd
from sklearn.neighbors import KDTree

Then we read the iris data set file and we print (some of) the data that it contains. We assume that the Python script is in the quiz/ directory; otherwise you have to modify the variable fname to point the data set.

fname = "../csv/iris_large.csv"
data = pd.read_csv(fname, delimiter=' ', header=0)
print(data)

Then we create the kd-tree, using the Scikit-learn library:

print('\nbuilding kd-tree...', end='')
kdt = KDTree(data.iloc[:,0:4], leaf_size=1, metric='euclidean')
print(' done')

Assume we have the following iris observation, which we want to query the database with:

x = [1.1, 2.2, 3.3, 4.4]
print('\n  my measurements [sepal_len, sepal_wid, petal_len, petal_wid] are: ', x)

In order to find the type of the nearest neighbor of x we add the following:

dist, idx = kdt.query([x], k=1)
print(f'    -> the iris {x} is closest to type {data.iloc[idx[0, 0],4]}')

To print the index (i.e., the position) and the distance of the nearest neighbor, we add the following

print(f'\t Index/indices of the {len(idx[0])} closest point(s): {idx[0]}')
print(f'\t Distance(s) to the {len(dist[0])} closest point(s): {dist[0]}')

In the quiz, you will be asked to find the types of other query points, and also to find the average distance to the 5 nearest neighbors. To answer these questions, you will have to slightly modify the above Python code.

(Optional) Data Visualization. Click to expand.

For your information, the following figure shows, in matrix form, the plots of the data against each pair of variables (specified in the row and column labels). Each iris type is assigned a distinct color. The plots along the diagonal represent the distributions of the three iris types along the corresponding variable.

Section 2 — Nearest-neighbor search via naive linear scan

In this section we will implement linear scan as a naive strategy for nearest-neighbor queries in a database. First, we need a C++ definition for point, i.e., vector of features (this definition is in the file retrieval.hpp):

typedef double* point;  // point = array of coordinates (doubles)

Thus, a point is simply a pointer to an array of doubles, with no indication of the length of this array. The code does not depend on a specific dimension. The dimension is given as an additional parameter to every function that needs it.

The print_point function that is already implemented in retrieval.cpp displays the coordinates of a point.

We shall complete the following functions in retrieval.cpp :

(Ex 1) double dist(point p, point q, int dim) computes the Euclidean distance between two points
(Ex 2) int linear_scan(point q, int dim, point* P, int n) computes the nearest neighbor of a query point to a database of points.

Ex 1 :: Function "dist"

The function double dist(point p, point q, int dim) in retrieval.cpp computes the Euclidean distance between two points p and q. It takes as arguments the two points and the dimension where they live, and it returns their Euclidean distance.

You can test your implementation by typing the commands make grader then ./grader 1 in the terminal. The command make grader launches the compilation, using the makefile that we have provided. The command ./grader 1 runs the tests for the exercise.

Once your program passes all (or at least some) of the tests, upload your file retrieval.cpp:

Upload form is only available when connected

Ex 2 :: Function "linear_scan"

Now that you have implemented dist, you can complete the function int linear_scan(point q, int dim, point* P, int n) in retrieval.cpp. This function takes as arguments: a query point q, the dimension dim where the points live, a pointer P to an array of points (this is the database that we will scan to find the nearest neighbor to our query point), and the number n of points in this array. It returns the position of the point in the array P that is closest to the query point q. If there are two points in P at the same the minimum distance to q, then it returns the position of the first one.

You can test your implementation by typing the commands make grader then ./grader 2 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file retrieval.cpp:

Upload form is only available when connected

Section 3 — Nearest-neighbor search via k-dimensional trees

In this section, we implement a kd-tree to answer nearest-neighbor queries faster.

We can implement a kd-tree in several ways. Here, we follow the outline given in the lecture. We assume that you have already implemented and tested the function dist.

A key step of the kd-tree implementation is splitting a list of points to make two balanced sub-lists, according to some particular order. We will implement this as two separate tasks, with functions computeMedian and partition. Then, we will consider the definition of the tree structure. Finally, we will consider two algorithms for searching in a kd-tree.

We shall complete and or modify the following functions in retrieval.cpp:

(Ex 3) double compute_median(point* P, int start, int end, int c) computes the median
(Ex 4) int partition(point* P, int start, int end, int c) partitions an array with respect to its median value along the coordinate c
(Ex 5) node* create_node(int _idx), and node* create_node(int _c, double _m, int _idx, node* _left, node* _right) create respectively a leaf and an internal nodes in a kd-tree
(Ex 6) void defeatist_search(node* n, point q, int dim, point* P, double& res, int& nnp) performs a defeatist search in a kd-tree
(Ex 7) void backtracking_search(node* n, point q, int dim, point* P, double& res, int& nnp) performs a backtracking search in a kd-tree

Ex 3 :: Function "compute_median"

Complete the function double computeMedian(point* P, int start, int end, int c). It takes as arguments: a pointer P to an array of points (this is our data set), whose sub-array of indices in the range [start..end[ will be the only one considered, and a coordinate number c. It returns the median of the collection of c-th coordinates of the points in the sub-array (there are end-start of those). To compute this median you can simply sort the collection of c-th coordinates in the range [start..end[ then take the ((end - start)/2)-th element in the sorted collection. The function std::sort(double* b, double* e) can perform the sort for you, provided that you give it pointers b and e to the beginning and end (respectively) of the array of double to be sorted.

After you implement your version of compute_median, you can test locally your code by typing the commands make grader then ./grader 3 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file retrieval.cpp:

Upload form is only available when connected

Ex 4 :: Function "partition"

Complete the function int partition(point* P, int start, int end, int c), which arranges the points around the median. It takes as arguments: a pointer P to an array of points, whose sub-array of indices in range [start..end[ will be the only one considered, and the coordinate c along which the partitioning is to be done. It returns the index idx of the point in P whose c-th coordinate is used as median. We here assume that all data points are unique in order to ensure a correct execution of our algorithm.
The function rearranges the points in the array P in such a way that every point within the subrange [start..idx] has a c-th coordinate less than or equal to the median value m (with P[idx][c] == m), while every point within the subrange [idx+1..end[ has a c-th coordinate strictly greater than the median.

After you implement your version of partition, you can test locally your code by typing the commands make grader then ./grader 4 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file retrieval.cpp:

Hint 1: Take a look at the slides of the lecture (again).

Hint 2: You should use std::swap.

Upload form is only available when connected

Ex 5 :: Creating nodes and building the tree

In this part you will implement the struct node and appropriate functions to create internal and leaf nodes in the kd-tree.

The definition of struct node, contains the coordinate for split c (int), the split value m (double), the index idx (int) of data point stored at the created node, and pointers left and right (node*) to the children nodes.

typedef struct node {  // node for the kd-tree
  int c;  // coordinate for split (at internal node)
  double m;  // split value (at internal node)
  int idx;  // index of data point stored at the node
  node* left;  // left child (NULL at leaf node)
  node* right;  // right child (NULL at leaf node)
} node;

Complete the function node* create_node(int _idx) that creates leaf nodes. Complete then the function node* create_node(int _c, double _m, int _idx, node* _left, node* _right) that creates internal nodes.

After you implement your versions of create_node, you can test locally your code by typing the commands make grader then ./grader 5 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file retrieval.cpp:

Upload form is only available when connected

Ex 6 :: Function "defeatist_search"

The following function defeatist_search implements the defeatist search algorithm.

void defeatist_search(node* n, point q, int dim, point* P, double& res) {
  if (n != NULL) {
    res = std::min(res, dist(q, P[n->idx], dim));
    if (n->left != NULL || n->right != NULL)  // internal node
      defeatist_search ( (q[n->c] <= n->m) ? n->left : n->right, q, dim, P, res);
  }
}

This basic implementation only computes (in res) the distance of the query point q to its estimated nearest neighbor. Your task is to modify it so that, in addition, it computes (in an additional parameter int& nnp) the index of the estimated nearest neighbor in the array of points P.

After you implement your version of defeatist_search, you can test locally your code by typing the commands make grader then ./grader 6 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file retrieval.cpp:

Upload form is only available when connected

Ex 7 :: Function "backtracking_search"

The following function backtracking_search implements part of the backtracking search algorithm.

void backtracking_search (node* n, point q, int dim, point* P, double& res) {
  if (n != NULL) {
    res= std::min(res, dist(q, P[n->idx], dim));
    if (n->left != NULL || n->right != NULL) {  // internal node
      // ball intersects left side of H
      if (q[n->c] - res <= n->m)  
        backtracking_search ( n->left, q, dim, P, res);
      // ball intersects right side of H
      
    }
  }
}

In particular, the previous implementation searches only the left child of a node and computes (in res) the distance of the query point q to its estimated nearest neighbor.

Your task is to modify the code so that : - it implements correctly the backtracking search algorithm, by also explore the right child of a node, and - it computes (in an additional parameter int& nnp) the index of the estimated nearest neighbor in the array of points P.

After you implement your version of backtracking_search, you can test locally your code by typing the commands make grader then ./grader 7 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file retrieval.cpp:

Upload form is only available when connected

Ex 8 :: (optional) Comparison of searching strategies

Now that you have implemented all the necessary functions, you can compare the speed and the accuracy of the three search algorithms, that is: linear, defeatist, and backtracking. For this you can use the provided program grader and issue the command make grader followed by ./grader 8 in the terminal. Equivalently, you can use the version of grader running on our server by uploading your file retrieval.cpp below:

Upload form is only available when connected

There is a construction of the tree that chooses, at every step, the coordinate axis along which the variance of the point cloud is maximal, to set the orientation of the splitting hyperplane. This should further improve the performance of the backtracking search. To go further, you can try implementing this version and then test it out on the same data.