INF442 - TD2: K-Dimensional Tree (kd-tree)
Context
- We study the problem of
k
nearest neighbors search(kNN)
, which is to find thek
nearest neighbors of a query point from a given data set. For this we will build kd-trees. - The parts of the code to be completed correspond to basic operations
and algorithms to build a kd-tree and use it to answer nearest-neighbor
queries. Then, you will be able to:
- compare the three types of search: linear, defeatist and backtracking search,
- compute the average running time of each method, and
- compute the success rates of the two kd-tree based searches.
- If you are a beginner in C++, then we suggest that you start by reading the short text Pointers in C++, which illustrates the use of pointers in C++.
Structure
This TD is structured as follows:
- Section 0 gives a brief reminder of the theory of kd-trees.
- Section 1 gives a brief introduction on how to construct and use a kd-tree in python using the pandas and scikit-learn libraries. You will need this functionality to answer some of the questions of the quiz.
- Section 2 is devoted to the C++ implementation of the straightforward, linear-time approach for finding the nearest neighbor of a query point in a database.
- Section 3 is devoted to the C++ implementation of kd-trees and the various associated search strategies.
Before you begin
Download and unzip the archive INF442-td2-1-handin.zip
.
It contains in particular:
- The file
retrieval.cpp
, which you will have to complete and/or modify, - The header file
retrieval.hpp
, which you do not have to change - The file
main.cpp
, which is used for the various test scripts - A
Makefile
that you can use to compile the various tests (see the next sections).
In order to get full credit, you need to solve and submit exercises 1, 2, 3, 4, 5, 6, and 7 (exercise 8 is optional). Log in to be able to upload your files.
Section 0 (optional) — An overview of k-dimensional trees
Click to expand. You can safely skip this section if you attended the lecture.
Overview of the data structure
A kd-tree is a binary tree, where data in each node is a k-dimensional point in space. It is a space-partitioning data structure for organizing points in a k-dimensional space.
Kd-trees are useful data structures for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches).
Fig. 1. Example of a 2d-tree.
Every non-leaf node (also called internal nodes) corresponds to a splitting hyperplane that divides the space of points into two parts (half-spaces).
Points on one side of this hyperplane are included in the left subtree of that node and points on the other side of the hyperplane are represented in the right subtree.
For example, if a particular split separates points across the x-axis: all points in the subtree with a smaller x-value than the node will appear in the left subtree and all points with larger x-value will be in the right subtree.
Remark: Refer to the lecture slides for a detailed description of the kd-tree and further references.
Kd-tree Construction:
A kd-tree can be constructed as follows:
- Iterate according to the ordered coordinates
- If there is a single (k-dimensional) point, form a "leaf node" with that point.
- Otherwise, each internal node implements a spatial partition induced by a hyperplane H, splitting the point cloud into two equal subsets: all points lying on one side of H in the right subtree, all other points in the left subtree.
- Recursively construct kd-trees for the two resulting sub-sets of points.
Searching for the nearest neighbor of a query point:
The computation of neighbors of a point q (i.e., those points that are closer to q than a given distance R) is performed by recursively traversing the tree (starting from the root).
At each stage it is necessary to test whether the ball of radius R (distance) centered at q intersects the separating hyperplane:
- if the ball intersects the splitting hyperplane (i.e. if the ball is not contained in a single sub-space/subtree), then we must continue the search recursively in both subtrees (left and right): the result of the query will be the union of the two recursive queries in the two subtrees.
- if there is no intersection with the separating hyperplane, then the ball of distance R around point q is contained in the sub-space of q. This means that the search only needs to be performed in the corresponding subtree. It is then necessary to determine on which side of the splitting plane the ball lies.
Section 1 — Quiz: querying the Iris data set
Click to expand.
In the quiz you will answer some questions related to the content of the lecture to evaluate your understanding of the various notions. In addition, you will answer some “data science” questions related to the use of kd-trees.For these questions we use (a variant of) the famous iris data
set that is widely used as a benchmark by the Machine Learning
community. The dataset describes each iris by a vector of 4 numerical
features measured on the flower (sepal length/width and petal
length/width), together with the botanical name of the species. The data
files are in the csv/
directory. There is no reason to
modify these files.
Before proceeding with the quiz, let us write a small Python script that will allow us to create a kd-tree on the iris data set and then perform some (basic) query operations.
First, we must install the pandas and scikit-learn libraries (if not already done):
$ pip3 install --user --only-binary=:all: pandas scikit-learn
Now we open a new file kd-tree-search.py
in the
quiz/
directory, and we add the following import
statements
import pandas as pd
from sklearn.neighbors import KDTree
Then we read the iris data set file and we print (some of) the data
that it contains. We assume that the Python script is in the
quiz/
directory; otherwise you have to modify the variable
fname
to point the data set.
= "../csv/iris_large.csv"
fname = pd.read_csv(fname, delimiter=' ', header=0)
data print(data)
Then we create the kd-tree, using the Scikit-learn library:
print('\nbuilding kd-tree...', end='')
= KDTree(data.iloc[:,0:4], leaf_size=1, metric='euclidean')
kdt print(' done')
Assume we have the following iris observation, which we want to query the database with:
= [1.1, 2.2, 3.3, 4.4]
x print('\n my measurements [sepal_len, sepal_wid, petal_len, petal_wid] are: ', x)
In order to find the type of the nearest neighbor of
x
we add the following:
= kdt.query([x], k=1)
dist, idx print(f' -> the iris {x} is closest to type {data.iloc[idx[0, 0],4]}')
To print the index (i.e., the position) and the distance of the nearest neighbor, we add the following
print(f'\t Index/indices of the {len(idx[0])} closest point(s): {idx[0]}')
print(f'\t Distance(s) to the {len(dist[0])} closest point(s): {dist[0]}')
In the quiz, you will be asked to find the types of other query points, and also to find the average distance to the 5 nearest neighbors. To answer these questions, you will have to slightly modify the above Python code.
(Optional) Data Visualization. Click to expand.
For your information, the following figure shows, in matrix form, the plots of the data against each pair of variables (specified in the row and column labels). Each iris type is assigned a distinct color. The plots along the diagonal represent the distributions of the three iris types along the corresponding variable.
Section 2 — Nearest-neighbor search via naive linear scan
In this section we will implement linear scan as a naive
strategy for nearest-neighbor queries in a database.
First, we need a C++ definition for point, i.e.,
vector of features (this definition is in the file
retrieval.hpp
):
typedef double* point; // point = array of coordinates (doubles)
Thus, a point is simply a pointer to an array of doubles, with no indication of the length of this array. The code does not depend on a specific dimension. The dimension is given as an additional parameter to every function that needs it.
The print_point
function that is already implemented in
retrieval.cpp
displays the coordinates of a point.
We shall complete the following functions in
retrieval.cpp
:
- (Ex 1)
double dist(point p, point q, int dim)
computes the Euclidean distance between two points - (Ex 2)
int linear_scan(point q, int dim, point* P, int n)
computes the nearest neighbor of a query point to a database of points.
Ex 1 :: Function "dist"
The function double dist(point p, point q, int dim)
in
retrieval.cpp
computes the Euclidean distance between two
points p
and q
. It takes as arguments the two
points and the dimension where they live, and it returns their Euclidean
distance.
You can test your implementation by typing the commands
make grader
then ./grader 1
in the terminal.
The command make grader
launches the compilation, using the
makefile that we have provided. The command ./grader 1
runs
the tests for the exercise.
Once your program passes all (or at least some) of the tests, upload
your file retrieval.cpp
:
Ex 2 :: Function "linear_scan"
Now that you have implemented dist
, you can complete the
function int linear_scan(point q, int dim, point* P, int n)
in retrieval.cpp
. This function takes as arguments: a query
point q
, the dimension dim
where the points
live, a pointer P
to an array of points (this is the
database that we will scan to find the nearest neighbor to our query
point), and the number n
of points in this array. It
returns the position of the point in the array P
that is
closest to the query point q
. If there are two points in
P
at the same the minimum distance to q
, then
it returns the position of the first one.
You can test your implementation by typing the commands
make grader
then ./grader 2
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file retrieval.cpp
:
Section 3 — Nearest-neighbor search via k-dimensional trees
In this section, we implement a kd-tree to answer nearest-neighbor queries faster.
We can implement a kd-tree in several ways. Here, we follow
the outline given in the lecture. We assume that you have already
implemented and tested the function dist
.
A key step of the kd-tree implementation is splitting a list
of points to make two balanced sub-lists, according to some particular
order. We will implement this as two separate tasks, with functions
computeMedian
and partition
. Then, we will
consider the definition of the tree structure. Finally, we will consider
two algorithms for searching in a kd-tree.
We shall complete and or modify the following functions in
retrieval.cpp
:
- (Ex 3)
double compute_median(point* P, int start, int end, int c)
computes the median - (Ex 4)
int partition(point* P, int start, int end, int c)
partitions an array with respect to its median value along the coordinatec
- (Ex 5)
node* create_node(int _idx)
, andnode* create_node(int _c, double _m, int _idx, node* _left, node* _right)
create respectively a leaf and an internal nodes in a kd-tree - (Ex 6)
void defeatist_search(node* n, point q, int dim, point* P, double& res, int& nnp)
performs a defeatist search in a kd-tree - (Ex 7)
void backtracking_search(node* n, point q, int dim, point* P, double& res, int& nnp)
performs a backtracking search in a kd-tree
Ex 3 :: Function "compute_median"
Complete the function
double computeMedian(point* P, int start, int end, int c)
.
It takes as arguments: a pointer P
to an array of points
(this is our data set), whose sub-array of indices in the range
[start..end[
will be the only one considered, and a
coordinate number c
. It returns the median of the
collection of c
-th coordinates of the points in the
sub-array (there are end-start
of those). To compute this
median you can simply sort the collection of c
-th
coordinates in the range [start..end[
then take the
((end - start)/2
)-th element in the sorted collection. The
function std::sort(double* b, double* e)
can perform the
sort for you, provided that you give it pointers b
and
e
to the beginning and end (respectively) of the array of
double
to be sorted.
After you implement your version of compute_median
, you
can test locally your code by typing the commands
make grader
then ./grader 3
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file retrieval.cpp
:
Ex 4 :: Function "partition"
Complete the function
int partition(point* P, int start, int end, int c)
, which
arranges the points around the median. It takes as arguments: a pointer
P
to an array of points, whose sub-array of indices in
range [start..end[
will be the only one considered, and the
coordinate c
along which the partitioning is to be done. It
returns the index idx
of the point in P
whose
c
-th coordinate is used as median. We here assume that all
data points are unique in order to ensure a correct execution of our
algorithm.
The function rearranges the points in the array P
in such a
way that every point within the subrange [start..idx]
has a
c
-th coordinate less than or equal to the median value
m
(with P[idx][c] == m
), while every point
within the subrange [idx+1..end[
has a c
-th
coordinate strictly greater than the median.
After you implement your version of partition
, you can
test locally your code by typing the commands make grader
then ./grader 4
in the terminal.
Once your program passes all (or at least some) of the tests, upload
your file retrieval.cpp
:
Hint 1: Take a look at the slides of the lecture (again).
Hint 2: You should use std::swap
.
Ex 5 :: Creating nodes and building the tree
In this part you will implement the struct node
and
appropriate functions to create internal and leaf nodes in the
kd-tree.
The definition of struct node
, contains the coordinate
for split c
(int
), the split value
m
(double
), the index idx
(int
) of data point stored at the created node, and
pointers left
and right
(node*
)
to the children nodes.
typedef struct node { // node for the kd-tree
int c; // coordinate for split (at internal node)
double m; // split value (at internal node)
int idx; // index of data point stored at the node
* left; // left child (NULL at leaf node)
node* right; // right child (NULL at leaf node)
node} node;
Complete the function node* create_node(int _idx)
that
creates leaf nodes. Complete then the function
node* create_node(int _c, double _m, int _idx, node* _left, node* _right)
that creates internal nodes.
After you implement your versions of create_node
, you
can test locally your code by typing the commands
make grader
then ./grader 5
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file retrieval.cpp
:
Ex 6 :: Function "defeatist_search"
The following function defeatist_search
implements the
defeatist search algorithm.
void defeatist_search(node* n, point q, int dim, point* P, double& res) {
if (n != NULL) {
= std::min(res, dist(q, P[n->idx], dim));
res if (n->left != NULL || n->right != NULL) // internal node
( (q[n->c] <= n->m) ? n->left : n->right, q, dim, P, res);
defeatist_search }
}
This basic implementation only computes (in res
) the
distance of the query point q
to its estimated nearest
neighbor. Your task is to modify it so that, in addition, it computes
(in an additional parameter int& nnp
) the index of the
estimated nearest neighbor in the array of points P
.
After you implement your version of defeatist_search
,
you can test locally your code by typing the commands
make grader
then ./grader 6
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file retrieval.cpp
:
Ex 7 :: Function "backtracking_search"
The following function backtracking_search
implements
part of the backtracking search algorithm.
void backtracking_search (node* n, point q, int dim, point* P, double& res) {
if (n != NULL) {
= std::min(res, dist(q, P[n->idx], dim));
resif (n->left != NULL || n->right != NULL) { // internal node
// ball intersects left side of H
if (q[n->c] - res <= n->m)
( n->left, q, dim, P, res);
backtracking_search // ball intersects right side of H
}
}
}
In particular, the previous implementation searches only the
left child of a node and computes (in res
) the
distance of the query point q
to its estimated nearest
neighbor.
Your task is to modify the code so that : - it implements correctly
the backtracking search algorithm, by also explore the right child of a
node, and - it computes (in an additional parameter
int& nnp
) the index of the estimated nearest neighbor
in the array of points P
.
After you implement your version of backtracking_search
,
you can test locally your code by typing the commands
make grader
then ./grader 7
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file retrieval.cpp
:
Ex 8 :: (optional) Comparison of searching strategies
Now that you have implemented all the necessary functions, you can
compare the speed and the accuracy of the three search algorithms, that
is: linear, defeatist, and backtracking. For this you can use the
provided program grader
and issue the command
make grader
followed by ./grader 8
in the
terminal. Equivalently, you can use the version of grader
running on our server by uploading your file retrieval.cpp
below: