CSC43042EP (formerly INF442) - Introduction and Nearest Neighbor Search
Context
Welcome to INF442 / CSC43042EP(’s TD sessions)!
The goal of this course is to introduce key Machine Leaning concepts, from a Computer Science perspective, thus focussing on complexity, approximate approaches, and implementation techniques.
In this first session, after some refresher on using the terminal, the Python interpreter, Object-Oriented Programming and NumPy, we will implement nearest neighbor search with linear scan and KDtrees.
Structure
This TD is structured as follows:
Section 0 provides a very brief introduction to the fundamental commands that you can use with a Unix/Linux terminal. More information is provided in a separate tutorial. If you are already comfortable with Unix and/or Linux, then you can safely skip this section, however we do recommend that you take at least a quick look at the section to make sure that you master all the commands introduced therein.
Section 1 provides base Python reminders, some introduction to NumPy, Pandas, Scikit-learn and matplotlib.
Section 2 makes you implement linear scan, kd-trees and defeatist and backtracking search algorithms.
Section 3 makes you compare your implementation to off-the-shelf implementations in terms of execution speed and accuracy.
Section 4 makes you compete for the best test set accuracy.
All subsequent TD sessions will be structured similarly, i.e. starting from a reimplementation of algorithms seen in the lecture, then a comparison with some off-the-shelf implementation, and lastly a “data challenge” where you apply your algorithm(s) on some (often curated but still real-life) dataset.
Before you begin
Download and unzip the archive CSC43042EP-td1-1-handin.zip
.
It contains:
- a
scripts
folder with a starter scriptintroduction.py
for the following introduction, and aniris.py
script when you’re done implementing all the exercises. - a
TD
folder which acts as a Python package (more on that later) where we’ll implement exercises. - a
grader.py
script which runs tests against your code and is used for autograding.
0 (optional) — Useful basic commands in Unix terminal
If you have a Windows-based laptop, it is not compulsory to install
Linux (e.g. through WSL) but you’ll have to adapt some commands that we
provide (e.g. ./grader.py
won’t work, but
python3 -m grader.py
should; if you use a Windows
PowerShell, many commands work almost identically though).
Click to expand
Open a terminal. Play a bit with the following commands:
pwd
(Print Working Directory) — prints out the current working directory.
To access a directory or a file, one has to specify a path. Paths can
be absolute or relative. An absolute path starts from
the root denoted by /
and goes all the way to the desired
element (directory or file). The command pwd
prints
absolute paths. For instance, on Linux systems, the home
directory of the user student
is
/home/student
. This can be read as follows: from the
root, go to a sub-directory home
, then to the
sub-directory student
. (On other systems, the default home
directory might be different. For instance, on MacOS, it would be
/Users/student
.)
ls
(LiSt) — lists the files and directories in the current working directory or in the directory given as a parameter. Compare the output of this command with different options:ls /
,ls -l /
, andls -lh /
(the three commands list the content of the root directory in different formats).A single dot
.
is used to refer to the current working directory. Hence, the commandsls
andls .
are equivalent. Two dots..
refer to the parent directory of the current one.cd
(Change Directory) — moves us in the directory tree. It can use both absolute and relative paths. For instance, the following command line moves us to the home directory of our hypothetical userstudent
, which becomes the new working directory:$ cd /home/student
(Here,
$
is the command prompt — yours could look different — do not type it in the terminal.)In order to move to the home directory of another user
student1
, one can either issue the same command replacingstudent
bystudent1
or, instead issue the following two commands using relative paths:$ cd .. $ cd student1 $ pwd /home/student1
The first command moves to the parent directory, i.e.
/home
. The second one moves from there to the subdirectorystudent1
.cd ~
(tilde) moves to the home directory of the current user,cd -
(minus) moves to the previous location:$ cd ~ ; pwd /home/student $ cd /home/student1 ; pwd /home/student1 $ cd - ; pwd /home/student
(Notice how we use
;
to chain two commands on the same line.)mkdir
(MaKe DIRectory) — creates a new directory. For instance,$ mkdir mydir
creates a directory named
mydir
in the working directory.rm
— this command is used for deleting a filermdir
— is used for deleting an empty directory.
Streams are input and output communication channels between
a computer program and its environment. There are three standard
streams: stdin
, stdout
, and
stderr
. The first two correspond to the "normal" input and
output channels, whereas the third one is used for printing error
messages. By default, in Unix/Linux terminals, stdin accumulates the
input coming from the keyboard. Similarly, by default, all content
channeled through stdout and stderr is printed directly in the terminal.
However, all these streams may be redirected, respectively
to or from files using the >
,
>>
, and <
operators:
>
redirects stdout to the file given as an argument, erasing its previous content,>>
appends stdout to the content of the file given as an argument,<
reads the contents of the file given as an argument into stdin.
Finally,
echo
— prints its arguments on the standard output.cat
(conCATenate) — prints the contents of the file given as an argument on the standard output (more complex in reality but this is good enough for now).
Now you can play around with the above commands to create files for the INF442 exercises as follows:
- Move to your home directory, using
ls
andcd
- Create a directory
INF442
- Change to the
INF442
directory - Create a new subdirectory
TD1
- Change to the
TD1
directory
During the next sessions, you will create subdirectories
TD2
, TD3
, etc. For now, you can continue
playing around with the above commands:
- Use
echo
and>
to print some message into a filetest.txt
. - Use
ls -l
and>>
to append the contents of the root directory/
to the filetest.txt
. - Use
cat
to check that the content of the filetest.txt
corresponds to your expectations. - Use
rm
to delete the file.
Repeat until you manage to obtain something like this:
$ cat test.txt
Here are the contents of the root dir:
total 10
drwxrwxr-x+ 98 root admin 3136 Feb 28 20:15 Applications
drwxr-xr-x 83 root wheel 2656 Feb 26 01:35 Library
drwxr-xr-x@ 8 root wheel 256 Dec 5 2019 System
drwxr-xr-x 5 root admin 160 Dec 5 2019 Users
drwxr-xr-x 6 root wheel 192 Feb 28 20:46 Volumes
drwxr-xr-x@ 38 root wheel 1216 Feb 26 01:31 bin
drwxr-xr-x 3 root wheel 96 Dec 5 2019 com.apple.TimeMachine.localsnapshots
drwxr-xr-x 2 root wheel 64 Nov 9 2019 cores
dr-xr-xr-x 3 root wheel 4795 Feb 26 10:00 dev
lrwxr-xr-x@ 1 root admin 11 Dec 18 2019 etc -> private/etc
lrwxr-xr-x 1 root wheel 25 Feb 26 10:00 home -> /System/Volumes/Data/home
drwxr-xr-x 5 root wheel 160 Dec 18 2019 opt
drwxr-xr-x 6 root wheel 192 Feb 26 01:32 private
drwxr-xr-x@ 63 root wheel 2016 Feb 26 01:31 sbin
lrwxr-xr-x@ 1 root admin 11 Dec 18 2019 tmp -> private/tmp
drwxr-xr-x@ 11 root wheel 352 Dec 18 2019 usr
lrwxr-xr-x@ 1 root admin 11 Dec 18 2019 var -> private/var
1. — Python reminders
Some useful documentation can be found here:
Note that since we’re not in computer rooms, you’ll work on your own machine.
For the record, here are the versions that are used to test the labs, and, most importantly, autograde your work:
Ubuntu 24.04
Python 3.12
numpy 1.26.4
scipy 1.11.4
pandas 2.1.4
sklearn 1.4.1
Try to stick to these versions as close as possible, as there are breaking changes e.g. between pandas major versions 1 and 2.
In terms of editors, we have no particular recommendations. VSCode seems to dominate the Python editor ecosystem, while some may find PyCharm or Spyder better. Code auto-complementation may be desirable, but try to abstain from using AI-powered tools.
Basic built-in tools
Contrary to C, C++, or Java, Python is not statically typed, i.e. you can do:
= 1
a = "bob" a
Nevertheless, in this course, whenever possible, we will give type hints, such as:
from typing import List
def len(l: List[int]):
= 0
cnt for _ in l: # "_" means "discard current value"
+= 1
cnt return cnt
This doesn’t enforce anything, i.e. you can call len
on
a list of strings, but it helps with readability and, most importantly,
helps you understand what is expected from the function (and you,
implementing it).
Loops
Python has two primitive loop commands, while
and
for
. They can be used for iterating over a sequence (that
is either a list, a tuple, a dictionary, a set, a string, an iterator).
Looping through lists, tuples, dictionaries or sets gives you one
element contained in it until exhaustion (we’ll see later on how).
Similarly, looping through a string will give you each character.
Loops are associated with two special keywords: continue
(i.e. prematurely do the next iteration) and break
(prematurely get out of the loop).
from typing import List
= ["apple", "banana", "cherry"]
fruits
def find_banana(l: List[str]):
for x in l:
if x == "banana":
print("found")
break
if x == "cherry":
continue
print("neither a banana nor a cherry")
find_banana(fruits)
neither a banana nor a cherry
found
Function definition, code organization
As can be seen from the earlier examples, a function can be defined
with the def
keyword, followed by its name, parentheses
possibly containing arguments and/or keyword arguments
(*args
and **kwargs
are generic “holders” for
these).
def my_func(a, *args, b: int = 3, **kwargs):
print(f"a: {a}")
print(f"args: {args}")
print(f"b: {b}")
print(f"kwargs: {kwargs}")
1, 'a', 'b', 'c', b=2, c=4, d="toto") my_func(
a: 1
args: ('a', 'b', 'c')
b: 2
kwargs: {'c': 4, 'd': 'toto'}
When typing python
in a terminal, you get an
interpreter: you can type code which gets interpreted
or executed. It would be quite cumbersome to have to write and
execute a program that searches for the nearest neighbor in a set of
points each time python
is run.
Variables, functions, and classes (more on that later) can thus be
written in modules, i.e. files with the
.py
extension. These files can easily be
imported (meaning their content gets executed
as if you had typed them in the current module or in your
interpreter).
Modules can be organized in packages, i.e. folders
containing modules (in particular a mandatory __init__.py
module: this is the module that is imported when you import the package
as if it were a module).
$ echo "print('This is module foo')" > module_foo.py
$ python
>>> import module_foo
This is module_foo
Typing exit()
or hitting Ctrl+D will make you exit the
Python interpreter.
$ echo "print('This is package bar'); from .module_foo import *" > __init__.py
$ mkdir -p package_bar && mv __init__.py module_foo.py package_bar
$ python
>>> import package_bar
This is package bar
This is module foo
Data structures
Data structures or containers often refer to objects capable of holding several other objects (of the same or of various types) and perform common operations on it.
The most common such structure in Python is the list
,
capable of holding objects of any type and providing useful features
like pop(index)
(returning the index-th element and
deleting it from the list), append(element)
(adding an
element at the end), extend(l)
(to concatenate it in place
with another list).
Another common structure in Python is the dict
ionary. It
holds key–value pairs of any types.
A very Pythonic trick is to use list
- (resp.
dict
-) comprehensions, that is, looping through each
element inside an expression that returns a list
(resp.
dict
):
>>> l = [1, 2, 3]
>>> [el + 1 for el in l]
[2, 3, 4]
Conventions and style
As in many programming languages, people have come up with many ways of using it. In this course, we will try to conform to PEP-styling: a set of style recommendations that many developers and organizations follow.
In particular, modules, functions and attributes will use snake_case formatting while classes use CamelCase.
Object-Oriented Programming (OOP)
In this course, whenever useful and possible, we will use the OOP paradigm. It is particularly used in Python since everything is an object, even classes.
What’s a class? Encapsulation.
A class is usually appropriate when it is possible to encapsulate or bundle related information together in some meaningful way:
class Person(object):
def __init__(self, first_name: str, last_name: str):
self.first_name = first_name
self.last_name = last_name
= Person("Adrien", "Ehrhardt")
person print(person.first_name)
Adrien
Magic methods: abstraction.
Once information is encapsulated in some meaningful container, you
can abstract away some logical operation(s) on the objects by hiding how
this happens in class methods. Python comes with magic
methods that are “standard” class methods’ names and will be
automatically called when doing some operations. We already saw the
__init__
method that gets called when instantiating a new
object. Some other useful magic methods:
class Person(object):
def __init__(self, first_name: str, last_name: str):
print("init")
self.first_name = first_name
self.last_name = last_name
def __new__(cls, first_name: str, last_name: str):
"""Creates a new object and returns it, effectively passing it to __init__"""
print("new")
= super().__new__(cls)
new_obj return new_obj
def __call__(self):
print(f"Hey {self.first_name}!")
def __str__(self):
return f"{self.first_name} {self.last_name}"
def __repr__(self):
return f"{id(self)} - {self.first_name} - {self.last_name}"
= Person("Adrien", "Ehrhardt") person
new
init
print(repr(person))
130002366817392 - Adrien - Ehrhardt
person()
Hey Adrien!
print(f"Person person is {person}")
'Person person is Adrien Ehrhardt'
Apart from object creation, calling and identification, you may want
to add operators, such as __len__
(which is called with
len(object)
), __add__
(obj1 + obj2
), __gt__
(>
),
etc.
Lastly (for this course), iterators can be made with the
yield
keyword. It represents a countable number of values
that can be iterated over. Lists, tuples, dictionaries, and sets are all
iterable objects, meaning all the objects they contain are already
materialized (in particular, you can take the element at index
i of a list of size greater than
i). In contrast, iterators are
objects of classes that can yield
the next element to
iterate over “as you go”.
def my_iterator(length: int = 10):
= 0
cur_length while cur_length < length:
yield (cur_length := (cur_length + 1))
for i in my_iterator():
print(i)
1
2
3
4
5
6
7
8
9
10
At any point in time, the past values iterated over are discarded, and the future values are unknown, which saves a lot of time and memory:
from timeit import timeit
>>> timeit(lambda: range(100), number=100)
3.3873002394102514e-05
>>> timeit(lambda: my_iterator(100), number=100)
3.6318000638857484e-05
>>> timeit(lambda: list(range(100)), number=100)
0.0002528949989937246
A look at inheritance and polymorphism through base/abstract classes
Let’s take the earlier Person
example yet again, and
assume that any person must have a job which gives them additional
properties. This would be nicely implemented with an abstract class:
from abc import ABC, abstractmethod
class Person(ABC): # ABC means Person is now abstract
def __init__(self, first_name: str, last_name: str):
self.first_name = first_name
self.last_name = last_name
@abstractmethod # any non-abstract inheriting class must implement this
def occupation(self):
pass
def __str__(self):
return f"{self.first_name} {self.last_name} {self.occupation()}"
class Teacher(Person):
def occupation(self):
return "teaches INF442"
print(Teacher("Adrien", "Ehrhardt"))
Adrien Ehrhardt teaches INF442
"Adrien", "Ehrhardt") Person(
TypeError: Can't instantiate abstract class Person without an implementation for abstract method 'occupation'
We will often rely on this mechanism, in particular in this TD where
we’ll provide a base abstract class NearestNeighborSearch
and you’ll implement inheriting children classes LinearScan
and KDTree
!
NumPy
One of Python’s strengths is its wide adoption. With wide adoption (usually) comes wide availability of third-party libraries (yet another name for a Python package that has been distributed), s.t. you don’t need to code matrix multiplication each time you use Python!
One such library is NumPy, which is actually written in C and is only some kind of interface between you (well, through Python) and operations that are implemented in C. Indeed, a number of “critical” things (things that must happen fast) in Python are implemented in “fast”/low-level languages and what you get is just some API to interact with that lower-level library.
On NumPy specifically, these resources can be useful to the interested reader: Python programming for data science and Python programming for data science addendum.
NumPy provides vectors/matrices of arbitrary dimensions and implements some useful operations.
import numpy as np
3) np.zeros(
array([0., 0., 0.])
2, 3)) np.ones((
array([[1., 1., 1.],
[1., 1., 1.]])
A NumPy array is relatively low-level, in the sense that it is, e.g., singly typed.
1 will be cast to string:
'a', 1]) np.array([
array(['a', '1'], dtype='<U21')
Adding arrays with incompatible dtypes will fail:
2) + np.array(['a', 'b']) np.ones(
numpy.core._exceptions._UFuncNoLoopError: ufunc 'add' did not contain a loop with signature matching types (dtype('float64'), dtype('<U1')) -> None
The underlying type can be retrieved:
2).dtype np.ones(
dtype('float64')
A 64 bits float uses 8 bytes:
2).itemsize np.ones(
8
A 32-bits float however:
2).astype("float32").dtype np.ones(
dtype('float32')
2).astype("float32").itemsize np.ones(
4
Although low-level, it abstracts many very useful operations. The documentation provides many details, but some (if not all) useful NumPy commands for this course are:
np.zeros
,np.ones
which take a shape as argument and return an array of the requested shape filled with zeros (resp. ones).np.full
to achieve the same behavior and provide an arbitraryfill_value
.np.transpose(array)
, orarray.T
to transpose an array.np.(nan)(arg){min,max}(array)
orarray.{min,max}()
to get the (arg){min,max} of an array, possibly discarding any null (nan) values encountered.array[:, 1]
to slice the 2nd column and all rows.
Recommendation: always prefer “vector(ized)” operations
(i.e. array[:, 1] = 0
) over loops
(i.e. for row in range(len(array)): A[row, 1] = 0
).
NumPy allows broadcasting (which already took place
in the earlier example: 0
was “interpreted” as vector array
of 0s on the right hand side of array[:, 1] = 0
). While
this topic is out of the scope of the present course (the interested
reader can go to this
page for more details), you will much likely encounter errors like
this:
ValueError: operands could not be broadcast together.
This usually means that you’re trying to perform on array operation implying arrays of incompatible shapes, e.g. multiplying a (3, 4) by a (2, 5) matrix.
3, 4)) @ np.ones((2, 5)) np.ones((
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 4)
Pandas
Pandas is another off-the-shelf library, geared towards data manipulation, that is widely adopted for Machine Learning purposes. Here are some useful resources:
Through this course, we will try to reimplement some classical methods using relatively low-level logic, including with NumPy, which we will usually compare to off-the-shelf implementations on some real dataset that will generally come as a Pandas DataFrame, such that very basic knowledge of the Pandas API will come in handy.
Scikit-learn
Scikit-learn (abbreviated sklearn
when imported in
Python) is a Machine Learning library that implements popular algorithms
in a common API. We will compare our implementations to those provided
by Scikit. No particular knowledge will be required, as the little code
will be explained (or self-explanatory).
Plots
Whenever needed, we will plot things relying on
maplotlib
. If you ever have to write code yourself to
display something, the command and link to its documentation will
usually be given.
2 — Nearest-neighbour search: linear scan VS kd-trees
We study the problem of
k
nearest neighbors search(kNN)
, which is to find thek
nearest neighbors of a query point from a given data set. For this we will build kd-trees, and we’ll stick tok=1
in what follows.The parts of the code to be completed correspond to basic operations and algorithms to build a kd-tree and use it to answer nearest-neighbor queries. Then, you will be able to:
- compare the three types of search: linear, defeatist and backtracking search,
- compute the average running time of each method, and
- compute the success rates of the two kd-tree based searches.
(optional) — An overview of k-dimensional trees
Click to expand. You can safely skip this section if you attended the lecture.
Overview of the data structure
A kd-tree is a binary tree, where data in each node is a k-dimensional point in space. It is a space-partitioning data structure for organizing points in a k-dimensional space.
Kd-trees are useful data structures for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches).
Fig. 1. Example of a 2d-tree.
Every non-leaf node (also called internal nodes) corresponds to a splitting hyperplane that divides the space of points into two parts (half-spaces).
Points on one side of this hyperplane are included in the left subtree of that node and points on the other side of the hyperplane are represented in the right subtree.
For example, if a particular split separates points across the x-axis: all points in the subtree with a smaller x-value than the node will appear in the left subtree and all points with larger x-value will be in the right subtree.
Remark: Refer to the lecture slides for a detailed description of the kd-tree and further references.
Kd-tree Construction:
A kd-tree can be constructed as follows:
- Iterate over the dimensions according to their order
- If there is a single (k-dimensional) point, form a "leaf node" with that point.
- Otherwise, each internal node implements a spatial partition induced by a hyperplane H, splitting the point cloud into two equal subsets: all points lying on one side of H in the right subtree, all other points in the left subtree.
- Recursively construct kd-trees for the two resulting sub-sets of points.
Searching for the nearest neighbor of a query point:
The computation of neighbors of a point q (i.e., those points that are closer to q than a given distance R) is performed by recursively traversing the tree (starting from the root).
At each stage it is necessary to test whether the ball of radius R (distance) centered at q intersects the separating hyperplane:
- if the ball intersects the splitting hyperplane (i.e. if the ball is not contained in a single subspace/subtree), then we must continue the search recursively in both subtrees (left and right): the result of the query will be the union of the two recursive queries in the two subtrees.
- if there is no intersection with the separating hyperplane, then the ball of distance R around point q is contained in the subspace of q. This means that the search only needs to be performed in the corresponding subtree. It is then necessary to determine on which side of the splitting plane the ball lies.
Nearest-neighbor search via naive linear scan
In this section we will implement linear scan as a naive strategy for nearest-neighbor queries in a database.
First, we need a definition for a point (also called sample, example, …). A point in Python will be a row numpy array (that is, either a 1D array corresponding to one point, or a 2D array with rows corresponding to several points).
Ex1: distance
To perform nearest-neighbor search, we need to be able to compute
distances between points. Complete the function
euclidean_distance
in TD/nearest_neighbor.py
.
While you’re there, take a look around! Class
NearestNeighborSearch
will be inherited by
LinearScan
, try to understand what that implies.
You can test your implementation of euclidean_distance
by typing ./grader.py 1
in the terminal.
Once your program passes all (or at least some) of the tests, upload
your file nearest_neighbor.py
:
Ex2: linear scan
Now that we have a distance, we can implement LinearScan
in linear_scan.py
. In particular, complete the
implementation of the query
method, which loops through
(hence the “linear” name) the points, computes their distance to the
query point, and returns the minimum distance of the query point to any
point in the set, and the index of this point (in case of ties in
distances, return smallest index).
You can test your implementation of query
by typing
./grader.py 2
in the terminal.
Once your program passes all (or at least some) of the tests, upload
your file linear_scan.py
:
Nearest-neighbor search via k-dimensional trees
In this section, we implement a kd-tree to answer nearest-neighbor queries faster.
We can implement a kd-tree in several ways. Here, we follow
the outline given in the lecture. We assume that you have already
implemented and tested the function euclidean_distance
.
A key step of the kd-tree implementation is splitting a list
of points to make two balanced sub-lists, according to some particular
order. We will implement this as two separate tasks, with functions
median
and partition
. Then, we will consider
the definition of the tree structure. Finally, we will consider two
algorithms for searching in a kd-tree.
We shall complete and or modify the following functions in
kdtree.py
:
- (Ex 3)
median(X: np.ndarray, start: int, stop: int, c: int) -> float
computes the median ofX
between indicesstart
andstop
along the coordinatec
- (Ex 4)
partition(X: np.ndarray, start: int, stop: int, c: int) -> int
partitions an array with respect to its median value along the coordinatec
- (Ex 5)
_build(self, start: int, stop: int, c: int) -> Node | None
builds the kd-tree’s node givenstart
andstop
indices along the coordinatec
- (Ex 6)
_defeatist(self, node: Node | None, x: np.ndarray)
performs a defeatist search in a kd-tree by correctly updatingself._current_idx
andself._current_dist
- (Ex 7)
_backtracking(self, node: Node | None, x: np.ndarray)
performs a backtracking search in a kd-tree by correctly updatingself._current_idx
andself._current_dist
Ex3: median
Complete the function
median(X: np.ndarray, start: int, stop: int, c: int) -> float
.
It takes as arguments: a 2D array X
, whose sub-array of
indices in the range [start..stop[
will be the only one
considered, and a coordinate number c
. It returns the
median of the collection of c
-th coordinates of the points
in the sub-array (there are stop-start
of those). To
compute this median you can simply sort the collection of
c
-th coordinates in the range [start..stop[
then take the ((stop - start)/2
)-th element in the sorted
collection. You can use NumPy’s .sort()
to sort in place a
copy of the appropriate values.
After you implement your version of median
, you can test
locally your code by typing ./grader.py 3
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kdtree.py
:
Ex4: partition
Complete the function
partition(X: np.ndarray, start: int, stop: int, c: int) -> int
,
which arranges the points around the median. It takes as arguments: a 2D
array X
, whose sub-array of indices in range
[start..stop[
will be the only one considered, and the
coordinate c
along which the partitioning is to be done. It
returns the index idx
of the point in X
whose
c
-th coordinate is used as median. We here assume that all
data points are unique in order to ensure a correct execution of our
algorithm. The function rearranges the points in the array
X
in such a way that every point within the subrange
[start..idx]
has a c
-th coordinate less than
or equal to the median value m
(with
X[idx, c] == m
), while every point within the subrange
[idx+1..stop[
has a c
-th coordinate strictly
greater than the median.
After you implement your version of partition
, you can
test locally your code by typing ./grader.py 4
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kdtree.py
:
Hint 1: Take a look at the slides of the lecture (again).
Hint 2: You can (not mandatory) use
median
and swap
. Indeed, you may find a
“better”, pure NumPy, alternative.
Ex5: build the tree
Complete the function
_build(self, start: int, stop: int, c: int) -> Node | None
which builds the kd-tree’s node given start
and
stop
indices along the coordinate c
. That is,
it (1) partitions the data in [start..stop[
along the
coordinate c
, retrieves the median and returns the
appropriate Node
. To populate left
and
right
, you need to appropriately recurse.
After you implement your version of build
, you can test
locally your code by typing ./grader.py 5
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kdtree.py
:
Ex6: perform defeatist search
Implement defeatist search as it is presented in your lecture notes.
After you implement your version of build
, you can test
locally your code by typing ./grader.py 6
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kdtree.py
:
N.B. 1: The second test compares precision of defeatist
versus linear_scan
, i.e. how many times
defeatist
got the nearest neighbor wrong.
Ex7: perform backtracking search
Implement backtracking search as it is presented in your lecture notes.
After you implement your version of build
, you can test
locally your code by typing ./grader.py 7
in the
terminal.
Once your program passes all (or at least some) of the tests, upload
your file kdtree.py
:
N.B. 1: The second test compares run-time speed of
defeatist
versus backtracking
.
N.B. 2: The last test compares your implementation to SciPy’s. If only this test is failing, read the next section to understand how we perform this comparison.
3 — Compare with off-the-shelf implementations
SciPy implements kd-trees as well. So, we naturally ask ourselves if our implementation
Have a look at scripts/iris.py
. It:
downloads the Iris dataset;
instantiates both (SciPy’s and yours) implementations of kd-trees;
compares predictions both in terms of distance to the nearest neighbor and indices;
computes an error rate:
- for each query point in
X_test
, compute nearest neighbor index inX_train
; - take label (Setosa, Versicolour, Virginica) of nearest neighbors of query points as prediction for label of query point;
- count a correct prediction as 0, incorrect as 1 (“0-1 loss”), divide by number of query points.
- for each query point in
Remarks:
The script can be run with
scripts/iris.py
.The train/test split done by
sklearn
(seeiris()
function definition) is random; results will change when the script is rerun.In case of equal distances, both implementations may return different indices, corresponding to different neighbors, which means both implementations may give same distances, but different neighbors, hence different errors rates.
4 — One-Nearest-Neighbor in practice
For each TD, we will try to end the session with some “Kaggle-like” competition where we provide some train (possibly some validation data), you will provide some “model” (most of the time, you will provide hyperparameters, see below) and the grader server will instantiate (and fit whenever that makes sense) your implementation on the same data as we provided, with the hyperparameters you provided, and compute some error over a private test set (hidden from you so as you don’t overfit the data).
We will define some performance threshold (given by a benchmark model/implementation of our choice which you won’t have access to) above which you’ll get full grade, and no points at all if below.
As a practical example for subsequent TDs, have a look at
xaggle
in grader.py
. It takes the
mode
as defined in set_xaggle_config
in
kdtree.py
to perform a query, either defeatist
or backtracking
(you have to choose!) given the iris
dataset train data (which you have access to!) for test points that are
part of some private test set (which you don’t have access to!).
Complete set_xaggle_config
with an appropriate query
method that yields an error rate on the private test set which passes
the online grader. Here, it is pretty simple to brute-force the problem
and not have to think much and/or play with the training data. In the
subsequent TDs, the combinatorics in the set_xaggle_config
method won’t let you do that!