This assignment has been closed on March 19, 2025.

You must be authenticated to submit your files

CSC43042EP (formerly INF442) - Introduction and Nearest Neighbor Search

Context
Structure
Before you begin
0 (optional) — Useful basic commands in Unix terminal
1. — Python reminders
2 — Nearest-neighbour search: linear scan VS kd-trees
3 — Compare with off-the-shelf implementations
4 — One-Nearest-Neighbor in practice

Context

Welcome to INF442 / CSC43042EP(’s TD sessions)!

The goal of this course is to introduce key Machine Leaning concepts, from a Computer Science perspective, thus focussing on complexity, approximate approaches, and implementation techniques.

In this first session, after some refresher on using the terminal, the Python interpreter, Object-Oriented Programming and NumPy, we will implement nearest neighbor search with linear scan and KDtrees.

Structure

This TD is structured as follows:

Section 0 provides a very brief introduction to the fundamental commands that you can use with a Unix/Linux terminal. More information is provided in a separate tutorial. If you are already comfortable with Unix and/or Linux, then you can safely skip this section, however we do recommend that you take at least a quick look at the section to make sure that you master all the commands introduced therein.
Section 1 provides base Python reminders, some introduction to NumPy, Pandas, Scikit-learn and matplotlib.
Section 2 makes you implement linear scan, kd-trees and defeatist and backtracking search algorithms.
Section 3 makes you compare your implementation to off-the-shelf implementations in terms of execution speed and accuracy.
Section 4 makes you compete for the best test set accuracy.

All subsequent TD sessions will be structured similarly, i.e. starting from a reimplementation of algorithms seen in the lecture, then a comparison with some off-the-shelf implementation, and lastly a “data challenge” where you apply your algorithm(s) on some (often curated but still real-life) dataset.

Before you begin

Download and unzip the archive CSC43042EP-td1-1-handin.zip.

It contains:

a scripts folder with a starter script introduction.py for the following introduction, and an iris.py script when you’re done implementing all the exercises.
a TD folder which acts as a Python package (more on that later) where we’ll implement exercises.
a grader.py script which runs tests against your code and is used for autograding.

0 (optional) — Useful basic commands in Unix terminal

If you have a Windows-based laptop, it is not compulsory to install Linux (e.g. through WSL) but you’ll have to adapt some commands that we provide (e.g. ./grader.py won’t work, but python3 -m grader.py should; if you use a Windows PowerShell, many commands work almost identically though).

Click to expand

Open a terminal. Play a bit with the following commands:

pwd (Print Working Directory) — prints out the current working directory.

To access a directory or a file, one has to specify a path. Paths can be absolute or relative. An absolute path starts from the root denoted by / and goes all the way to the desired element (directory or file). The command pwd prints absolute paths. For instance, on Linux systems, the home directory of the user student is /home/student. This can be read as follows: from the root, go to a sub-directory home, then to the sub-directory student. (On other systems, the default home directory might be different. For instance, on MacOS, it would be /Users/student.)

ls (LiSt) — lists the files and directories in the current working directory or in the directory given as a parameter. Compare the output of this command with different options: ls /, ls -l /, and ls -lh / (the three commands list the content of the root directory in different formats).

A single dot . is used to refer to the current working directory. Hence, the commands ls and ls . are equivalent. Two dots .. refer to the parent directory of the current one.
cd (Change Directory) — moves us in the directory tree. It can use both absolute and relative paths. For instance, the following command line moves us to the home directory of our hypothetical user student, which becomes the new working directory:
```
$ cd /home/student
```
(Here, $ is the command prompt — yours could look different — do not type it in the terminal.)

In order to move to the home directory of another user student1, one can either issue the same command replacing student by student1 or, instead issue the following two commands using relative paths:
```
$ cd ..
$ cd student1
$ pwd
/home/student1
```
The first command moves to the parent directory, i.e. /home. The second one moves from there to the subdirectory student1.

cd ~ (tilde) moves to the home directory of the current user, cd - (minus) moves to the previous location:
```
$ cd ~ ; pwd
/home/student
$ cd /home/student1 ; pwd
/home/student1
$ cd - ; pwd
/home/student
```
(Notice how we use ; to chain two commands on the same line.)
mkdir (MaKe DIRectory) — creates a new directory. For instance,
```
$ mkdir mydir
```
creates a directory named mydir in the working directory.
rm — this command is used for deleting a file
rmdir — is used for deleting an empty directory.

Streams are input and output communication channels between a computer program and its environment. There are three standard streams: stdin, stdout, and stderr. The first two correspond to the "normal" input and output channels, whereas the third one is used for printing error messages. By default, in Unix/Linux terminals, stdin accumulates the input coming from the keyboard. Similarly, by default, all content channeled through stdout and stderr is printed directly in the terminal. However, all these streams may be redirected, respectively to or from files using the >, >>, and < operators:

> redirects stdout to the file given as an argument, erasing its previous content,
>> appends stdout to the content of the file given as an argument,
< reads the contents of the file given as an argument into stdin.

Finally,

echo — prints its arguments on the standard output.
cat (conCATenate) — prints the contents of the file given as an argument on the standard output (more complex in reality but this is good enough for now).

Now you can play around with the above commands to create files for the INF442 exercises as follows:

Move to your home directory, using ls and cd
Create a directory INF442
Change to the INF442 directory
Create a new subdirectory TD1
Change to the TD1 directory

During the next sessions, you will create subdirectories TD2, TD3, etc. For now, you can continue playing around with the above commands:

Use echo and > to print some message into a file test.txt.
Use ls -l and >> to append the contents of the root directory / to the file test.txt.
Use cat to check that the content of the file test.txt corresponds to your expectations.
Use rm to delete the file.

Repeat until you manage to obtain something like this:

$ cat test.txt
Here are the contents of the root dir:
total 10
drwxrwxr-x+ 98 root  admin  3136 Feb 28 20:15 Applications
drwxr-xr-x  83 root  wheel  2656 Feb 26 01:35 Library
drwxr-xr-x@  8 root  wheel   256 Dec  5  2019 System
drwxr-xr-x   5 root  admin   160 Dec  5  2019 Users
drwxr-xr-x   6 root  wheel   192 Feb 28 20:46 Volumes
drwxr-xr-x@ 38 root  wheel  1216 Feb 26 01:31 bin
drwxr-xr-x   3 root  wheel    96 Dec  5  2019 com.apple.TimeMachine.localsnapshots
drwxr-xr-x   2 root  wheel    64 Nov  9  2019 cores
dr-xr-xr-x   3 root  wheel  4795 Feb 26 10:00 dev
lrwxr-xr-x@  1 root  admin    11 Dec 18  2019 etc -> private/etc
lrwxr-xr-x   1 root  wheel    25 Feb 26 10:00 home -> /System/Volumes/Data/home
drwxr-xr-x   5 root  wheel   160 Dec 18  2019 opt
drwxr-xr-x   6 root  wheel   192 Feb 26 01:32 private
drwxr-xr-x@ 63 root  wheel  2016 Feb 26 01:31 sbin
lrwxr-xr-x@  1 root  admin    11 Dec 18  2019 tmp -> private/tmp
drwxr-xr-x@ 11 root  wheel   352 Dec 18  2019 usr
lrwxr-xr-x@  1 root  admin    11 Dec 18  2019 var -> private/var

1. — Python reminders

Some useful documentation can be found here:

Note that since we’re not in computer rooms, you’ll work on your own machine.

For the record, here are the versions that are used to test the labs, and, most importantly, autograde your work:

Ubuntu 24.04
Python 3.12
numpy 1.26.4
scipy 1.11.4
pandas 2.1.4
sklearn 1.4.1

Try to stick to these versions as close as possible, as there are breaking changes e.g. between pandas major versions 1 and 2.

In terms of editors, we have no particular recommendations. VSCode seems to dominate the Python editor ecosystem, while some may find PyCharm or Spyder better. Code auto-complementation may be desirable, but try to abstain from using AI-powered tools.

Basic built-in tools

Contrary to C, C++, or Java, Python is not statically typed, i.e. you can do:

a = 1
a = "bob"

Nevertheless, in this course, whenever possible, we will give type hints, such as:

from typing import List

def len(l: List[int]):
    cnt = 0
    for _ in l:  # "_" means "discard current value"
        cnt += 1
    return cnt

This doesn’t enforce anything, i.e. you can call len on a list of strings, but it helps with readability and, most importantly, helps you understand what is expected from the function (and you, implementing it).

Loops

Python has two primitive loop commands, while and for. They can be used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, a string, an iterator). Looping through lists, tuples, dictionaries or sets gives you one element contained in it until exhaustion (we’ll see later on how). Similarly, looping through a string will give you each character.

Loops are associated with two special keywords: continue (i.e. prematurely do the next iteration) and break (prematurely get out of the loop).

from typing import List

fruits = ["apple", "banana", "cherry"]


def find_banana(l: List[str]):
    for x in l:
        if x == "banana":
            print("found")
            break
        if x == "cherry":
            continue
        print("neither a banana nor a cherry")


find_banana(fruits)

neither a banana nor a cherry
found

Function definition, code organization

As can be seen from the earlier examples, a function can be defined with the def keyword, followed by its name, parentheses possibly containing arguments and/or keyword arguments (*args and **kwargs are generic “holders” for these).

def my_func(a, *args, b: int = 3, **kwargs):
    print(f"a: {a}")
    print(f"args: {args}")
    print(f"b: {b}")
    print(f"kwargs: {kwargs}")


my_func(1, 'a', 'b', 'c', b=2, c=4, d="toto")

a: 1
args: ('a', 'b', 'c')
b: 2
kwargs: {'c': 4, 'd': 'toto'}

When typing python in a terminal, you get an interpreter: you can type code which gets interpreted or executed. It would be quite cumbersome to have to write and execute a program that searches for the nearest neighbor in a set of points each time python is run.

Variables, functions, and classes (more on that later) can thus be written in modules, i.e. files with the .py extension. These files can easily be imported (meaning their content gets executed as if you had typed them in the current module or in your interpreter).

Modules can be organized in packages, i.e. folders containing modules (in particular a mandatory __init__.py module: this is the module that is imported when you import the package as if it were a module).

$ echo "print('This is module foo')" > module_foo.py
$ python

>>> import module_foo

This is module_foo

Typing exit() or hitting Ctrl+D will make you exit the Python interpreter.

$ echo "print('This is package bar'); from .module_foo import *" > __init__.py
$ mkdir -p package_bar && mv __init__.py module_foo.py package_bar
$ python

>>> import package_bar

This is package bar
This is module foo

Data structures

Data structures or containers often refer to objects capable of holding several other objects (of the same or of various types) and perform common operations on it.

The most common such structure in Python is the list, capable of holding objects of any type and providing useful features like pop(index) (returning the index-th element and deleting it from the list), append(element) (adding an element at the end), extend(l) (to concatenate it in place with another list).

Another common structure in Python is the dictionary. It holds key–value pairs of any types.

A very Pythonic trick is to use list- (resp. dict-) comprehensions, that is, looping through each element inside an expression that returns a list (resp. dict):

>>> l = [1, 2, 3]
>>> [el + 1 for el in l]
[2, 3, 4]

Conventions and style

As in many programming languages, people have come up with many ways of using it. In this course, we will try to conform to PEP-styling: a set of style recommendations that many developers and organizations follow.

In particular, modules, functions and attributes will use snake_case formatting while classes use CamelCase.

Object-Oriented Programming (OOP)

In this course, whenever useful and possible, we will use the OOP paradigm. It is particularly used in Python since everything is an object, even classes.

What’s a class? Encapsulation.

A class is usually appropriate when it is possible to encapsulate or bundle related information together in some meaningful way:

class Person(object):
    def __init__(self, first_name: str, last_name: str):
        self.first_name = first_name
        self.last_name = last_name


person = Person("Adrien", "Ehrhardt")
print(person.first_name)

Adrien

Magic methods: abstraction.

Once information is encapsulated in some meaningful container, you can abstract away some logical operation(s) on the objects by hiding how this happens in class methods. Python comes with magic methods that are “standard” class methods’ names and will be automatically called when doing some operations. We already saw the __init__ method that gets called when instantiating a new object. Some other useful magic methods:

class Person(object):
    def __init__(self, first_name: str, last_name: str):
        print("init")
        self.first_name = first_name
        self.last_name = last_name

    def __new__(cls, first_name: str, last_name: str):
        """Creates a new object and returns it, effectively passing it to __init__"""
        print("new")
        new_obj = super().__new__(cls)
        return new_obj

    def __call__(self):
        print(f"Hey {self.first_name}!")

    def __str__(self):
        return f"{self.first_name} {self.last_name}"

    def __repr__(self):
        return f"{id(self)} - {self.first_name} - {self.last_name}"

person = Person("Adrien", "Ehrhardt")

new
init

print(repr(person))

130002366817392 - Adrien - Ehrhardt

person()

Hey Adrien!

print(f"Person person is {person}")

'Person person is Adrien Ehrhardt'

Apart from object creation, calling and identification, you may want to add operators, such as __len__ (which is called with len(object)), __add__ (obj1 + obj2), __gt__ (>), etc.

Lastly (for this course), iterators can be made with the yield keyword. It represents a countable number of values that can be iterated over. Lists, tuples, dictionaries, and sets are all iterable objects, meaning all the objects they contain are already materialized (in particular, you can take the element at index $i$ of a list of size greater than $i$). In contrast, iterators are objects of classes that can yield the next element to iterate over “as you go”.

def my_iterator(length: int = 10):
    cur_length = 0
    while cur_length < length:
        yield (cur_length := (cur_length + 1))


for i in my_iterator():
    print(i)

At any point in time, the past values iterated over are discarded, and the future values are unknown, which saves a lot of time and memory:

from timeit import timeit

>>> timeit(lambda: range(100), number=100)
3.3873002394102514e-05
>>> timeit(lambda: my_iterator(100), number=100)
3.6318000638857484e-05
>>> timeit(lambda: list(range(100)), number=100)
0.0002528949989937246

A look at inheritance and polymorphism through base/abstract classes

Let’s take the earlier Person example yet again, and assume that any person must have a job which gives them additional properties. This would be nicely implemented with an abstract class:

from abc import ABC, abstractmethod

class Person(ABC):  # ABC means Person is now abstract
    def __init__(self, first_name: str, last_name: str):
        self.first_name = first_name
        self.last_name = last_name

    @abstractmethod  # any non-abstract inheriting class must implement this
    def occupation(self):
        pass

    def __str__(self):
      return f"{self.first_name} {self.last_name} {self.occupation()}"
    

class Teacher(Person):
    def occupation(self):
        return "teaches INF442"


print(Teacher("Adrien", "Ehrhardt"))

Adrien Ehrhardt teaches INF442

Person("Adrien", "Ehrhardt")

TypeError: Can't instantiate abstract class Person without an implementation for abstract method 'occupation'

We will often rely on this mechanism, in particular in this TD where we’ll provide a base abstract class NearestNeighborSearch and you’ll implement inheriting children classes LinearScan and KDTree!

NumPy

One of Python’s strengths is its wide adoption. With wide adoption (usually) comes wide availability of third-party libraries (yet another name for a Python package that has been distributed), s.t. you don’t need to code matrix multiplication each time you use Python!

One such library is NumPy, which is actually written in C and is only some kind of interface between you (well, through Python) and operations that are implemented in C. Indeed, a number of “critical” things (things that must happen fast) in Python are implemented in “fast”/low-level languages and what you get is just some API to interact with that lower-level library.

On NumPy specifically, these resources can be useful to the interested reader: Python programming for data science and Python programming for data science addendum.

NumPy provides vectors/matrices of arbitrary dimensions and implements some useful operations.

import numpy as np

np.zeros(3)

array([0., 0., 0.])

np.ones((2, 3))

array([[1., 1., 1.],
       [1., 1., 1.]])

A NumPy array is relatively low-level, in the sense that it is, e.g., singly typed.

1 will be cast to string:

np.array(['a', 1])

array(['a', '1'], dtype='<U21')

Adding arrays with incompatible dtypes will fail:

np.ones(2) + np.array(['a', 'b'])

numpy.core._exceptions._UFuncNoLoopError: ufunc 'add' did not contain a loop with signature matching types (dtype('float64'), dtype('<U1')) -> None

The underlying type can be retrieved:

np.ones(2).dtype

dtype('float64')

A 64 bits float uses 8 bytes:

np.ones(2).itemsize

A 32-bits float however:

np.ones(2).astype("float32").dtype

dtype('float32')

np.ones(2).astype("float32").itemsize

Although low-level, it abstracts many very useful operations. The documentation provides many details, but some (if not all) useful NumPy commands for this course are:

np.zeros, np.ones which take a shape as argument and return an array of the requested shape filled with zeros (resp. ones).
np.full to achieve the same behavior and provide an arbitrary fill_value.
np.transpose(array), or array.T to transpose an array.
np.(nan)(arg){min,max}(array) or array.{min,max}() to get the (arg){min,max} of an array, possibly discarding any null (nan) values encountered.
array[:, 1] to slice the 2nd column and all rows.

Recommendation: always prefer “vector(ized)” operations (i.e. array[:, 1] = 0) over loops (i.e. for row in range(len(array)): A[row, 1] = 0).

NumPy allows broadcasting (which already took place in the earlier example: 0 was “interpreted” as vector array of 0s on the right hand side of array[:, 1] = 0). While this topic is out of the scope of the present course (the interested reader can go to this page for more details), you will much likely encounter errors like this:

ValueError: operands could not be broadcast together.

This usually means that you’re trying to perform on array operation implying arrays of incompatible shapes, e.g. multiplying a (3, 4) by a (2, 5) matrix.

np.ones((3, 4)) @ np.ones((2, 5))

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 4)

Pandas

Pandas is another off-the-shelf library, geared towards data manipulation, that is widely adopted for Machine Learning purposes. Here are some useful resources:

Through this course, we will try to reimplement some classical methods using relatively low-level logic, including with NumPy, which we will usually compare to off-the-shelf implementations on some real dataset that will generally come as a Pandas DataFrame, such that very basic knowledge of the Pandas API will come in handy.

Scikit-learn

Scikit-learn (abbreviated sklearn when imported in Python) is a Machine Learning library that implements popular algorithms in a common API. We will compare our implementations to those provided by Scikit. No particular knowledge will be required, as the little code will be explained (or self-explanatory).

Plots

Whenever needed, we will plot things relying on maplotlib. If you ever have to write code yourself to display something, the command and link to its documentation will usually be given.

2 — Nearest-neighbour search: linear scan VS kd-trees

We study the problem of k nearest neighbors search (kNN), which is to find the k nearest neighbors of a query point from a given data set. For this we will build kd-trees, and we’ll stick to k=1 in what follows.
The parts of the code to be completed correspond to basic operations and algorithms to build a kd-tree and use it to answer nearest-neighbor queries. Then, you will be able to:
- compare the three types of search: linear, defeatist and backtracking search,
- compute the average running time of each method, and
- compute the success rates of the two kd-tree based searches.

(optional) — An overview of $k$-dimensional trees

Click to expand. You can safely skip this section if you attended the lecture.

Overview of the data structure

A kd-tree is a binary tree, where data in each node is a $k$-dimensional point in space. It is a space-partitioning data structure for organizing points in a $k$-dimensional space.

Kd-trees are useful data structures for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches).

Fig. 1. Example of a 2d-tree.

Every non-leaf node (also called internal nodes) corresponds to a splitting hyperplane that divides the space of points into two parts (half-spaces).

Points on one side of this hyperplane are included in the left subtree of that node and points on the other side of the hyperplane are represented in the right subtree.

For example, if a particular split separates points across the $x$-axis: all points in the subtree with a smaller $x$-value than the node will appear in the left subtree and all points with larger $x$-value will be in the right subtree.

Remark: Refer to the lecture slides for a detailed description of the kd-tree and further references.

Kd-tree Construction:

A kd-tree can be constructed as follows:

Iterate over the dimensions according to their order
If there is a single ($k$-dimensional) point, form a "leaf node" with that point.
Otherwise, each internal node implements a spatial partition induced by a hyperplane $H$, splitting the point cloud into two equal subsets: all points lying on one side of $H$ in the right subtree, all other points in the left subtree.
Recursively construct kd-trees for the two resulting sub-sets of points.

Searching for the nearest neighbor of a query point:

The computation of neighbors of a point q (i.e., those points that are closer to q than a given distance R) is performed by recursively traversing the tree (starting from the root).

At each stage it is necessary to test whether the ball of radius R (distance) centered at q intersects the separating hyperplane:

if the ball intersects the splitting hyperplane (i.e. if the ball is not contained in a single subspace/subtree), then we must continue the search recursively in both subtrees (left and right): the result of the query will be the union of the two recursive queries in the two subtrees.
if there is no intersection with the separating hyperplane, then the ball of distance R around point q is contained in the subspace of q. This means that the search only needs to be performed in the corresponding subtree. It is then necessary to determine on which side of the splitting plane the ball lies.

Nearest-neighbor search via naive linear scan

In this section we will implement linear scan as a naive strategy for nearest-neighbor queries in a database.

First, we need a definition for a point (also called sample, example, …). A point in Python will be a row numpy array (that is, either a 1D array corresponding to one point, or a 2D array with rows corresponding to several points).

Ex1: distance

To perform nearest-neighbor search, we need to be able to compute distances between points. Complete the function euclidean_distance in TD/nearest_neighbor.py. While you’re there, take a look around! Class NearestNeighborSearch will be inherited by LinearScan, try to understand what that implies.

You can test your implementation of euclidean_distance by typing ./grader.py 1 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file nearest_neighbor.py:

Upload form is only available when connected

Ex2: linear scan

Now that we have a distance, we can implement LinearScan in linear_scan.py. In particular, complete the implementation of the query method, which loops through (hence the “linear” name) the points, computes their distance to the query point, and returns the minimum distance of the query point to any point in the set, and the index of this point (in case of ties in distances, return smallest index).

You can test your implementation of query by typing ./grader.py 2 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file linear_scan.py:

Upload form is only available when connected

Nearest-neighbor search via $k$-dimensional trees

In this section, we implement a kd-tree to answer nearest-neighbor queries faster.

We can implement a kd-tree in several ways. Here, we follow the outline given in the lecture. We assume that you have already implemented and tested the function euclidean_distance.

A key step of the kd-tree implementation is splitting a list of points to make two balanced sub-lists, according to some particular order. We will implement this as two separate tasks, with functions median and partition. Then, we will consider the definition of the tree structure. Finally, we will consider two algorithms for searching in a kd-tree.

We shall complete and or modify the following functions in kdtree.py:

(Ex 3) median(X: np.ndarray, start: int, stop: int, c: int) -> float computes the median of X between indices start and stop along the coordinate c
(Ex 4) partition(X: np.ndarray, start: int, stop: int, c: int) -> int partitions an array with respect to its median value along the coordinate c
(Ex 5) _build(self, start: int, stop: int, c: int) -> Node | None builds the kd-tree’s node given start and stop indices along the coordinate c
(Ex 6) _defeatist(self, node: Node | None, x: np.ndarray) performs a defeatist search in a kd-tree by correctly updating self._current_idx and self._current_dist
(Ex 7) _backtracking(self, node: Node | None, x: np.ndarray) performs a backtracking search in a kd-tree by correctly updating self._current_idx and self._current_dist

Ex3: median

Complete the function median(X: np.ndarray, start: int, stop: int, c: int) -> float. It takes as arguments: a 2D array X, whose sub-array of indices in the range [start..stop[ will be the only one considered, and a coordinate number c. It returns the median of the collection of c-th coordinates of the points in the sub-array (there are stop-start of those). To compute this median you can simply sort the collection of c-th coordinates in the range [start..stop[ then take the ((stop - start)/2)-th element in the sorted collection. You can use NumPy’s .sort() to sort in place a copy of the appropriate values.

After you implement your version of median, you can test locally your code by typing ./grader.py 3 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kdtree.py:

Upload form is only available when connected

Ex4: partition

Complete the function partition(X: np.ndarray, start: int, stop: int, c: int) -> int, which arranges the points around the median. It takes as arguments: a 2D array X, whose sub-array of indices in range [start..stop[ will be the only one considered, and the coordinate c along which the partitioning is to be done. It returns the index idx of the point in X whose c-th coordinate is used as median. We here assume that all data points are unique in order to ensure a correct execution of our algorithm. The function rearranges the points in the array X in such a way that every point within the subrange [start..idx] has a c-th coordinate less than or equal to the median value m (with X[idx, c] == m), while every point within the subrange [idx+1..stop[ has a c-th coordinate strictly greater than the median.

After you implement your version of partition, you can test locally your code by typing ./grader.py 4 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kdtree.py:

Hint 1: Take a look at the slides of the lecture (again).

Hint 2: You can (not mandatory) use median and swap. Indeed, you may find a “better”, pure NumPy, alternative.

Upload form is only available when connected

Ex5: build the tree

Complete the function _build(self, start: int, stop: int, c: int) -> Node | None which builds the kd-tree’s node given start and stop indices along the coordinate c. That is, it (1) partitions the data in [start..stop[ along the coordinate c, retrieves the median and returns the appropriate Node. To populate left and right, you need to appropriately recurse.

After you implement your version of build, you can test locally your code by typing ./grader.py 5 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kdtree.py:

Upload form is only available when connected

Ex6: perform defeatist search

Implement defeatist search as it is presented in your lecture notes.

After you implement your version of build, you can test locally your code by typing ./grader.py 6 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kdtree.py:

N.B. 1: The second test compares precision of defeatist versus linear_scan, i.e. how many times defeatist got the nearest neighbor wrong.

Upload form is only available when connected

Ex7: perform backtracking search

Implement backtracking search as it is presented in your lecture notes.

After you implement your version of build, you can test locally your code by typing ./grader.py 7 in the terminal.

Once your program passes all (or at least some) of the tests, upload your file kdtree.py:

N.B. 1: The second test compares run-time speed of defeatist versus backtracking.

N.B. 2: The last test compares your implementation to SciPy’s. If only this test is failing, read the next section to understand how we perform this comparison.

Upload form is only available when connected

3 — Compare with off-the-shelf implementations

SciPy implements kd-trees as well. So, we naturally ask ourselves if our implementation

Have a look at scripts/iris.py. It:

downloads the Iris dataset;
instantiates both (SciPy’s and yours) implementations of kd-trees;
compares predictions both in terms of distance to the nearest neighbor and indices;
computes an error rate:
- for each query point in X_test, compute nearest neighbor index in X_train;
- take label (Setosa, Versicolour, Virginica) of nearest neighbors of query points as prediction for label of query point;
- count a correct prediction as 0, incorrect as 1 (“0-1 loss”), divide by number of query points.

Remarks:

The script can be run with scripts/iris.py.
The train/test split done by sklearn (see iris() function definition) is random; results will change when the script is rerun.
In case of equal distances, both implementations may return different indices, corresponding to different neighbors, which means both implementations may give same distances, but different neighbors, hence different errors rates.

4 — One-Nearest-Neighbor in practice

For each TD, we will try to end the session with some “Kaggle-like” competition where we provide some train (possibly some validation data), you will provide some “model” (most of the time, you will provide hyperparameters, see below) and the grader server will instantiate (and fit whenever that makes sense) your implementation on the same data as we provided, with the hyperparameters you provided, and compute some error over a private test set (hidden from you so as you don’t overfit the data).

We will define some performance threshold (given by a benchmark model/implementation of our choice which you won’t have access to) above which you’ll get full grade, and no points at all if below.

As a practical example for subsequent TDs, have a look at xaggle in grader.py. It takes the mode as defined in set_xaggle_config in kdtree.py to perform a query, either defeatist or backtracking (you have to choose!) given the iris dataset train data (which you have access to!) for test points that are part of some private test set (which you don’t have access to!).

Complete set_xaggle_config with an appropriate query method that yields an error rate on the private test set which passes the online grader. Here, it is pretty simple to brute-force the problem and not have to think much and/or play with the training data. In the subsequent TDs, the combinatorics in the set_xaggle_config method won’t let you do that!

Upload form is only available when connected

CSC43042EP (formerly INF442) - Introduction and Nearest Neighbor Search

Context

Structure

Before you begin

0 (optional) — Useful basic commands in Unix terminal

1. — Python reminders

Basic built-in tools

Loops

Function definition, code organization

Data structures

Conventions and style

Object-Oriented Programming (OOP)

What’s a class? Encapsulation.

Magic methods: abstraction.

A look at inheritance and polymorphism through base/abstract classes

NumPy

Pandas

Scikit-learn

Plots

2 — Nearest-neighbour search: linear scan VS kd-trees

(optional) — An overview of \(k\)-dimensional trees

Overview of the data structure

Nearest-neighbor search via naive linear scan

Ex1: distance

Ex2: linear scan

Nearest-neighbor search via \(k\)-dimensional trees

Ex3: median

Ex4: partition

Ex5: build the tree

Ex6: perform defeatist search

Ex7: perform backtracking search

3 — Compare with off-the-shelf implementations

4 — One-Nearest-Neighbor in practice