Tutorial 12: Indexing a Text
Log in to enable code uploads
You need to log in to be able to upload your work as you go. Use your Polytechnique LDAP login here.
Skills we will practice today: recursive functions and data structures
Setup: before you start
Create a new empty project in Spyder, and call it “Tutorial_12”. Then
create a file index.py
in which you will write all your
code.
Files to download
You will need to download the following files (right-clicking on the link and selecting “Save link as…”), saving them in your Tutorial_12 project directory:
Context and Objectives
The goal of today’s exercises is to write a program which can create an index of a text. An index is a traditional part of a book, usually situated at the end, which contains pointers to important concepts treated in the book.
Typically, a book index contains an alphabetized list of indexable entries, and its pointers are page numbers. Here, we will instead index all of the words in the text, and the pointers will be a list of line numbers where those words appear.
Our goal is to write a program which, given an input text such as
The quick brown fox fox
jumps over the
lazy brown dog dog
will produce the index
brown: [1, 3]
dog: [3]
fox: [1]
jumps: [2]
lazy: [3]
over: [2]
quick: [1]
the: [1, 2]
Observe the following features of our index:
- the entries are ordered lexicographically (i.e., in alphabetical order);
- no difference is made between lower- and upper-case letters
(e.g. Line 1’s
The
and Line 2’sthe
are treated as the same word); - the line numbers for each entry appear in increasing order;
- if an entry appears more than once in a line, then only one occurrence is recorded (that is, the line numbers in each list are never repeated).
Binary Search Trees
We will use binary search trees (BSTs) to store and process our data. A BST is a labeled binary tree, i.e. each node has a key and a value and at most two children, with the extra property that for every node,
- all keys in its left subtree are strictly smaller than the node’s key;
- all keys in its right subtree are strictly greater than the node’s key.
Of the binary trees in the following example, the left one is a BST, whereas the right one is not. (Here we use integers as keys; the values are not shown.)
Let us check the above claims. First, the left tree is a BST because
- the keys in the left subtree of the root are 3, 4, 6, 7, which are all strictly smaller than 8;
- the keys in the right subtree of the root are 10, 13, 14, which are all strictly greater than 8;
- the only key in the left subtree of the node with key 4 is 3, which is strictly smaller than 4;
- the keys in the right subtree of the node with key 4 are 6, 7, which are all strictly greater than 4;
- the only key in the right subtree of the node with key 6 is 7, which is strictly greater than 6;
- etc.
The right tree of the figure is not a BST because the right subtree of the node with key 4 contains the key 2.
The BST property makes it possible to use binary search for accessing keys: start at the root and descend left or right, recursively, depending on comparison with the current node’s key.
Exercise 1: A Class for Binary Trees
We will represent trees recursively, using the idea that a node has a key and a value, and possibly a left and/or right child, which are also nodes.
Copy the following class definition to your file
index.py
:
class Node:
def __init__(self, key, value, left, right):
self.key = key
self.value = value
self.left = left
self.right = right
def __eq__(self, other):
return self.key == other.key \
and self.value == other.value \
and self.left == other.left \
and self.right == other.right
def __repr__(self):
return f'Node({repr(self.key)}, {repr(self.value)}, {repr(self.left)}, {repr(self.right)})'
def __str__(self):
= self._str_aux()
lines, _ return '\n'.join(lines)
def _str_aux(self):
# Recursive helper for __str__.
# Returns lines (to be joined) and the horizontal position where
# a branch from an eventual parent should be attached.
= f'{self.key}: {self.value}'
label
# Leaf case
if self.right is None and self.left is None:
return [label], len(label) // 2
if self.left is None:
= [], 0, 0, '', '', ''
llines, lpos, lwidth, ltop0, ltop1, lfill else: # Recurse left
= self.left._str_aux()
llines, lpos = len(llines[0])
lwidth = lpos*' ' + ' ' + (lwidth - lpos - 1)*'_'
ltop0 = lpos*' ' + '/' + (lwidth - lpos - 1)*' '
ltop1 = lwidth*' '
lfill
if self.right is None:
= [], 0, 0, '', '', ''
rlines, rpos, rwidth, rtop0, rtop1, rfill else: # Recurse right
= self.right._str_aux()
rlines, rpos = len(rlines[0])
rwidth = rpos*'_' + ' ' + (rwidth - rpos - 1)*' '
rtop0 = rpos*' ' + '\\' + (rwidth - rpos - 1)*' '
rtop1 = rwidth*' '
rfill
= len(label)*' '
cfill
# Extend llines or rlines to same length, filling with spaces (or '')
= max(len(llines), len(rlines))
maxlen for _ in range(maxlen - len(llines)))
llines.extend(lfill for _ in range(maxlen - len(rlines)))
rlines.extend(rfill
= []
res + label + rtop0)
res.append(ltop0 + cfill + rtop1)
res.append(ltop1 + cfill + rline for (lline, rline) in zip(llines, rlines))
res.extend(lline
return res, lwidth + len(label) // 2
Notice that the __str__
method has a (rather long)
recursive subroutine. It will be easier to see how it works when you try
some examples:
In [1]: n1 = Node(2, 'Mathilde', None, None) # a leaf
In [2]: print(n1) # implicitly calls str
2: Mathilde
In [3]: n2 = Node(0, 'Stefan', None, None) # another leaf
In [4]: n0 = Node(1, 'Ben', n2, n1) # a node with left child 'Stefan' and right child 'Mathilde'
In [5]: n0
Out[5]: Node(1, 'Ben', Node(0, 'Stefan', None, None), Node(2, 'Mathilde', None, None))
In [6]: n0.left
Out[6]: Node(0, 'Stefan', None, None)
In [7]: n0.right
Out[7]: Node(2, 'Mathilde', None, None)
In [8]: print(n0)
____1: Ben_____
/ \
0: Stefan 2: Mathilde
Now write a function example_bst()
(outside the class Node
) which defines the BST in
the figure below and returns its root node:
The keys are the numbers in the circles. The corresponding values are
string representations of the keys: for example, the value for the node
with key 8
would be 'Eight'
.
Note that your function only needs to construct the BST above; nothing else.
Test your function in the console:
In [9]: root = example_bst()
In [10]: print(root)
_________________8: Eight___
/ \
___4: Four___ 10: Ten__________________
/ \ \
3: Three 6: Six____ _____14: Fourteen
\ /
7: Seven 13: Thirteen
Upload your file index.py
:
Exercise 2: Searching for a Key in a BST
The following algorithm is used to search for keys in a BST. Assume
we want to find a key k
in a BST given as
node
:
- If
k == node.key
, then we have found our keyk
. - if
k < node.key
, then we must search fork
in the left subtree ofnode
. Now ifnode.left
isNone
, then we know thatk
is nowhere to be found; otherwise, we recursively search innode.left
. - Similarly, if
k > node.key
, then we search fork
in the right subtree ofnode
if this exists.
Add a recursive method
search(self, key)
to your Node
class which
returns the value
of the node with key key
,
if key
is found in the tree; otherwise, it
returns None
.
Test your work in the console:
In [11]: root = example_bst()
In [12]: root.search(8)
Out[12]: 'Eight'
In [13]: root.search(4)
Out[13]: 'Four'
In [14]: root.search(2) is None
Out[14]: True
Upload your file index.py
:
Exercise 3: Printing Keys and Values in Order
Add a recursive method
print_in_order(self)
to your Node
class which
prints all keys and values in the tree, sorted in increasing order.
Because of the BST property, your method needs to do an in-order traversal of the tree, as follows:
- first, if
node
has a left child, recursively callprint_in_order()
on the left child; - then, print the node’s key and value;
- finally, if
node
has a right child, recursively callprint_in_order()
on the right child.
Test your method in the console:
In [15]: root = example_bst()
In [16]: root.print_in_order()
3: Three
4: Four
6: Six
7: Seven
8: Eight
10: Ten
13: Thirteen
14: Fourteen
Upload your file index.py
:
Exercise 4: Adding a Key to a BST for Indexing
We are now, finally, ready to use our Node
class for
indexing texts.
First we implement the standard algorithm for adding a key-value pair to a binary search tree. The only difference is that each node stores a list of line numbers, so we have to be careful with the way we insert line numbers.
Assume that we want to add a key k
(a word in our text)
together with a value v
(a line number) to a BST given as
node
:
- If
k < node.key
, thenk
should be inserted into the left subtree ofnode
. Hence,- if
node.left
isNone
, then we create a new node atnode.left
with keyk
and value[v]
(that is, we start a new list of line numbers for that word). - If
node.left
is notNone
, then we repeat the procedure recursively onnode.left
.
- if
- Similarly, if
k > node.key
, then we either create a new node atnode.right
with keyk
and value[v]
, or we recurse on the right subtree ofnode
. - If
k == node.key
, then the wordk
is already in the tree, so we check ifv
is already innode.value
(the list of line numbers wherek
appears); if it does not yet appear there, then we append it.
Note that as we are going through the text line-by-line to
construct our index, the line numbers in our tree’s values will
automatically be sorted. Hence you should check whether
v
is in node.value
by comparing it with
node.value[-1]
instead of using in
.
Add a recursive method
add(self, key, value)
to your Node
class which
implements the above algorithm.
Test your work in the console:
In [17]: root = Node('quick', [1], None, None)
In [18]: root.add('the', 1)
In [19]: root.add('fox', 1)
In [20]: root.add('fox', 1)
In [21]: root.add('fox', 2)
In [22]: root
Out[22]: Node('quick', [1], Node('fox', [1, 2], None, None), Node('the', [1], None, None))
In [23]: print(root)
_____quick: [1]____
/ \
fox: [1, 2] the: [1]
Upload your file index.py
:
Exercise 5: Constructing a BST for Indexing
Copy the following function definition to your file
index.py
(outside the class Node
).
The function split_in_words_and_lowercase(line)
splits an
input line (from a file) into words and converts these to lower case.
(Do you understand how it works?)
def split_in_words_and_lowercase(line):
"""Split a line of text into a list of lower-case words."""
= line.strip('\n').replace('-', ' ').replace("'", " ").replace('"', ' ').split()
parts = [p.strip('",._;?!:()[]').lower() for p in parts]
parts return [p for p in parts if p != '']
Now write a function
construct_bst_for_indexing(filename)
(still
outside the Node
class) which finds all words in
the file named filename
and returns the root of a BST with
these words, ordered lexicographically, as keys and, for each key, a
list of lines in which the word appears, as value. Use the function
split_in_words_and_lowercase(line)
and the
Node
class’ add(key, value)
method.
Test your work in the console:
In [24]: root = construct_bst_for_indexing('foxdog.txt')
In [25]: print(root)
____the: [1, 2]
/
__________________________________________________quick: [1]
/
brown: [1, 3]____________
\
___fox: [1]_____
/ \
dog: [3] jumps: [2]_____________
\
____over: [2]
/
lazy: [3]
In [26]: root.print_in_order()
brown: [1, 3]
dog: [3]
fox: [1]
jumps: [2]
lazy: [3]
over: [2]
quick: [1]
the: [1, 2]
(Note that you have just produced an index of the file
foxdog.txt
.)
Test your work with this other example:
In [27]: root = construct_bst_for_indexing('ouch.txt')
Does it take more time than the previous example? If this is clearly
the case, this is probably because, contrary to the indication given,
you used in
in your add
method. Modify your
add
method consequently.
Upload your file index.py
:
Exercise 6: Writing an Index to a File
The last step in our index production is to write a generated index to a file.
Add a method
write_in_order(self, filename)
to your Node
class which writes all keys and values, in increasing order, to a file
named filename
. Your method should open the file for
writing and then call another recursive method
write_in_order_rec(self, file)
which, like
print_in_order
, writes one line at a time to the file. Be
careful, file
in the method
write_in_order_rec(self, file)
is a file
object and not a filename.
We give you the first of the methods below; your task is to implement the second, recursive, one:
def write_in_order(self, filename):
"""Write all key: value pairs in the index tree
to the named file, one entry per line.
"""
with open(filename, 'w') as file:
self.write_in_order_rec(file)
def write_in_order_rec(self, file):
"""Recursive helper method for write_in_order."""
pass # replace with your own code
Test your method in the console: running
In [28]: root = construct_bst_for_indexing('foxdog.txt')
In [29]: root.write_in_order('test.txt')
should create a file test.txt
with precisely the
following contents:
brown: [1, 3]
dog: [3]
fox: [1]
jumps: [2]
lazy: [3]
over: [2]
quick: [1]
the: [1, 2]
Upload your file index.py
:
Exercise 7: Collecting the Parts
Now write a function
generate_index(textfile, indexfile)
(outside the class
Node
) which constructs an index BST for the file named
textfile
and then writes the sorted index to a file
indexfile
.
Test your work in the console: running
In [30]: generate_index('foxdog.txt', 'index.txt')
should create a file index.txt
with the same contents as
test.txt
above.
For a more comprehensive test, we can now index Pride and Prejudice:
In [31]: generate_index('pp.txt', 'pp_index.txt')
This should take just a bit of time; but look at the generated
pp_index.txt
! (Also, we now find out that there is no word
in Pride and Prejudice which starts with a “z”?!)
Upload your file index.py
:
Exercise 8: Computing the Height of a BST
The next two exercises are concerned with optimization: in some cases, the BSTs we generate may be unbalanced, that is to say, they more resemble lists than trees. We first need a method to compute heights of BSTs.
The height of a tree is the length of a longest path from the root to a leaf. For example,
- the height of a tree with only the root node is 0;
- the height of our example BST is 3;
- the height of its subtree under the 4-valued node is 2.
Define a recursive method
height(self)
in your Node
class which returns
the height of the tree given as node
.
Test your method in the console:
In [32]: root = example_bst()
In [33]: root.height()
Out[33]: 3
In [34]: root.left.height()
Out[34]: 2
In [35]: pp = construct_bst_for_indexing('pp.txt')
In [36]: pp.height()
Out[36]: 35
Upload your file index.py
:
Exercise 9: Balancing a BST
The provided file clara.txt
contains the sentence:
At bathtime,
Clara demanded eigenvalues
for great high-intensity Jovian Krypton lasers
(more naturally, on principle);
quietly,
Russians sent tanks under veils with xylophones,
yelling "ZACHARYYY!"
Run the following commands in the console:
In [37]: clara = construct_bst_for_indexing('clara.txt')
In [38]: clara.height()
Out[38]: 25
There are twenty-six words in the file, and the height of our
indexing BST is twenty-five. That is to say, our BST is extremely
unbalanced: none of its nodes have left children! This BST is,
in essence, a linked list. (You can confirm this by running
clara
in the console.)
Can you explain why this is the case?
In a balanced binary tree, the heights of the two subtrees under any internal node differ by at most one.
The time it takes to search for a key in a BST is proportional to the height of the BST. (Why?) Hence searching is most efficient if the BST is balanced.
The following algorithm can be used to balance a BST: First, convert
the BST to a sorted list l
. Then, recursively construct a
new BST:
- the root is the middle element, with index
len(l) // 2
, in the list; - its left child is the balanced BST constructed from the left part of the list;
- its right child is the balanced BST constructed from the right part of the list.
Implement this algorithm. First, add a
recursive method list_in_order(self)
to your
Node
class which converts a BST to a sorted list of
key-value pairs.
Test your work in the console:
In [39]: root = example_bst()
In [40]: root.list_in_order()
Out[40]:
[(3, 'Three'),
(4, 'Four'),
(6, 'Six'),
(7, 'Seven'),
(8, 'Eight'),
(10, 'Ten'),
(13, 'Thirteen'),
(14, 'Fourteen')]
Then, write a function
balanced_bst(sorted_list)
(outside the class
Node
) which constructs a balanced BST from a sorted list of
key-value tuples. For performance reasons, your function should call a
recursive function
balanced_bst_rec(sorted_list, lower, upper)
which takes as
extra input the part of the list (from lower
to
upper - 1
) which is to be treated.
We give you the first of the functions below; your task is to implement the second, recursive, one:
def balanced_bst(sorted_list):
"""Return balanced BST constructed from sorted list."""
return balanced_bst_rec(sorted_list, 0, len(sorted_list))
def balanced_bst_rec(sorted_list, lower, upper):
"""Recursive helper function for balanced_bst."""
pass # replace with your own code
Test your work in the console:
In [41]: test = balanced_bst([(0, 'Zero'), (1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four'), (5, 'Five'), (6, 'Six'), (7, 'Seven'), (8, 'Eight'), (9, 'Nine')])
In [42]: print(test)
_________________5: Five__________________
/ \
__2: Two___________ ___8: Eight___
/ \ / \
___1: One ___4: Four __7: Seven 9: Nine
/ / /
0: Zero 3: Three 6: Six
In [43]: clara = construct_bst_for_indexing('clara.txt')
In [44]: (clara.key, clara.value)
Out[44]: ('at', [1])
In [45]: clara.height()
Out[45]: 25
In [46]: mylist = clara.list_in_order()
In [47]: clara_bal = balanced_bst(mylist)
In [48]: (clara_bal.key, clara_bal.value)
('naturally', [4])
In [49]: clara_bal.height()
Out[49]: 4
In [50]: pp = construct_bst_for_indexing('pp.txt')
In [51]: pp_bal = balanced_bst(pp.list_in_order())
In [52]: pp_bal.height()
Out[52]: 12
We have condensed all of Pride and Prejudice into a tree of height 12!
Upload your file index.py
: