This assignment will be closed on January 21, 2025 (23:59:59).
You must be authenticated to submit your files

Tutorial 12: Indexing a Text

U. Fahrenberg

Log in to enable code uploads

You need to log in to be able to upload your work as you go. Use your Polytechnique LDAP login here.

Skills we will practice today: recursive functions and data structures

Setup: before you start

Create a new empty project in Spyder, and call it “Tutorial_12”. Then create a file index.py in which you will write all your code.

Files to download

You will need to download the following files (right-clicking on the link and selecting “Save link as…”), saving them in your Tutorial_12 project directory:

Context and Objectives

The goal of today’s exercises is to write a program which can create an index of a text. An index is a traditional part of a book, usually situated at the end, which contains pointers to important concepts treated in the book.

First page of the index of Atlas Maior

Typically, a book index contains an alphabetized list of indexable entries, and its pointers are page numbers. Here, we will instead index all of the words in the text, and the pointers will be a list of line numbers where those words appear.

Our goal is to write a program which, given an input text such as

The quick brown fox fox
jumps over the
lazy brown dog dog

will produce the index

brown: [1, 3]
dog: [3]
fox: [1]
jumps: [2]
lazy: [3]
over: [2]
quick: [1]
the: [1, 2]

Observe the following features of our index:

Binary Search Trees

We will use binary search trees (BSTs) to store and process our data. A BST is a labeled binary tree, i.e. each node has a key and a value and at most two children, with the extra property that for every node,

Of the binary trees in the following example, the left one is a BST, whereas the right one is not. (Here we use integers as keys; the values are not shown.)

On the left: a binary search tree. On the right: a tree that is not a BST.

Let us check the above claims. First, the left tree is a BST because

The right tree of the figure is not a BST because the right subtree of the node with key 4 contains the key 2.

The BST property makes it possible to use binary search for accessing keys: start at the root and descend left or right, recursively, depending on comparison with the current node’s key.

Exercise 1: A Class for Binary Trees

We will represent trees recursively, using the idea that a node has a key and a value, and possibly a left and/or right child, which are also nodes.

Copy the following class definition to your file index.py:

class Node:
    def __init__(self, key, value, left, right):
        self.key = key
        self.value = value
        self.left = left
        self.right = right

    def __eq__(self, other):
        return self.key == other.key \
           and self.value == other.value \
           and self.left == other.left \
           and self.right == other.right

    def __repr__(self):
        return f'Node({repr(self.key)}, {repr(self.value)}, {repr(self.left)}, {repr(self.right)})'

    def __str__(self):
        lines, _ = self._str_aux()
        return '\n'.join(lines)

    def _str_aux(self):
        # Recursive helper for __str__.
        # Returns lines (to be joined) and the horizontal position where
        # a branch from an eventual parent should be attached.
        label = f'{self.key}: {self.value}'

        # Leaf case
        if self.right is None and self.left is None:
            return [label], len(label) // 2
    
        if self.left is None:
            llines, lpos, lwidth, ltop0, ltop1, lfill = [], 0, 0, '', '', ''
        else:  # Recurse left
            llines, lpos = self.left._str_aux()
            lwidth = len(llines[0])
            ltop0 = lpos*' ' + ' ' + (lwidth - lpos - 1)*'_'
            ltop1 = lpos*' ' + '/' + (lwidth - lpos - 1)*' '
            lfill = lwidth*' '
            
        if self.right is None:
            rlines, rpos, rwidth, rtop0, rtop1, rfill = [], 0, 0, '', '', ''
        else:  # Recurse right
            rlines, rpos = self.right._str_aux()
            rwidth = len(rlines[0])
            rtop0 = rpos*'_' + ' ' + (rwidth - rpos - 1)*' '
            rtop1 = rpos*' ' + '\\' + (rwidth - rpos - 1)*' '
            rfill = rwidth*' '

        cfill = len(label)*' '
        
        # Extend llines or rlines to same length, filling with spaces (or '')
        maxlen = max(len(llines), len(rlines))
        llines.extend(lfill for _ in range(maxlen - len(llines)))
        rlines.extend(rfill for _ in range(maxlen - len(rlines)))
          
        res = []
        res.append(ltop0 + label + rtop0)
        res.append(ltop1 + cfill + rtop1)
        res.extend(lline + cfill + rline for (lline, rline) in zip(llines, rlines))
        
        return res, lwidth + len(label) // 2

Notice that the __str__ method has a (rather long) recursive subroutine. It will be easier to see how it works when you try some examples:

In [1]: n1 = Node(2, 'Mathilde', None, None)  # a leaf

In [2]: print(n1)  # implicitly calls str
2: Mathilde

In [3]: n2 = Node(0, 'Stefan', None, None)  # another leaf

In [4]: n0 = Node(1, 'Ben', n2, n1)  # a node with left child 'Stefan' and right child 'Mathilde'

In [5]: n0
Out[5]: Node(1, 'Ben', Node(0, 'Stefan', None, None), Node(2, 'Mathilde', None, None))

In [6]: n0.left
Out[6]: Node(0, 'Stefan', None, None)

In [7]: n0.right
Out[7]: Node(2, 'Mathilde', None, None)

In [8]: print(n0)
     ____1: Ben_____      
    /               \     
0: Stefan      2: Mathilde

Now write a function example_bst() (outside the class Node) which defines the BST in the figure below and returns its root node:

The keys are the numbers in the circles. The corresponding values are string representations of the keys: for example, the value for the node with key 8 would be 'Eight'.

Note that your function only needs to construct the BST above; nothing else.

Test your function in the console:

In [9]: root = example_bst()

In [10]: print(root)
            _________________8: Eight___                            
           /                            \                           
     ___4: Four___                   10: Ten__________________      
    /             \                                           \     
3: Three       6: Six____                          _____14: Fourteen
                         \                        /                 
                     7: Seven               13: Thirteen 

Upload your file index.py:

Upload form is only available when connected

Exercise 2: Searching for a Key in a BST

The following algorithm is used to search for keys in a BST. Assume we want to find a key k in a BST given as node:

  • If k == node.key, then we have found our key k.
  • if k < node.key, then we must search for k in the left subtree of node. Now if node.left is None, then we know that k is nowhere to be found; otherwise, we recursively search in node.left.
  • Similarly, if k > node.key, then we search for k in the right subtree of node if this exists.

Add a recursive method search(self, key) to your Node class which returns the value of the node with key key, if key is found in the tree; otherwise, it returns None.

Test your work in the console:

In [11]: root = example_bst()

In [12]: root.search(8)
Out[12]: 'Eight'

In [13]: root.search(4)
Out[13]: 'Four'

In [14]: root.search(2) is None
Out[14]: True

Upload your file index.py:

Upload form is only available when connected

Exercise 3: Printing Keys and Values in Order

Add a recursive method print_in_order(self) to your Node class which prints all keys and values in the tree, sorted in increasing order.

Because of the BST property, your method needs to do an in-order traversal of the tree, as follows:

  • first, if node has a left child, recursively call print_in_order() on the left child;
  • then, print the node’s key and value;
  • finally, if node has a right child, recursively call print_in_order() on the right child.

Test your method in the console:

In [15]: root = example_bst()

In [16]: root.print_in_order()
3: Three
4: Four
6: Six
7: Seven
8: Eight
10: Ten
13: Thirteen
14: Fourteen

Upload your file index.py:

Upload form is only available when connected

Exercise 4: Adding a Key to a BST for Indexing

We are now, finally, ready to use our Node class for indexing texts.

First we implement the standard algorithm for adding a key-value pair to a binary search tree. The only difference is that each node stores a list of line numbers, so we have to be careful with the way we insert line numbers.

Assume that we want to add a key k (a word in our text) together with a value v (a line number) to a BST given as node:

  • If k < node.key, then k should be inserted into the left subtree of node. Hence,
    • if node.left is None, then we create a new node at node.left with key k and value [v] (that is, we start a new list of line numbers for that word).
    • If node.left is not None, then we repeat the procedure recursively on node.left.
  • Similarly, if k > node.key, then we either create a new node at node.right with key k and value [v], or we recurse on the right subtree of node.
  • If k == node.key, then the word k is already in the tree, so we check if v is already in node.value (the list of line numbers where k appears); if it does not yet appear there, then we append it.

Note that as we are going through the text line-by-line to construct our index, the line numbers in our tree’s values will automatically be sorted. Hence you should check whether v is in node.value by comparing it with node.value[-1] instead of using in.

Add a recursive method add(self, key, value) to your Node class which implements the above algorithm.

Test your work in the console:

In [17]: root = Node('quick', [1], None, None)

In [18]: root.add('the', 1)

In [19]: root.add('fox', 1)

In [20]: root.add('fox', 1)

In [21]: root.add('fox', 2)

In [22]: root
Out[22]: Node('quick', [1], Node('fox', [1, 2], None, None), Node('the', [1], None, None))

In [23]: print(root)
      _____quick: [1]____    
     /                   \   
fox: [1, 2]          the: [1]

Upload your file index.py:

Upload form is only available when connected

Exercise 5: Constructing a BST for Indexing

Copy the following function definition to your file index.py (outside the class Node). The function split_in_words_and_lowercase(line) splits an input line (from a file) into words and converts these to lower case. (Do you understand how it works?)

def split_in_words_and_lowercase(line):
    """Split a line of text into a list of lower-case words."""
    parts = line.strip('\n').replace('-', ' ').replace("'", " ").replace('"', ' ').split()
    parts = [p.strip('",._;?!:()[]').lower() for p in parts]
    return [p for p in parts if p != '']

Now write a function construct_bst_for_indexing(filename) (still outside the Node class) which finds all words in the file named filename and returns the root of a BST with these words, ordered lexicographically, as keys and, for each key, a list of lines in which the word appears, as value. Use the function split_in_words_and_lowercase(line) and the Node class’ add(key, value) method.

Test your work in the console:

In [24]: root = construct_bst_for_indexing('foxdog.txt')

In [25]: print(root)
                                                               ____the: [1, 2]
                                                              /               
       __________________________________________________quick: [1]           
      /                                                                       
brown: [1, 3]____________                                                     
                         \                                                    
                  ___fox: [1]_____                                            
                 /                \                                           
             dog: [3]        jumps: [2]_____________                          
                                                    \                         
                                            ____over: [2]                     
                                           /                                  
                                       lazy: [3]                              

In [26]: root.print_in_order()
brown: [1, 3]
dog: [3]
fox: [1]
jumps: [2]
lazy: [3]
over: [2]
quick: [1]
the: [1, 2]

(Note that you have just produced an index of the file foxdog.txt.)

Test your work with this other example:

In [27]: root = construct_bst_for_indexing('ouch.txt')

Does it take more time than the previous example? If this is clearly the case, this is probably because, contrary to the indication given, you used in in your add method. Modify your add method consequently.

Upload your file index.py:

Upload form is only available when connected

Exercise 6: Writing an Index to a File

The last step in our index production is to write a generated index to a file.

Add a method write_in_order(self, filename) to your Node class which writes all keys and values, in increasing order, to a file named filename. Your method should open the file for writing and then call another recursive method write_in_order_rec(self, file) which, like print_in_order, writes one line at a time to the file. Be careful, file in the method write_in_order_rec(self, file) is a file object and not a filename.

We give you the first of the methods below; your task is to implement the second, recursive, one:

    def write_in_order(self, filename):
        """Write all key: value pairs in the index tree
        to the named file, one entry per line.
        """
        with open(filename, 'w') as file:
            self.write_in_order_rec(file)

    def write_in_order_rec(self, file):
        """Recursive helper method for write_in_order."""
        pass # replace with your own code

Test your method in the console: running

In [28]: root = construct_bst_for_indexing('foxdog.txt')

In [29]: root.write_in_order('test.txt')

should create a file test.txt with precisely the following contents:

brown: [1, 3]
dog: [3]
fox: [1]
jumps: [2]
lazy: [3]
over: [2]
quick: [1]
the: [1, 2]

Upload your file index.py:

Upload form is only available when connected

Exercise 7: Collecting the Parts

Now write a function generate_index(textfile, indexfile) (outside the class Node) which constructs an index BST for the file named textfile and then writes the sorted index to a file indexfile.

Test your work in the console: running

In [30]: generate_index('foxdog.txt', 'index.txt')

should create a file index.txt with the same contents as test.txt above.

For a more comprehensive test, we can now index Pride and Prejudice:

In [31]: generate_index('pp.txt', 'pp_index.txt')

This should take just a bit of time; but look at the generated pp_index.txt! (Also, we now find out that there is no word in Pride and Prejudice which starts with a “z”?!)

Upload your file index.py:

Upload form is only available when connected

Exercise 8: Computing the Height of a BST

The next two exercises are concerned with optimization: in some cases, the BSTs we generate may be unbalanced, that is to say, they more resemble lists than trees. We first need a method to compute heights of BSTs.

The height of a tree is the length of a longest path from the root to a leaf. For example,

  • the height of a tree with only the root node is 0;
  • the height of our example BST is 3;
  • the height of its subtree under the 4-valued node is 2.

Define a recursive method height(self) in your Node class which returns the height of the tree given as node.

Test your method in the console:

In [32]: root = example_bst()

In [33]: root.height()
Out[33]: 3

In [34]: root.left.height()
Out[34]: 2

In [35]: pp = construct_bst_for_indexing('pp.txt')

In [36]: pp.height()
Out[36]: 35

Upload your file index.py:

Upload form is only available when connected

Exercise 9: Balancing a BST

The provided file clara.txt contains the sentence:

At bathtime,
Clara demanded eigenvalues
for great high-intensity Jovian Krypton lasers
(more naturally, on principle);
quietly,
Russians sent tanks under veils with xylophones,
yelling "ZACHARYYY!"

Run the following commands in the console:

In [37]: clara = construct_bst_for_indexing('clara.txt')

In [38]: clara.height()
Out[38]: 25

There are twenty-six words in the file, and the height of our indexing BST is twenty-five. That is to say, our BST is extremely unbalanced: none of its nodes have left children! This BST is, in essence, a linked list. (You can confirm this by running clara in the console.)

Can you explain why this is the case?

In a balanced binary tree, the heights of the two subtrees under any internal node differ by at most one.

The time it takes to search for a key in a BST is proportional to the height of the BST. (Why?) Hence searching is most efficient if the BST is balanced.

The following algorithm can be used to balance a BST: First, convert the BST to a sorted list l. Then, recursively construct a new BST:

  • the root is the middle element, with index len(l) // 2, in the list;
  • its left child is the balanced BST constructed from the left part of the list;
  • its right child is the balanced BST constructed from the right part of the list.

Implement this algorithm. First, add a recursive method list_in_order(self) to your Node class which converts a BST to a sorted list of key-value pairs.

Test your work in the console:

In [39]: root = example_bst()

In [40]: root.list_in_order()
Out[40]: 
[(3, 'Three'),
 (4, 'Four'),
 (6, 'Six'),
 (7, 'Seven'),
 (8, 'Eight'),
 (10, 'Ten'),
 (13, 'Thirteen'),
 (14, 'Fourteen')]

Then, write a function balanced_bst(sorted_list) (outside the class Node) which constructs a balanced BST from a sorted list of key-value tuples. For performance reasons, your function should call a recursive function balanced_bst_rec(sorted_list, lower, upper) which takes as extra input the part of the list (from lower to upper - 1) which is to be treated.

We give you the first of the functions below; your task is to implement the second, recursive, one:

def balanced_bst(sorted_list):
    """Return balanced BST constructed from sorted list."""
    return balanced_bst_rec(sorted_list, 0, len(sorted_list))

def balanced_bst_rec(sorted_list, lower, upper):
    """Recursive helper function for balanced_bst."""
    pass # replace with your own code

Test your work in the console:

In [41]: test = balanced_bst([(0, 'Zero'), (1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four'), (5, 'Five'), (6, 'Six'), (7, 'Seven'), (8, 'Eight'), (9, 'Nine')])

In [42]: print(test)
                 _________________5: Five__________________           
                /                                          \          
           __2: Two___________                      ___8: Eight___    
          /                   \                    /              \   
    ___1: One           ___4: Four           __7: Seven        9: Nine
   /                   /                    /                         
0: Zero            3: Three              6: Six                       

In [43]: clara = construct_bst_for_indexing('clara.txt')

In [44]: (clara.key, clara.value)
Out[44]: ('at', [1])

In [45]: clara.height()
Out[45]: 25

In [46]: mylist = clara.list_in_order()

In [47]: clara_bal = balanced_bst(mylist)

In [48]: (clara_bal.key, clara_bal.value)
('naturally', [4])

In [49]: clara_bal.height()
Out[49]: 4

In [50]: pp = construct_bst_for_indexing('pp.txt')

In [51]: pp_bal = balanced_bst(pp.list_in_order())

In [52]: pp_bal.height()
Out[52]: 12

We have condensed all of Pride and Prejudice into a tree of height 12!

Upload your file index.py:

Upload form is only available when connected