You must be authenticated to submit your files

CSE 102 – Tutorial 5 – Programming with trees

General instructions
Binary trees and binary search trees
If you don’t know the right word to use, trie guessing (optional)

General instructions

Sections tagged with a green marginal line are mandatory.

Sections tagged with an orange marginal line are either more advanced or supplementary, and are therefore considered optional. They will help you deepen your understanding of the material in the course. You may skip these sections on a first pass through the tutorial, and come back to them after finishing the mandatory exercises.

When submitting a solution, the system will do some basic validation of your files and run some automated tests on your solutions. A green tick tells you that your file compiles and has passed all of the automated tests. A red tick tells you either that your file is badly broken (e.g., a syntax error) or that your solutions fail one or more of the automated tests. You can see the result of the tests by clicking on “See my uploads” in the upper-left menu, and then clicking on the right arrow button next to each file you submitted.

Although passing all of our tests provides evidence that your solutions are correct, it might not guarantee it, as typically we would need to run infinitely many tests! You are encouraged to write your own tests, and more generally to think about why your code works and not just rely on the automated tests.

Discussion of the tutorials with friends and classmates is allowed, with acknowledgment, but please read the Moodle page on “Collaboration vs plagiarism/cheating”.

Binary trees and binary search trees

We expect you to write your solution to this problem in a file named trees.py and to upload it using the form below. You can resubmit your solution as many times as you like.

Upload form is only available when connected

In the problems below, we will use the following class to represent the nodes of a binary tree:

class Node:
    def __init__(self, value, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right

Each node stores a value, and optionally pointers to its left and right children nodes (otherwise set to None if the node does not have a left/right child). A tree is specified by its root node, with the empty tree represented by None.

For example, the binary tree

is represented by the following Python value:

  T1 = Node(0, Node(1, Node(2), Node(3)), Node(4, Node(5)))

while the pair of binary trees

are represented by the following pair of Python values, respectively:

  T2 = Node(0, Node(1), Node(2, Node(3), Node(4, Node(5))))
  T3 = Node(11, Node(10, Node(9, None, Node(8)), Node(7)), Node(6))

Write a function size(root) that takes the root node of a tree and returns the number of nodes in the tree. For example, all of the trees T1, T2, and T3 above have size 6.
Write a function sum_values(root) that takes the root node of a tree and returns the sum of the values stored at the nodes. For example, for the trees defined above, we have:
```
>>> sum_values(T1)
15
>>> sum_values(T2)
15
>>> sum_values(T3)
51
```
Write a function mirrored(lroot, rroot) that returns True if the tree rroot is the mirror image of the tree lroot, and False otherwise.

For example, the two trees

are the mirror images of each other.
Write a function check_symmetry(root) that returns True if the tree is (vertically) symmetric and False otherwise.

For example, of the following two trees, the one on the left is symmetric, but the one on the right is not:

The complexity of check_symmetry should be \(\mathrm{O}(n)\), where \(n\) is the number of nodes of the tree.

Hint: use the previous problem.

For the following problems we will consider binary search trees (BSTs), which you saw previously in CSE101. Recall that a binary tree has the BST property if for every node of the tree, the value at that node is greater than or equal to every value in its left subtree, and less than or equal to every value in its right subtree.

For example, of the following two trees, the one on the right has the BST property, but the one on the left does not:

Write a function check_BST(root) that takes the root of an arbitrary binary tree as input and returns True if the tree has the BST property, or else False.

The time complexity of the function should be linear in the number nodes. (In particular, it should be fast for trees with 10000 nodes.)

Hint: a good strategy is to introduce a helper function that does something more than just check the BST property. For example, you can use a helper function that returns, in addition to the desired True/False, the maximum and the minimum in the tree. Or, alternatively, a helper function that takes as extra parameters two values minv and maxv, and checks that the input tree is a BST whose values are all in the range minv .. maxv.

(Optional theory question.) What is the minimum number of comparisons needed to check that a binary tree with \(n\) nodes is a BST, in the worst case? Does your implementation of check_BST achieve this theoretical minimum? If not, how would you modify it to make the minimum number of comparisons? (Record your observations as a comment.)

Write a function min_BST(root) that takes as input the root of a binary search tree and returns the smallest value in the tree, or math.inf if the tree is empty. The complexity should be \(\mathrm{O}(h)\), where \(h\) is the height of the tree.

Write a function min_diff(root) that takes a binary search tree and computes the smallest absolute value of the difference between the values in different nodes. The complexity should be \(\mathrm{O}(n)\).

For example, the following BST has min_diff equal to 1:

Hint: use an auxiliary function similar like for check_BST.
Write a function count_distinct(root) that takes a binary search tree and computes the number of distinct values present in the tree. You should not compute the set of all values and return its length. Instead, your program should only use constant memory.

Searching and insertion into a binary search tree can be performed by recursively choosing to follow the left or right branch depending on the value being searched/inserted. Let us remind you how to do this in Python:

def bst_search(node, x):
    if node is None:
        return False
    elif node.value == x:
        return True
    elif x < node.value:
        return bst_search(node.left, x)
    else:
        return bst_search(node.right, x)
    
def bst_insert(node, x):
    if node is None:
        return Node(x)
    elif x <= node.value:
        node.left = bst_insert(node.left, x)
    else:
        node.right = bst_insert(node.right, x)
    return node

We are now interested in implementing removal of a value from a BST. The procedure works as follows. We start by looking for the node \(N\) of the tree containing the given value to be deleted. Once it has been found, several cases need to be considered:

\(N\) is a leaf: since the node has no children, we just remove it.
\(N\) has exactly one child: we remove \(N\) from the tree and replace it with its child.
\(N\) has two children: In that case, we do not remove \(N\), but instead we replace its value with the value of its in-order successor \(S\) (= the left-most node of the right-subtree of \(N\)). Then, we remove \(S\): since \(S\) has at most one child, we are back in one of the two previous cases.

First, try to convince yourself that this procedure preserves the BST structure. Then, write a function bst_remove(root, x) that removes x from the BST root and returns the resulting BST. (If x does not appear in root, the function does nothing. If x appears multiple times in root, only one occurrence of x should be removed.)

If you don’t know the right word to use, trie guessing (optional)

We expect you to write your solution to this problem in a file named wordrects.py and to upload it using the form below. You can resubmit your solution as many times as you like.

Upload form is only available when connected

A word square is a square of words that can be read both horizontally and vertically, for example:

===========
| L O V E |
| O V A L |
| V A N S |
| E L S E |
===========

In this case the square contains the same words both horizontally and vertically, but in general a word square does not necessarily have to be symmetric (although apparently this restriction is enforced for carrées magiques). For example, the following is a non-symmetric word square:

===========
| L O V E |
| O M E N |
| S E N D |
| E N D S |
===========

The notion naturally generalizes to word rectangles, such as the following 4 x 6 rectangle:

===============
| P Y T H O N |
| E U R O P E |
| E L O P E S |
| D E T E C T |
===============

In Volume 4B of The Art of Computer Programming, Donald Knuth suggests a procedure for constructing \(m \times n\) word rectangles using backtracking combined with a trie data structure. A trie is a kind of dictionary that supports \(O(k)\) worst-case search and insertion time for any given string of length \(k\) (in particular, the time does not depend on the number of strings in the dictionary). The key property of tries that make them useful for backtracking is that they permit efficiently testing whether a string is a valid prefix of a collection of strings, and of finding out what are its possible suffixes. Before discussing the backtracking algorithm itself, let us take a moment to describe tries in more detail, beginning with how to visualize them.

Here is a picture of a trie (adapted from the Wikipedia article linked above) containing the seven strings 'to', 'tea', 'ted', 'ten', 'A', 'in', and 'inn':

The idea is to organize strings into a tree based on their common prefixes. Each of the eleven labelled nodes logically represents a valid prefix of one of the strings in the dictionary, e.g., node 5 represents the prefix 'te', and has three children corresponding to the possible extensions of the prefix by one letter. A red circle indicates that the trie accepts the current prefix as a string, i.e., that it is a complete word in the dictionary. Observe that all of the leaves are accepting, but so is the internal node 6 in this example, since 'in' is both a valid prefix of 'inn' and also a valid string itself. As usual when talking about trees, we think of the trie itself as represented by its root node, i.e., node 0 representing the empty prefix (a.k.a. ε, which is not a word in our dictionary although it is a prefix of every word!).

Now here is a description of Knuth’s algorithm.

Beginning with an empty rectangle (or perhaps beginning with a particular word on the first row, such as “PYTHON”), we start filling in letters contiguously, going across the rows and then down the columns. At each step, we ensure that all of the rows contain valid prefixes of \(n\)-letter words, and that all of the columns contains valid prefixes of \(m\)-letter words. Indeed, to ensure this we just need to check that the newly introduced letter does not create any invalid prefixes. At the end of this process, if we have managed to fill in all of the \(m \times n\) letters then we have a valid word rectangle.

Let’s illustrate with a simple example for the case \(m = n = 4\). Suppose we have already filled in the first two rows and the first two letters of the third row:

===========
| L O V E |
| O M E N |
| S E ? ? |
| ? ? ? ? |
===========

At this point we consider the letters that can possibly follow “SE” and “VE”. Our dictionary of 4-letter words contains 26 possible suffixes for “SE-” (“-AL”, “-AM”, …, “-WS”, “-XY”) corresponding to 10 distinct initial letters, and 12 possible suffixes for “VE-” (“-AL”, “-ER”, …, “-TO”, “-TS”) corresponding to 7 distinct initial letters. By considering the intersection of the sets of initial letters, we see that there are only 5 possible letters that can go in the third column of the third row: A, E, N, R, or T. Suppose we go with E:

===========
| L O V E |
| O M E N |
| S E E ? |
| ? ? ? ? |
===========

We continue filling in letters, say until we end up with the following square:

===========
| L O V E |
| O M E N |
| S E E D |
| T ? ? ? |
===========

At this point we’re stuck, because although our dictionary of 4-letter words contains 156 possible suffixes for “T-” corresponding to 11 distinct initial letters, it only has one possible suffix for “OME-” (“-N”), and the intersection of these sets is empty.

We can backtrack on the choice of T for the first letter of the fourth row, making more progress by instead choosing E:

===========
| L O V E |
| O M E N |
| S E E D |
| E N ? ? |
===========

However, at this point we get stuck again, since the only possible suffix of “VEE-” is “-R”, and “ENR-” is not a valid prefix of a 4-letter word. Eventually, we’ll backtrack to the point where we chose E as the third letter of the third row and replace it by another letter, say N:

===========
| L O V E |
| O M E N |
| S E N ? |
| ? ? ? ? |
===========

And now we can work our way forwards again, until we arrive at the following (unique) solution:

===========
| L O V E |
| O M E N |
| S E N D |
| E N D S |
===========

Implementing tries

Your first goal is to understand and complete the following partially-completed implementation of tries:

class Trie:
    def __init__(self, words = []):
        '''Creates a trie from a list of words.'''
        # initialize an empty trie
        self.__accept = False
        self.__step = {}
        # insert the list of words into the trie
        for w in words:
            self.insert(w)

    def __bool__(self):
        '''Returns True iff the trie contains any words.'''
        return self.__accept or self.__step

    def __contains__(self, w):
        '''Returns True iff w is in the trie.'''
        node = self.prefix(w)
        return node.__accept if node is not None else False

    def __iter__(self):
        '''Iterates over the strings in the trie.'''
        if self.__accept:
            yield ''
        for (c, t) in self.__step.items():
            yield from [c + w for w in t]

    def follows(self, w):
        '''Returns the list of characters that can possibly follow w
        in the trie, assuming w is a valid prefix.'''
        return [c for c in self.prefix(w).__step.keys()]

    def prefix(self, w):
        '''Returns the node of the trie corresponding to prefix w, 
        or else None if w is not a valid prefix.'''
        pass  # your code here

    def insert(self, w):
        '''Inserts w into the trie.'''
        pass  # your code here

In particular, you have to implement the two methods:

prefix(self, w)
insert(self, w)

which are themselves used in the implementation of some of the other methods. To define prefix and insert correctly, you should understand the interpretations of the two fields stored at each node of the trie:

__accept: this is a boolean indicating whether the current prefix is a word of the dictionary (corresponding to the color of the node in the visualization we saw earlier, with red = True, black = False);
__step: this is a Python dictionary associating single-letter extensions of the current prefix to corresponding child nodes.

(Note that a trie can also be thought of as a simple finite-state machine, where the states correspond to valid prefixes of words. Under this interpretation, the __accept field indicates whether the current state is accepting, and the __step field encodes its outgoing transitions.)

Here is an example of a simple test you can try on your implementation:

def test_trie():
    trie = Trie(["CAT", "CATEGORY", "CATION", "DATA", "DO", "DOG"])
    for w in trie.prefix("CAT"):
        print("CAT-", w)
    for w in trie.prefix("DO"):
        print("DO-", w)
    for c in trie.follows("CAT"):
        print("CAT" + c + "?...")

It should output:

CAT- 
CAT- EGORY
CAT- ION
DO- 
DO- G
CATE?...
CATI?...

(The grader server will also run some automatic tests.)

Generating word rectangles

We will represent a partially-filled \(m \times n\) word rectangle as a two-dimensional matrix rect, where rect[i][j] contains the letter (string of length 1) at row i, column j, or else None if that position is unfilled. We have provided a few utility functions that you can use for constructing and printing partially-filled word rectangles:

def partial_rect(m, n, topstrings = []):
    rect = [[None for _ in range(n)] for _ in range(m)]
    for i in range(len(topstrings)):
        s = topstrings[i]
        for j in range(len(s)):
            rect[i][j] = s[j]
    return rect

def rect_print(m, n, rect):
    print('='*(2*n+3))
    for i in range(m):
        print('| ' + ' '.join([c if c else '?' for c in rect[i]]) + ' |')
    print('='*(2*n+3))

Example:

>>> rect = partial_rect(4, 4, ["LOVE","OMEN","SE"])
>>> rect_print(4, 4, rect)
===========
| L O V E |
| O M E N |
| S E ? ? |
| ? ? ? ? |
===========

We have also provided a few different dictionaries that you can use to test your implementation:

common100.txt: a list of the 100 most common words in English
common10000.txt: a list of the 10000 most common words in English
francais1.txt: a dictionary containing 22739 words in French
english2.txt: a dictionary containing 65199 words in English
francais2.txt: a dictionary containing 208916 words in French

On most Unix systems there is also a standard dictionary file, available at /usr/share/dict/words. Or feel free to supply your own dictionary!

You can use the following function to load a dictionary file containing one word per line, and return the list of words all converted to uppercase:

def readdict(name):
    with open(name, 'r', encoding='utf-8') as stream:
        return [x.strip().upper() for x in stream]

Finally you are ready to fill some rectangles!

Write a function wordrects(dict, m, n, rect) that takes a list of strings dict, natural numbers m and n, and a partially filled word rectangle rect, and generates (by calling yield) all possible extensions of rect to a full \(m \times n\) word rectangle containing only words from dict. You can assume that rect is filled contiguously from the first row and column reading across the rows, in the sense that rect[i][j] == None implies rect[ii][jj] == None if either ii > i or ii == i and jj > j. Moreover, you are allowed to update rect in place when generating solutions, although you should be careful to appropriately undo these updates when backtracking.

Hint: introduce an auxiliary function solve_wordrects(rowtrie, coltrie, rect, unfilled) which takes a pair of tries rowtrie and coltrie containing all possible row words and column words respectively, as well as a list unfilled of the currently unfilled positions in rect.