You must be authenticated to submit your files

CSE 102 – Tutorial 8 – Binary representation

Sections tagged with a green marginal line are mandatory.

Sections tagged with an orange marginal line are either more advanced or supplementary, and are therefore considered optional. They will help you deepen your understanding of the material in the course. You may skip these sections on a first pass through the tutorial, and come back to them after finishing the mandatory exercises.

After you are finished with the tutorial, please upload your submission to Moodle by the indicated deadline (end of the day Friday, May 10th).

Warmup: bit operations in Python

We expect you to write your solution to this problem in a file named bits.py and to upload it using the form below. You can resubmit your solution as many times as you like.

Upload form is only available when connected

For this section, you are NOT allowed to use bin or int or any function that performs a conversion from/to a binary representation (as a string) of an integer.

For this section, we will assume a word width of 16 just to keep things simple.

Recall the following bitwise operations in Python.

# To print the binary representation of a number, use the function `bin`...
print(bin(42))
# ...or the format spec {:b}.
print('0b{:b}'.format(42))

# Bitwise-or is done with |
x = 9
y = 10
z = x | y
print(f'{x} | {y} = {z}')
print(f'0b{x:b} | 0b{y:b} = 0b{z:b}')

# Bitwise-and is done with &
z = x & y
print(f'{x} & {y} = {z}')
print(f'0b{x:b} & 0b{y:b} = 0b{z:b}')

# Bitwise-xor is done with ^
z = x ^ y
print(f'{x} ^ {y} = {z}')
print(f'0b{x:b} ^ 0b{y:b} = 0b{z:b}')

# Bitwise-not is done with ~
z = ~ x
print(f'~ {x} = {z}')
print(f'~ 0b{x:b} = 0b{z:b}')

# Bitwise shift-left is << and shift-right is >>
z = y << 2
w = y >> 2
print(f'{y} << 2 = {z}')
print(f'0b{y:b} << 2 = 0b{z:b}')
print(f'{y} >> 2 = {w}')
print(f'0b{y:b} >> 2 = 0b{w:b}')

You can find a review of Python’s bitwise operators here.

We remind you that the 0b notation is a way of writing a binary numerical literal in Python. So, just like you would write 42, 100, 1234, etc., you could write 0b1011, 0b1010100, etc.

  1. Write a function uint16_to_bitstring(x) that returns a list of length 16 where each element of the list is 0 or 1. The list should be the bitstring representation of x, interpreted as an unsigned integer (i.e., ignore its sign) 16 bits wide. You can assume that x is in the range \([0 \ldots 2^{16}-1]\). You can test your function with:

    assert(uint16_to_bitstring(42) == [0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0])
  2. Write a function bitstring_to_uint16(bs) that is the inverse of the previous function. You can test your function with:

    assert(42 == bitstring_to_uint16([0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0]))
    
    # check that all 16 bit numbers survive a roundtrip through bitstrings
    for n in range(1<<16):
        bs = uint16_to_bitstring(n)
        nn = bitstring_to_uint16(bs)
        if nn != n:
            print('Oops! Error found with the following number, to_bitstring, and to_int: ', n, bs, nn)
            break
  3. Write a function mod_pow2(x, k) that computes x % (2 ** k) without using the operators % or **, and without using iteration. The input x is assumed to be an integer between 0 and 2 ** 16 - 1, and k is assumed to be an integer between 0 and 15.

    Hint: As an example, \(944\) modulo \(2^8 = 256\) is \(176\), and in binary, \(176 = 0b10110000\) can be obtained from \(944 = 0b1110110000\) by clearing all of the bits other than the eight least significant bits to zero.

  4. Implement the function is_pow2(x) that returns True if x is of the form 2 ** n for some natural number n, and False otherwise, using as few arithmetic/bitwise operations and comparisons as possible. You can assume that n is a non-negative integer.

    Hint: If the number is a power of two, then only one bit will be set to 1 in its binary representation. If we subtract 1 from a number which is a power of 2, then all the bits after the set-bit will become set and the set bit will be unset. E.g.:

    N    = 4,  000100
    N-1  = 3,  000011
    N    = 16, 010000
    N-1  = 15, 001111
  5. Recall the functions which set, toggle and clear individual bits in a word:

    def set_bit(w, k):
        """set the k'th bit of w to 1"""
        return w | (1 << k)
    
    def toggle_bit(w, k):
        """toggle the k'th bit of w"""
        return w ^ (1 << k)
    
    def clear_bit(w, k):
        """set the k'th bit of w to 0"""
        return w & (~ (1 << k))

    Generalize the three functions from above to work on entire bit masks. That is, write the following functions:

    def set_mask(w, m):
       """set every bit position which is 1 in m, to 1 in w"""
       pass # Your code goes here
    
    def toggle_mask(w, m):
       """toggle every bit position which is 1 in m, in w"""
       pass # Your code goes here
    
    def clear_mask(w, m):
       """set every bit position which is 1 in m, to 0 in w"""
       pass # Your code goes here

    Do this by only using bit operations.

Generating pseudorandom numbers

We expect you to write your solution to this problem in a file named lfsr.py and to upload it using the form below. You can resubmit your solution as many times as you like.

Upload form is only available when connected

For this section, you are NOT allowed to use bin or int or any function that performs a conversion from/to a binary representation (as a string) of an integer.

Recall from Lecture 4 that a pseudorandom number generator (PRNG) is a deterministic algorithm generating a sequence of numbers (or other objects) whose properties approximate that of a random sequence. In this section, you will implement a simple kind of PRNG known as a linear feedback shift register (LFSR). To explain how this works, let’s consider the following illustration of an LFSR generating 16-bit numbers:

In the diagram, the bit vector may be interpreted as holding the current value of the sequence, namely 44257 (= 0b1010110011100001). Note that the positions are numbered from right to left, running from the least significant digit to the most significant digit. The indices 0, 2, 3, and 5 are called tap positions, and are used to compute the next value of the sequence as follows: shift all of the bits of the current value over to the right, while taking the sum modulo 2 (also known as the exclusive or, or xor) of the bits in the tap positions as the new value for the leftmost digit (index 15). In this example, the LFSR will produce 22128 (= 0b0101011001110000) as the next value of the sequence.

More formally, for any 16-bit number \(X\) and index \(0 \le i < 16\), let us write \(X[i]\) for the bit in the \(i\)th position of \(X\). Let \(I \subseteq \{0,\dots,15\}\) be any subset of tap positions. The LSFR based on \(I\) generates a sequence of 16-bit numbers \((X_n)_{n\in\mathbb{N}}\) defined as follows:

  • \(X_0\) is some seed value
  • \(X_{n+1}\) has:
    • \(X_{n+1}[15] = \sum_{i \in I} X_n[i]\) (modulo 2)
    • \(X_{n+1}[i] = X_n[i+1]\) for all \(0 \le i < 15\)
  1. The operation of “tapping” the \(i\)th least significant bit of a 16-bit number can be naively implemented in terms of uint16_to_bitstring, although this is inefficient:

    def slowtap_uint16(x, i):
        return uint16_to_bitstring(x)[15-i]

    Complete the following by writing a better version of the tap operation, which computes the result directly using bitwise arithmetic:

    def tap_uint16(x, i):
        """ Return 1 or 0 depending on whether the ith-least significant bit
            of x is 1 or 0.
        """
        pass #FIXME

    You can test your function by comparing it to the result of slowtap_uint16 above.

  2. Write a function that takes a 16-bit number and a list of tap positions, and returns the sum modulo 2 of all the values in the tapped positions.

    def polytap_uint16(x, I):
        """ Tap x at all the positions in I (which is a list of tap
            positions) and return the xor of all of these tapped values.
        """
        pass #FIXME

    For example, you should have polytap_uint16(44257, [0, 2, 3, 5]) = 0.

  3. Finish the implementation of the LFSR function below. It should compute the next number in the sequence \(X_i\) defined above.

    def lfsr_uint16(x, I):
        """Return the successor of x in an LFSR based on tap positions I"""
        pass #FIXME

    You can test it using the following code:

    def test_lfsr_uint16(seed):
        I = [0, 2, 3, 5, 15]
        xs = [seed]
        for _ in range(10):
            xs.append(lfsr_uint16(xs[-1], I))
        return xs
    
    assert test_lfsr_uint16(42) == [42, 21, 10, 32773, 49154,
                                    57345, 28672, 14336, 7168,
                                    3584, 1792]

Huffman coding

Lossless data compression is the operation of transforming a sequence of bits into a shorter sequence of bits that encodes the same information – the initial sequence of bits being reobtained using a decompression algorithm. In short, it is a coding operation which shortens the size (transmission, storage) of the data at the cost of compression work.

Huffman coding is a variable length coding, developed by David Albert Huffman while a student at MIT. The algorithm has been published in 1952 in a paper entitled “A Method for the Construction of Minimum-Redundancy Codes”.

In this section, we are interested in compressing (character) strings. In a normal mode of operation, when using the extended ASCII encoding, strings are encoded using 8 bits per character. To get the binary representation of a string, we can use the .encode() method. For example, the following function takes a string, gets its ASCII encoding as an array of bytes (represented using the class bytearray in Python), and formats it so that it is human-readable: a string made of 0s and 1s, one character per bit, where groups of 8 bits are separated by a space (again, for readability).

def bin_repr_of_ascii_string(s):
    return ' '.join(format(x, '08b') for x in s.encode('ascii'))

For instance, on the string "abc", we obtain the following representation:

01100001 01100010 01100011

We see that the ASCII encoding of "abc" is 3 bytes long (as expected) and one can check, for example, that the byte 0b01100001 (97 in decimal notation) corresponds to the character a in the ASCII table.

(Note that in extended ASCII, since characters are encoded using 8-bit words, you are limited to 256 characters. Basically, the encoding space is barely large enough to encode latin characters without diacritic marks. As of today, there exist dozens of different encodings, some of them using multiple bytes per character, or even variable-width, i.e. the encoding of one character may vary in size. For this tutorial, we will stick to extended ASCII.)

Under Huffman coding, compression is achieved by replacing the most frequent characters with short codes (< 8 bits) and the least frequent characters with long codes (> 8 bits). In average, size reductions of between 20 and 90% are thus observed. It is a widely used compression method, often as a complement to other algorithms.

Huffman uses a particular approach to choose the representation of each coded symbol. The idea is that each code should not be a prefix of any other code. In other words, none of the codes chosen should be the beginning of another code. We will see that more in details later.

Description of the Huffman coding

The principle of Huffman coding is based on the creation of a tree structure called a Huffman tree. That tree is built as follows: suppose the sentence \(s\) to be compressed is

Forsan et haec olim meminisse juvabit

We first compute the probability of appearance, in the sentence, of each character. For example, in \(s\), the character c appears once, while the character a and the space character appear \(3\) and \(5\) times, respectively. Since \(s\) is of length \(37\), the inverse probabilities of appearance of c, a and the space character are resp. equal to \(\frac{1}{37}\), \(\frac{3}{37}\) and \(\frac{5}{37}\). We give below the probability of appearance for all the characters of the sentence:

\(c\) \(p\) \(c\) \(p\) \(c\) \(p\)
F 0.02703 r 0.02703 a 0.08108
b 0.02703 u 0.02703 m 0.08108
c 0.02703 v 0.02703 s 0.08108
h 0.02703 n 0.05405 e 0.10811
j 0.02703 o 0.05405 i 0.10811
l 0.02703 t 0.05405 0.13514

Then, the tree structure is constructed as follows. For each character \(c\) with probability \(p\), we start by constructing a set of isolated leaf nodes with value \(c\) and associated probability \(p\). Then, we iteratively repeat the following process: we select two nodes with lowest associated probabilities and remove them from the set – let’s call them \(n_1\) and \(n_2\), with associated probabilities \(p_1\) and \(p_2\). We then create a new node with children \(n_1\) and \(n_2\) (the order is irrelevant here), with no data and with associated probability \(p_1 + p_2\), and add it to the set. Eventually, the set will only contain one node that forms the Huffman tree.

To obtain the Huffman code of a character \(c\), we walk down the tree from the root to the leaf that contains \(c\), adding a 0 (resp. a 1) to the code each time we take a left-edge (resp. a right-edge).

The construction of a Huffman tree for the probability table given above is shown below:

Here we have also indicated a path from the root down to a leaf node u, which goes right, then left twice, then right, and finally left. We thus obtain the code 10010 for u, the code 0101 for t, and so on. You can check that in general, the letters with a higher probability of appearance have shorter codes.

Last, one can encode an entire string by encoding its characters one by one, using their respective Huffman codes, and by concatenating all the obtained codes.

Implementation

We expect you to write your solution to this problem in a file named huffman.py and to upload it using the form below. You can resubmit your solution as many times as you like.

Upload form is only available when connected

We now have all the information for implementing the Huffman coding. In this part, we are not going to manipulate bits (and bit strings) but are going to represent codes with standard strings made of 0s and 1s. Note that this somewhat defeats the goal of Huffman coding since we are going to represent \(1\) bit with a character which in turn requires (at least) \(8\) bits to be stored — but still it will help us to get an understanding of the underlying principles of Huffman coding!

  1. Write a function huffman_stats(s) that takes a string s and returns a dictionary whose keys are the characters \(c\) appearing in s and whose associated value \(p\) is the appearance ratio of \(c\) in s.

Some remarks about floating point arithmetic. Floating point numbers are not real numbers in the mathematical sense — there are not even close to enough floating point numbers, since you are missing an uncountably infinite number of real numbers. But moreover, due to their internal representation, floating point numbers can have weird behavior. For example:

  • Addition is not associative:

    In [1]: (0.1 + 0.2) - 0.3
    Out[1]: 5.551115123125783e-17
    In [2]: 0.1 + (0.2 - 0.3)
    Out[2]: 2.7755575615628914e-17
  • Even if 0.3 can be represented using floating point numbers, we do not have 0.1+0.2 == 0.3, instead, we have:

    In [3]: 0.1+0.2
    Out[3]: 0.30000000000000004
  • Sometimes, the result can be totally chaotic. The function f implements a sequence that mathematically converges to \(6\).

    def f(n):
        u, v = 2.0, -4.0
        while n > 1:
            u, v, n = v, 111 - (1130 / v) + 3000 / (u * v), n-1
        return u

    However, its implementation converges to 100.0 (You can see that the sequence starts to converge to 6.0, and then, at some point, it diverges):

    In [4]: print([f(n) for n in range(100)])
    Out[4]: [2.0, 2.0, -4.0, 18.5, 9.378378378378379, 7.801152737752169, 7.154414480975333, 6.806784736924811, 6.592632768721792, 6.449465934053933, 6.348452060746624, 6.274438662728116, 6.218696768582163, 6.17585385581539, 6.142627170481006, 6.120248704570159, 6.166086559598099, 7.235021165534931, 22.062078463525793, 78.57557488787224, 98.34950312216536, 99.8985692661829, 99.99387098890278, 99.99963038728635, 99.99997773067949, 99.99999865921669, 99.99999991932181, 99.99999999514776, 99.99999999970828, 99.99999999998246, 99.99999999999893, 99.99999999999993, 99.99999999999999, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0]
  • More related to the question, you can check that \(1 \neq \sum_{i < n} \frac{1}{n}\):

    In [5]: sum([1/100 for _ in range(100)])
    Out[5]: 1.0000000000000007

    This means that for computing the probability of appearance of a character \(c\) that appears \(k\) times in a string of length \(n\), it is better to compute \(\frac{k}{n}\) than to compute \(\sum_{i < k} \frac{1}{n}\).

  1. Write a function huffman_tree(d) that takes a dictionary d associating frequencies to characters and constructs a Huffman tree for d and returns its root. The frequencies can be either relative frequencies (i.e., probabilities, as returned by huffman_stats), or absolute frequencies (i.e., total counts) — what is important is that the frequencies associated to characters can be compared, so that the characters with the lowest frequency may be iteratively extracted to construct the tree.

    For representing the nodes of the Huffman tree, you can create your own class or use the following one:

    class Node:
        def __init__(self, value, left = None, right = None):
            self.value = value
            self.left  = left
            self.right = right

    (The auto-grader script expects your nodes to have the three attributes value, left & right. Of course, you can add more attributes if needed). The value attribute for leaves should contain the character the leaf is associated to. (For internal node, this attribute is simply ignored and you can use it as your own discretion).

  2. Write a function huffman_codes(tree) that takes a Huffman tree tree and that returns a dictionary associating to any character \(c\) coded by the tree tree its Huffman code as a string made of 0’s and 1’s.

  3. Write a function huffman_encode(tree, s) that takes a Huffman tree tree and a string s and that returns its Huffman encoding as a string made of 0’s and 1’s.

  4. Write a function huffman_decode(tree, s) that takes a Huffman tree tree and a Huffman encoding s and that decodes s w.r.t. tree. Here to, s is a string made of 0’s and 1’s.

  5. Write a function huffman_compress(s) that takes a string s, does its frequency analysis (huffman_stats), computes the Huffman tree based on this frequency analysis (huffman_tree) and encodes s under this tree (huffman_encode). Your function should return a triplet composed of: 1. the computed Huffman tree, 2. the encoding of s and 3. the compression ratio.

    The compression ratio is defined as \(\frac{|s|}{|h(s)|}\) where \(|s|\) is the size of the initial string \(s\) in bits (assuming the ASCII encoding where characters are encoded using 8 bits) and \(|h(s)|\) is the size of the Huffman encoding of \(s\) in bits. (We recall you that characters in \(h(s)\) represent bits – i.e. one character in \(h(s)\) counts for one bit.)

Binary Huffman Implementation (optional)

We expect you to write your solution to this problem in a file named huffman.py and to upload it using the form below. You can resubmit your solution as many times as you like.

Upload form is only available when connected

In the previous section, we saw the principle of Huffman encoding and developed basic encoding & decoding functionalities. However, to obtain a functional implementation, some work remains:

As a first step, we will adapt the function huffman_encode & huffman_decode to manipulate bitstrings. First, we need to modify huffman_codes:

  1. Write a function huffman_bcodes(tree) that takes a Huffman tree tree and that returns a dictionary associating to any character \(c\) coded by the tree tree its Huffman code as a pair \((n, h)\) of two integers where \(n\) is the length (in bits) of the Huffman code of \(c\) and \(h\) is the actual Huffman code - i.e. the binary representation of \(h\) (using \(n\) bits) is the Huffman code of \(c\). For example, if \(c\) is coded by 010, then huffman_bcodesshould associate (3, 2) to it (the binary representation of 2 being 0b010 when using 3 bits).

We can now modify the huffman_encode function to return actual bit-strings. However, since the minimal unit of information a computer can manipulate is a byte, we are going to use byte arrays (or byte strings) to store our encoded messages. Python comes with two classes for manipulating byte arrays: bytes and bytearray. The type bytes (documentation) is very similar to the one for strings: it is an immutable array of bytes (i.e. integers in the range \([0 \ldots 255]\)) and you can iterate over it or randomly access its elements. The type bytearray (documentation), on the other end, represents mutable arrays of bytes. Its interface is close to the one of lists and uses the same syntax for accessing, adding or removing elements. Here are some examples of their usage:

# Create a immutable byte-string of length 4 containing the bytes:
#
# Decimal      Binary
# -------------------
#      97  0b01100001
#      98  0b01100010
#      99  0b01100011
#       1  0b00000001
#
b = bytes([97, 98, 99, 1])
print(b)

# Note that if you print `b`, Python is going to display this
#
#   b'abc\x01'
#
# First note the `b` in front of quote - it indicates that this
# is a byte string literal. Then, we can see that bytes are represented
# by characters! Indeed, Python displays bytes using the associated
# character in the ASCII encoding (when possible). If not, it displays
# \x?? where ?? is the value of the byte in hexadecimal notation (i.e.
# in base 16, where the digits are 0123456789abcdef) -- recall that 'a'
# (resp. 'b' & 'c') are encoded by 97 (resp. 98 & 99) in ASCII.

# We can now access different properties of b:
assert len(b) == 4     # b contains 4 bytes
assert b[0] == 97      # its first byte if 0
assert b[1:2] == b'bc'

# We can then create a mutable byte-array...
mb = bytearray([97, 98, 99, 1])
# or: mb = bytearray(b'abc\x01')

# ...and modify it:
mb[0] = 100
mb.append(101)

# ...and get its content:
print(len(mb))
print(bytes(mb))

Python also gives all the necessary functions for converting characters in bytes and bytes in characters:

# We can convert a string to its ASCII encoding...
s = 'abc'
e = s.encode('ascii')

# ...and convert it back to a string
s2 = e.decode('ascii')
assert s == s2

# Moreover, the function `chr(x)` gives the character whose
# ASCII code is `x` (as an integer). A contrario, the function
# `ord(c)` gives the ASCII code (as an integer) of the
# character `c`

We are now ready to write binary versions of huffman_encode and huffman_decode:

  1. Write a function huffman_bencode(tree, s) that takes a Huffman tree tree and a string s and that returns its Huffman encoding as a byte-string. If the encoding is not even (i.e. if the encoding has a number of bits that is not multiple of 8), the result should be padded with 0s.

  2. Write a function huffman_bdecode(tree, bs, n) that takes a Huffman tree tree, a Huffman encoding bs (as a byte string) and an integer n and that decodes bs w.r.t. tree. The integer n represents the length (in characters) of the original message. It is necessary for two reasons:

    • because huffman_bencode pads its output with 0s, without knowing the length of the original message, the decoder would not be able to decide wether the final 0s correspond to actual characters or padding.

    • moreover, as we saw, the Huffman encoding of a string that is made from a single character is the empty bit-string - regardless of the original length of the original string. Here again, we need the length of the original string to recreate it.

Ok, we are not far from the end. What remains is to (binary) encode the Huffman tree and to pack all things together. There are numerous ways of encoding a Huffman tree. Here is one possible way:

Finally, the binary encoding of a Huffman tree is simply the binary encoding of its root padded with 0s at the end if the encoding length (in bits) is not a multiple of 8.

  1. Write a function huffman_bencode_tree(tree) that takes a Huffman tree and returns its binary encoding, as described above.

We are now ready to write our final compression function. The encoding of a string s is the contenation of:

  1. Write a function huffman_bcompress(s) that takes a string s and compresses it, as described just above, and returns the result of the compression.

    Compute the compression ratio for different strings (in terms of character sets and length – you can generate random strings here). What do you observe?

  2. Write a function huffman_buncompress(s) that takes a string s and uncompresses it. It must satisfy the equation huffman_buncompress(huffman_bcompress(s)) = s for every string s.


  1. 🎼 Je vous parle d’un temps, que les moins de 20 ans, ne peuvent pas connaître. 🎶↩︎