CSE 102 – Tutorial 7 – Binary representation
Sections tagged with a green marginal line are mandatory.
Sections tagged with an orange marginal line are either more advanced or supplementary, and are therefore considered optional. They will help you deepen your understanding of the material in the course. You may skip these sections on a first pass through the tutorial, and come back to them after finishing the mandatory exercises.
After you are finished with the tutorial, please upload your submission to Moodle by completing “Tutorial 7 assignment” before the indicated deadline.
Warmup: bit operations in Python
We expect you to write your solution to this problem in a file named
bits.py
and to upload it using the form below. You can
resubmit your solution as many times as you like.
For this section, you are NOT allowed to use bin
or
int
or any function that performs a conversion from/to a
binary representation (as a string) of an integer.
For this section, we will assume a word width of 16 just to keep things simple.
Recall the following bitwise operations in Python.
# To print the binary representation of a number, use the function `bin`...
print(bin(42))
# ...or the format spec {:b}.
print('0b{:b}'.format(42))
# Bitwise-or is done with |
= 9
x = 10
y = x | y
z print(f'{x} | {y} = {z}')
print(f'0b{x:b} | 0b{y:b} = 0b{z:b}')
# Bitwise-and is done with &
= x & y
z print(f'{x} & {y} = {z}')
print(f'0b{x:b} & 0b{y:b} = 0b{z:b}')
# Bitwise-xor is done with ^
= x ^ y
z print(f'{x} ^ {y} = {z}')
print(f'0b{x:b} ^ 0b{y:b} = 0b{z:b}')
# Bitwise-not is done with ~
= ~ x
z print(f'~ {x} = {z}')
print(f'~ 0b{x:b} = 0b{z:b}')
# Bitwise shift-left is << and shift-right is >>
= y << 2
z = y >> 2
w print(f'{y} << 2 = {z}')
print(f'0b{y:b} << 2 = 0b{z:b}')
print(f'{y} >> 2 = {w}')
print(f'0b{y:b} >> 2 = 0b{w:b}')
You can find a review of Python’s bitwise operators here.
We remind you that the 0b
notation is a way of writing a
binary numerical literal in Python. So, just like you would write
42
, 100
, 1234
, etc., you could
write 0b1011
, 0b1010100
, etc.
Write a function
uint16_to_bitstring(x)
that returns a list of length16
where each element of the list is0
or1
. The list should be the bitstring representation of x, interpreted as an unsigned integer (i.e., ignore its sign)16
bits wide. You can assume thatx
is in the range \([0 \ldots 2^{16}-1]\). You can test your function with:assert(uint16_to_bitstring(42) == [0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0])
Write a function
bitstring_to_uint16(bs)
that is the inverse of the previous function. You can test your function with:assert(42 == bitstring_to_uint16([0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0])) # check that all 16 bit numbers survive a roundtrip through bitstrings for n in range(1<<16): = uint16_to_bitstring(n) bs = bitstring_to_uint16(bs) nn if nn != n: print('Oops! Error found with the following number, to_bitstring, and to_int: ', n, bs, nn) break
Write a function
mod_pow2(x, k)
that computesx % (2 ** k)
without using the operators%
or**
, and without using iteration. The inputx
is assumed to be an integer between0
and2 ** 16 - 1
, andk
is assumed to be an integer between0
and15
.Hint: As an example, \(944\) modulo \(2^8 = 256\) is \(176\), and in binary, \(176 = 0b10110000\) can be obtained from \(944 = 0b1110110000\) by clearing all of the bits other than the eight least significant bits to zero.
Implement the function
is_pow2(x)
that returnsTrue
ifx
is of the form2 ** n
for some natural numbern
, andFalse
otherwise, using as few arithmetic/bitwise operations and comparisons as possible. You can assume thatn
is a non-negative integer.Hint: If the number is a power of two, then only one bit will be set to 1 in its binary representation. If we subtract 1 from a number which is a power of 2, then all the bits after the set-bit will become set and the set bit will be unset. E.g.:
= 4, 000100 N -1 = 3, 000011 N= 16, 010000 N -1 = 15, 001111 N
Recall the functions which set, toggle and clear individual bits in a word:
def set_bit(w, k): """set the k'th bit of w to 1""" return w | (1 << k) def toggle_bit(w, k): """toggle the k'th bit of w""" return w ^ (1 << k) def clear_bit(w, k): """set the k'th bit of w to 0""" return w & (~ (1 << k))
Generalize the three functions from above to work on entire bit masks. That is, write the following functions:
def set_mask(w, m): """set every bit position which is 1 in m, to 1 in w""" pass # Your code goes here def toggle_mask(w, m): """toggle every bit position which is 1 in m, in w""" pass # Your code goes here def clear_mask(w, m): """set every bit position which is 1 in m, to 0 in w""" pass # Your code goes here
Do this by only using bit operations.
Generating pseudorandom numbers
We expect you to write your solution to this problem in a file named
lfsr.py
and to upload it using the form below. You can
resubmit your solution as many times as you like.
For this section, you are NOT allowed to use bin
or
int
or any function that performs a conversion from/to a
binary representation (as a string) of an integer.
Recall from Lecture 4 that a pseudorandom number generator (PRNG) is a deterministic algorithm generating a sequence of numbers (or other objects) whose properties approximate that of a random sequence. In this section, you will implement a simple kind of PRNG known as a linear feedback shift register (LFSR). To explain how this works, let’s consider the following illustration of an LFSR generating 16-bit numbers:
In the diagram, the bit vector may be interpreted as holding the
current value of the sequence, namely 44257 (=
0b1010110011100001
). Note that the positions are numbered
from right to left, running from the least significant digit to the most
significant digit. The indices 0, 2, 3, and 5 are called tap
positions, and are used to compute the next value of the sequence
as follows: shift all of the bits of the current value over to the
right, while taking the sum modulo 2 (also known as the exclusive
or, or xor) of the bits in the tap positions as the new
value for the leftmost digit (index 15). In this example, the LFSR will
produce 22128 (= 0b0101011001110000
) as the next value of
the sequence.
More formally, for any 16-bit number \(X\) and index \(0 \le i < 16\), let us write \(X[i]\) for the bit in the \(i\)th position of \(X\). Let \(I \subseteq \{0,\dots,15\}\) be any subset of tap positions. The LSFR based on \(I\) generates a sequence of 16-bit numbers \((X_n)_{n\in\mathbb{N}}\) defined as follows:
- \(X_0\) is some seed value
- \(X_{n+1}\) has:
- \(X_{n+1}[15] = \sum_{i \in I} X_n[i]\) (modulo 2)
- \(X_{n+1}[i] = X_n[i+1]\) for all \(0 \le i < 15\)
The operation of “tapping” the \(i\)th least significant bit of a 16-bit number can be naively implemented in terms of
uint16_to_bitstring
, although this is inefficient:def slowtap_uint16(x, i): return uint16_to_bitstring(x)[15-i]
Complete the following by writing a better version of the tap operation, which computes the result directly using bitwise arithmetic:
def tap_uint16(x, i): """ Return 1 or 0 depending on whether the ith-least significant bit of x is 1 or 0. """ pass #FIXME
You can test your function by comparing it to the result of
slowtap_uint16
above.Write a function that takes a 16-bit number and a list of tap positions, and returns the sum modulo 2 of all the values in the tapped positions.
def polytap_uint16(x, I): """ Tap x at all the positions in I (which is a list of tap positions) and return the xor of all of these tapped values. """ pass #FIXME
For example, you should have
polytap_uint16(44257, [0, 2, 3, 5])
=0
.Finish the implementation of the LFSR function below. It should compute the next number in the sequence \(X_i\) defined above.
def lfsr_uint16(x, I): """Return the successor of x in an LFSR based on tap positions I""" pass #FIXME
You can test it using the following code:
def test_lfsr_uint16(seed): = [0, 2, 3, 5, 15] I = [seed] xs for _ in range(10): -1], I)) xs.append(lfsr_uint16(xs[return xs assert test_lfsr_uint16(42) == [42, 21, 10, 32773, 49154, 57345, 28672, 14336, 7168, 3584, 1792]
Huffman coding
Lossless data compression is the operation of transforming a sequence of bits into a shorter sequence of bits that encodes the same information – the initial sequence of bits being reobtained using a decompression algorithm. In short, it is a coding operation which shortens the size (transmission, storage) of the data at the cost of compression work.
Huffman coding is a variable length coding, developed by David Albert Huffman while a student at MIT. The algorithm has been published in 1952 in a paper entitled “A Method for the Construction of Minimum-Redundancy Codes”.
In this section, we are interested in compressing (character)
strings. In a normal mode of operation, when using the extended ASCII
encoding, strings are encoded using 8 bits per character. To get the
binary representation of a string, we can use the .encode()
method. For example, the following function takes a string, gets its
ASCII encoding as an array of bytes (represented using the class
bytearray
in Python), and formats it so that it is
human-readable: a string made of 0
s and 1
s,
one character per bit, where groups of 8 bits are separated by a space
(again, for readability).
def bin_repr_of_ascii_string(s):
return ' '.join(format(x, '08b') for x in s.encode('ascii'))
For instance, on the string "abc"
, we obtain the
following representation:
01100001 01100010 01100011
We see that the ASCII encoding of "abc"
is 3 bytes long
(as expected) and one can check, for example, that the byte
0b01100001
(97 in decimal notation) corresponds to the
character a
in the ASCII
table.
(Note that in extended ASCII, since characters are encoded using 8-bit words, you are limited to 256 characters. Basically, the encoding space is barely large enough to encode latin characters without diacritic marks. As of today, there exist dozens of different encodings, some of them using multiple bytes per character, or even variable-width, i.e. the encoding of one character may vary in size. For this tutorial, we will stick to extended ASCII.)
Under Huffman coding, compression is achieved by replacing the most frequent characters with short codes (< 8 bits) and the least frequent characters with long codes (> 8 bits). In average, size reductions of between 20 and 90% are thus observed. It is a widely used compression method, often as a complement to other algorithms.
Huffman uses a particular approach to choose the representation of each coded symbol. The idea is that each code should not be a prefix of any other code. In other words, none of the codes chosen should be the beginning of another code. We will see that more in details later.
Description of the Huffman coding
The principle of Huffman coding is based on the creation of a tree structure called a Huffman tree. That tree is built as follows: suppose the sentence \(s\) to be compressed is
Forsan et haec olim meminisse juvabit
We first compute the probability of appearance, in the sentence, of
each character. For example, in \(s\),
the character c
appears once, while the character
a
and the space character appear \(3\) and \(5\) times, respectively. Since \(s\) is of length \(37\), the inverse probabilities of
appearance of c
, a
and the space character are
resp. equal to \(\frac{1}{37}\), \(\frac{3}{37}\) and \(\frac{5}{37}\). We give below the
probability of appearance for all the characters of the sentence:
\(c\) | \(p\) | \(c\) | \(p\) | \(c\) | \(p\) |
---|---|---|---|---|---|
F |
0.02703 | r |
0.02703 | a |
0.08108 |
b |
0.02703 | u |
0.02703 | m |
0.08108 |
c |
0.02703 | v |
0.02703 | s |
0.08108 |
h |
0.02703 | n |
0.05405 | e |
0.10811 |
j |
0.02703 | o |
0.05405 | i |
0.10811 |
l |
0.02703 | t |
0.05405 |
|
0.13514 |
Then, the tree structure is constructed as follows. For each character \(c\) with probability \(p\), we start by constructing a set of isolated leaf nodes with value \(c\) and associated probability \(p\). Then, we iteratively repeat the following process: we select two nodes with lowest associated probabilities and remove them from the set – let’s call them \(n_1\) and \(n_2\), with associated probabilities \(p_1\) and \(p_2\). We then create a new node with children \(n_1\) and \(n_2\) (the order is irrelevant here), with no data and with associated probability \(p_1 + p_2\), and add it to the set. Eventually, the set will only contain one node that forms the Huffman tree.
To obtain the Huffman code of a character \(c\), we walk down the tree from the root to
the leaf that contains \(c\), adding a
0
(resp. a 1
) to the code each time we take a
left-edge (resp. a right-edge).
The construction of a Huffman tree for the probability table given above is shown below:

Here we have also indicated a path from the root down to a leaf node
u
, which goes right, then left twice, then right, and
finally left. We thus obtain the code 10010
for
u
, the code 0101
for t
, and so
on. You can check that in general, the letters with a higher probability
of appearance have shorter codes.
Last, one can encode an entire string by encoding its characters one by one, using their respective Huffman codes, and by concatenating all the obtained codes.
Representing Huffman trees
We will use the following class to representing the nodes of the Huffman tree:
class Node:
def __init__(self, value, left = None, right = None):
self.value = value
self.left = left
self.right = right
The left
and right
attributes point to the
left and right children of a node. The value
attribute for
leaves should contain the character the leaf is associated to. For
internal nodes, this attribute is simply ignored and you can use it at
your own discretion.
(The grader server expects your nodes to have these three attributes, but you can add more attributes if needed.)
Priority queues
In the high-level description of the algorithm above for constructing the Huffman tree, we explained that we need to iteratively select the two nodes with lowest associated probabilities. Rather than scanning over the entire list of value-probability pairs each time to find the minimum (which takes time \(O(n)\)), a better approach is to use a priority queue, which is a data structure that allows to insert values with associated priorities and efficiently remove the value with the highest/lowest associated priority (typically in time \(O(\log n)\)).
Priority queues are often implemented using a heap data
structure. The Python standard library heapq
provides
methods heappush
and heappop
that almost do
exactly what we want, and below we use them to implement priority queues
through a small wrapper class, PQueue
, that exports the
following methods:
put
adds an item to the queue along with its priority (weight).get
removes and returns the item with the lowest priority.
One subtle issue when using the heapq
operations
directly is that if two elements have the same priority, Python’s heap
implementation tries to compare them to determine their order, which can
cause an error if the values are not inherently comparable (e.g., if
they are objects rather than numbers or strings). To avoid this, we
introduce an additional uid
field that ensures all elements
are uniquely ordered in the heap, even when they have the same
weight.
import heapq
class PQueue:
def __init__(self):
'''initialize an empty priority queue'''
self.pqueue = []
self.uid = 0
def __len__(self):
'''return the number of elements in the queue'''
return len(self.pqueue)
def put(self, value, weight):
'''add a value to the queue with associated weight'''
self.pqueue, (weight, self.uid, value))
heapq.heappush(self.uid += 1
def get(self):
'''remove and return the value with lowest associated weight'''
= heapq.heappop(self.pqueue)
w, _, v return (v, w)
Implementation of Huffman coding
We expect you to write your solution to this problem in a file named
huffman.py
and to upload it using the form below. You can
resubmit your solution as many times as you like.
We now have all the information for implementing the Huffman coding.
In this part, we are not going to manipulate bits (and bit strings) but
are going to represent codes with standard strings made of
0
s and 1
s. Note that this somewhat defeats the
goal of Huffman coding since we are going to represent \(1\) bit with a character which in turn
requires (at least) \(8\) bits to be
stored — but still it will help us to get an understanding of the
underlying principles of Huffman coding.
- Write a function
huffman_stats(s)
that takes a strings
and returns a dictionary whose keys are the characters \(c\) appearing ins
and whose associated value \(p\) is the appearance ratio of \(c\) ins
.
Some remarks about floating point arithmetic. Floating point numbers are not real numbers in the mathematical sense — there are not even close to enough floating point numbers, since you are missing an uncountably infinite number of real numbers. But moreover, due to their internal representation, floating point numbers can have weird behavior. For example:
Addition is not associative:
1]: (0.1 + 0.2) - 0.3 In [1]: 5.551115123125783e-17 Out[2]: 0.1 + (0.2 - 0.3) In [2]: 2.7755575615628914e-17 Out[
Even if
0.3
can be represented using floating point numbers, we do not have0.1+0.2 == 0.3
, instead, we have:3]: 0.1+0.2 In [3]: 0.30000000000000004 Out[
Sometimes, the result can be totally chaotic. The function
f
implements a sequence that mathematically converges to \(6\).def f(n): = 2.0, -4.0 u, v while n > 1: = v, 111 - (1130 / v) + 3000 / (u * v), n-1 u, v, n return u
However, its implementation converges to
100.0
(You can see that the sequence starts to converge to6.0
, and then, at some point, it diverges):4]: print([f(n) for n in range(100)]) In [4]: [2.0, 2.0, -4.0, 18.5, 9.378378378378379, 7.801152737752169, 7.154414480975333, 6.806784736924811, 6.592632768721792, 6.449465934053933, 6.348452060746624, 6.274438662728116, 6.218696768582163, 6.17585385581539, 6.142627170481006, 6.120248704570159, 6.166086559598099, 7.235021165534931, 22.062078463525793, 78.57557488787224, 98.34950312216536, 99.8985692661829, 99.99387098890278, 99.99963038728635, 99.99997773067949, 99.99999865921669, 99.99999991932181, 99.99999999514776, 99.99999999970828, 99.99999999998246, 99.99999999999893, 99.99999999999993, 99.99999999999999, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0] Out[
More related to the question, you can check that \(1 \neq \sum_{i < n} \frac{1}{n}\):
5]: sum([1/100 for _ in range(100)]) In [5]: 1.0000000000000007 Out[
This means that for computing the probability of appearance of a character \(c\) that appears \(k\) times in a string of length \(n\), it is better to compute \(\frac{k}{n}\) than to compute \(\sum_{i < k} \frac{1}{n}\).
Write a function
huffman_tree(d)
that takes a dictionaryd
associating frequencies to characters and returns a Huffman tree ford
constructed via the algorithm described above. You should use some form of priority queue (for example using the classPQueue
provided above) to efficiently extract the nodes with lowest associated weights.You should not assume that the frequencies provided as input are necessarily relative frequencies (i.e., probabilities, as returned by
huffman_stats
), since they could also be absolute frequencies (i.e., total counts). The only thing you should assume is that the frequencies associated to characters can be compared, so that the characters with the lowest frequency may be iteratively extracted to construct the tree.Write a function
huffman_codes(tree)
that takes a Huffman treetree
and that returns a dictionary associating to any character \(c\) coded by the treetree
its Huffman code as a string made of0
’s and1
’s.Write a function
huffman_encode(tree, s)
that takes a Huffman treetree
and a strings
and that returns its Huffman encoding as a string made of0
’s and1
’s.Write a function
huffman_decode(tree, s)
that takes a Huffman treetree
and a Huffman encodings
and that decodess
w.r.t.tree
. Here to,s
is a string made of0
’s and1
’s.Write a function
huffman_compress(s)
that takes a strings
, does its frequency analysis (huffman_stats
), computes the Huffman tree based on this frequency analysis (huffman_tree
) and encodess
under this tree (huffman_encode
). Your function should return a triplet composed of: 1. the computed Huffman tree, 2. the encoding ofs
and 3. the compression ratio.The compression ratio is defined as \(\frac{|s|}{|h(s)|}\) where \(|s|\) is the size of the initial string \(s\) in bits (assuming the ASCII encoding where characters are encoded using 8 bits) and \(|h(s)|\) is the size of the Huffman encoding of \(s\) in bits. (We recall you that characters in \(h(s)\) represent bits – i.e. one character in \(h(s)\) counts for one bit.)
Binary Huffman Implementation (optional)
We expect you to write your solution to this problem in a file named
huffman.py
and to upload it using the form below. You can
resubmit your solution as many times as you like.
In the previous section, we saw the principle of Huffman encoding and developed basic encoding & decoding functionalities. However, to obtain a functional implementation, some work remains:
first, for representing bit-strings, we used (character) strings made of
0
s and1
s. That is, we used 1 character (i.e. 8 bits if we assume the ASCII encoding of characters) for encoding 1 bit of information! In that setting, the functionhuffman_encode
is more an anti-data-compression implementation than a data-compression one.then, the functions
huffman_encode
&huffman_decode
assume that you have the relevant Huffman tree at disposition. Forhuffman_encode
, this is not a problem since you have the non-encoded message at disposition and the tree can be reconstructed from it. But for thehuffman_decode
function, you cannot assume that – the goal of the function is to decode the message without already having access to its decoding! In consequence, the Huffman tree must be part of the encoded message, i.e. we have to find a way to encode a Huffman tree using bit-strings. Note that encoding must be efficient (in term of space) because now, the encoding of the Huffman tree is going to be part of compressed data – the more space it takes, the less the compression ratio will be. (There are other solutions. E.g., the Huffman tree could be fixed a priori and be a pre-shared knowledge between the compressor and the decompressor. For example, fax machines1 use a modified Huffman algorithm where the Huffman tree is fixed for all machines & transmissions.)last, for strings that are made from only one character (e.g.
aaaaaa
), the Huffman tree is made of a single node. In consequence, the repeated character is going to be encoded with a code of length \(0\). And thus, all strings of that form are going to be compressed as the empty bit-string and uncompressed as the empty string – you can check this with your implementation. We will have to detect such cases.
As a first step, we will adapt the function
huffman_encode
& huffman_decode
to
manipulate bitstrings. First, we need to modify
huffman_codes
:
- Write a function
huffman_bcodes(tree)
that takes a Huffman treetree
and that returns a dictionary associating to any character \(c\) coded by the treetree
its Huffman code as a pair \((n, h)\) of two integers where \(n\) is the length (in bits) of the Huffman code of \(c\) and \(h\) is the actual Huffman code - i.e. the binary representation of \(h\) (using \(n\) bits) is the Huffman code of \(c\). For example, if \(c\) is coded by010
, thenhuffman_bcodes
should associate(3, 2)
to it (the binary representation of2
being0b010
when using 3 bits).
We can now modify the huffman_encode
function to return
actual bit-strings. However, since the minimal unit of information a
computer can manipulate is a byte, we are going to use byte arrays (or
byte strings) to store our encoded messages. Python comes with two
classes for manipulating byte arrays: bytes
and
bytearray
. The type bytes
(documentation)
is very similar to the one for strings: it is an immutable
array of bytes (i.e. integers in the range \([0 \ldots 255]\)) and you can iterate over
it or randomly access its elements. The type bytearray
(documentation),
on the other end, represents mutable arrays of bytes. Its
interface is close to the one of lists and uses the same syntax for
accessing, adding or removing elements. Here are some examples of their
usage:
# Create a immutable byte-string of length 4 containing the bytes:
#
# Decimal Binary
# -------------------
# 97 0b01100001
# 98 0b01100010
# 99 0b01100011
# 1 0b00000001
#
= bytes([97, 98, 99, 1])
b print(b)
# Note that if you print `b`, Python is going to display this
#
# b'abc\x01'
#
# First note the `b` in front of quote - it indicates that this
# is a byte string literal. Then, we can see that bytes are represented
# by characters! Indeed, Python displays bytes using the associated
# character in the ASCII encoding (when possible). If not, it displays
# \x?? where ?? is the value of the byte in hexadecimal notation (i.e.
# in base 16, where the digits are 0123456789abcdef) -- recall that 'a'
# (resp. 'b' & 'c') are encoded by 97 (resp. 98 & 99) in ASCII.
# We can now access different properties of b:
assert len(b) == 4 # b contains 4 bytes
assert b[0] == 97 # its first byte if 0
assert b[1:2] == b'bc'
# We can then create a mutable byte-array...
= bytearray([97, 98, 99, 1])
mb # or: mb = bytearray(b'abc\x01')
# ...and modify it:
0] = 100
mb[101)
mb.append(
# ...and get its content:
print(len(mb))
print(bytes(mb))
Python also gives all the necessary functions for converting characters in bytes and bytes in characters:
# We can convert a string to its ASCII encoding...
= 'abc'
s = s.encode('ascii')
e
# ...and convert it back to a string
= e.decode('ascii')
s2 assert s == s2
# Moreover, the function `chr(x)` gives the character whose
# ASCII code is `x` (as an integer). A contrario, the function
# `ord(c)` gives the ASCII code (as an integer) of the
# character `c`
We are now ready to write binary versions of
huffman_encode
and huffman_decode
:
Write a function
huffman_bencode(tree, s)
that takes a Huffman treetree
and a strings
and that returns its Huffman encoding as a byte-string. If the encoding is not even (i.e. if the encoding has a number of bits that is not multiple of 8), the result should be padded with0
s.Write a function
huffman_bdecode(tree, bs, n)
that takes a Huffman treetree
, a Huffman encodingbs
(as a byte string) and an integern
and that decodesbs
w.r.t.tree
. The integern
represents the length (in characters) of the original message. It is necessary for two reasons:because
huffman_bencode
pads its output with0
s, without knowing the length of the original message, the decoder would not be able to decide wether the final0
s correspond to actual characters or padding.moreover, as we saw, the Huffman encoding of a string that is made from a single character is the empty bit-string - regardless of the original length of the original string. Here again, we need the length of the original string to recreate it.
Ok, we are not far from the end. What remains is to (binary) encode the Huffman tree and to pack all things together. There are numerous ways of encoding a Huffman tree. Here is one possible way:
the bit encoding of a leaf node is the bit
1
followed by the 8-bit long ASCII code of the character this node encodes (as returned byord
),the bit encoding of an internal node is the bit
0
followed by the encoding of its left children and its right children (in that order).
Finally, the binary encoding of a Huffman tree is simply the binary
encoding of its root padded with 0
s at the end if the
encoding length (in bits) is not a multiple of 8.
- Write a function
huffman_bencode_tree(tree)
that takes a Huffman tree and returns its binary encoding, as described above.
We are now ready to write our final compression function. The
encoding of a string s
is the contenation of:
its length (on 4 bytes, using little endian). In Python, you can obtain this encoding using
int.to_bytes(x, 4, 'little')
wherex
is the integer to encode (you can decode it usingint.from_bytes(b, 'little')
).the binary encoding of the Huffman tree
t
obtained from the frequency analysis ofs
.the binary encoding of
s
under the treet
.
Write a function
huffman_bcompress(s)
that takes a strings
and compresses it, as described just above, and returns the result of the compression.Compute the compression ratio for different strings (in terms of character sets and length – you can generate random strings here). What do you observe?
Write a function
huffman_buncompress(s)
that takes a strings
and uncompresses it. It must satisfy the equationhuffman_buncompress(huffman_bcompress(s)) = s
for every strings
.