Huffman Codes

Posted on March 27, 2015

Firstly, this is an example of a greedy algorithm. We can represent data in files by using binary character code in which each character is represented by a unique binary string, which we call as a codeword. We may use a fixed-length code or a variable-length code. Typically, variable-length code is minimizes the amount of data required to store a file.

Prefix Codes

In this case, no codeword is a prefix of another (and hence the name prefix codes). One can show that a prefix code can always achieve the optimal data compression among any character code. Since no code is a prefix of another, decoding is easy—read the code till you get a match.

We see that a binary tree whose leaves are the given characters provide a nice representation of prefix codes. The binary representation of the prefix code can be interpreted in the following way: if a one is found it means to go to the right child and if a zero is found, go to the left child.

One can show that an optimal code for a file is a always represented by a full binary tree (a full binary tree is a tree in which every non-leaf node has two children).

If \(C\) is the alphabet from which characters are drawn, and all character frequencies are positive, then the tree for an optimal prefix code has exactly \(\vert, C\vert\) leaves, one for each letter of the alphabet and exactly \(\vert C\vert -1\) internal nodes.

Given a tree \(T\) corresponding to a prefix code, we can compute the number of bits required to encode a file. For each character \(c\) in the alphabet \(C\), let \(c.freq\) denote the frequency of \(c\) in the file and \(d_T(c)\) denote the depth of \(c\)’s leaf in the tree. Thus the number of bits required to encode a file is thus

\[B(T) = \sum_{c\in C} c.freq \cdot d_T(C)\]

Constructing a Huffman code

The algorithm proceeds as follows

Modelling algorithm using naive recursion

Using naive recursion is trivial and the running time take \(\mathcal O n^2\). But this can be improved.

Modelling the algorithm using a min heap

Let the list, which we referred to in the last paragraph be \(Q\), we use a min-heap to model the list. We shall do the extract-min operation twice and then insert the element which is the sum of two elements into the heap.

The running time will then be \(\mathcal O (n\lg n)\), where there are \(n\) calls to the \(Q\) and at each call, BUILD-MIN-HEAP is to be called.

Modelling algorithm using two Queues

If the elements are sorted (or if sorting is done as a pre-processing step), we can design a \(\mathcal O(n)\) algorithm (See Wikipedia). The following is a brief outline of the method:

It should be noted that the assumption that the elements are sorted in increasing order is important for proving the correctness of the above method.


Correctness of Huffman’s Greedy algorithm


Notes