Huffman Codes
Firstly, this is an example of a greedy algorithm. We can represent data in files by using binary character code in which each character is represented by a unique binary string, which we call as a codeword. We may use a fixed-length code or a variable-length code. Typically, variable-length code is minimizes the amount of data required to store a file.
Prefix Codes
In this case, no codeword is a prefix of another (and hence the name prefix codes). One can show that a prefix code can always achieve the optimal data compression among any character code. Since no code is a prefix of another, decoding is easy—read the code till you get a match.
We see that a binary tree whose leaves are the given characters provide a nice representation of prefix codes. The binary representation of the prefix code can be interpreted in the following way: if a one is found it means to go to the right child and if a zero is found, go to the left child.
One can show that an optimal code for a file is a always represented by a full binary tree (a full binary tree is a tree in which every non-leaf node has two children).
If \(C\) is the alphabet from which characters are drawn, and all character frequencies are positive, then the tree for an optimal prefix code has exactly \(\vert, C\vert\) leaves, one for each letter of the alphabet and exactly \(\vert C\vert -1\) internal nodes.
Given a tree \(T\) corresponding to a prefix code, we can compute the number of bits required to encode a file. For each character \(c\) in the alphabet \(C\), let \(c.freq\) denote the frequency of \(c\) in the file and \(d_T(c)\) denote the depth of \(c\)’s leaf in the tree. Thus the number of bits required to encode a file is thus
\[B(T) = \sum_{c\in C} c.freq \cdot d_T(C)\]
Constructing a Huffman code
The algorithm proceeds as follows
- It first selects the two elements with minimum frequency, combines the two of them (virtually) and then (virtually) adds them into the list with frequency of the added element being the sum of the two elements initially selected. The added element can be visualized as a tree the leaf nodes being the two elements selected and the node contains the sum of frequencies of the two leaf elements.
- The above process is repeated until there are no elements left in the list.
Modelling algorithm using naive recursion
Using naive recursion is trivial and the running time take \(\mathcal O n^2\). But this can be improved.
Modelling the algorithm using a min heap
Let the list, which we referred to in the last paragraph be \(Q\), we use a min-heap to model the list. We shall do the extract-min
operation twice and then insert the element which is the sum of two elements into the heap.
The running time will then be \(\mathcal O (n\lg n)\), where there are \(n\) calls to the \(Q\) and at each call, BUILD-MIN-HEAP
is to be called.
Modelling algorithm using two Queues
If the elements are sorted (or if sorting is done as a pre-processing step), we can design a \(\mathcal O(n)\) algorithm (See Wikipedia). The following is a brief outline of the method:
- Enqueue the first queue with elements that are in increasing order;
- Pop the two smallest elements from both the queues—the two smallest elements shall either be the first two elements in the first queue, the first two elements in the second queue or the first element of the first and the second queue.
- Add the frequencies of the two elements and then push them into the second queue.
- Repeat the process until there is one element left.
It should be noted that the assumption that the elements are sorted in increasing order is important for proving the correctness of the above method.
Correctness of Huffman’s Greedy algorithm
- A code tree that is incomplete is not optimal.
- Let \(T\) be the tree we get from Huffman’s algorithm and \(T'\) be the tree for the optimal prefix code. Let \(a\) and \(b\) the two elements with least frequencies. By replacing the elements that are siblings in \(T'\) at the node with largest depth, we are only improving the average code length of the tree.
- Let the inductive hypothesis be the following: \(p(n)\): In an \(n\) element alphabet, the Huffman algorithm is correct.
- Base case, i.e., \(n=1\) is trivially true.
- Suppose \(n-1\) is true, and \(T_1\) be the tree generated by the alphabet \((Q-\{[a,b]\}) \cup \{[a+b]\}\). Now, the average length of \(T\) is \(f(T) = f(T_1)+a+b\), which must the minimum average code length possible.
Notes
- The same algorithm can be used for a ternary prefix code, where the tree shall have three nodes (For a proof, see here).
- If the frequencies of the alphabet are Fibonacci numbers, then the tree has a particularly simple, predictable form.
- One can show that a binary tree that is not full cannot correspond to an optimal prefix code.
- Suppose that the frequency of the most frequent entry is less than or equal to \(0.33\), then the Huffman code of the alphabet is going to have at least two bits.