Huffman Trees/Codes - Why are they different?

2024-11-19 Software-Craftmanship Augmented-by-AI Functional-Programming Nerdy Scala

Right now I am solving the 99-scala-problems. Again!!!

This time around I am using them to learn more about what it means to be an Augmented Software Engineer (ASE). Just subscribe to the podcast to follow along.

P50 is to build a Huffman tree and to return the set of Huffman codes for the given string.

codeium found a good implementation, but struggled to find tests that worked.

I then tried various websites ([1], [2], [3], [4]) to test various strings and noticed that the trees and the resulting set of codes are different between the implementations. Even more interesting for the same implementation the set is different when you reverse the string.

That felt strange to me. My expectation was that the same set of (character, frequency) tuples will always create the same set of Huffman codes.

So I started digging. Here is what chatGPT had to say about it …

You're absolutely correct to ask this question. When there are multiple
nodes with the same weight in the priority queue, the behavior can indeed
depend on how the priority queue resolves ties. This can result in different
tree structures, which could lead to different Huffman codes, even though
all such codes would still represent valid Huffman encodings (i.e., they
would satisfy the prefix-free property and correspond to the correct
frequencies).

So it seems that Huffman codes can be different for the same set of (character, frequency) pairs, because there is an edge case where you need to dequeue the top two nodes from the priority queue with the lowest weight. If there are three or more nodes that have the same lowest weight it depends on the order the nodes where put into the queue and the internal implementation of the queue, which two of the three or more nodes are the top nodes that get dequeued.

To address this and to make the set of generated Huffman codes deterministic I changed the implementation to keep track of the largest character in the sub-tree and sort the queue by weight(descending) and maxChar(ascending) and use the guaranteed-to-be-unique maxChar as a tie-breaker for the sorting of the priority queue. This will ensure that we always dequeue the same two nodes, regardless of the order the nodes were put into the queue and regardless of the internal implementation of the queue.

Note: This would (obviously) also work with minChar as a tie-breaker. Your call.

That way we will always get the same set of Huffman codes for the same set of frequencies (means huffman(text) == huffman(text.reverse), which is not true for most of the implementations out there.

That was an interesting one.

And it drives home one of the main/major points that you need to consider as an Augmented Software Engineer: The code you get is a suggestion. A starting point. You are well-advised to reason about it and dig into it, if/when you do not understand something.

I was very close to moving on to the next problem, because getting the first/initial implementation and the first/initial tests work was a very fast and smooth experience (with the help of codeium).

It was only when I started to question the implementation and the test against my own understanding, that I realized that something is not quite right.

With that additional step/work we are now in a good place: We saved some time to get the initial implementation done, but I also have a much deeper understanding how Huffman trees work and how to implement them so that they are more deterministic.

Feel free to check out the code.