So you have some data you want to compress. Realistically, you’re gonna use GZIP or tar or a library that implements those, but let’s assume you’re really stubborn and insist on implementing it yourself (weirdo). A bit of research or insight will reveal that there’s two general ways in which data takes up more space than it needs to: one is repetition (‘aaaaaaab’ is significantly longer than ‘7a+b’) and the other is code length (each of those ‘a’s and 'b’s take up eight bits of memory, despite there being significantly more 'a’s). A couple smart guys got together in '78 and came up with a way to compress data in both regards, and called the resulting algorithm DEFLATE. We gonna implement a simple version of that.
Repeat compression
If you have the same pattern in the data more than once, it makes sense to only store it once, and then somehow tell the decompressor to reuse that whenever needed. For example, since I’m using the word ‘data’ all the time in this text, it’d make sense to come up with some shorter symbol to use instead. The precursor to DEFLATE (LZ77) did exactly that - going through data, coming up with symbols for each new pattern, and using those symbols whenever it saw the patterns repeat themselves. The issue was that it was making a new symbol for every new combination of letters it found in the data, and quickly became overwhelmed with symbols. For short segments it still works out okay, but there is a better solution: have a symbol that means ‘when you encounter me, go back X letters and repeat the next Y letters’ (X and Y being encoded by the next two symbols in the encoded data) (well, in practice, you’d have multiple symbols that tell you how many of the next symbols to take for X and Y, but that’s mostly irrelevant for this example). Thus, each time we need to use the word ‘data’, we would just tell the decompressor where to find the previous instance of that word (and how long the word is) - and since we can do this in only three bytes, we end up using less data to encode that!
That might’ve been a bit hard to understand, so let me make another example. You are probably reading this on my website. Have you noticed how every webpage has the exact same menu to the left? I don’t actually store that menu in every single webpage - I store it once, and each page then has a little blurb at the beginning that says “hey, go fetch the menu from this-and-this url, and display that first”. Since that blurb ends up being shorter than the menu itself, I am saving on space :3 (okay that’s not exactly how it works (I use a python script to assemble the pages, very helpful with the fanfic chapter lists), but it’s essentially the same idea.)
Implementation
First you’re gonna need a list of symbols that actually show up in the data. For text, you can usually use ASCII (unless you’re using unicode, but then you have a wholly different tangled mess of problems to deal with), in my game I use the color palette (the game only uses 32 colors, so I only need 32 different symbols - or five bytes). We will also need at least one symbol to signify repeats - using ASCII as an example, it always leaves the uppermost bit unused, so we could set that one to signify a repeat. Remainder of the symbol byte could be length, and the next byte could be offset (since repeats are usually shorter than they are apart) (also, unlike in the introductory example, we only use two bytes for this). We could also set up more symbols for different types of repeat encodings (eg one that only uses one byte for really short/close matches, or one that lets us have really distant or long ones), but the idea remains the same.
Then we’ll have to find matches. This is the most computationally intensive part of the algorithm. We need a list of lists. In the first list, we look up the current symbols (since we encode matches in two bytes, it makes sense to only encode matches if they’re longer than that - thus, the keys are all strings of length three), and each lookup gives us a list of all places we’ve seen that pattern in the data before (as offset from beginning of file, which we can subtract from current offset to get offset from current position - which is what we encode). Usually, this is done as an array of linked lists - we use the array for fast lookup (or sparse array - optimise this to taste), and the linked list s superb for insertions (more on that later) without sacrificing performance in traversal (“going through it item by item”). If we HAVE seen it before, we figure out how long the match is. Once we have that, it’s a good idea to also (recursively) search for matches from the next character - sometimes, that can give a better opportunity for compression which we might’ve missed if we took the first match as it was. This is referred to as “inert matching”. Then we can insert the corresponsing code into the output stream (or the raw data if we didn’t find a match), and then we insert the current position into the lookup table - this way, if we come across this pattern again, we can repeat it as well. The linked lists above turn out to be relevant here, as they make it trivial to insert a new match. More relevantly, we insert the match at the beginning of list of matches - thus, it gets checked first, and keeps the offsets as short as possible.
Many implementations put a limit on how far back matches can be made. This is accomplished via a simple check when traversing the linked list (plus a step that deletes the remainder of the list if it is found to be excessive). This is useful, as it tends to limit the amount of data the computer needs to keep track of (iow “minimises RAM requirements”) - there’s also another couple cool tricks regarding that (buffer swapping), but so far I have not seen a need for this, as all the data I’ve compressed comfortably fits within working memory when decompressed.
Frequency compression
This one deals with compressing individual symbols. In the english language, the letter ‘e’ is the most frequent (about one in eight letters is ‘e’), while ‘z’ is the least used (roughly once every 1350 letters). However, they both use eight bits on a computer. Wouldn’t it make more sense if ‘e’ used very few bits, and ‘z’ many? Morse does a version of this - ‘e’ is a single dot, ’t' (second most frequent) is a dash, and ‘z’ is ‘–..’. It took about a century before a certain Huffman formalised the process, in other words, came up with an algorithm that could take any data and produce optimal codes for each symbol - often represented as a binary tree. Huffman encoding’s biggest flaw is that this tree also needs to be transmitted alongside the compressed data (often making it useless for shorter segments of data), but this can be optimised by sending an encoded version of the frequencies of each symbol, and having the receiver/decompressor reconstruct the tree from those. The original DEFLATE implementation also offered the option to use a standard tree optimised for compressing english plaintext (like this webpage), completely circumventing the need to transmit a tree.
P.S. It seems not obvious how the encoder/decoder can use codes of different lengths without mixing them up. I thought the comparison with Morse code helped that, yet alas I must elaborate. The ‘secret’ is in how the codes are assigned. For an easy example, let’s say we need three codes that need to be short, and thirteen that need to be long. We can then say that all codes that start with a 0 are only three bytes long, but if a code starts with a 1, it takes five bytes. Then we can assign the codes as follows:
000
001
010
10001
10010
10011
10100
...
11101
Since we know what codes we are using, and each code starts where the previous one ends, we always know where to start reading a code, and once we can find the first byte, we know how long the code is - and thus where to start reading the next.
Implementation
The process starts with counting the occurences of all symbols (shocking, I know). The tree is constructed by repeatedly taking the entries with the least occurences and joining them in a subtree (which is then inserted back into the list, having the combined occurences of its children). It is relevant to have a set order to the children - traditionally, the right child has less occurences, or, failing that, higher lexicorgaphical order. This is what makes the tree unique, allowing the receiver/decompressor to reconstruct it from frequencies alone.
This is repeated until there is a single element left in the list, which is the resulting tree. As the tree itself is clunky to work with, it is used to calculate a list of codes and code lengths that are used for the actual compression. Combine that with a bitpacker (that takes arbitrary-length codes and fits them into the eight-bit-per-byte limitation modern computers work wth) and you’re good to go.
Sending the frequency data (to reconstruct the tree) is made trivial by the fact that we don’t need the exact frequencies/counts of all elements, but only the bit lengths of their codes - those encode their relative frequencies to exactly the level of precision needed to recalculate their codes. The original paper also did shenanigans with how those were to be encoded (believe it or not, it’s another Huffman code!), I’ll update this when I get through that.
Recreating the tree for decoding is a bit more involved. Symbols are sorted by order of occurence or lexicography (this is where the ordering from the first paragraph becomes relevant), and enumerated to their code length. Each longer code then takes the prefix of the previous code. This is where an example is more helpful than an explanation: if we have a2 b3 c2 d3 e2
, we sort them to a2 c2 e2 b3 d3
, and enumerate as a00 c01 e10 b110 d111
- since e was code 10
, the remaining codes must then start with 11
. Just like with encoding, using a tree is the slowest and clunkiest option. Instead, we once again use a table of codes and their lengths (remember to advance the code according to the length of the previous code - this ends up being messy, because bitpackers are not that simple to run in reverse). Matching is a bit more complicated - we either only match against each entry up to its length (meaning we cannot use an array lookup), or we have to have all possible superfluous codes (reusing the above example, we’d use a table of length 8 (ie, matching three bits), and have entries 100
and 101
both point to e), which results in a huge table. Luckily, we can make things a bit more complicated but working a lot better - we can intermix tables. We use a four-entry table (ie, two bits), and have the last entry 11
point to another table, which matches the remaining codes. This adds a bit of complexity, but ends up having the best of both worlds.
Sources
- An article that covers more or less everything listed here, without saying how to actually do it
- Very useful PDF
- Very very useful site regarding implementation, even though it has some superfluous explanations before that.
- The wikipedia articles turned out to be rather useless, since they describe what is done, but not how.
(written 25032024)