A simple implementation of byte-pair encoding

Friday 7 May 2010

Byte pair encoding is a simple but (sometimes) effective algorithm for compressing data (especially textual data). Best of all, it is very easy to understand! Therefore, in this experiment I have created a simple Python script that implements byte pair encoding on files.

Basically, byte pair encoding works by finding the most common pair of bytes in a file, and replacing them with a single unused byte, at the same time storing that association in a dictionary. It then does this over and over again until there are either no pairs that occur more than two times or no unused characters left.

This script doesn’t implement all the concepts described in the original paper on the encoding method (Gage 1994), but should still provide a nice example of how it works.

Note that while this method of encoding may sound cool, it has its limitations. Think about binary files (e.g. images, programs, and pretty much all other files besides plain-text ones) – how much room would there be for additional compression through byte pairs? Try it!