Run-Length Encoding

In this article we'll go over how the run-length encoding algorithm works, what it's used for, and how to implement its encode and decode functions in Python.

Run-length encoding (RLE) is a very simple form of data compression in which a stream of data is given as the input (i.e. "AAABBCCCC") and the output is a sequence of counts of consecutive data values in a row (i.e. "3A2B4C"). This type of data compression is lossless, meaning that when decompressed, all of the original data will be recovered when decoded. Its simplicity in both the encoding (compression) and decoding (decompression) is one of the most attractive features of the algorithm.

Here you can see a simple example of a stream ("run") of data in its original form and encoded form:

Input data:


Output data:


In this example we were able to compress the data from 34 characters down to 13.

As you may have noticed, the more consecutive values in a row, the more space we save in the resulting compression. On the other hand, if you have a sequence of data that frequently changes between values (i.e. "BEFEFADED") then we won't save much space at all. In fact, we could even increase the size of our data since a single instance of a character results in 2 characters (i.e. "A" becomes "1A") in the output of the encoding.

Because of this, RLE is only good for certain types of data and applications. For example, the Pixy camera, which is a robotics camera that helps you easily track objects, uses RLE to compress labeled video data before transferring it from the embedded camera device to an external application. Each pixel is given a label of "no object", "object 1", "object 2", etc. This is the perfect encoding for this application because of its simplicity, speed, and ability to compress the low-entropy label data.


In order to encode a string of data, your code will need to loop through each character of the data and count the occurrences. Once you see a character that is different from the previous character, you will append the number of occurrences and the character to your encoding.

Below you'll find a simple implementation in Python:


def rle_encode(data):  
    encoding = ''
    prev_char = ''
    count = 1

    if not data: return ''

    for char in data:
        # If the prev and current characters
        # don't match...
        if char != prev_char:
            # ...then add the count and character
            # to our encoding
            if prev_char:
                encoding += str(count) + prev_char
            count = 1
            prev_char = char
            # Or increment our counter
            # if the characters do match
            count += 1
        # Finish off the encoding
        encoding += str(count) + prev_char
        return encoding

From the comments you should be able to tell what is going on throughout the code. If not, it would be a good exercise to run through the code with a debugger and see it in action.

Continuing with the same file as above, here is an example of the code being executed:


And the output:

$ python


Decoding an RLE-encoded stream of data is actually even easier than encoding it. Like before, you iterate through the data stream one character at a time. If you see a numeric character then you add it to your count, and if you see a non-numeric character then you add count of those characters to your decoding, which is returned to the caller once you iterate through all of the inputdata.

Here is the algorithm implemented in Python:


def rle_decode(data):  
    decode = ''
    count = ''
    for char in data:
        # If the character is numerical...
        if char.isdigit():
            # ...append it to our count
            count += char
            # Otherwise we've seen a non-numerical
            # character and need to expand it for
            # the decoding
            decode += char * int(count)
            count = ''
    return decode

We can run this code on the same output that we got from our encoding:

decoded_val = rle_decode('6A1F2D7C1A17E')  

And the output is the same as our original input to the encoding function:

$ python

Note that this implementation doesn't do any error checking to ensure that we have a valid RLE data stream. If any of the input data isn't formatted properly then you'll likely encounter an error.