In this article we'll go over how the run-length encoding algorithm works, what it's used for, and how to implement its encode and decode functions in Python.
Run-length encoding (RLE) is a very simple form of data compression in which a stream of data is given as the input (i.e. "AAABBCCCC") and the output is a sequence of counts of consecutive data values in a row (i.e. "3A2B4C"). This type of data compression is lossless, meaning that when decompressed, all of the original data will be recovered when decoded. Its simplicity in both the encoding (compression) and decoding (decompression) is one of the most attractive features of the algorithm.
Here you can see a simple example of a stream ("run") of data in its original form and encoded form:
Input data:
AAAAAAFDDCCCCCCCAEEEEEEEEEEEEEEEEE
Output data:
6A1F2D7C1A17E
In this example we were able to compress the data from 34 characters down to 13.
As you may have noticed, the more consecutive values in a row, the more space we save in the resulting compression. On the other hand, if you have a sequence of data that frequently changes between values (i.e. "BEFEFADED") then we won't save much space at all. In fact, we could even increase the size of our data since a single instance of a character results in 2 characters (i.e. "A" becomes "1A") in the output of the encoding.
Because of this, RLE is only good for certain types of data and applications. For example, the Pixy camera, which is a robotics camera that helps you easily track objects, uses RLE to compress labeled video data before transferring it from the embedded camera device to an external application. Each pixel is given a label of "no object", "object 1", "object 2", etc. This is the perfect encoding for this application because of its simplicity, speed, and ability to compress the low-entropy label data.
Encoding
In order to encode a string of data, your code will need to loop through each character of the data and count the occurrences. Once you see a character that is different from the previous character, you will append the number of occurrences and the character to your encoding.
Below you'll find a simple implementation in Python:
# rle-encode.py
def rle_encode(data):
encoding = ''
prev_char = ''
count = 1
if not data: return ''
for char in data:
# If the prev and current characters
# don't match...
if char != prev_char:
# ...then add the count and character
# to our encoding
if prev_char:
encoding += str(count) + prev_char
count = 1
prev_char = char
else:
# Or increment our counter
# if the characters do match
count += 1
else:
# Finish off the encoding
encoding += str(count) + prev_char
return encoding
From the comments you should be able to tell what is going on throughout the code. If not, it would be a good exercise to run through the code with a debugger and see it in action.
Continuing with the same file as above, here is an example of the code being executed:
encoded_val = rle_encode('AAAAAAFDDCCCCCCCAEEEEEEEEEEEEEEEEE')
print(encoded_val)
And the output:
$ python rle-encode.py
6A1F2D7C1A17E
Decoding
Decoding an RLE-encoded stream of data is actually even easier than encoding it. Like before, you iterate through the data stream one character at a time. If you see a numeric character then you add it to your count
, and if you see a non-numeric character then you add count
of those characters to your decoding, which is returned to the caller once you iterate through all of the inputdata
.
Here is the algorithm implemented in Python:
# rle-decode.py
def rle_decode(data):
decode = ''
count = ''
for char in data:
# If the character is numerical...
if char.isdigit():
# ...append it to our count
count += char
else:
# Otherwise we've seen a non-numerical
# character and need to expand it for
# the decoding
decode += char * int(count)
count = ''
return decode
We can run this code on the same output that we got from our encoding:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
decoded_val = rle_decode('6A1F2D7C1A17E')
print(decoded_val)
And the output is the same as our original input to the encoding function:
$ python rle-decode.py
AAAAAAFDDCCCCCCCAEEEEEEEEEEEEEEEEE
Note that this implementation doesn't do any error checking to ensure that we have a valid RLE data stream. If any of the input data isn't formatted properly then you'll likely encounter an error.