Introduction
Writing text is a creative process that is based on thoughts and ideas which come to our mind. The way that the text is written reflects our personality and is also very much influenced by the mood we are in, the way we organize our thoughts, the topic itself and by the people we are addressing it to - our readers.
In the past it happened that two or more authors had the same idea, wrote it down separately, published it under their name and created something that was very similar. Prior to electronic publications their ideas took a while to circulate and therefore led to conflicts about the real inventor and who should be the one to be honored for it.
Today, every article is immediately available online in a digital format. Online articles are indexed correctly and linked to other documents, which makes it easy to find them quickly. On the one hand this way of working simplifies the exchange of ideas as well as the research about a topic but on the other hand the accessibility opens doors to just copy and paste others work without permission or acknowledging them, called plagiarism.
At this point methods come into play that deal with the similarity of different texts. The main idea behind this is to be able to answer the questions if two texts (or datasets in general) are entirely or at least partly similar, if they are related to each other in terms of the same topic and how many edits have to be done to transform one text to the other.
As an example, this technology is used by information retrieval systems, search engines, automatic indexing systems, text summarizers, categorization systems, plagiarism checkers, speech recognition, rating systems, DNA analysis, and profiling algorithms (IR/AI programs to automatically link data between people and what they do).
Search and Comparison Methods
All of us are familiar with searching a text for a specified word or character sequence (pattern). The goal is to either find the exact occurrence (match) or to find an in-exact match using characters with a special meaning, for example by regular expressions or by fuzzy logic. Mostly, it is a sequence of characters that is similar to another one.
Furthermore, the similarity can be measured by the way words sound -- do they sound similar but are written in a different way? Translations from one alphabet to another often gives more than one result depending on the language, so to find relatives based on the different spellings of their surname and name the Soundex algorithm was created and is still one of the most popular and widespread ones today.
Last but not least, how many changes (edits) are necessary to get from one word to the other? The less edits to be done the higher is the similarity level. This category of comparison contains the Levenshtein distance that we will focus on in more detail below.
Table 1 covers a selection of ways to search and compare text data. The right column of the table contains a selection of the corresponding Python modules to achieve these tasks.
Category | Method or Algorithm | Python packages |
---|---|---|
Exact search | Boyer-Moore string search, Rabin-Karp string search, Knuth-Morris-Pratt (KMP), Regular Expressions | string, re, Advas |
In-exact search | bigram search, trigram search, fuzzy logic | Fuzzy |
Phonetic algorithms | Soundex, Metaphone, Double Metaphone, Caverphone, NYSIIS, Kölner Phonetik, Match Rating codex | Advas, Fuzzy, jellyfish, phonetics, kph |
Changes or edits | Levenshtein distance, Hamming distance, Jaro distance, Jaro-Winkler distance | editdistance, python-Levenshtein, jellyfish |
Table 1
The Levenshtein Distance
This method was invented in 1965 by the Russian Mathematician Vladimir Levenshtein (1935-2017). The distance value describes the minimal number of deletions, insertions, or substitutions that are required to transform one string (the source) into another (the target). Unlike the Hamming distance, the Levenshtein distance works on strings with an unequal length.
The greater the Levenshtein distance, the greater the difference between the strings. For example, from "test" to "test" the Levenshtein distance is 0 because both the source and target strings are identical. No transformations are needed. In contrast, from "test" to "team" the Levenshtein distance is 2 - two substitutions have to be done to turn "test" into "team".
Here is a great video explaining how the algorithm works:
Implementing Levenshtein Distance in Python
For Python, there are quite a few different implementations available online [9,10] as well as from different Python packages (see table above). This includes versions following the Dynamic programming concept as well as vectorized versions. The version we show here is an iterative version that uses the NumPy package and a single matrix to do the calculations. As an example we would like to find out the edit distance between "test" and "text".
It starts with an empty matrix that has the size of the length of the strings. Both the first row and column, starting from zero, are indexed increasingly:
t e s t
[[ 0. 1. 2. 3. 4.]
t [ 1. 0. 0. 0. 0.]
e [ 2. 0. 0. 0. 0.]
x [ 3. 0. 0. 0. 0.]
t [ 4. 0. 0. 0. 0.]]
Next, two loops follow to compare the strings letter by letter - row-wise, and column-wise. If two letters are equal, the new value at position [x, y]
is the minimum between the value of position [x-1, y] + 1
, position [x-1, y-1]
, and position [x, y-1] + 1
.
[+0.] [+1.]
[+1.] [ ]
Otherwise, it is the minimum between the value of position [x-1, y] + 1
, position [x-1, y-1] + 1
, and position [x, y-1] + 1
. Again, this can be visualized as a two by two sub-matrix where you are calculating the missing value in the bottom right position as below:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
[+1.] [+1.]
[+1.] [ ]
Note there are three possible types of change if the two characters are different - insert, delete and substitute. Finally, the matrix looks as follows:
t e s t
[[ 0. 1. 2. 3. 4.]
t [ 1. 0. 1. 2. 3.]
e [ 2. 1. 0. 1. 2.]
x [ 3. 2. 1. 1. 2.]
t [ 4. 3. 2. 1. 1.]]
The edit distance is the value at position [4, 4] - at the lower right corner - which is 1, actually. Note that this implementation is in O(N*M)
time, for N
and M
the lengths of the two strings. Other implementations may run in less time but are more ambitious to understand.
Here is the corresponding code for the Levenshtein distance algorithm I just described:
import numpy as np
def levenshtein(seq1, seq2):
size_x = len(seq1) + 1
size_y = len(seq2) + 1
matrix = np.zeros ((size_x, size_y))
for x in xrange(size_x):
matrix [x, 0] = x
for y in xrange(size_y):
matrix [0, y] = y
for x in xrange(1, size_x):
for y in xrange(1, size_y):
if seq1[x-1] == seq2[y-1]:
matrix [x,y] = min(
matrix[x-1, y] + 1,
matrix[x-1, y-1],
matrix[x, y-1] + 1
)
else:
matrix [x,y] = min(
matrix[x-1,y] + 1,
matrix[x-1,y-1] + 1,
matrix[x,y-1] + 1
)
print (matrix)
return (matrix[size_x - 1, size_y - 1])
References
- [1] Python re module
- [2] Python Levenshtein module
- [3] Python editdistance module
- [4] Python AdvaS module
- [5] Python fuzzy module
- [6] Python jellyfish module
- [7] Python phonetics module
- [8] Python kph module
- [9] https://www.python-course.eu/levenshtein_distance.php
- [10] https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
Acknowledgements
The author would like to thank
Axel Beckert, Mandy Neumeyer, Gerold Rupprecht and Zoleka Hatitongwe for their support while preparing the article.