Phonetic Similarity of Words: A Vectorized Approach in Python

In an earlier article I gave you an introduction into phonetic algorithms, and shows their variety. In more detail we had a look at the edit distance, which is also known as the Levenshtein Distance. This algorithm was developed in order to calculate the number of letter substitutions to get from one word to the next.

As you may have already noted in the previous article, there are different methods to calculate the sound of a word like Soundex, Metaphone, and the Match Rating codex. Some of them are more common than others. As an example, an implementation of Soundex is part of every programming language as well as Database Management Systems (DBMS) like Oracle, MySQL, and PostgreSQL. In contrast both Metaphone and the Match Rating codex are rarely used, and in most cases require additional software libraries to be installed on your system.

Seen as a proposal, this article demonstrates how to combine different phonetic algorithms in a vectorized approach, and to use their peculiarities in order to achieve a better comparison result than using the single algorithms separately. To implement this, the Python-based library named AdvaS Advanced Search on SourceForge comes into play. AdvaS already includes a method in order to calculate several phonetic codes for a word in a single step.

Phonetic Algorithms Explained

To be more precise, each of these algorithms creates a specific phonetic representation of a single word. Usually, such a representation is either a fixed-length, or a variable-length string that consists of only letters, or a combination of both letters and digits. The detailed structure of the representation depends on the algorithm. Actually, if two representations - calculated using the same algorithm - are similar the two original words are pronounced in the same way no matter how they are written. In reality, this helps to detect similar-sounding words even if they are spelt differently - no matter if done on purpose, or by accident.

Each of these algorithms were designed with a certain language or purpose in mind, and do not fit in each others languages in exactly the same way. Keep in mind that the representations are not always optimal but intended to fit as close as possible. As an example, the original Soundex algorithm focuses on the English language, whereas the Kölner Phonetik focuses on the German language, which contains umlauts, and other special characters like an "ß".

Next, we will have a short look at a selection of phonetic algorithms. For a more detailed description follow the links given below. Be warned that the level of documentation of the algorithms is quite different - from very detailed to quite sparse.

Soundex

The resulting representation from the Soundex algorithm is a four letter word. This is based on a character followed by three numerical digits. As an example, the Soundex value of "Knuth" is K530 which is similar to "Kant". This simplicity leads to quite a few misleading representations. Although, in general the results are quite good. Originally designed for American English, Soundex is today available in different language-specific versions like French, German, and Hebrew.

Developed by Robert C. Russell and Margaret King Odell at the beginning of the 20th century, Soundex was designed with the English language in mind. It was widely used to detect similar-sounding family names as part of the US census in the 1930s.

Metaphone

Developed by Lawrence Phillips in 1990, Metaphone was also designed with the English language in mind. He tried to improve the Soundex mechanism by using information on variations and inconsistencies in English spelling/pronunciation to produce more accurate encodings. As a result the phonetic representation is a variable-length word based on the 16 consonants "0BFHJKLMNPRSTWXY". The 5 vowels "AEIOU" are allowed, too, but only at the beginning of the representation.

The original description of the Metaphone algorithm was rather inexact and led to the development of both Double Metaphone and Metaphone 3. The latter one can correct thousands of miscodings that are be produced by the first two versions. Metaphone 3 is available as a commercial software and supports both German and Spanish pronunciation.

Figure 1 below is a screenshot taken from a Dutch genealogy website, and shows the different representations for Soundex, Metaphone, and Double Metaphone for the name "Knuth". Also, the figure displays a selection of words that are represented in the same way and have the same phonetic code ("Gleiche Kodierung wie"). The more distinctive the algorithm the less number of words with the same phonetic code is best.

Figure 1

The Metaphone algorithm is a standard part of only a few programming languages, for example PHP. For Python, both Metaphone and Double Metaphone are part of the Phonetics package. Commercial implementations are available for the programming languages C++, C#, Java, Python, and Ruby.

Caverphone

The Caverphone algorithm was created by David Hood in 2002. A revised version was released in 2004. The project environment is the Caversham Project at the University of Otago, New Zealand. The background for the algorithm was to assist with matching electoral rolls data between late 19th century and early 20th century, where names only needed to be in a 'commonly recognizable form'. The algorithm is named after the municipality the university is located, and optimized for language-specific letter combinations where the research of the names took place.

By default, a Caverphone representation consists of six characters and numbers. Some implementations allow to extend the length up to ten characters and numbers. As an example, "Thompson" is transformed into the code "TMPSN1". Currently, the algorithm is available for C#, Python (revised version), Java (both the original and revised version), and R.

New York State Identification and Intelligence System

This algorithm was developed in the 1970s as part of the New York State Identification and Intelligence System (NYSIIS). Still in use today its quality is said to be close to the Soundex algorithm.

The design was optimized to match specifically with American names. So, the two names "Webberley" and "Wibberley" are represented by the phonetic code "WABARLY".

Kölner Phonetik

Based on the Soundex algorithm, in 1969 Hans Joachim Postel developed the Kölner Phonetik. It is targeted towards the German language, and later became part of the SAP systems. The phonetic representation is just a variable-length string of digits.

Currently, implementations in Perl, PHP, and JavaScript are known.

Match Rating Approach

The Match rating approach (MRA) codex was developed in 1977 by Western Airlines. The idea was to detect homophonous names on passenger lists with a strong focus on the English language. As an example, the representation for "Smith" is "SMTH", whereas "Smyth" is encoded by "SMYTH".

Currently, MRA is available as a C# implementation from an archived website, and as a Python method in the Jellyfish module.

Implementation

The calculation of the degree of similarity is based on three vectors denominated as codeList1, codeList2, and weight in the source code listing below. In Python a vector can be implemented as an array, for example using the NumPy package. Vector number one and two represent the phonetic code for the two different words. Vector number three represents the specific algorithm weight, and contains a fractional value between 0 and 1 in order to describe that weight. The total of the single values of vector three is the exact value of 1, and should neither be lower or higher than that. In case this happens the single values of vector three have to be normalized beforehand.

Figure 2 displays the three vectors.

Figure 2 Three vectors used to keep the data

The calculated degree of similarity between the two words is a decimal value based on a calculation per phonetic algorithm (subtotal). Each subtotal is the product of the Levenshtein distance between the specific phonetic representation of codeList1 and codeList2, and the according weight for the specific phonetic algorithm. For NYSIIS, it is calculated as follows:

nysiis = Levenshtein(codeList1["nysiis"], codeList2["nysiis"]) * weight["nysiis"]
= Levenshtein("Knatt", "Kand") * 0.1
= 3 * 0.1
= 0.3


As described in the previous article, Levenshtein distance returns the number of edits required to come from one word to the next. In our case the two words are phonetic codes that are calculated per algorithm. The lower the number of changes (edits) between the codes the higher the level of phonetic similarity between the original words as seen from the point of view of the algorithm.

The Python code below uses the Phonetics class from the AdvaS module, as well as the NumPy module. The definition of the Levenshtein function is similar to the earlier article on Levenshtein distance, and just included for completeness. Next, the three vectors are initialized as shown in Figure 2, the subtotals are calculated in a loop, and the total is printed to stdout.

# -*- coding: utf-8 -*-

from phonetics import Phonetics
import numpy as np

def levenshtein(seq1, seq2):
size_x = len(seq1) + 1
size_y = len(seq2) + 1
matrix = np.zeros ((size_x, size_y))
for x in xrange(size_x):
matrix [x, 0] = x
for y in xrange(size_y):
matrix [0, y] = y

for x in xrange(1, size_x):
for y in xrange(1, size_y):
if seq1[x-1] == seq2[y-1]:
matrix [x,y] = min(
matrix[x-1, y] + 1,
matrix[x-1, y-1],
matrix[x, y-1] + 1
)
else:
matrix [x,y] = min(
matrix[x-1,y] + 1,
matrix[x-1,y-1] + 1,
matrix[x,y-1] + 1
)
return (matrix[size_x - 1, size_y - 1])

# -- initialize phonetics object

word1 = Phonetics("Knuth")
word2 = Phonetics("Kant")
print ("Comparing %s with %s" % (word1.getText(), word2.getText()))

# -- phonetic code
codeList1 = word1.phoneticCode()
codeList2 = word2.phoneticCode()

# -- weight
weight = {
"soundex": 0.2,
"caverphone": 0.2,
"metaphone": 0.5,
"nysiis": 0.1
}

# -- algorithms
algorithms = ["soundex", "caverphone", "metaphone", "nysiis"]

# -- total
total = 0.0
for entry in algorithms:
code1 = codeList1[entry]
code2 = codeList2[entry]
lev = levenshtein (code1, code2)
currentWeight = weight[entry]
print ("comparing %s with %s for %s (%0.2f: weight %0.2f)" % (code1, code2, entry, lev, currentWeight))
subtotal = lev * currentWeight
total += subtotal

print ("total: %0.2f" % total)


Assuming the source code is stored in the file phonetics-vector.py the output is as follows:

$python phonetics-vector.py Comparing Knuth with Kant comparing K530 with K530 for soundex (0.00: weight 0.20) comparing KNT1111111 with KNT1111111 for caverphone (0.00: weight 0.20) comparing n0h with nt for metaphone (2.00: weight 0.50) comparing Knatt with Kand for nysiis (3.00: weight 0.20) total: 1.60$


The smaller the degree of similarity the more identical are the two words in terms of pronunciation. As demonstrated in the example above "Knuth" and "Kant" the calculated value is 1.6, and quite low.

Conclusion

The approach explained here helps to find a solution to balance the peculiarities of the different phonetic methods. So far, the first result is promising but may not be optimal yet. The weight vector is used to regulate the influence of each specific phonetic algorithm. Further research is required to find the appropriate weight value distribution per language. Also, the list of algorithms that are taken into account can be easily extended.

Acknowledgements

The author would like to thank Gerold Rupprecht and Zoleka Hatitongwe for their support while preparing the article.

Berlin -- Genève -- Cape Town
IT developer, trainer, and author. Coauthor of the Debian Package Management Book (http://www.dpmb.org/).