Levenshtein Distance

/ˌlɛvə(n)ˈʃtaɪn ˈdɪstəns/

noun — "a metric for measuring string differences."

Levenshtein Distance is a metric used in computer science and computational linguistics to quantify the difference between two sequences, typically strings, by counting the minimum number of single-character edits required to transform one string into the other. The allowed edits are insertion, deletion, and substitution of a single character. This concept is fundamental in applications such as spell checking, approximate string matching, DNA sequence analysis, and natural language processing.

Formally, given two strings s of length m and t of length n, the Levenshtein distance d(m,n) can be computed using a recursive relation:


d(i, j) = 
    0                                         if i = 0 and j = 0
    j                                         if i = 0
    i                                         if j = 0
    min(
        d(i-1, j) + 1,                        // deletion
        d(i, j-1) + 1,                        // insertion
        d(i-1, j-1) + cost                     // substitution
    )                                         where cost = 0 if s[i] = t[j] else 1

The algorithm is typically implemented using a two-dimensional matrix of size (m+1) × (n+1), where each cell represents the minimal edit distance between prefixes of the strings. Dynamic programming is used to fill the matrix efficiently, avoiding redundant recursive calls and reducing time complexity from exponential to O(m × n).

In practical applications, Levenshtein Distance is used to rank similarity between strings. For example, in search engines, it allows the system to find approximate matches for user queries with typographical errors. In bioinformatics, it measures evolutionary divergence between DNA, RNA, or protein sequences by quantifying insertions, deletions, or substitutions. In text comparison, it can be combined with thresholds or weighted operations to prioritize certain edits over others.

A simple usage example in pseudo-code for spell checking:


dictionary = ["apple", "orange", "banana"]
input_word = "appl"
closest_word = None
min_distance = infinity

for word in dictionary:
    distance = LevenshteinDistance(input_word, word)
    if distance < min_distance:
        min_distance = distance
        closest_word = word
return closest_word

Optimizations exist for memory and performance. For instance, only two rows of the dynamic programming matrix need to be stored at any time, reducing space complexity from O(m × n) to O(min(m, n)). Other variations include weighted Levenshtein distance, where different edits have different costs, and Damerau-Levenshtein distance, which adds transposition of adjacent characters as an allowed operation.

Conceptually, Levenshtein Distance can be visualized as a grid or lattice where each path from the top-left to bottom-right represents a sequence of edits. The shortest path corresponds to the minimal edits needed to convert one string into another, analogous to finding the least costly route through a maze.

See String Metrics, Approximate String Matching, Damerau-Levenshtein Distance.

Algorithm

String

Processing