Swizec Teller - a geek with a hatswizec.com

    Levenshtein distance in Haskell

    When you have to figure out the difference between two strings your best bet is either Hamming distance or Levenshtein distance.

    Because hamming distance only works on strings of equal lengths, you're usually better off using levenshtein distance. Its name is also harder to spell so you know it's better!

    It is defined as the minimum amount of edits needed to transform one string into another. Like this:

    1. kitten → sitten (substitution of 's' for 'k')
    2. sitten → sittin (substitution of 'i' for 'e')
    3. sittin → sitting (insertion of 'g' at the end).

    Levenshtein distance also happens to be a great case for learning about dynamic programming, because one of the simplest implementations involves nothing more than keeping track of the already known distances and deciding whether you should increase that number for the next character.

    Doing this in a 2D array is trivially simple because it lest you travel through the distance space very efficiently without forgetting that kitten and kiten only have a levenshtein distance of 1.

    The meat of the algorithm goes like this:

    if s[i] = t[j] then
    d[i, j] := d[i-1, j-1] // no operation required
    else
    d[i, j] := minimum
    (
    d[i-1, j] + 1, // a deletion
    d[i, j-1] + 1, // an insertion
    d[i-1, j-1] + 1 // a substitution
    )

    Dynamic programming and Haskell

    Doing this in Haskell becomes tricky because we don't have state so we can't really keep track of what we already know. Translating this into a recursive problem is our only option.

    The key insight comes from looking at what those index manipulations are doing - traveling in the space of two strings.

    So, if we start from the other end of the "matrix" - bottom right instead of top left - we can express the levenshtein distance recursively:

    • if last two characters are same, ignore them
    • otherwise take the minimum of ignoring either or both the last characters
    • recurse

    In Haskell the function ends up looking like this:

    -- calculate levenshtein distance between two strings
    levenshtein::[Char] -> [Char] -> Int
    levenshtein "" "" = 0
    levenshtein "" s2 = length s2
    levenshtein s1 "" = length s1
    levenshtein s1 s2
    | last s1 == last s2 = levenshtein (init s1) (init s2)
    | otherwise = minimum [1 + levenshtein (init s1) s2,
    1 + levenshtein s1 (init s2),
    1 + levenshtein (init s1) (init s2)]

    Better Haskellers than me can probably write this a bit cleaner - for instance you don't need the levenshtein "" "" = 0 line, but I think it's more readable when all the border cases are spelled out explicitly.

    However, the problem is that this is incredibly slow when comparing a long string to a short one. So slow in fact, I never waited for the code to stop executing (takes 20s+).

    Luckily, speeding up the code is trivial - look at those border cases. We can just handle those first and then go into the meaty algorithm!

    The code becomes somewhat uglier to look at, but is usefully quick for all cases:

    -- calculate levenshtein distance between two strings
    levenshtein::[Char] -> [Char] -> Int
    -- this part is mostly a speed optimiziation
    levenshtein s1 s2
    | length s1 > length s2 = levenshtein s2 s1
    | length s1 < length s2 =
    let d = length s2 - length s1
    in d + levenshtein s1 (take (length s2 - d) s2)
    -- the meat of the algorithm
    levenshtein "" "" = 0
    levenshtein s1 s2
    | last s1 == last s2 = levenshtein (init s1) (init s2)
    | otherwise = minimum [1 + levenshtein (init s1) s2,
    1 + levenshtein s1 (init s2),
    1 + levenshtein (init s1) (init s2)]

    If anyone's got an even better idea for implementing Levenshtein distance in Haskell, I'd love to hear!

    edit: A Reddit commenter provides a far better Levenshtein distance function.





    Enhanced by Zemanta

    Did you enjoy this article?

    Published on July 6th, 2012 in Algorithm, Dynamic programming, Edit distance, Haskell, Languages, Levenshtein distance, Programming, String searching algorithm, Uncategorized,

    Learned something new?
    Read more Software Engineering Lessons from Production

    I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.

    Subscribe below 👇

    Software Engineering Lessons from Production

    Join Swizec's Newsletter and get insightful emails 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.

    "Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"

    ~ Ashish Kumar

    Join 15,883+ engineers learning lessons from my "raw and honest from the heart" emails.

    ⭐️⭐️⭐️⭐️✨
    4.5 stars average rating

    Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

    Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

    Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

    Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

    Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

    Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

    Want to brush up on modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

    Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

    Created by Swizec with ❤️