levenshtein

(PHP 4 >= 4.0.1, PHP 5, PHP 7)

levenshtein计算两个字符串之间的编辑距离

说明

int levenshtein ( string $str1 , string $str2 )
int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )

编辑距离,是指两个字串之间,通过替换、插入、删除等操作将字符串str1转换成str2所需要操作的最少字符数量。 该算法的复杂度是 O(m*n),其中 nm 分别是str1str2的长度 (当和算法复杂度为O(max(n,m)**3)的similar_text()相比时,此函数还是相当不错的,尽管仍然很耗时。)。

在最简单的形式中,该函数只以两个字符串作为参数,并计算通过插入、替换和删除等操作将str1转换成str2所需要的操作次数。

第二种变体将采用三个额外的参数来定义插入、替换和删除操作的次数。此变体比第一种更加通用和适应,但效率不高。

参数

str1

求编辑距离中的其中一个字符串

str2

求编辑距离中的另一个字符串

cost_ins

定义插入次数

cost_rep

定义替换次数

cost_del

定义删除次数

返回值

此函数返回两个字符串参数之间的编辑距离,如果其中一个字符串参数长度大于限制的255个字符时,返回-1。

范例

Example #1 levenshtein() 例子:

<?php
// 输入拼写错误的单词
$input 'carrrot';

// 要检查的单词数组
$words  = array('apple','pineapple','banana','orange',
                
'radish','carrot','pea','bean','potato');

// 目前没有找到最短距离
$shortest = -1;

// 遍历单词来找到最接近的
foreach ($words as $word) {

    
// 计算输入单词与当前单词的距离
    
$lev levenshtein($input$word);

    
// 检查完全的匹配
    
if ($lev == 0) {

        
// 最接近的单词是这个(完全匹配)
        
$closest $word;
        
$shortest 0;

        
// 退出循环;我们已经找到一个完全的匹配
        
break;
    }

    
// 如果此次距离比上次找到的要短
    // 或者还没找到接近的单词
    
if ($lev <= $shortest || $shortest 0) {
        
// 设置最接近的匹配以及它的最短距离
        
$closest  $word;
        
$shortest $lev;
    }
}

echo 
"Input word: $input\n";
if (
$shortest == 0) {
    echo 
"Exact match found: $closest\n";
} else {
    echo 
"Did you mean: $closest?\n";
}

?>

以上例程会输出:

Input word: carrrot
Did you mean: carrot?

参见

add a note add a note

User Contributed Notes 28 notes

up
41
luciole75w at no dot spam dot gmail dot com
4 years ago
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.

Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)

You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.

Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.

Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.

<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
   
// find all multibyte characters (cf. utf-8 encoding specs)
   
$matches = array();
    if (!
preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return
$str; // plain ascii string
   
    // update the encoding map with the characters not already met
   
foreach ($matches[0] as $mbc)
        if (!isset(
$map[$mbc]))
           
$map[$mbc] = chr(128 + count($map));
   
   
// finally remap non-ascii characters
   
return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
   
$charMap = array();
   
$s1 = utf8_to_extended_ascii($s1, $charMap);
   
$s2 = utf8_to_extended_ascii($s2, $charMap);
   
    return
levenshtein($s1, $s2);
}
?>

Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms
up
5
Johan Gennesson php at genjo dot fr
1 year ago
Please, be aware that:

<?php
// Levenshtein Apostrophe (U+0027 &#39;) and Right Single Quotation Mark (U+2019 &#8217;)
echo levenshtein("'", "’");
?>

will output 3!
up
18
paulrowe at iname dot com
9 years ago
[EDITOR'S NOTE: original post and 2 corrections combined into 1 -- mgf]

Here is an implementation of the Levenshtein Distance calculation that only uses a one-dimensional array and doesn't have a limit to the string length. This implementation was inspired by maze generation algorithms that also use only one-dimensional arrays.

I have tested this function with two 532-character strings and it completed in 0.6-0.8 seconds.

<?php
/*
* This function starts out with several checks in an attempt to save time.
*   1.  The shorter string is always used as the "right-hand" string (as the size of the array is based on its length). 
*   2.  If the left string is empty, the length of the right is returned.
*   3.  If the right string is empty, the length of the left is returned.
*   4.  If the strings are equal, a zero-distance is returned.
*   5.  If the left string is contained within the right string, the difference in length is returned.
*   6.  If the right string is contained within the left string, the difference in length is returned.
* If none of the above conditions were met, the Levenshtein algorithm is used.
*/
function LevenshteinDistance($s1, $s2)
{
 
$sLeft = (strlen($s1) > strlen($s2)) ? $s1 : $s2;
 
$sRight = (strlen($s1) > strlen($s2)) ? $s2 : $s1;
 
$nLeftLength = strlen($sLeft);
 
$nRightLength = strlen($sRight);
  if (
$nLeftLength == 0)
    return
$nRightLength;
  else if (
$nRightLength == 0)
    return
$nLeftLength;
  else if (
$sLeft === $sRight)
    return
0;
  else if ((
$nLeftLength < $nRightLength) && (strpos($sRight, $sLeft) !== FALSE))
    return
$nRightLength - $nLeftLength;
  else if ((
$nRightLength < $nLeftLength) && (strpos($sLeft, $sRight) !== FALSE))
    return
$nLeftLength - $nRightLength;
  else {
   
$nsDistance = range(1, $nRightLength + 1);
    for (
$nLeftPos = 1; $nLeftPos <= $nLeftLength; ++$nLeftPos)
    {
     
$cLeft = $sLeft[$nLeftPos - 1];
     
$nDiagonal = $nLeftPos - 1;
     
$nsDistance[0] = $nLeftPos;
      for (
$nRightPos = 1; $nRightPos <= $nRightLength; ++$nRightPos)
      {
       
$cRight = $sRight[$nRightPos - 1];
       
$nCost = ($cRight == $cLeft) ? 0 : 1;
       
$nNewDiagonal = $nsDistance[$nRightPos];
       
$nsDistance[$nRightPos] =
         
min($nsDistance[$nRightPos] + 1,
             
$nsDistance[$nRightPos - 1] + 1,
             
$nDiagonal + $nCost);
       
$nDiagonal = $nNewDiagonal;
      }
    }
    return
$nsDistance[$nRightLength];
  }
}
?>
up
6
WiLDRAGoN
2 years ago
Some small changes allow you to calculate multiple words.

<?php

$input
= array();
$dictionary  = array();
foreach (
$input as $output) {
   
$shortest = -1;
    foreach (
$dictionary as $word) {
       
$lev = levenshtein($output, $word);
        if (
$lev == 0) {
           
$closest = $word;
           
$shortest = 0;
        }
        if (
$lev <= $shortest || $shortest < 0) {
           
$closest  = $word;
           
$shortest = $lev;
        }
    }
    echo
"Input word: $output\n";
    if (
$shortest == 0) {
        echo
"Exact match found: $closest\n";
    } else {
        echo
"Did you mean: $closest?\n";
    }
}

?>
up
3
bisqwit at iki dot fi
15 years ago
At the time of this manual note the user defined thing 
in levenshtein() is not implemented yet. I wanted something
like that, so I wrote my own function. Note that this
doesn't return levenshtein() difference, but instead
an array of operations to transform a string to another.

Please note that the difference finding part (resync)
may be extremely slow on long strings.

<?php

/* matchlen(): returns the length of matching
* substrings at beginning of $a and $b
*/
function matchlen(&$a, &$b)
{
 
$c=0;
 
$alen = strlen($a);
 
$blen = strlen($b);
 
$d = min($alen, $blen);
  while(
$a[$c] == $b[$c] && $c < $d)
   
$c++;  
  return
$c;
}

/* Returns a table describing
* the differences of $a and $b */
function calcdiffer($a, $b)
{
 
$alen = strlen($a);
 
$blen = strlen($b);
 
$aptr = 0;
 
$bptr = 0;
 
 
$ops = array();
 
  while(
$aptr < $alen && $bptr < $blen)
  {
   
$matchlen = matchlen(substr($a, $aptr), substr($b, $bptr));
    if(
$matchlen)
    {
     
$ops[] = array('=', substr($a, $aptr, $matchlen));
     
$aptr += $matchlen;
     
$bptr += $matchlen;
      continue;
    }
   
/* Difference found */
    
   
$bestlen=0;
   
$bestpos=array(0,0);
    for(
$atmp = $aptr; $atmp < $alen; $atmp++)
    {
      for(
$btmp = $bptr; $btmp < $blen; $btmp++)
      {
       
$matchlen = matchlen(substr($a, $atmp), substr($b, $btmp));
        if(
$matchlen>$bestlen)
        {
         
$bestlen=$matchlen;
         
$bestpos=array($atmp,$btmp);
        }
        if(
$matchlen >= $blen-$btmp)break;
      }
    }
    if(!
$bestlen)break;
  
   
$adifflen = $bestpos[0] - $aptr;
   
$bdifflen = $bestpos[1] - $bptr;

    if(
$adifflen)
    {
     
$ops[] = array('-', substr($a, $aptr, $adifflen));
     
$aptr += $adifflen;
    }
    if(
$bdifflen)
    {
     
$ops[] = array('+', substr($b, $bptr, $bdifflen));
     
$bptr += $bdifflen;
    }
   
$ops[] = array('=', substr($a, $aptr, $bestlen));
   
$aptr += $bestlen;
   
$bptr += $bestlen;
  }
  if(
$aptr < $alen)
  {
   
/* b has too much stuff */
   
$ops[] = array('-', substr($a, $aptr));
  }
  if(
$bptr < $blen)
  {
   
/* a has too little stuff */
   
$ops[] = array('+', substr($b, $bptr));
  }
  return
$ops;
}


Example:

$tab = calcdiffer('Tm on jonkinlainen testi',
                 
'Tm ei ole minknlainen testi.'); 
$ops = array('='=>'Ok', '-'=>'Remove', '+'=>'Add');
foreach(
$tab as $k)
  echo
$ops[$k[0]], " '", $k[1], "'\n";

Example output:

Ok 'Tm '
Remove 'on jonki'
Add 'ei ole mink'
Ok 'nlainen testi'
Add '.'
up
8
dschultz at protonic dot com
17 years ago
It's also useful if you want to make some sort of registration page and you want to make sure that people who register don't pick usernames that are very similar to their passwords.
up
3
gzink at zinkconsulting dot com
14 years ago
Try combining this with metaphone() for a truly amazing fuzzy search function. Play with it a bit, the results can be plain scary (users thinking the computer is almost telepathic) when implemented properly. I wish spell checkers worked as well as the code I've written.

I would release my complete code if reasonable, but it's not, due to copyright issues. I just hope that somebody can learn from this little tip!
up
6
justin at visunet dot ie
12 years ago
<?php

   
/*********************************************************************
    * The below func, btlfsa, (better than levenstien for spelling apps)
    * produces better results when comparing words like haert against
    * haart and heart.
    *
    * For example here is the output of levenshtein compared to btlfsa
    * when comparing 'haert' to 'herat, haart, heart, harte'
    *
    * btlfsa('haert','herat'); output is.. 3
    * btlfsa('haert','haart'); output is.. 3
    * btlfsa('haert','harte'); output is.. 3
    * btlfsa('haert','heart'); output is.. 2
    *
    * levenshtein('haert','herat'); output is.. 2
    * levenshtein('haert','haart'); output is.. 1
    * levenshtein('haert','harte'); output is.. 2
    * levenshtein('haert','heart'); output is.. 2
    *
    * In other words, if you used levenshtein, 'haart' would be the
    * closest match to 'haert'. Where as, btlfsa sees that it should be
    * 'heart'
    */

   
function btlfsa($word1,$word2)
    {
       
$score = 0;

       
// For each char that is different add 2 to the score
        // as this is a BIG difference

       
$remainder  = preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word1)."]/i",'',$word2);
       
$remainder .= preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word2)."]/i",'',$word1);
       
$score      = strlen($remainder)*2;

       
// Take the difference in string length and add it to the score
       
$w1_len  = strlen($word1);
       
$w2_len  = strlen($word2);
       
$score  += $w1_len > $w2_len ? $w1_len - $w2_len : $w2_len - $w1_len;

       
// Calculate how many letters are in different locations
        // And add it to the score i.e.
        //
        // h e a r t
        // 1 2 3 4 5
        //
        // h a e r t     a e        = 2
        // 1 2 3 4 5   1 2 3 4 5
        //

       
$w1 = $w1_len > $w2_len ? $word1 : $word2;
       
$w2 = $w1_len > $w2_len ? $word2 : $word1;

        for(
$i=0; $i < strlen($w1); $i++)
        {
            if ( !isset(
$w2[$i]) || $w1[$i] != $w2[$i] )
            {
               
$score++;
            }
        }

        return
$score;
    }

   
// *************************************************************
    // Here is a full code example showing the difference

   
$misspelled = 'haert';

   
// Imagine that these are sample suggestions thrown back by soundex or metaphone..
   
$suggestions = array('herat', 'haart', 'heart', 'harte');

   
// Firstly order an array based on levenshtein
   
$levenshtein_ordered = array();
    foreach (
$suggestions as $suggestion )
    {
       
$levenshtein_ordered[$suggestion] = levenshtein($misspelled,$suggestion);
    }
   
asort($levenshtein_ordered, SORT_NUMERIC );

    print
"<b>Suggestions as ordered by levenshtein...</b><ul><pre>";
   
print_r($levenshtein_ordered);
    print
"</pre></ul>";

   
// Secondly order an array based on btlfsa
   
$btlfsa_ordered = array();
    foreach (
$suggestions as $suggestion )
    {
       
$btlfsa_ordered[$suggestion] = btlfsa($misspelled,$suggestion);
    }
   
asort($btlfsa_ordered, SORT_NUMERIC );

    print
"<b>Suggestions as ordered by btlfsa...</b><ul><pre>";
   
print_r($btlfsa_ordered);
    print
"</pre></ul>";

?>
up
4
carey at NOSPAM dot internode dot net dot au
11 years ago
I have found that levenshtein is actually case-sensitive (in PHP 4.4.2 at least).

<?php
$distance
=levenshtein('hello','ELLO');
echo
"$distance";
?>

Outputs: "5", instead of "1". If you are implementing a fuzzy search feature that makes use of levenshtein, you will probably need to find a way to work around this.
up
3
engineglue at gmail dot com
5 years ago
I really like [the manual's] example for the use of the levenshtein function to match against an array. I ran into the need to specify the sensitivity of the result. There are circumstances when you want it to return false if the match is way out of line. I wouldn't want "marry had a little lamb" to match with "saw viii" simply because it was the best match in the array. Hence the need for sensitivity:

<?php
   
function wordMatch($words, $input, $sensitivity){
       
$shortest = -1;
        foreach (
$words as $word) {
           
$lev = levenshtein($input, $word);
            if (
$lev == 0) {
               
$closest = $word;
               
$shortest = 0;
                break;
            }
            if (
$lev <= $shortest || $shortest < 0) {
               
$closest  = $word;
               
$shortest = $lev;
            }
        }
        if(
$shortest <= $sensitivity){
            return
$closest;
        } else {
            return
0;
        }
    }

   
$word = 'PINEEEEAPPLE';

   
$words  = array('apple','pineapple','banana','orange',
                   
'radish','carrot','pea','bean','potato');
                   
    echo
wordMatch($words, strtolower($word), 2);
?>
up
1
yhoko at yhoko dot com
1 year ago
Note that this function might cause problems when working with multibyte charactes like in UTF-8. Example:

<?php
print( similar_text( 'hä', 'hà' ) ); // Returns 2 where only 1 character matches
?>
up
4
Chaim Chaikin
6 years ago
As regards to Example #1 above, would it not be more efficient to first use a simple php == comparison to check if the strings are equal even before testing the word with levenshtein().

Something like this:

<?php
// input misspelled word
$input = 'carrrot';

// array of words to check against
$words  = array('apple','pineapple','banana','orange',
               
'radish','carrot','pea','bean','potato');

// no shortest distance found, yet
$shortest = -1;

// loop through words to find the closest
foreach ($words as $word) {

   
// check for an exact match
   
if ($input == $word) {

       
// closest word is this one (exact match)
       
$closest = $word;
       
$shortest = 0;

       
// break out of the loop; we've found an exact match
       
break;
    }

   
// calculate the distance between the input word,
    // and the current word
   
$lev = levenshtein($input, $word);

   
// if this distance is less than the next found shortest
    // distance, OR if a next shortest word has not yet been found
   
if ($lev <= $shortest || $shortest < 0) {
       
// set the closest match, and shortest distance
       
$closest  = $word;
       
$shortest = $lev;
    }
}

echo
"Input word: $input\n";
if (
$shortest == 0) {
    echo
"Exact match found: $closest\n";
} else {
    echo
"Did you mean: $closest?\n";
}

?>
up
2
dale3h
9 years ago
Using PHP's example along with Patrick's comparison percentage function, I have come up with a function that returns the closest word from an array, and assigns the percentage to a referenced variable:

<?php
 
function closest_word($input, $words, &$percent = null) {
   
$shortest = -1;
    foreach (
$words as $word) {
     
$lev = levenshtein($input, $word);

      if (
$lev == 0) {
       
$closest = $word;
       
$shortest