Approximate string matching - 13801

In computing, approximate string matching is the technique of finding approximate matches to a pattern in a string.

The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. The usual primitive operations are:

  • insertion (e.g., changing cot to coat),
  • deletion (for example - changing coat to cot), and
  • substitution (for example - changing coat to cost).

Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation. Changing cost to cots is an example of a transposition.

Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is coil, foil differs by one substitution, coils by one insertion, oil by one deletion, and foal by two subsitutions. If all operations count as a single unit of cost and the limit is set to one, foil, coils, and oil will count as matches while foal will not.

Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations. Some matchers allow separate assignments of limits & weights to individual groups in the pattern.

Most approximate matchers used for text processing are regular expression matchers. The distance between a candidate and the pattern is therefore computed as the minimum distance between the candidate and a fixed string matching the regular expression. thus, if the pattern is co.l, using the POSIX notation in which a dot matches any single character, both coal and coil are exact matches, while soil differs by one substitution.

The most common application of approximate matchers until recently has been spell checking. With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application. Approximate matching is also used to identify pieces of music from small snatches & in spam filtering.

References for this article
  • Pattern Matching Algorithms, Alberto Apostolico & Zvi Galil, Oxford University Press, UK, 1997.

More available info
  • Fuzzy string searching
  • Levenshtein distance
  • Needleman-Wunsch algorithm
  • Soundex
  • Agrep
  • Zsh
  • This page was last modified by Admin. Previous modification to this article was done on 21:13, 10 Nov 2006 by Wikipedia user QTJ. Based on work by Wikipedia user(s) Billposer, JonHarder, Macha, Bluebot, Gflores, Pablo-flores & Emj.
  • Click here to view  authors profile
    Pub date - 2009-05-15 10:49:23 Related resources:
    The Wild World of SIGMOD
    Thu, 02 Jul 2009 21:10:07 GMT - ... Similarity Caching; Indexing Uncertain Data; Top-k Queries on Uncertain Data: On Score Distribution & Typical Answers; Incremental Maintenance of Length Normalized Indexes for Approximate String Matching; Why Not? ...
    Life at the Rough String: Searching for the Wild Ones~Part 1
    Fri, 03 Jul 2009 17:25:00 GMT - Sale Barn at the Rough String. Sale Barn at the Rough String 2003 Kiger Mustang-BLM Titled. 2000 BLM Titled Mustang-Coyote Lakes HMA. SOLD!! 2003 BLM Titled Mustang-Big Summit HMA ...


    Approximate string matching - 13801

Leave a Reply