Subscribe to this thread
Home - General / All posts - fuzzy look up and/or string-matching algorithm implementation in manifold?
Corentin
159 post(s)
#10-Dec-20 18:45

Hi

I have two database of village in madagascar, I need to join them. the first one is from an old census the second one from the last census. Let's call them A & B (B being the most recent data)

A & B both have position, region, district, commune, name and position, but there are sliglth differences between A & B (small position change : 0-1km, orthograph change in names...). A has 17000 points, B has 20000 points. I have some information in A database that I need to put into B database

I tried a geographical join (through a voronoi area approach) but only ~65% of A point fall uniquely in a B voronoi Cell. And when I have a unique A point into a B cell sometimes it is not the right point (for example the village names do not match)

I would like to try another method based on "full name" (concatenation of region-district-commune-name), but I am facing orthograph problem, this way I have only 60% of matching before orthograph correction. So I would need to know if there is some fuzzy lookup algorithm in Manifold (8 or9, preferably8) that helps join database base on almost exact primary key column (like in excel : fuzzy look up addin https://www.microsoft.com/en-us/download/details.aspx?id=15011).

If no, second option would be to analyse if my "full name" are similar to decide wether points in A & B table should match or not based on a note given by a string-matching function. Is such a function exist in manifolf? As someone already used this kind of stuff?

Corentin
159 post(s)
#13-Jan-21 09:17

No one has a tip on this question ?

rk
621 post(s)
#13-Jan-21 09:29

You could try soundex.

M8

SQL Reserved Words / Index (georeference.org)

SoundsLike(string, string)

Returns True if strings sound like each other, and False otherwise. Uses Soundex algorithm for English language.

M9 String SQL Functions (manifold.net)

StringSoundex(<string>) : <string>

Given a string returns the Soundex code for that string. The Soundex algorithm generates short strings of alphanumeric codes based on how an English word sounds. English words that are pronounced with similar sounds have the same Soundex codes.

? StringSoundex('Sonora')

Returns:

S560

Example:

? StringSoundex('Sanara')

Returns:

S560

Soundex codes may be used to find matches to words entered by users who use inexact spelling. For example, if we have a table of provinces in Mexico and a user enters Sanara we could use a query to find possible matches:

SELECT [NAME] FROM [Mexico Table]

  WHERE StringSoundex('Sanara') = StringSoundex([NAME]);

Dimitri


7,413 post(s)
#13-Jan-21 11:39

That's a great idea from rk. Soundex was developed for English but often works remarkably well in other languages that use Latin characters, even for French.

There are many other "fuzzy" matching algorithms that are similar to or are alternatives to Soundex (use a search engine to find them, along with sample code to implement them).

Here's a page on another English language fuzzy matching algorithm: https://seanstuber.com/2019/11/21/soundex-alternatives-part-1/

Here's a French alternative to Soundex: https://hackage.haskell.org/package/Sonnex - That's written in Haskell, but you might be able to find an example written in a more popular language.

Based on what you find on the web for matching algorithms that look like they'd work for whatever text is involved in your concatenated location name strings, it's a simple matter to write a function that you could call in SQL, your own alternative to StringSoundex.

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.