You are here
Home > Programming > Simil Is An Algorithm Used To Find Similar Strings – And It Is Awesome!

Simil Is An Algorithm Used To Find Similar Strings – And It Is Awesome!

Simil

Simil is Awesome

Simil is one of the coolest algorithms I have come across in a long while. Developers usually know that all user input is evil, and we really need to implement code contracts and all the fail safes afforded to us by the numerous development standards out there. Unfortunately, spelling mistakes are human and we all make them.

Not what you are looking for?

Check out some of these other great resources

Simil Solves A Common Problem

Duplicate or similar database entries have been an issue for as long as I can remember. Yet many developers tend to cater for this problem at the end of development when the system is almost complete. Adding the final touches and tweaks somehow reminds them that they should probably sanitize their user entries. The proof is in the pudding here. I have inherited many systems where the occurrence of similar strings have been a problem.

Consider the following few strings found in a database:

  • Valen Corporation
  • Vallen Corporation
  • Vaylen Corp

These are all the same supplier in a table of 1 million supplier names. I used Redgate’s SQL Data Generator to create this table.

Simil

I then updated 3 records in the database with this supplier name.

Simil

The developer knows that the supplier name is ‘Valen Corporation’, so a typical LIKE query on all entries like ‘Val’ returns two of the three entries.

Simil

Trying the same LIKE query with the string ‘Va’ returns 35,629 rows. That is far too many to verify. In this situation, the incorrectly entered string ‘Vaylen Corp’ seems to have slipped through the cracks. This is because the developer could not possibly know the different misspellings.

Simil

Unless different combinations of the perceived spelling is tested by the developer, ‘Vaylen Corp’ will remain undiscovered for a long time.

Simil Solves This Problem

In situations such as these, Simil can be very beneficial and efficient at weeding out similar names in a table of hundreds of thousands of rows.

Simil

Compare the above screenshot of Simil searching for similar words to the string ‘Valen Corporation’. The Simil stored procedure is executed by passing it the string to search for as well as the threshold value (value between 0 and 1). The lower this threshold value is, the more results are returned (0 = completely different and 1 = identical). This way you can fine tune the threshold value instead of trying to figure out different misspellings of the string. Best of all, Simil took 3 seconds to return all three entries in a database of a million rows.

Implementing Simil

The implementation of Simil is really easy. Tom van Stiphout takes us through Simil in his blog post Simil: An algorithm to look for similar strings. Below is the SQL I used to create a Simil Stored Procedure on my SQL database. You can now implement this string check easily from your .NET application by simply executing the Stored Procedure and passing it a threshold and the string in question. Think of when a user enters (for example) a Supplier name in a web form. You can now return all the similar names in the database and ask them if the supplier isn’t perhaps one of the following returned by the Stored Procedure.

Dirk Strauss
Dirk is a Software Developer and Microsoft MVP from South Africa. He loves all things Technology and is slightly addicted to Twitter and Jimi Hendrix. Apart from writing code, he also enjoys writing human readable articles. "I love sharing knowledge and connecting with people from around the world. It's the diversity that makes life so beautiful." Dirk feels very strongly that pizza is simply not complete without Tabasco, that you can never have too much garlic, and that cooking the perfect steak is an art he has yet to master.

Similar Articles

Top