Simil is one of the coolest algorithms I have come across in a long while. Developers usually know that all user input is evil, and we really need to implement code contracts and all the fail safes afforded to us by the numerous development standards out there. Unfortunately, spelling mistakes are human and we all make them.
Simil Solves A Common Problem
Duplicate or similar database entries have been an issue for as long as I can remember. Yet many developers tend to cater for this problem at the end of development when the system is almost complete. Adding the final touches and tweaks somehow reminds them that they should probably sanitize their user entries. The proof is in the pudding here. I have inherited many systems where the occurrence of similar strings have been a problem.
Consider the following few strings found in a database:
- Valen Corporation
- Vallen Corporation
- Vaylen Corp
These are all the same supplier in a table of 1 million supplier names. I used Redgate’s SQL Data Generator to create this table.
I then updated 3 records in the database with this supplier name.
The developer knows that the supplier name is ‘Valen Corporation’, so a typical LIKE query on all entries like ‘Val’ returns two of the three entries.
Trying the same LIKE query with the string ‘Va’ returns 35,629 rows. That is far too many to verify. In this situation, the incorrectly entered string ‘Vaylen Corp’ seems to have slipped through the cracks. This is because the developer could not possibly know the different misspellings.
Unless different combinations of the perceived spelling is tested by the developer, ‘Vaylen Corp’ will remain undiscovered for a long time.
Simil Solves This Problem
In situations such as these, Simil can be very beneficial and efficient at weeding out similar names in a table of hundreds of thousands of rows.
Compare the above screenshot of Simil searching for similar words to the string ‘Valen Corporation’. The Simil stored procedure is executed by passing it the string to search for as well as the threshold value (value between 0 and 1). The lower this threshold value is, the more results are returned (0 = completely different and 1 = identical). This way you can fine tune the threshold value instead of trying to figure out different misspellings of the string. Best of all, Simil took 3 seconds to return all three entries in a database of a million rows.
The implementation of Simil is really easy. Tom van Stiphout takes us through Simil in his blog post Simil: An algorithm to look for similar strings. Below is the SQL I used to create a Simil Stored Procedure on my SQL database. You can now implement this string check easily from your .NET application by simply executing the Stored Procedure and passing it a threshold and the string in question. Think of when a user enters (for example) a Supplier name in a web form. You can now return all the similar names in the database and ask them if the supplier isn’t perhaps one of the following returned by the Stored Procedure.