Fuzzy Match Join

I have a scenario where I think ApproximativeJoin would have helped, but from what I can tell, that join was retired and nothing has taken it’s place. I was hoping someone might have a better solution than what I’ve done.

Scenario: I need to compare policy numbers from two different tables and pull out any matches. The problem is that in either table the same policy might have the policy number formatted slightly differently in each table. The most common variance is leading or trailing zeros, but there are also probably 20 or 30 other common variances like having a single letter at the end or beginning.

That probably makes sense but here are some examples to further illustrate what i’m talking about:

Table 1 = Table 2

123450 = 00123450
12345a = 12345
10012345 = 12345MD
0012345 = M1234500

My current solution is to create a series of nested Joins and reformats that cover each of the common variances. This means I have a lot of join statements that can take up a lot of time that I have a feeling could be done more simply and faster. I also am working to fix all the variances before they get into the two tables that are being compared, but that is going to be a very long process that ultimately might not be successful because of how we get the information.

Any suggestions or guidance would be appreciated.

thanks,
Mike

Hi Michael,

well it depends a bit on what version of CloverETL do you have. I believe this component was deprecated for a moment but in later versions reinstated.

Anyway, what you should have as well (behaves similarly and is easier to configure) is this: http://doc.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/aspell-lookup-table.html. The question is, about how many records are we talking about here…? Keep in mind, both component and lookup uses memory to store “slave” records (the ones you’re joining against).

Hi Mike,
as the_goat already mentioned, the Aspell Lookup feature looks like a good candidate for your case (in consideration of the memory consumption as stated above). Let me provide you with an example project where the usage of the Aspell Lookup is demonstrated.

  • The idea is to first configure the Aspell Lookup table, namely its source file (Data file URL) which would be one of the 2 input files of yours, its metadata and lookup field key.

  • In the graph itself, you would read your second input file and utilize a LookupJoin component for joining it with the lookup table (your first input file).

  • The key feature of the Aspell Lookup can be configured (again in the lookup edit mode) by defining the Spelling threshold and Costs. The higher the threshold, the more tolerant is the component to spelling errors. Words with this value higher than the specified limit are not included in the results.
    In this example, I set the Spelling threshold to 60 and the cost of every operation to 20. For instance, when Clover tries to join the first master record “AAA1234567890” with the slave record “1234567890” it needs to perform 3 INSERT operations on the slave record (insert “A” character 3 times). As the cost of the INSERT operation is 20, the total cost of the used operations is 60 which is exactly within the given threshold so the records are joined successfully.
    When Clover tries to join the second master record “001122334455_1” to the slave record “1122334455” it needs to perform 4 INSERT operations on the slave record. As the cost of the INSERT operation is 20, the total cost of the used operations is 80 which is above the given threshold and the records are not joined.
    You can try changing the Spelling threshold to various values (20, 40, 80 and 120) and see how the results change.

Kind regards,