Record matching in humanitarian data
Posted on Sun 20 September 2020 in How-to
A reoccurring scenario in information management in humanitarian response is one where you have to compile messy data from a number of disparate sources within a very short time window. This post is an attempt to summarise some of the tools and approaches I use to tackle this problem.
Excel: VLOOKUP, XLOOKUP and INDEXMATCH
While I wont cover them here, using VLOOKUP, INDEX MATCH and the newer XLOOKUP in Excel are really useful formula when trying to combine different datasets in Excel. Learning these for the first time really feels like unlocking an analysis superpower.
=XLOOKUP (lookup, lookup_array, return_array)
When using VLOOKUP or XLOOKUP you can shoose to use ‘FALSE” at the end of the formula or ‘TRUE’ to find an approximate match. ‘TRUE(exact match)’ uses a linear search algorithm to step through the values looking for an exact match using a linear search algorithm. ‘FALSE(approximate match)’ is perhaps somewhat misleading, as it doesnt match the closest textual match, but rather, returns the second highest value. It uses a binary search algoritms to continually search at the mid-point value. While this may be faster, it only works on sorted numerical data and if an exact match isnt found it returns the second largest value.
Excel fuzzy matching
This is not terribly useful. What we are looking for is something more powerful, which can help match text with slight spelling discrepencies or can search for matches acrosss multiple columns. The tool we need is called Fuzzy Matching, or more boardly speaking Probabilistic record linkage. This works by assigning a a probability score that the text in one cell matches another, based on a chosen method. Recently Microsoft released a tool to merge tables using fuzzy matching (windows only). This allows you to set the following matching options:
-
Similarity Threshold – This option indicates how similar two values need to be in order to match. The minimum value of 0.00 will cause all values to match each other, and the maximum value of 1.00 will only allow exact matches. The default is 0.80.
-
Ignore case – This option indicates whether text values should be compared in a case sensitive or insensitive setting. The default behavior is case insensitive, which ignores casing.
-
Maximum number of matches – This option controls the maximum number of matching rows that will be returned for each input row. For example, if you only want to find one matching row for each input row, specify a value of 1. The default behavior is to return all matches.
-
Transformation table – This option allows users to specify another query that holds a mapping table, so that some values can be auto-mapped as part of the matching logic. For example, defining a two-column table with a “From” and “To” text columns with values “Microsoft” and “MSFT” will make these two values be considered the same (similarity score of 1.00) by the matching logic.
Record matching using Python
If you want more options for your record linkages, you use macOS, or if you want more control over your workflow, Python is a good option, especially using the fuzzymatcher or Python Record Linkage Toolkit libraries.
fuzzymatcher examples
Here is a basic example from the fuzzymatcher docs linking two tables.
Table A
id | ons_name |
---|---|
0 | Darlington |
1 | Monmouthshire |
2 | Havering |
3 | Knowsley |
4 | Charnwood |
and Table B
id | os_name |
---|---|
0 | Darlington (B) |
1 | Havering London Boro |
2 | Sir Fynwy - Monmouthshire |
3 | Knowsley District (B) |
4 | Charnwood District (B) |
Writing the following
import fuzzymatcher
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on = "ons_name", right_on = "os_name")
gives you:
best_match_score | ons_name | os_name |
---|---|---|
0.178449 | Darlington | Darlington (B) |
0.133371 | Monmouthshire | Sir Fynwy - Monmouthshire |
0.102473 | Havering | Havering London Boro |
0.155775 | Knowsley | Knowsley District (B) |
0.155775 | Charnwood | Charnwood District (B) |