Performing Fuzzy Table Joins#

Introduction#

When merging data from different sources, one often needs to match records with slightly different spellings. Instead of an exact match, we can perform a Fuzzy match to find matching records from two diffent tables that have slight variations. Python ecosystem offers an excellent fuzzy matching package TheFuzz (formerly FuzzyWuzzy). We use this along with GeoPandas to do a fuzzy table join. The tutorial also covers advanced data-cleaning techniques required for working with messy real-world datasets.

Overview of the task#

We will be working with 2 data layers for the Sri Lanka. Given the shapefile of Admin4 regions and a CSV file containing division-wise population statistics - we will learn how to merge them to display these indicators on a map.

Input Layers:

  • lka_admbnda_adm4_slsd_20220816.shp: A shapefile of all Grama Niladhari (GN)Divisions (Admin Level 4) of Sri Lanka.

  • GN_Division_Age_Group_of_Population.csv: Age-wise population for all GN Divisions of Sri Lanka.

Output:

  • srilanka.gpkg: A geopackage containing merged dataset.

Data Credit:

  • Sri Lanka Population Statistics: Department of Census and Statistics - Sri Lanka. Downloaded from DCS Map Portal](http://map.statistics.gov.lk/).

  • Sri Lanka - Subnational Administrative Boundaries: Sri Lanka administrative levels 0-4 shapefiles and gazetteer. Downloaded from HDX portal.

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install thefuzz
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from thefuzz import process
data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)
def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'

shapefile = 'lka_admbnda_adm4_slsd_20220816'
exts = ['.shp', '.shx', '.dbf', '.prj']
csv_file = 'GN_Division_Age_Group_of_Population.csv'

for ext in exts:
  download(data_url + shapefile + ext)

download(data_url + csv_file)

Procedure#

Pre-Process Data Table#

The data table containing population couts is organized by 1 row per GN Division. We do not have a unique identifier that uniquely matches a GN Division. Since GN Division names are not unique amounf different provinces, we must create a unique identifier to be able to match it against another data source.

csv_path = os.path.join(data_folder, csv_file)
df = pd.read_csv(csv_path)
df.iloc[:, :5]
GN Division DS Division District Province Total Population
0 Sammanthranapura Colombo Colombo Western 7829
1 Mattakkuliya Colombo Colombo Western 28003
2 Modara Colombo Colombo Western 17757
3 Madampitiya Colombo Colombo Western 12970
4 Mahawatta Colombo Colombo Western 8809
... ... ... ... ... ...
13984 Demeda Deraniyagala Kegalle Sabaragamuwa 971
13985 Kosgahakanda Deraniyagala Kegalle Sabaragamuwa 2336
13986 Yatiwala Deraniyagala Kegalle Sabaragamuwa 1882
13987 Magala Deraniyagala Kegalle Sabaragamuwa 1076
13988 Dikellakanda Deraniyagala Kegalle Sabaragamuwa 684

13989 rows × 5 columns

GN Division names are not unique, so they can’t be used as identifiers for a join.

df[df['GN Division'].duplicated(keep=False)].sort_values('GN Division').iloc[:, :5]
GN Division DS Division District Province Total Population
8944 Abakolawewa North Ambanpola Kurunegala North Western 561
6521 Abakolawewa North Weeraketiya Hambantota Southern 648
8954 Abakolawewa South Ambanpola Kurunegala North Western 488
6520 Abakolawewa South Weeraketiya Hambantota Southern 322
6303 Abhayapura Lunugamvehera Hambantota Southern 624
... ... ... ... ... ...
2375 Yatiyana East Agalawatta Kalutara Western 1493
978 Yatiyana East Minuwangoda Gampaha Western 2004
13634 Yodhagama Aranayaka Kegalle Sabaragamuwa 760
13493 Yodhagama Rambukkana Kegalle Sabaragamuwa 1421
13371 Yodhagama Embilipitiya Ratnapura Sabaragamuwa 5657

2511 rows × 5 columns

Let’s create a unique idenfier by merging the Admin4, Admin3, Admin2 and Admin1 names. We also remove whitespaces and convert the names to lowercase to ensure the match is case insensitive.

df_processed = df.copy()
gn_division = df_processed['GN Division'].str.lower().str.replace(' ', '')
ds_division = df_processed['DS Division'].str.lower().str.replace(' ', '')
district = df_processed['District'].str.lower().str.replace(' ', '')
province = df_processed['Province'].str.lower().str.replace(' ', '')

df_processed['joinkey'] = gn_division + ds_division + district + province
df_processed[['GN Division', 'DS Division', 'District', 'Province', 'joinkey']]
GN Division DS Division District Province joinkey
0 Sammanthranapura Colombo Colombo Western sammanthranapuracolombocolombowestern
1 Mattakkuliya Colombo Colombo Western mattakkuliyacolombocolombowestern
2 Modara Colombo Colombo Western modaracolombocolombowestern
3 Madampitiya Colombo Colombo Western madampitiyacolombocolombowestern
4 Mahawatta Colombo Colombo Western mahawattacolombocolombowestern
... ... ... ... ... ...
13984 Demeda Deraniyagala Kegalle Sabaragamuwa demedaderaniyagalakegallesabaragamuwa
13985 Kosgahakanda Deraniyagala Kegalle Sabaragamuwa kosgahakandaderaniyagalakegallesabaragamuwa
13986 Yatiwala Deraniyagala Kegalle Sabaragamuwa yatiwaladeraniyagalakegallesabaragamuwa
13987 Magala Deraniyagala Kegalle Sabaragamuwa magaladeraniyagalakegallesabaragamuwa
13988 Dikellakanda Deraniyagala Kegalle Sabaragamuwa dikellakandaderaniyagalakegallesabaragamuwa

13989 rows × 5 columns

We still have some duplicates.

duplicate_df = df_processed[df_processed.duplicated('joinkey', keep=False)] \
  .sort_values('joinkey')
duplicate_df[['GN Division', 'DS Division', 'District', 'Province', 'joinkey']].head()
GN Division DS Division District Province joinkey
10341 Ahugoda West Polgahawela Kurunegala North Western ahugodawestpolgahawelakurunegalanorthwestern
10335 Ahugoda West Polgahawela Kurunegala North Western ahugodawestpolgahawelakurunegalanorthwestern
10694 Aluthwatta Chilaw Puttalam North Western aluthwattachilawputtalamnorthwestern
10693 Aluthwatta Chilaw Puttalam North Western aluthwattachilawputtalamnorthwestern
2781 Dewahandiya East Udadumbara Kandy Central dewahandiyaeastudadumbarakandycentral

Since these refer to the same region, we merge the data for duplicate admin4 regions. While merging the rows, we sum the population counts from each row having the same key.

df_processed.columns
Index(['GN Division', 'DS Division', 'District', 'Province',
       'Total Population', 'Less than 10', '10 ~ 19', '20 ~ 29', '30 ~ 39',
       '40 ~ 49', '50 ~ 59', '60 ~ 69', '70 ~ 79', '80 ~ 89', '90 and above',
       'joinkey'],
      dtype='object')
df_dissolved = df_processed.groupby(['joinkey'],as_index=False).agg({
    'GN Division': 'first',
    'DS Division': 'first',
    'District': 'first',
    'Province': 'first',
    'Total Population': 'sum',
    'Less than 10': 'sum',
    '10 ~ 19': 'sum',
    '20 ~ 29': 'sum',
    '30 ~ 39': 'sum',
    '40 ~ 49': 'sum',
    '50 ~ 59': 'sum',
    '60 ~ 69': 'sum',
    '70 ~ 79': 'sum',
    '80 ~ 89': 'sum',
    '90 and above': 'sum'
})
df_dissolved = df_dissolved[[
    'joinkey', 'Total Population', 'Less than 10',
    '10 ~ 19', '20 ~ 29', '30 ~ 39','40 ~ 49',
    '50 ~ 59', '60 ~ 69', '70 ~ 79', '80 ~ 89',
    '90 and above']]
print('Total Records of original table', len(df))
print('Total Records of processed table', len(df_dissolved))
Total Records of original table 13989
Total Records of processed table 13973

Pre-Process Shapefile#

Our shapefile comes from a different source and thus has a different structure. We need to match each feature with a record from the data table. We process the data to create a similar ‘joinkey’ and remove duplicates.

shapefile_path = os.path.join(data_folder, shapefile + '.shp')
gdf = gpd.read_file(shapefile_path, encoding='utf-8')
gdf.iloc[:, :5]
Shape_Leng Shape_Area ADM4_EN ADM4_SI ADM4_TA
0 0.019689 0.000015 Sammanthranapura සම්මන්ත්‍රණපුර None
1 0.072878 0.000177 Mattakkuliya මට්ටක්කුලිය None
2 0.052290 0.000064 Modara මෝදර None
3 0.019201 0.000023 Madampitiya මාදම්පිටිය None
4 0.033889 0.000042 Mahawatta මහවත්ත None
... ... ... ... ... ...
14038 0.149196 0.000260 Demeda දෙමැද None
14039 0.143733 0.000765 Kosgahakanda කොස්ගහකන්ද None
14040 0.122895 0.000583 Yatiwala යටිවල None
14041 0.246448 0.001756 Magala මාගල None
14042 0.275330 0.002457 Dikellakanda දික්ඇල්ලකන්ද None

14043 rows × 5 columns

gdf_processed = gdf.copy()
gdf_processed = gdf_processed.rename(columns = {
    'ADM4_EN': 'GN Division',
    'ADM3_EN': 'DS Division',
    'ADM2_EN': 'District',
    'ADM1_EN': 'Province'})

gn_division = gdf_processed['GN Division'].str.lower().str.replace(' ', '')
ds_division = gdf_processed['DS Division'].str.lower().str.replace(' ', '')
district = gdf_processed['District'].str.lower().str.replace(' ', '')
province = gdf_processed['Province'].str.lower().str.replace(' ', '')

gdf_processed['joinkey'] = gn_division + ds_division + district + province
gdf_processed = gdf_processed[
    ['GN Division', 'DS Division', 'District', 'Province', 'joinkey', 'geometry']]

There are some duplicate polygons having the same ‘joinkey’.

duplicate_gdf = gdf_processed[gdf_processed.duplicated('joinkey', keep=False)] \
  .sort_values('joinkey')
duplicate_gdf[['GN Division', 'DS Division', 'District', 'Province', 'joinkey']]
GN Division DS Division District Province joinkey
7879 [unknown] Eravur Pattu Batticaloa Eastern [unknown]eravurpattubatticaloaeastern
7877 [unknown] Eravur Pattu Batticaloa Eastern [unknown]eravurpattubatticaloaeastern
7876 [unknown] Eravur Pattu Batticaloa Eastern [unknown]eravurpattubatticaloaeastern
7875 [unknown] Eravur Pattu Batticaloa Eastern [unknown]eravurpattubatticaloaeastern
7838 [unknown] Eravur Pattu Batticaloa Eastern [unknown]eravurpattubatticaloaeastern
... ... ... ... ... ...
2455 Miriswatta Walallawita Kalutara Western miriswattawalallawitakalutarawestern
1389 Sapugasthenna Attanagalla Gampaha Western sapugasthennaattanagallagampahawestern
1334 Sapugasthenna Attanagalla Gampaha Western sapugasthennaattanagallagampahawestern
10612 Udappuwa Mundel Puttalam North Western udappuwamundelputtalamnorthwestern
10613 Udappuwa Mundel Puttalam North Western udappuwamundelputtalamnorthwestern

73 rows × 5 columns

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(10,10)

# Plot all features and highlight duplicates
gdf_processed.plot(ax=ax, facecolor='none', edgecolor='#bdbdbd', linewidth=0.2)
duplicate_gdf.plot(
    'joinkey', ax=ax, categorical=True, edgecolor='#ffffff', linewidth=0.5)

# Zoom to a specific region
selected = gdf_processed[gdf_processed['DS Division'] == 'Mundel']
minx, miny, maxx, maxy = selected.total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)

ax.set_axis_off()
ax.set_title('Duplicate Polygons', size = 18)

plt.show()
../_images/57774fa236e88752847cf8dfa8da869d597a22eb6d993a64f58bfac3d7d9606d.png

We dissolve (merge) these polygon boundaries to create a single region from multiple polygons having the same attributes.

gdf_dissolved = gdf_processed.dissolve(by='joinkey', aggfunc={
    'GN Division': 'first',
    'DS Division': 'first',
    'District': 'first',
    'Province': 'first'})
gdf_dissolved = gdf_dissolved.reset_index()
print('Total Records of original shapefile', len(gdf))
print('Total Records of processed shapefile', len(gdf_dissolved))
Total Records of original shapefile 14043
Total Records of processed shapefile 13987

Table Join#

Finally, we do a table join using GeoPandas merge().

We have almost equal count of features between the shapefile and table now. There are still a few polygons that do not have corresponding population counts. They will be removed from the output layer.

print('Total Records of processed shapefile', len(gdf_dissolved))
print('Total Records of processed table', len(df_dissolved))
Total Records of processed shapefile 13987
Total Records of processed table 13973
merged = gdf_dissolved.merge(df_dissolved, on='joinkey')
print('Total Records of merged shapefile', len(merged))
Total Records of merged shapefile 10747

There are still many records that do not have an exact match between the shapefile and the population table. This is due to small differences in their spellings. We can instead try a fuzzy match.

Fuzzy Join#

Let’s perform the Fuzzy match. Fuzzy matching for large tables can take long time, so we use tqdm package to track the progress.

gdf_dissolved.rename(columns = {'joinkey':'joinkey_exact'}, inplace=True)
from tqdm import tqdm

def getFuzzyMatch(row):
    joinkey = row['joinkey_exact']
    results = process.extractOne(joinkey, df_dissolved.joinkey)
    match = results[0]
    score = results[1]
    return match, score

tqdm.pandas()
gdf_dissolved[['match', 'score']] = gdf_dissolved.progress_apply(
    getFuzzyMatch, axis=1, result_type='expand')
100%|██████████| 13987/13987 [07:50<00:00, 29.71it/s]

The matches with scores less than 100 are fuzzy matches.

gdf_dissolved[gdf_dissolved['score'] != 100] \
 [['joinkey_exact', 'match', 'score']] \
 .sort_values('score', ascending=False)
joinkey_exact match score
0 041/2kanuwarajanganayaanuradhapuranorthcentral 41/2kanuwarajanganayaanuradhapuranorthcentral 99
10708 puthukudiyiruppuwestputhukkudiyiruppumullaitiv... puthukkudiyiruppuwestputhukkudiyiruppumullaiti... 99
3936 hingurakgodahigurakgodapolonnaruwanorthcentral hingurakgodahingurakgodapolonnaruwanorthcentral 99
3933 hingurakahigurakgodapolonnaruwanorthcentral hingurakahingurakgodapolonnaruwanorthcentral 99
3932 hingurakabandaragamahigurakgodapolonnaruwanort... hingurakabandaragamahingurakgodapolonnaruwanor... 99
... ... ... ...
12459 udumullamadampagamagallesouthern udumullahikkaduwagallesouthern 77
6803 loinonnorwoodnuwaraeliyacentral loinonambagamuwanuwaraeliyacentral 77
8 [unknown]ambangangamatalecentral hunuketeambangangakoralematalecentral 76
4473 injusternorwoodnuwaraeliyacentral injustryambagamuwanuwaraeliyacentral 75
8969 newvalleygamanorwoodnuwaraeliyacentral manelwalawalapanenuwaraeliyacentral 74

3236 rows × 3 columns

We can use all matches with a score of above a threshold. You need to use your discretion and local knowledge to determine the appropriate threshold. Here we see all matches with a low score.

score_threshold = 90
gdf_dissolved[gdf_dissolved['score'] < score_threshold] \
  .sort_values('score') \
  [['joinkey_exact', 'match', 'score']]
joinkey_exact match score
8969 newvalleygamanorwoodnuwaraeliyacentral manelwalawalapanenuwaraeliyacentral 74
4473 injusternorwoodnuwaraeliyacentral injustryambagamuwanuwaraeliyacentral 75
8 [unknown]ambangangamatalecentral hunuketeambangangakoralematalecentral 76
5527 kawattikudatown&gravetstrincomaleeeastern kavaddikkudatrincomaleetownandgravetstrincomal... 77
9812 panmurnorwoodnuwaraeliyacentral panmurambagamuwanuwaraeliyacentral 77
... ... ... ...
13575 welipathayayakalthotaratnapurasabaragamuwa welipathayayabalangodaratnapurasabaragamuwa 89
20 [unknown]manmunaisouth&eruvilpattubatticaloaea... mankadumanmunaisouth&eruvilpattubatticaloaeastern 89
5853 kirimichchaikoralaipattunorthbatticaloaeastern kirimichaikoralaipattunorth(vaharai)batticaloa... 89
5131 kankeyanodaisouthmanmunaipattubatticaloaeastern kankeyanodaisouthmanmunaipattu(araipattai)batt... 89
8715 narangamuwalaggalamatalecentral narangamuwamatalematalecentral 89

550 rows × 3 columns

Save the high score matches as joinkey.

gdf_dissolved['joinkey'] = np.where(
    gdf_dissolved['score'] >= 90,
    gdf_dissolved['match'],
    '')

Use the new joinkey which contains the fuzzy matches from the df_dissolved table. Now we will get a lot more matches of records.

merged = gdf_dissolved.merge(df_dissolved, on='joinkey')
print('Total Records of merged shapefile', len(merged))
Total Records of merged shapefile 13437

Save the datasets#

output_dir = 'output'
output_filename = 'srilanka.gpkg'
output_path = os.path.join(output_dir, output_filename)

gdf_dissolved.to_file(driver='GPKG', filename=output_path, layer='admin4polygons', encoding='utf-8')
merged.to_file(driver='GPKG', filename=output_path, layer='merged', encoding='utf-8')

print('Successfully written output file at {}'.format(output_path))
Successfully written output file at output/srilanka.gpkg

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)