Performing Fuzzy Table Joins

Performing Fuzzy Table Joins#

Introduction#

When merging data from different sources, one often needs to match records with slightly different spellings. Instead of an exact match, we can perform a Fuzzy match to find matching records from two diffent tables that have slight variations. Python ecosystem offers an excellent fuzzy matching package TheFuzz (formerly FuzzyWuzzy). We use this along with GeoPandas to do a fuzzy table join. The tutorial also covers advanced data-cleaning techniques required for working with messy real-world datasets.

Overview of the task#

We will be working with 2 data layers for the Sri Lanka. Given the shapefile of Admin4 regions and a CSV file containing division-wise population statistics - we will learn how to merge them to display these indicators on a map.

Input Layers:

lka_admbnda_adm4_slsd_20220816.shp: A shapefile of all Grama Niladhari (GN)Divisions (Admin Level 4) of Sri Lanka.
GN_Division_Age_Group_of_Population.csv: Age-wise population for all GN Divisions of Sri Lanka.

Output:

srilanka.gpkg: A geopackage containing merged dataset.

Data Credit:

Sri Lanka Population Statistics: Department of Census and Statistics - Sri Lanka. Downloaded from DCS Map Portal](http://map.statistics.gov.lk/).
Sri Lanka - Subnational Administrative Boundaries: Sri Lanka administrative levels 0-4 shapefiles and gazetteer. Downloaded from HDX portal.

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install thefuzz

import os
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from thefuzz import process

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'

shapefile = 'lka_admbnda_adm4_slsd_20220816'
exts = ['.shp', '.shx', '.dbf', '.prj']
csv_file = 'GN_Division_Age_Group_of_Population.csv'

for ext in exts:
  download(data_url + shapefile + ext)

download(data_url + csv_file)

Procedure#

Pre-Process Data Table#

The data table containing population couts is organized by 1 row per GN Division. We do not have a unique identifier that uniquely matches a GN Division. Since GN Division names are not unique amounf different provinces, we must create a unique identifier to be able to match it against another data source.

csv_path = os.path.join(data_folder, csv_file)
df = pd.read_csv(csv_path)
df.iloc[:, :5]

	GN Division	DS Division	District	Province	Total Population
0	Sammanthranapura	Colombo	Colombo	Western	7829
1	Mattakkuliya	Colombo	Colombo	Western	28003
2	Modara	Colombo	Colombo	Western	17757
3	Madampitiya	Colombo	Colombo	Western	12970
4	Mahawatta	Colombo	Colombo	Western	8809
...	...	...	...	...	...
13984	Demeda	Deraniyagala	Kegalle	Sabaragamuwa	971
13985	Kosgahakanda	Deraniyagala	Kegalle	Sabaragamuwa	2336
13986	Yatiwala	Deraniyagala	Kegalle	Sabaragamuwa	1882
13987	Magala	Deraniyagala	Kegalle	Sabaragamuwa	1076
13988	Dikellakanda	Deraniyagala	Kegalle	Sabaragamuwa	684

13989 rows × 5 columns

GN Division names are not unique, so they can’t be used as identifiers for a join.

df[df['GN Division'].duplicated(keep=False)].sort_values('GN Division').iloc[:, :5]

	GN Division	DS Division	District	Province	Total Population
8944	Abakolawewa North	Ambanpola	Kurunegala	North Western	561
6521	Abakolawewa North	Weeraketiya	Hambantota	Southern	648
8954	Abakolawewa South	Ambanpola	Kurunegala	North Western	488
6520	Abakolawewa South	Weeraketiya	Hambantota	Southern	322
6303	Abhayapura	Lunugamvehera	Hambantota	Southern	624
...	...	...	...	...	...
2375	Yatiyana East	Agalawatta	Kalutara	Western	1493
978	Yatiyana East	Minuwangoda	Gampaha	Western	2004
13634	Yodhagama	Aranayaka	Kegalle	Sabaragamuwa	760
13493	Yodhagama	Rambukkana	Kegalle	Sabaragamuwa	1421
13371	Yodhagama	Embilipitiya	Ratnapura	Sabaragamuwa	5657

2511 rows × 5 columns

Let’s create a unique idenfier by merging the Admin4, Admin3, Admin2 and Admin1 names. We also remove whitespaces and convert the names to lowercase to ensure the match is case insensitive.

df_processed = df.copy()
gn_division = df_processed['GN Division'].str.lower().str.replace(' ', '')
ds_division = df_processed['DS Division'].str.lower().str.replace(' ', '')
district = df_processed['District'].str.lower().str.replace(' ', '')
province = df_processed['Province'].str.lower().str.replace(' ', '')

df_processed['joinkey'] = gn_division + ds_division + district + province
df_processed[['GN Division', 'DS Division', 'District', 'Province', 'joinkey']]

	GN Division	DS Division	District	Province	joinkey
0	Sammanthranapura	Colombo	Colombo	Western	sammanthranapuracolombocolombowestern
1	Mattakkuliya	Colombo	Colombo	Western	mattakkuliyacolombocolombowestern
2	Modara	Colombo	Colombo	Western	modaracolombocolombowestern
3	Madampitiya	Colombo	Colombo	Western	madampitiyacolombocolombowestern
4	Mahawatta	Colombo	Colombo	Western	mahawattacolombocolombowestern
...	...	...	...	...	...
13984	Demeda	Deraniyagala	Kegalle	Sabaragamuwa	demedaderaniyagalakegallesabaragamuwa
13985	Kosgahakanda	Deraniyagala	Kegalle	Sabaragamuwa	kosgahakandaderaniyagalakegallesabaragamuwa
13986	Yatiwala	Deraniyagala	Kegalle	Sabaragamuwa	yatiwaladeraniyagalakegallesabaragamuwa
13987	Magala	Deraniyagala	Kegalle	Sabaragamuwa	magaladeraniyagalakegallesabaragamuwa
13988	Dikellakanda	Deraniyagala	Kegalle	Sabaragamuwa	dikellakandaderaniyagalakegallesabaragamuwa

13989 rows × 5 columns

We still have some duplicates.

duplicate_df = df_processed[df_processed.duplicated('joinkey', keep=False)] \
  .sort_values('joinkey')
duplicate_df[['GN Division', 'DS Division', 'District', 'Province', 'joinkey']].head()

	GN Division	DS Division	District	Province	joinkey
10341	Ahugoda West	Polgahawela	Kurunegala	North Western	ahugodawestpolgahawelakurunegalanorthwestern
10335	Ahugoda West	Polgahawela	Kurunegala	North Western	ahugodawestpolgahawelakurunegalanorthwestern
10694	Aluthwatta	Chilaw	Puttalam	North Western	aluthwattachilawputtalamnorthwestern
10693	Aluthwatta	Chilaw	Puttalam	North Western	aluthwattachilawputtalamnorthwestern
2781	Dewahandiya East	Udadumbara	Kandy	Central	dewahandiyaeastudadumbarakandycentral

Since these refer to the same region, we merge the data for duplicate admin4 regions. While merging the rows, we sum the population counts from each row having the same key.

df_processed.columns

Index(['GN Division', 'DS Division', 'District', 'Province',
       'Total Population', 'Less than 10', '10 ~ 19', '20 ~ 29', '30 ~ 39',
       '40 ~ 49', '50 ~ 59', '60 ~ 69', '70 ~ 79', '80 ~ 89', '90 and above',
       'joinkey'],
      dtype='object')

df_dissolved = df_processed.groupby(['joinkey'],as_index=False).agg({
    'GN Division': 'first',
    'DS Division': 'first',
    'District': 'first',
    'Province': 'first',
    'Total Population': 'sum',
    'Less than 10': 'sum',
    '10 ~ 19': 'sum',
    '20 ~ 29': 'sum',
    '30 ~ 39': 'sum',
    '40 ~ 49': 'sum',
    '50 ~ 59': 'sum',
    '60 ~ 69': 'sum',
    '70 ~ 79': 'sum',
    '80 ~ 89': 'sum',
    '90 and above': 'sum'
})
df_dissolved = df_dissolved[[
    'joinkey', 'Total Population', 'Less than 10',
    '10 ~ 19', '20 ~ 29', '30 ~ 39','40 ~ 49',
    '50 ~ 59', '60 ~ 69', '70 ~ 79', '80 ~ 89',
    '90 and above']]

print('Total Records of original table', len(df))
print('Total Records of processed table', len(df_dissolved))

Total Records of original table 13989
Total Records of processed table 13973

Pre-Process Shapefile#

Our shapefile comes from a different source and thus has a different structure. We need to match each feature with a record from the data table. We process the data to create a similar ‘joinkey’ and remove duplicates.

shapefile_path = os.path.join(data_folder, shapefile + '.shp')
gdf = gpd.read_file(shapefile_path, encoding='utf-8')
gdf.iloc[:, :5]

	Shape_Leng	Shape_Area	ADM4_EN	ADM4_SI	ADM4_TA
0	0.019689	0.000015	Sammanthranapura	සම්මන්ත්‍රණපුර	None
1	0.072878	0.000177	Mattakkuliya	මට්ටක්කුලිය	None
2	0.052290	0.000064	Modara	මෝදර	None
3	0.019201	0.000023	Madampitiya	මාදම්පිටිය	None
4	0.033889	0.000042	Mahawatta	මහවත්ත	None
...	...	...	...	...	...
14038	0.149196	0.000260	Demeda	දෙමැද	None
14039	0.143733	0.000765	Kosgahakanda	කොස්ගහකන්ද	None
14040	0.122895	0.000583	Yatiwala	යටිවල	None
14041	0.246448	0.001756	Magala	මාගල	None
14042	0.275330	0.002457	Dikellakanda	දික්ඇල්ලකන්ද	None

14043 rows × 5 columns

gdf_processed = gdf.copy()
gdf_processed = gdf_processed.rename(columns = {
    'ADM4_EN': 'GN Division',
    'ADM3_EN': 'DS Division',
    'ADM2_EN': 'District',
    'ADM1_EN': 'Province'})

gn_division = gdf_processed['GN Division'].str.lower().str.replace(' ', '')
ds_division = gdf_processed['DS Division'].str.lower().str.replace(' ', '')
district = gdf_processed['District'].str.lower().str.replace(' ', '')
province = gdf_processed['Province'].str.lower().str.replace(' ', '')

gdf_processed['joinkey'] = gn_division + ds_division + district + province
gdf_processed = gdf_processed[
    ['GN Division', 'DS Division', 'District', 'Province', 'joinkey', 'geometry']]

There are some duplicate polygons having the same ‘joinkey’.

duplicate_gdf = gdf_processed[gdf_processed.duplicated('joinkey', keep=False)] \
  .sort_values('joinkey')
duplicate_gdf[['GN Division', 'DS Division', 'District', 'Province', 'joinkey']]

	GN Division	DS Division	District	Province	joinkey
7879	[unknown]	Eravur Pattu	Batticaloa	Eastern	[unknown]eravurpattubatticaloaeastern
7877	[unknown]	Eravur Pattu	Batticaloa	Eastern	[unknown]eravurpattubatticaloaeastern
7876	[unknown]	Eravur Pattu	Batticaloa	Eastern	[unknown]eravurpattubatticaloaeastern
7875	[unknown]	Eravur Pattu	Batticaloa	Eastern	[unknown]eravurpattubatticaloaeastern
7838	[unknown]	Eravur Pattu	Batticaloa	Eastern	[unknown]eravurpattubatticaloaeastern
...	...	...	...	...	...
2455	Miriswatta	Walallawita	Kalutara	Western	miriswattawalallawitakalutarawestern
1389	Sapugasthenna	Attanagalla	Gampaha	Western	sapugasthennaattanagallagampahawestern
1334	Sapugasthenna	Attanagalla	Gampaha	Western	sapugasthennaattanagallagampahawestern
10612	Udappuwa	Mundel	Puttalam	North Western	udappuwamundelputtalamnorthwestern
10613	Udappuwa	Mundel	Puttalam	North Western	udappuwamundelputtalamnorthwestern

73 rows × 5 columns

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(10,10)

# Plot all features and highlight duplicates
gdf_processed.plot(ax=ax, facecolor='none', edgecolor='#bdbdbd', linewidth=0.2)
duplicate_gdf.plot(
    'joinkey', ax=ax, categorical=True, edgecolor='#ffffff', linewidth=0.5)

# Zoom to a specific region
selected = gdf_processed[gdf_processed['DS Division'] == 'Mundel']
minx, miny, maxx, maxy = selected.total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)

ax.set_axis_off()
ax.set_title('Duplicate Polygons', size = 18)

plt.show()

../_images/57774fa236e88752847cf8dfa8da869d597a22eb6d993a64f58bfac3d7d9606d.png

We dissolve (merge) these polygon boundaries to create a single region from multiple polygons having the same attributes.

gdf_dissolved = gdf_processed.dissolve(by='joinkey', aggfunc={
    'GN Division': 'first',
    'DS Division': 'first',
    'District': 'first',
    'Province': 'first'})
gdf_dissolved = gdf_dissolved.reset_index()

print('Total Records of original shapefile', len(gdf))
print('Total Records of processed shapefile', len(gdf_dissolved))

Total Records of original shapefile 14043
Total Records of processed shapefile 13987

Table Join#

Finally, we do a table join using GeoPandas merge().

We have almost equal count of features between the shapefile and table now. There are still a few polygons that do not have corresponding population counts. They will be removed from the output layer.

print('Total Records of processed shapefile', len(gdf_dissolved))
print('Total Records of processed table', len(df_dissolved))

Total Records of processed shapefile 13987
Total Records of processed table 13973

merged = gdf_dissolved.merge(df_dissolved, on='joinkey')

print('Total Records of merged shapefile', len(merged))

Total Records of merged shapefile 10747

There are still many records that do not have an exact match between the shapefile and the population table. This is due to small differences in their spellings. We can instead try a fuzzy match.

Fuzzy Join#

Let’s perform the Fuzzy match. Fuzzy matching for large tables can take long time, so we use tqdm package to track the progress.

gdf_dissolved.rename(columns = {'joinkey':'joinkey_exact'}, inplace=True)

from tqdm import tqdm

def getFuzzyMatch(row):
    joinkey = row['joinkey_exact']
    results = process.extractOne(joinkey, df_dissolved.joinkey)
    match = results[0]
    score = results[1]
    return match, score

tqdm.pandas()
gdf_dissolved[['match', 'score']] = gdf_dissolved.progress_apply(
    getFuzzyMatch, axis=1, result_type='expand')

100%|██████████| 13987/13987 [07:50<00:00, 29.71it/s]

The matches with scores less than 100 are fuzzy matches.

gdf_dissolved[gdf_dissolved['score'] != 100] \
 [['joinkey_exact', 'match', 'score']] \
 .sort_values('score', ascending=False)

	joinkey_exact	match	score
0	041/2kanuwarajanganayaanuradhapuranorthcentral	41/2kanuwarajanganayaanuradhapuranorthcentral	99
10708	puthukudiyiruppuwestputhukkudiyiruppumullaitiv...	puthukkudiyiruppuwestputhukkudiyiruppumullaiti...	99
3936	hingurakgodahigurakgodapolonnaruwanorthcentral	hingurakgodahingurakgodapolonnaruwanorthcentral	99
3933	hingurakahigurakgodapolonnaruwanorthcentral	hingurakahingurakgodapolonnaruwanorthcentral	99
3932	hingurakabandaragamahigurakgodapolonnaruwanort...	hingurakabandaragamahingurakgodapolonnaruwanor...	99
...	...	...	...
12459	udumullamadampagamagallesouthern	udumullahikkaduwagallesouthern	77
6803	loinonnorwoodnuwaraeliyacentral	loinonambagamuwanuwaraeliyacentral	77
8	[unknown]ambangangamatalecentral	hunuketeambangangakoralematalecentral	76
4473	injusternorwoodnuwaraeliyacentral	injustryambagamuwanuwaraeliyacentral	75
8969	newvalleygamanorwoodnuwaraeliyacentral	manelwalawalapanenuwaraeliyacentral	74

3236 rows × 3 columns

We can use all matches with a score of above a threshold. You need to use your discretion and local knowledge to determine the appropriate threshold. Here we see all matches with a low score.

score_threshold = 90
gdf_dissolved[gdf_dissolved['score'] < score_threshold] \
  .sort_values('score') \
  [['joinkey_exact', 'match', 'score']]

	joinkey_exact	match	score
8969	newvalleygamanorwoodnuwaraeliyacentral	manelwalawalapanenuwaraeliyacentral	74
4473	injusternorwoodnuwaraeliyacentral	injustryambagamuwanuwaraeliyacentral	75
8	[unknown]ambangangamatalecentral	hunuketeambangangakoralematalecentral	76
5527	kawattikudatown&gravetstrincomaleeeastern	kavaddikkudatrincomaleetownandgravetstrincomal...	77
9812	panmurnorwoodnuwaraeliyacentral	panmurambagamuwanuwaraeliyacentral	77
...	...	...	...
13575	welipathayayakalthotaratnapurasabaragamuwa	welipathayayabalangodaratnapurasabaragamuwa	89
20	[unknown]manmunaisouth&eruvilpattubatticaloaea...	mankadumanmunaisouth&eruvilpattubatticaloaeastern	89
5853	kirimichchaikoralaipattunorthbatticaloaeastern	kirimichaikoralaipattunorth(vaharai)batticaloa...	89
5131	kankeyanodaisouthmanmunaipattubatticaloaeastern	kankeyanodaisouthmanmunaipattu(araipattai)batt...	89
8715	narangamuwalaggalamatalecentral	narangamuwamatalematalecentral	89

550 rows × 3 columns

Save the high score matches as joinkey.

gdf_dissolved['joinkey'] = np.where(
    gdf_dissolved['score'] >= 90,
    gdf_dissolved['match'],
    '')

Use the new joinkey which contains the fuzzy matches from the df_dissolved table. Now we will get a lot more matches of records.

merged = gdf_dissolved.merge(df_dissolved, on='joinkey')

print('Total Records of merged shapefile', len(merged))

Total Records of merged shapefile 13437

Save the datasets#

output_dir = 'output'
output_filename = 'srilanka.gpkg'
output_path = os.path.join(output_dir, output_filename)

gdf_dissolved.to_file(driver='GPKG', filename=output_path, layer='admin4polygons', encoding='utf-8')
merged.to_file(driver='GPKG', filename=output_path, layer='merged', encoding='utf-8')

print('Successfully written output file at {}'.format(output_path))

Successfully written output file at output/srilanka.gpkg

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)