Bulk Geocoding Addresses#
Introduction#
Geocoding is the process of converting addresses to coordinates. It typically involves querying a web-based geocoding service that can understand and map the address to a real-world location. GeoPy is a Python library that offers an easy-to-use methods to use many popular geocoding services. In this tutorial, we will learn how to use GeoPy with Pandas and GeoPandas for bulk geocoding. We cover the following two methods for geocoding:
Geocoding Method 1: Pandas and GeoPy: Our recommended method that offers more flexibility and control over the whole process.
Geocoding Method 2: GeoPandas: A simpler method using GeoPandas’ built-in
geopandas.tools.geocode
function.
Overview of the Task#
We will be working with an Excel sheet containing addresses of Hurricane Evacuation Centers in New York City and geocode these addresses to create a map.
Input Layers:
Hurricane_Evacuation_Centers.xlsx
: A spreadsheet containing names and addresses of hurricane shelters inn NYC.
Output:
hurricane_evacuation_centers.zip
: A zipped shapefile with geocoded locations of the hurricane evacuation centers.
Data Credit:
Setup and Data Download#
The following blocks of code will install the required packages and download the datasets to your Colab environment.
%%capture
if 'google.colab' in str(get_ipython()):
!pip install leafmap mapclassify
import os
import re
import pandas as pd
import geopandas as gpd
import leafmap.foliumap as leafmap
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from zipfile import ZipFile
data_folder = 'data'
output_folder = 'output'
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'
download(data_url + 'Hurricane_Evacuation_Centers.xlsx')
Downloaded data/Hurricane_Evacuation_Centers.xlsx
Procedure#
Read the Hurricane_Evacuation_Centers.xlsx
file using Pandas. This step requires the openpyxl
package to be installed along with pandas
. The dataset contains 64 addresses of hurricane evacuation centers in New York City.
excel_file = 'Hurricane_Evacuation_Centers.xlsx'
excel_file_path = os.path.join(data_folder, excel_file)
address_df = pd.read_excel(excel_file_path, skiprows=[0])
address_df
CITY | EC_Name | ADDRESS | ZIP_CODE | BOROCODE | STATE | ACCESSIBLE | |
---|---|---|---|---|---|---|---|
0 | Flushing | J.H.S. 185 - Queens | 147-26 25 Drive | 11354 | 4 | NY | Y |
1 | Brooklyn | I.S. 258 - Brooklyn | 141 Macon Street | 11216 | 3 | NY | Y |
2 | Brooklyn | I.S. 88 - Brooklyn | 544 7 Avenue | 11215 | 3 | NY | Y |
3 | Bronx | Walton HS - X | 2780 Reservoir Avenue | 10468 | 2 | NY | Y |
4 | Brooklyn | P.S./I.S.30 Mary White Ovington - Brooklyn | 7002 4 Avenue | 11209 | 3 | NY | Y |
... | ... | ... | ... | ... | ... | ... | ... |
59 | Bronx | I.S. 201 - Bronx | 730 Bryant Avenue | 10474 | 2 | NY | Y |
60 | Bronx | P.S. 46 - Bronx | 2760 Briggs Avenue | 10458 | 2 | NY | Y |
61 | Long Island City | Aviation HS - Q | 45-30 36 Street | 11101 | 4 | NY | Y |
62 | Bronx | P.S. 102 - Bronx | 1827 Archer Street | 10460 | 2 | NY | Y |
63 | Bronx | Morris Heights Educational Complex - Bronx | 1780 Dr. Martin Luther King Jr. Blvd | 10453 | 2 | NY | Y |
64 rows × 7 columns
The Nominatim geocoder expects the addresses with certain format. We fix our source addresses by making the street numbers as ordinal numbers.
Examples
45-30 36 Street
→45-30 36th Street
.544 7 Avenue
→544 7th Avenue
def make_ordinal(match):
n = int(match.group(1))
if 11 <= (n % 100) <= 13:
suffix = 'th'
else:
suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
return str(n) + suffix + match.group(2)
def update_address(row):
old_address = row['ADDRESS']
pattern = r'(\d+)(\s+(?:Street|Avenue|Blvd|Drive))'
result = re.sub(pattern, make_ordinal, old_address)
return result
address_df['ADDRESS_FIXED'] = address_df.apply(update_address, axis=1)
address_df
CITY | EC_Name | ADDRESS | ZIP_CODE | BOROCODE | STATE | ACCESSIBLE | ADDRESS_FIXED | |
---|---|---|---|---|---|---|---|---|
0 | Flushing | J.H.S. 185 - Queens | 147-26 25 Drive | 11354 | 4 | NY | Y | 147-26 25th Drive |
1 | Brooklyn | I.S. 258 - Brooklyn | 141 Macon Street | 11216 | 3 | NY | Y | 141 Macon Street |
2 | Brooklyn | I.S. 88 - Brooklyn | 544 7 Avenue | 11215 | 3 | NY | Y | 544 7th Avenue |
3 | Bronx | Walton HS - X | 2780 Reservoir Avenue | 10468 | 2 | NY | Y | 2780 Reservoir Avenue |
4 | Brooklyn | P.S./I.S.30 Mary White Ovington - Brooklyn | 7002 4 Avenue | 11209 | 3 | NY | Y | 7002 4th Avenue |
... | ... | ... | ... | ... | ... | ... | ... | ... |
59 | Bronx | I.S. 201 - Bronx | 730 Bryant Avenue | 10474 | 2 | NY | Y | 730 Bryant Avenue |
60 | Bronx | P.S. 46 - Bronx | 2760 Briggs Avenue | 10458 | 2 | NY | Y | 2760 Briggs Avenue |
61 | Long Island City | Aviation HS - Q | 45-30 36 Street | 11101 | 4 | NY | Y | 45-30 36th Street |
62 | Bronx | P.S. 102 - Bronx | 1827 Archer Street | 10460 | 2 | NY | Y | 1827 Archer Street |
63 | Bronx | Morris Heights Educational Complex - Bronx | 1780 Dr. Martin Luther King Jr. Blvd | 10453 | 2 | NY | Y | 1780 Dr. Martin Luther King Jr. Blvd |
64 rows × 8 columns
We also create a new column containing the full address. The CITY
column in the data are actually boroughs of NYC and adding it causing many geocoding requests to fail. Instead, we use the city name ‘NYC’ for all addresses.
address_df['Full_Address'] = (
address_df['ADDRESS_FIXED'] + ',' +
'NYC' + ',' +
address_df['STATE']+ ',' +
address_df['ZIP_CODE'].astype(str))
address_df
CITY | EC_Name | ADDRESS | ZIP_CODE | BOROCODE | STATE | ACCESSIBLE | ADDRESS_FIXED | Full_Address | |
---|---|---|---|---|---|---|---|---|---|
0 | Flushing | J.H.S. 185 - Queens | 147-26 25 Drive | 11354 | 4 | NY | Y | 147-26 25th Drive | 147-26 25th Drive,NYC,NY,11354 |
1 | Brooklyn | I.S. 258 - Brooklyn | 141 Macon Street | 11216 | 3 | NY | Y | 141 Macon Street | 141 Macon Street,NYC,NY,11216 |
2 | Brooklyn | I.S. 88 - Brooklyn | 544 7 Avenue | 11215 | 3 | NY | Y | 544 7th Avenue | 544 7th Avenue,NYC,NY,11215 |
3 | Bronx | Walton HS - X | 2780 Reservoir Avenue | 10468 | 2 | NY | Y | 2780 Reservoir Avenue | 2780 Reservoir Avenue,NYC,NY,10468 |
4 | Brooklyn | P.S./I.S.30 Mary White Ovington - Brooklyn | 7002 4 Avenue | 11209 | 3 | NY | Y | 7002 4th Avenue | 7002 4th Avenue,NYC,NY,11209 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59 | Bronx | I.S. 201 - Bronx | 730 Bryant Avenue | 10474 | 2 | NY | Y | 730 Bryant Avenue | 730 Bryant Avenue,NYC,NY,10474 |
60 | Bronx | P.S. 46 - Bronx | 2760 Briggs Avenue | 10458 | 2 | NY | Y | 2760 Briggs Avenue | 2760 Briggs Avenue,NYC,NY,10458 |
61 | Long Island City | Aviation HS - Q | 45-30 36 Street | 11101 | 4 | NY | Y | 45-30 36th Street | 45-30 36th Street,NYC,NY,11101 |
62 | Bronx | P.S. 102 - Bronx | 1827 Archer Street | 10460 | 2 | NY | Y | 1827 Archer Street | 1827 Archer Street,NYC,NY,10460 |
63 | Bronx | Morris Heights Educational Complex - Bronx | 1780 Dr. Martin Luther King Jr. Blvd | 10453 | 2 | NY | Y | 1780 Dr. Martin Luther King Jr. Blvd | 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453 |
64 rows × 9 columns
Geocoding Method 1: Pandas#
This is the preferred method that gives you more control over the geocoding process but involves a few more steps.
Now we use the Nominatim Geocoder from GeoPy. Nominatim geocoder is based on OpenStreetMap data.
The geocoder is free to use but requires rate limiting our queries to 1 query per second. We use the RateLimiter
provided by geopy
package to set the delay. Since geocoding many address with the rate limit can take some time - we use a progress bar provided by the tqdm
package to track the progress.
You can also install and run Nominatim on your computer without any limitations and can geocode large number of addresses easily. Here’s a good tutorial showing how to Run Nominatim using Docker.
from tqdm.notebook import tqdm
tqdm.pandas()
locator = Nominatim(user_agent='spatialthoughts', timeout=10)
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)
address_df_pd = address_df.copy()
address_df_pd['location'] = address_df_pd['Full_Address'].progress_apply(geocode)
address_df_pd
CITY | EC_Name | ADDRESS | ZIP_CODE | BOROCODE | STATE | ACCESSIBLE | ADDRESS_FIXED | Full_Address | location | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Flushing | J.H.S. 185 - Queens | 147-26 25 Drive | 11354 | 4 | NY | Y | 147-26 25th Drive | 147-26 25th Drive,NYC,NY,11354 | (Junior High School 185, 147-26, 25th Drive, L... |
1 | Brooklyn | I.S. 258 - Brooklyn | 141 Macon Street | 11216 | 3 | NY | Y | 141 Macon Street | 141 Macon Street,NYC,NY,11216 | (Junior High School 258, 141, Macon Street, Be... |
2 | Brooklyn | I.S. 88 - Brooklyn | 544 7 Avenue | 11215 | 3 | NY | Y | 544 7th Avenue | 544 7th Avenue,NYC,NY,11215 | (544, 7th Avenue, Greenwood Heights, Brooklyn,... |
3 | Bronx | Walton HS - X | 2780 Reservoir Avenue | 10468 | 2 | NY | Y | 2780 Reservoir Avenue | 2780 Reservoir Avenue,NYC,NY,10468 | (2780, Reservoir Avenue, Kingsbridge Heights, ... |
4 | Brooklyn | P.S./I.S.30 Mary White Ovington - Brooklyn | 7002 4 Avenue | 11209 | 3 | NY | Y | 7002 4th Avenue | 7002 4th Avenue,NYC,NY,11209 | (7002, 4th Avenue, Bay Ridge, Brooklyn, Kings ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59 | Bronx | I.S. 201 - Bronx | 730 Bryant Avenue | 10474 | 2 | NY | Y | 730 Bryant Avenue | 730 Bryant Avenue,NYC,NY,10474 | (730, Bryant Avenue, Hunts Point, The Bronx, B... |
60 | Bronx | P.S. 46 - Bronx | 2760 Briggs Avenue | 10458 | 2 | NY | Y | 2760 Briggs Avenue | 2760 Briggs Avenue,NYC,NY,10458 | (2760, Briggs Avenue, Bedford Park, The Bronx,... |
61 | Long Island City | Aviation HS - Q | 45-30 36 Street | 11101 | 4 | NY | Y | 45-30 36th Street | 45-30 36th Street,NYC,NY,11101 | (45-30, 36th Street, Sunnyside, Queens, Queens... |
62 | Bronx | P.S. 102 - Bronx | 1827 Archer Street | 10460 | 2 | NY | Y | 1827 Archer Street | 1827 Archer Street,NYC,NY,10460 | (Public School 102, 1827, Archer Street, The B... |
63 | Bronx | Morris Heights Educational Complex - Bronx | 1780 Dr. Martin Luther King Jr. Blvd | 10453 | 2 | NY | Y | 1780 Dr. Martin Luther King Jr. Blvd | 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453 | (East 125th Street, Manhattan Community Board ... |
64 rows × 10 columns
The response of the geocoding request is a geopy.location.Location
object containing the address along with geocoded coordinates as latitude and longitude. We extract them and add them to separate columns.
address_df_pd['latitude'] = address_df_pd['location'].apply(lambda loc: loc.latitude if loc else None)
address_df_pd['longitude'] = address_df_pd['location'].apply(lambda loc: loc.longitude if loc else None)
address_df_pd
CITY | EC_Name | ADDRESS | ZIP_CODE | BOROCODE | STATE | ACCESSIBLE | ADDRESS_FIXED | Full_Address | location | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Flushing | J.H.S. 185 - Queens | 147-26 25 Drive | 11354 | 4 | NY | Y | 147-26 25th Drive | 147-26 25th Drive,NYC,NY,11354 | (Junior High School 185, 147-26, 25th Drive, L... | 40.774912 | -73.818619 |
1 | Brooklyn | I.S. 258 - Brooklyn | 141 Macon Street | 11216 | 3 | NY | Y | 141 Macon Street | 141 Macon Street,NYC,NY,11216 | (Junior High School 258, 141, Macon Street, Be... | 40.681918 | -73.945597 |
2 | Brooklyn | I.S. 88 - Brooklyn | 544 7 Avenue | 11215 | 3 | NY | Y | 544 7th Avenue | 544 7th Avenue,NYC,NY,11215 | (544, 7th Avenue, Greenwood Heights, Brooklyn,... | 40.660181 | -73.987930 |
3 | Bronx | Walton HS - X | 2780 Reservoir Avenue | 10468 | 2 | NY | Y | 2780 Reservoir Avenue | 2780 Reservoir Avenue,NYC,NY,10468 | (2780, Reservoir Avenue, Kingsbridge Heights, ... | 40.870261 | -73.896569 |
4 | Brooklyn | P.S./I.S.30 Mary White Ovington - Brooklyn | 7002 4 Avenue | 11209 | 3 | NY | Y | 7002 4th Avenue | 7002 4th Avenue,NYC,NY,11209 | (7002, 4th Avenue, Bay Ridge, Brooklyn, Kings ... | 40.633625 | -74.024038 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59 | Bronx | I.S. 201 - Bronx | 730 Bryant Avenue | 10474 | 2 | NY | Y | 730 Bryant Avenue | 730 Bryant Avenue,NYC,NY,10474 | (730, Bryant Avenue, Hunts Point, The Bronx, B... | 40.816040 | -73.885264 |
60 | Bronx | P.S. 46 - Bronx | 2760 Briggs Avenue | 10458 | 2 | NY | Y | 2760 Briggs Avenue | 2760 Briggs Avenue,NYC,NY,10458 | (2760, Briggs Avenue, Bedford Park, The Bronx,... | 40.867729 | -73.890294 |
61 | Long Island City | Aviation HS - Q | 45-30 36 Street | 11101 | 4 | NY | Y | 45-30 36th Street | 45-30 36th Street,NYC,NY,11101 | (45-30, 36th Street, Sunnyside, Queens, Queens... | 40.743475 | -73.929464 |
62 | Bronx | P.S. 102 - Bronx | 1827 Archer Street | 10460 | 2 | NY | Y | 1827 Archer Street | 1827 Archer Street,NYC,NY,10460 | (Public School 102, 1827, Archer Street, The B... | 40.838106 | -73.865790 |
63 | Bronx | Morris Heights Educational Complex - Bronx | 1780 Dr. Martin Luther King Jr. Blvd | 10453 | 2 | NY | Y | 1780 Dr. Martin Luther King Jr. Blvd | 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453 | (East 125th Street, Manhattan Community Board ... | 40.802729 | -73.933493 |
64 rows × 12 columns
Some addresses might have failed to geocode. Let’s review them.
failed = address_df_pd[address_df_pd['location'].isna()]
failed
CITY | EC_Name | ADDRESS | ZIP_CODE | BOROCODE | STATE | ACCESSIBLE | ADDRESS_FIXED | Full_Address | location | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
54 | New York | P.S. 48 - Manhattan | 4360-78 Broadway | 10033 | 1 | NY | Y | 4360-78 Broadway | 4360-78 Broadway,NYC,NY,10033 | None | NaN | NaN |
We may discard these rows. But sometimes it is useful to manually geocode failed queries. You may use another service manually or obtain the coordinates from another source. The below block assigns the coordinates 40.85339,-73.93350 to the row with index 54.
address_df_pd.loc[54, ['latitude', 'longitude']] = (40.85339,-73.93350)
Select a subset of columns to keep in the final result and rename them.
address_df_pd = address_df_pd[['EC_Name', 'Full_Address', 'latitude', 'longitude']]
address_df_pd.rename(columns = {'EC_Name': 'Name', 'Full_Address': 'Address'}, inplace=True)
We create a GeoDataFrame from the latitude and longitude columns.
geometry = gpd.points_from_xy(address_df_pd.longitude, address_df_pd.latitude)
address_gdf = gpd.GeoDataFrame(address_df_pd, crs='EPSG:4326', geometry=geometry)
address_gdf
Name | Address | latitude | longitude | geometry | |
---|---|---|---|---|---|
0 | J.H.S. 185 - Queens | 147-26 25th Drive,NYC,NY,11354 | 40.774912 | -73.818619 | POINT (-73.81862 40.77491) |
1 | I.S. 258 - Brooklyn | 141 Macon Street,NYC,NY,11216 | 40.681918 | -73.945597 | POINT (-73.9456 40.68192) |
2 | I.S. 88 - Brooklyn | 544 7th Avenue,NYC,NY,11215 | 40.660181 | -73.987930 | POINT (-73.98793 40.66018) |
3 | Walton HS - X | 2780 Reservoir Avenue,NYC,NY,10468 | 40.870261 | -73.896569 | POINT (-73.89657 40.87026) |
4 | P.S./I.S.30 Mary White Ovington - Brooklyn | 7002 4th Avenue,NYC,NY,11209 | 40.633625 | -74.024038 | POINT (-74.02404 40.63362) |
... | ... | ... | ... | ... | ... |
59 | I.S. 201 - Bronx | 730 Bryant Avenue,NYC,NY,10474 | 40.816040 | -73.885264 | POINT (-73.88526 40.81604) |
60 | P.S. 46 - Bronx | 2760 Briggs Avenue,NYC,NY,10458 | 40.867729 | -73.890294 | POINT (-73.89029 40.86773) |
61 | Aviation HS - Q | 45-30 36th Street,NYC,NY,11101 | 40.743475 | -73.929464 | POINT (-73.92946 40.74347) |
62 | P.S. 102 - Bronx | 1827 Archer Street,NYC,NY,10460 | 40.838106 | -73.865790 | POINT (-73.86579 40.83811) |
63 | Morris Heights Educational Complex - Bronx | 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453 | 40.802729 | -73.933493 | POINT (-73.93349 40.80273) |
64 rows × 5 columns
Visualize the geocoded addresses on a map.
import folium
m = leafmap.Map(width=800, height=500)
address_gdf.explore(
m=m,
marker_type='marker',
marker_kwds={
'icon': folium.Icon(color='#fdbb84', icon='hurricane', prefix='fa')
}
)
m.zoom_to_gdf(address_gdf)
m
Save the results as a shapefile.
output_file = 'hurricane_evacuation_centers.shp'
output_path = os.path.join(output_folder, output_file)
address_gdf.to_file(filename=output_path)
We also zip the resulting shapefile along with all sidecar files so it can be downloaded easily.
output_zip_file = 'hurricane_evacuation_centers.zip'
output_zip_path = os.path.join(output_folder, output_zip_file)
sidecar_files = [
os.path.join(output_folder, file)
for file in os.listdir(output_folder)
if file.endswith(('shp', 'shx', 'dbf', 'prj'))
]
with ZipFile(output_zip_path, 'w') as zip_object:
for sidecar in sidecar_files:
zip_object.write(sidecar, os.path.basename(sidecar))
Geocoding Method 2: GeoPandas#
GeoPandas also provides a helper tool geopandas.tools.geocode for geocoding that allows using a GeoPy geocoder. This is a handy tool that gives you a GeoDataFrame but doesn’t offer the flexibility of the previous method. We demonstrate the process here for the sake of completeness.
geocoder_options = {
'user_agent': 'spatialthoughts',
'timeout': 10,
}
address_df_gpd = address_df.copy()
# GeoPandas uses the value of 1-second for rate limiting Nominatim queries
gdf = gpd.tools.geocode(
address_df_gpd['ADDRESS_FIXED'],
provider='Nominatim',
**geocoder_options)
gdf
We merge the resulting GeoDataFrame with the original DataFrame to retain some columns from the original dataset.
merged = gdf.join(address_df_gpd)
merged
We select the subset of columns and rename them.
merged = merged[['EC_Name', 'Full_Address', 'geometry']]
merged.rename(columns = {'EC_Name': 'Name', 'Full_Address': 'Address'}, inplace=True)
merged
If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)