Bulk Geocoding Addresses#

Introduction#

Geocoding is the process of converting addresses to coordinates. It typically involves querying a web-based geocoding service that can understand and map the address to a real-world location. GeoPy is a Python library that offers an easy-to-use methods to use many popular geocoding services. In this tutorial, we will learn how to use GeoPy with Pandas and GeoPandas for bulk geocoding. We cover the following two methods for geocoding:

  • Geocoding Method 1: Pandas and GeoPy: Our recommended method that offers more flexibility and control over the whole process.

  • Geocoding Method 2: GeoPandas: A simpler method using GeoPandas’ built-in geopandas.tools.geocode function.

Overview of the Task#

We will be working with an Excel sheet containing addresses of Hurricane Evacuation Centers in New York City and geocode these addresses to create a map.

Input Layers:

  • Hurricane_Evacuation_Centers.xlsx: A spreadsheet containing names and addresses of hurricane shelters inn NYC.

Output:

  • hurricane_evacuation_centers.zip: A zipped shapefile with geocoded locations of the hurricane evacuation centers.

Data Credit:

Watch Video Walkthrough https://github.com/spatialthoughts/geopython-tutorials/blob/main/images/yt_logo.png?raw=1

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install leafmap mapclassify
import os
import re
import pandas as pd
import geopandas as gpd
import leafmap.foliumap as leafmap
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from zipfile import ZipFile
data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)
def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'

download(data_url + 'Hurricane_Evacuation_Centers.xlsx')
Downloaded data/Hurricane_Evacuation_Centers.xlsx

Procedure#

Read the Hurricane_Evacuation_Centers.xlsx file using Pandas. This step requires the openpyxl package to be installed along with pandas. The dataset contains 64 addresses of hurricane evacuation centers in New York City.

excel_file = 'Hurricane_Evacuation_Centers.xlsx'
excel_file_path = os.path.join(data_folder, excel_file)
address_df = pd.read_excel(excel_file_path, skiprows=[0])
address_df
CITY EC_Name ADDRESS ZIP_CODE BOROCODE STATE ACCESSIBLE
0 Flushing J.H.S. 185 - Queens 147-26 25 Drive 11354 4 NY Y
1 Brooklyn I.S. 258 - Brooklyn 141 Macon Street 11216 3 NY Y
2 Brooklyn I.S. 88 - Brooklyn 544 7 Avenue 11215 3 NY Y
3 Bronx Walton HS - X 2780 Reservoir Avenue 10468 2 NY Y
4 Brooklyn P.S./I.S.30 Mary White Ovington - Brooklyn 7002 4 Avenue 11209 3 NY Y
... ... ... ... ... ... ... ...
59 Bronx I.S. 201 - Bronx 730 Bryant Avenue 10474 2 NY Y
60 Bronx P.S. 46 - Bronx 2760 Briggs Avenue 10458 2 NY Y
61 Long Island City Aviation HS - Q 45-30 36 Street 11101 4 NY Y
62 Bronx P.S. 102 - Bronx 1827 Archer Street 10460 2 NY Y
63 Bronx Morris Heights Educational Complex - Bronx 1780 Dr. Martin Luther King Jr. Blvd 10453 2 NY Y

64 rows × 7 columns

The Nominatim geocoder expects the addresses with certain format. We fix our source addresses by making the street numbers as ordinal numbers.

Examples

  • 45-30 36 Street45-30 36th Street.

  • 544 7 Avenue544 7th Avenue

def make_ordinal(match):
    n = int(match.group(1))
    if 11 <= (n % 100) <= 13:
        suffix = 'th'
    else:
        suffix = ['th', 'st', 'nd', 'rd', 'th'][min(n % 10, 4)]
    return str(n) + suffix + match.group(2)

def update_address(row):
  old_address = row['ADDRESS']
  pattern = r'(\d+)(\s+(?:Street|Avenue|Blvd|Drive))'
  result = re.sub(pattern, make_ordinal, old_address)
  return result

address_df['ADDRESS_FIXED'] = address_df.apply(update_address, axis=1)
address_df
CITY EC_Name ADDRESS ZIP_CODE BOROCODE STATE ACCESSIBLE ADDRESS_FIXED
0 Flushing J.H.S. 185 - Queens 147-26 25 Drive 11354 4 NY Y 147-26 25th Drive
1 Brooklyn I.S. 258 - Brooklyn 141 Macon Street 11216 3 NY Y 141 Macon Street
2 Brooklyn I.S. 88 - Brooklyn 544 7 Avenue 11215 3 NY Y 544 7th Avenue
3 Bronx Walton HS - X 2780 Reservoir Avenue 10468 2 NY Y 2780 Reservoir Avenue
4 Brooklyn P.S./I.S.30 Mary White Ovington - Brooklyn 7002 4 Avenue 11209 3 NY Y 7002 4th Avenue
... ... ... ... ... ... ... ... ...
59 Bronx I.S. 201 - Bronx 730 Bryant Avenue 10474 2 NY Y 730 Bryant Avenue
60 Bronx P.S. 46 - Bronx 2760 Briggs Avenue 10458 2 NY Y 2760 Briggs Avenue
61 Long Island City Aviation HS - Q 45-30 36 Street 11101 4 NY Y 45-30 36th Street
62 Bronx P.S. 102 - Bronx 1827 Archer Street 10460 2 NY Y 1827 Archer Street
63 Bronx Morris Heights Educational Complex - Bronx 1780 Dr. Martin Luther King Jr. Blvd 10453 2 NY Y 1780 Dr. Martin Luther King Jr. Blvd

64 rows × 8 columns

We also create a new column containing the full address. The CITY column in the data are actually boroughs of NYC and adding it causing many geocoding requests to fail. Instead, we use the city name ‘NYC’ for all addresses.

address_df['Full_Address'] = (
    address_df['ADDRESS_FIXED'] + ',' +
    'NYC' + ',' +
    address_df['STATE']+ ',' +
    address_df['ZIP_CODE'].astype(str))
address_df
CITY EC_Name ADDRESS ZIP_CODE BOROCODE STATE ACCESSIBLE ADDRESS_FIXED Full_Address
0 Flushing J.H.S. 185 - Queens 147-26 25 Drive 11354 4 NY Y 147-26 25th Drive 147-26 25th Drive,NYC,NY,11354
1 Brooklyn I.S. 258 - Brooklyn 141 Macon Street 11216 3 NY Y 141 Macon Street 141 Macon Street,NYC,NY,11216
2 Brooklyn I.S. 88 - Brooklyn 544 7 Avenue 11215 3 NY Y 544 7th Avenue 544 7th Avenue,NYC,NY,11215
3 Bronx Walton HS - X 2780 Reservoir Avenue 10468 2 NY Y 2780 Reservoir Avenue 2780 Reservoir Avenue,NYC,NY,10468
4 Brooklyn P.S./I.S.30 Mary White Ovington - Brooklyn 7002 4 Avenue 11209 3 NY Y 7002 4th Avenue 7002 4th Avenue,NYC,NY,11209
... ... ... ... ... ... ... ... ... ...
59 Bronx I.S. 201 - Bronx 730 Bryant Avenue 10474 2 NY Y 730 Bryant Avenue 730 Bryant Avenue,NYC,NY,10474
60 Bronx P.S. 46 - Bronx 2760 Briggs Avenue 10458 2 NY Y 2760 Briggs Avenue 2760 Briggs Avenue,NYC,NY,10458
61 Long Island City Aviation HS - Q 45-30 36 Street 11101 4 NY Y 45-30 36th Street 45-30 36th Street,NYC,NY,11101
62 Bronx P.S. 102 - Bronx 1827 Archer Street 10460 2 NY Y 1827 Archer Street 1827 Archer Street,NYC,NY,10460
63 Bronx Morris Heights Educational Complex - Bronx 1780 Dr. Martin Luther King Jr. Blvd 10453 2 NY Y 1780 Dr. Martin Luther King Jr. Blvd 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453

64 rows × 9 columns

Geocoding Method 1: Pandas#

This is the preferred method that gives you more control over the geocoding process but involves a few more steps.

Now we use the Nominatim Geocoder from GeoPy. Nominatim geocoder is based on OpenStreetMap data.

The geocoder is free to use but requires rate limiting our queries to 1 query per second. We use the RateLimiter provided by geopy package to set the delay. Since geocoding many address with the rate limit can take some time - we use a progress bar provided by the tqdm package to track the progress.

You can also install and run Nominatim on your computer without any limitations and can geocode large number of addresses easily. Here’s a good tutorial showing how to Run Nominatim using Docker.

from tqdm.notebook import tqdm

tqdm.pandas()

locator = Nominatim(user_agent='spatialthoughts', timeout=10)
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

address_df_pd = address_df.copy()
address_df_pd['location'] = address_df_pd['Full_Address'].progress_apply(geocode)
address_df_pd
CITY EC_Name ADDRESS ZIP_CODE BOROCODE STATE ACCESSIBLE ADDRESS_FIXED Full_Address location
0 Flushing J.H.S. 185 - Queens 147-26 25 Drive 11354 4 NY Y 147-26 25th Drive 147-26 25th Drive,NYC,NY,11354 (Junior High School 185, 147-26, 25th Drive, L...
1 Brooklyn I.S. 258 - Brooklyn 141 Macon Street 11216 3 NY Y 141 Macon Street 141 Macon Street,NYC,NY,11216 (Junior High School 258, 141, Macon Street, Be...
2 Brooklyn I.S. 88 - Brooklyn 544 7 Avenue 11215 3 NY Y 544 7th Avenue 544 7th Avenue,NYC,NY,11215 (544, 7th Avenue, Greenwood Heights, Brooklyn,...
3 Bronx Walton HS - X 2780 Reservoir Avenue 10468 2 NY Y 2780 Reservoir Avenue 2780 Reservoir Avenue,NYC,NY,10468 (2780, Reservoir Avenue, Kingsbridge Heights, ...
4 Brooklyn P.S./I.S.30 Mary White Ovington - Brooklyn 7002 4 Avenue 11209 3 NY Y 7002 4th Avenue 7002 4th Avenue,NYC,NY,11209 (7002, 4th Avenue, Bay Ridge, Brooklyn, Kings ...
... ... ... ... ... ... ... ... ... ... ...
59 Bronx I.S. 201 - Bronx 730 Bryant Avenue 10474 2 NY Y 730 Bryant Avenue 730 Bryant Avenue,NYC,NY,10474 (730, Bryant Avenue, Hunts Point, The Bronx, B...
60 Bronx P.S. 46 - Bronx 2760 Briggs Avenue 10458 2 NY Y 2760 Briggs Avenue 2760 Briggs Avenue,NYC,NY,10458 (2760, Briggs Avenue, Bedford Park, The Bronx,...
61 Long Island City Aviation HS - Q 45-30 36 Street 11101 4 NY Y 45-30 36th Street 45-30 36th Street,NYC,NY,11101 (45-30, 36th Street, Sunnyside, Queens, Queens...
62 Bronx P.S. 102 - Bronx 1827 Archer Street 10460 2 NY Y 1827 Archer Street 1827 Archer Street,NYC,NY,10460 (Public School 102, 1827, Archer Street, The B...
63 Bronx Morris Heights Educational Complex - Bronx 1780 Dr. Martin Luther King Jr. Blvd 10453 2 NY Y 1780 Dr. Martin Luther King Jr. Blvd 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453 (East 125th Street, Manhattan Community Board ...

64 rows × 10 columns

The response of the geocoding request is a geopy.location.Location object containing the address along with geocoded coordinates as latitude and longitude. We extract them and add them to separate columns.

address_df_pd['latitude'] = address_df_pd['location'].apply(lambda loc: loc.latitude if loc else None)
address_df_pd['longitude'] = address_df_pd['location'].apply(lambda loc: loc.longitude if loc else None)
address_df_pd
CITY EC_Name ADDRESS ZIP_CODE BOROCODE STATE ACCESSIBLE ADDRESS_FIXED Full_Address location latitude longitude
0 Flushing J.H.S. 185 - Queens 147-26 25 Drive 11354 4 NY Y 147-26 25th Drive 147-26 25th Drive,NYC,NY,11354 (Junior High School 185, 147-26, 25th Drive, L... 40.774912 -73.818619
1 Brooklyn I.S. 258 - Brooklyn 141 Macon Street 11216 3 NY Y 141 Macon Street 141 Macon Street,NYC,NY,11216 (Junior High School 258, 141, Macon Street, Be... 40.681918 -73.945597
2 Brooklyn I.S. 88 - Brooklyn 544 7 Avenue 11215 3 NY Y 544 7th Avenue 544 7th Avenue,NYC,NY,11215 (544, 7th Avenue, Greenwood Heights, Brooklyn,... 40.660181 -73.987930
3 Bronx Walton HS - X 2780 Reservoir Avenue 10468 2 NY Y 2780 Reservoir Avenue 2780 Reservoir Avenue,NYC,NY,10468 (2780, Reservoir Avenue, Kingsbridge Heights, ... 40.870261 -73.896569
4 Brooklyn P.S./I.S.30 Mary White Ovington - Brooklyn 7002 4 Avenue 11209 3 NY Y 7002 4th Avenue 7002 4th Avenue,NYC,NY,11209 (7002, 4th Avenue, Bay Ridge, Brooklyn, Kings ... 40.633625 -74.024038
... ... ... ... ... ... ... ... ... ... ... ... ...
59 Bronx I.S. 201 - Bronx 730 Bryant Avenue 10474 2 NY Y 730 Bryant Avenue 730 Bryant Avenue,NYC,NY,10474 (730, Bryant Avenue, Hunts Point, The Bronx, B... 40.816040 -73.885264
60 Bronx P.S. 46 - Bronx 2760 Briggs Avenue 10458 2 NY Y 2760 Briggs Avenue 2760 Briggs Avenue,NYC,NY,10458 (2760, Briggs Avenue, Bedford Park, The Bronx,... 40.867729 -73.890294
61 Long Island City Aviation HS - Q 45-30 36 Street 11101 4 NY Y 45-30 36th Street 45-30 36th Street,NYC,NY,11101 (45-30, 36th Street, Sunnyside, Queens, Queens... 40.743475 -73.929464
62 Bronx P.S. 102 - Bronx 1827 Archer Street 10460 2 NY Y 1827 Archer Street 1827 Archer Street,NYC,NY,10460 (Public School 102, 1827, Archer Street, The B... 40.838106 -73.865790
63 Bronx Morris Heights Educational Complex - Bronx 1780 Dr. Martin Luther King Jr. Blvd 10453 2 NY Y 1780 Dr. Martin Luther King Jr. Blvd 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453 (East 125th Street, Manhattan Community Board ... 40.802729 -73.933493

64 rows × 12 columns

Some addresses might have failed to geocode. Let’s review them.

failed = address_df_pd[address_df_pd['location'].isna()]
failed
CITY EC_Name ADDRESS ZIP_CODE BOROCODE STATE ACCESSIBLE ADDRESS_FIXED Full_Address location latitude longitude
54 New York P.S. 48 - Manhattan 4360-78 Broadway 10033 1 NY Y 4360-78 Broadway 4360-78 Broadway,NYC,NY,10033 None NaN NaN

We may discard these rows. But sometimes it is useful to manually geocode failed queries. You may use another service manually or obtain the coordinates from another source. The below block assigns the coordinates 40.85339,-73.93350 to the row with index 54.

address_df_pd.loc[54, ['latitude', 'longitude']] = (40.85339,-73.93350)

Select a subset of columns to keep in the final result and rename them.

address_df_pd = address_df_pd[['EC_Name', 'Full_Address', 'latitude', 'longitude']]
address_df_pd.rename(columns = {'EC_Name': 'Name', 'Full_Address': 'Address'}, inplace=True)

We create a GeoDataFrame from the latitude and longitude columns.

geometry = gpd.points_from_xy(address_df_pd.longitude, address_df_pd.latitude)
address_gdf = gpd.GeoDataFrame(address_df_pd, crs='EPSG:4326', geometry=geometry)
address_gdf
Name Address latitude longitude geometry
0 J.H.S. 185 - Queens 147-26 25th Drive,NYC,NY,11354 40.774912 -73.818619 POINT (-73.81862 40.77491)
1 I.S. 258 - Brooklyn 141 Macon Street,NYC,NY,11216 40.681918 -73.945597 POINT (-73.9456 40.68192)
2 I.S. 88 - Brooklyn 544 7th Avenue,NYC,NY,11215 40.660181 -73.987930 POINT (-73.98793 40.66018)
3 Walton HS - X 2780 Reservoir Avenue,NYC,NY,10468 40.870261 -73.896569 POINT (-73.89657 40.87026)
4 P.S./I.S.30 Mary White Ovington - Brooklyn 7002 4th Avenue,NYC,NY,11209 40.633625 -74.024038 POINT (-74.02404 40.63362)
... ... ... ... ... ...
59 I.S. 201 - Bronx 730 Bryant Avenue,NYC,NY,10474 40.816040 -73.885264 POINT (-73.88526 40.81604)
60 P.S. 46 - Bronx 2760 Briggs Avenue,NYC,NY,10458 40.867729 -73.890294 POINT (-73.89029 40.86773)
61 Aviation HS - Q 45-30 36th Street,NYC,NY,11101 40.743475 -73.929464 POINT (-73.92946 40.74347)
62 P.S. 102 - Bronx 1827 Archer Street,NYC,NY,10460 40.838106 -73.865790 POINT (-73.86579 40.83811)
63 Morris Heights Educational Complex - Bronx 1780 Dr. Martin Luther King Jr. Blvd,NYC,NY,10453 40.802729 -73.933493 POINT (-73.93349 40.80273)

64 rows × 5 columns

Visualize the geocoded addresses on a map.

import folium
m = leafmap.Map(width=800, height=500)
address_gdf.explore(
    m=m,
    marker_type='marker',
    marker_kwds={
        'icon': folium.Icon(color='#fdbb84', icon='hurricane', prefix='fa')
    }
)
m.zoom_to_gdf(address_gdf)
m

Save the results as a shapefile.

output_file = 'hurricane_evacuation_centers.shp'
output_path = os.path.join(output_folder, output_file)

address_gdf.to_file(filename=output_path)

We also zip the resulting shapefile along with all sidecar files so it can be downloaded easily.

output_zip_file = 'hurricane_evacuation_centers.zip'
output_zip_path = os.path.join(output_folder, output_zip_file)

sidecar_files = [
    os.path.join(output_folder, file)
    for file in os.listdir(output_folder)
    if file.endswith(('shp', 'shx', 'dbf', 'prj'))
]

with ZipFile(output_zip_path, 'w') as zip_object:
    for sidecar in sidecar_files:
      zip_object.write(sidecar, os.path.basename(sidecar))

Geocoding Method 2: GeoPandas#

GeoPandas also provides a helper tool geopandas.tools.geocode for geocoding that allows using a GeoPy geocoder. This is a handy tool that gives you a GeoDataFrame but doesn’t offer the flexibility of the previous method. We demonstrate the process here for the sake of completeness.

geocoder_options = {
     'user_agent': 'spatialthoughts',
     'timeout': 10,
}

address_df_gpd = address_df.copy()

# GeoPandas uses the value of 1-second for rate limiting Nominatim queries
gdf = gpd.tools.geocode(
    address_df_gpd['ADDRESS_FIXED'],
    provider='Nominatim',
    **geocoder_options)

gdf

We merge the resulting GeoDataFrame with the original DataFrame to retain some columns from the original dataset.

merged = gdf.join(address_df_gpd)
merged

We select the subset of columns and rename them.

merged = merged[['EC_Name', 'Full_Address', 'geometry']]
merged.rename(columns = {'EC_Name': 'Name', 'Full_Address': 'Address'}, inplace=True)
merged

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)