Extract a Shapefile Subset#
Introduction#
Many GIS processes involve extracting a subset from a database. A common pattern is to have data identifiers (like Parcel IDs) sent in a spreadsheet which needs to be queried and extracted from a master file. This tutorial shows how you can automate such a process using Pandas and GeoPandas.
Overview of the task#
This tutorial shows you how to use extract a subset from a shapefile using data contained in an Excel spreadsheet.
We will be working with a parcels data layer for the city of San Francisco, California. Given a list of parcel ids in a spreadsheet, we will extract those parcels and save it to another data layer.
Input Layers:
sf_parcels.zip
: A shapefile of parcels San Franciscoparcels_to_export.xlsx
: A spreadsheet containing list of parcels to export.
Output:
subset.zip
: A zipped shapefile containing a subset of parcels based on the spreadsheet.
Data Credit:
Parcels downloaded from DataSF Open Data Portal
Setup and Data Download#
The following blocks of code will install the required packages and download the datasets to your Colab environment.
import os
import pandas as pd
import geopandas as gpd
import zipfile
data_folder = 'data'
output_folder = 'output'
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'
download(data_url + 'sf_parcels.zip')
download(data_url + 'parcels_to_export.xlsx')
Downloaded data/sf_parcels.zip
Downloaded data/parcels_to_export.xlsx
Procedure#
We first unzip the sf_parcels.zip
archive and extract the shapefile contained inside. Then we can read it using GeoPandas.
GeoPandas can read zipped files directly using the
zip://
prefix as described in Reading and Writing Files section of the documentation.gpd.read_file('zip:///data/sf_parcels.zip')
. But it was much slower than unzipping and reading the shapefile.
parcels_filepath = os.path.join(data_folder, 'sf_parcels.zip')
We use Python’s built-in zipfile
module to extract the files in the data directory.
with zipfile.ZipFile(parcels_filepath) as zf:
zf.extractall(data_folder)
Once unzipped, we can read the parcels shapefile using GeoPandas.
parcels_shp = os.path.join(data_folder, 'sf_parcels.shp')
parcels_gdf = gpd.read_file(parcels_shp)
Preview the resulting GeoDataFrame. The parcel ids are contained in the mapblklot
column.
parcels_gdf.iloc[:5,:5]
mapblklot | blklot | block_num | lot_num | from_addre | |
---|---|---|---|---|---|
0 | 0001001 | 0001001 | 0001 | 001 | 0 |
1 | 0002001 | 0002001 | 0002 | 001 | 0 |
2 | 0004002 | 0004002 | 0004 | 002 | 160 |
3 | 0005001 | 0005001 | 0005 | 001 | 206 |
4 | 0006001 | 0006001 | 0006 | 001 | 350 |
Next, we read the Excel file containing the parcel ids that we need to export.
export_file_path = os.path.join(data_folder, 'parcels_to_export.xlsx')
Pandas can read Excel files directly using read_excel()
function. If you get an error, make sure to install the package openpyxl
which is used to read excel files.
export_df = pd.read_excel(export_file_path)
export_df
mapblklot | blklot | block_num | lot_num | |
---|---|---|---|---|
0 | 0478013 | 0478013 | 478 | 013 |
1 | 0478001 | 0478001 | 478 | 001 |
2 | 0478001B | 0478001B | 478 | 001B |
3 | 0478001C | 0478001C | 478 | 001C |
4 | 0478002A | 0478002A | 478 | 002A |
... | ... | ... | ... | ... |
84 | 0499036 | 0499037 | 499 | 037 |
85 | 0499036 | 0499038 | 499 | 038 |
86 | 0499036 | 0499039 | 499 | 039 |
87 | 0499036 | 0499040 | 499 | 040 |
88 | 0499036 | 0499041 | 499 | 041 |
89 rows × 4 columns
We need to export all parcels whose ids are given in the mapblklot
column. We extract that column and create a list.
id_list = export_df['blklot'].values
id_list
array(['0478013', '0478001', '0478001B', '0478001C', '0478002A',
'0478004', '0478005', '0478007', '0478008', '0478009', '0478010',
'0478010B', '0478011', '0478011A', '0478011B', '0478011C',
'0478011E', '0478014', '0478015', '0478015A', '0478016', '0478021',
'0478022', '0478023', '0478024', '0478025', '0478026', '0478027',
'0478028', '0478029', '0478030', '0478031', '0478032', '0478033',
'0478034', '0478035', '0478036', '0478037', '0478038', '0478039',
'0478040', '0478041', '0478042', '0478043', '0478044', '0478045',
'0478046', '0478047', '0478061', '0478062', '0478063', '0478064',
'0478065', '0478066', '0478067', '0499001', '0499001A', '0499001B',
'0499001C', '0499001F', '0499001H', '0499002', '0499002A',
'0499002B', '0499002D', '0499003', '0499004', '0499005', '0499006',
'0499007', '0499009', '0499013', '0499014', '0499015', '0499016',
'0499017', '0499018', '0499021', '0499022', '0499023', '0499024',
'0499025', '0499026', '0499036', '0499037', '0499038', '0499039',
'0499040', '0499041'], dtype=object)
Now we can use Pandas isin()
method to filter the GeoDataFrame where the blklot
column matches any ids from the id_list
.
subset_gdf = parcels_gdf[parcels_gdf['blklot'].isin(id_list)]
subset_gdf.iloc[:5, :5]
mapblklot | blklot | block_num | lot_num | from_addre | |
---|---|---|---|---|---|
21103 | 0478013 | 0478013 | 0478 | 013 | 2940 |
21119 | 0478001 | 0478001 | 0478 | 001 | 1101 |
21120 | 0478001B | 0478001B | 0478 | 001B | 2855 |
21121 | 0478001C | 0478001C | 0478 | 001C | 2845 |
21122 | 0478002A | 0478002A | 0478 | 002A | 2821 |
We have successfully selected the subset of parcels. We are ready to save the resulting GeoDataFrame as a shapefile. We define the output file path and save the subset_gdf
.
output_file = 'subset.shp'
output_path = os.path.join(output_folder, output_file)
subset_gdf.to_file(output_path)
For ease of data sharing, let’s zip all the shapefile parts into a single archive. We again use the zipfile
module and use the write()
method to add each sidecar file for the shapefile. The arcname
parameter is used to avoid creating a sub-folder inside the archive.
output_zip = 'subset.zip'
output_zip_path = os.path.join(output_folder, output_zip)
with zipfile.ZipFile(output_zip_path, 'w') as output_zf:
for ext in ['.shp', '.shx', '.prj', '.dbf']:
filename = 'subset' + ext
filepath = os.path.join(output_folder, filename)
output_zf.write(filepath, arcname=filename)
If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)