{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "N6CsIukeW1-7" }, "source": [ "# Extract a Shapefile Subset\n", "\n", "## Introduction\n", "\n", "Many GIS processes involve extracting a subset from a database. A common pattern is to have data identifiers (like Parcel IDs) sent in a spreadsheet which needs to be queried and extracted from a master file. This tutorial shows how you can automate such a process using Pandas and GeoPandas.\n", "\n", "## Overview of the task\n", "\n", "This tutorial shows you how to use extract a subset from a shapefile using data contained in an Excel spreadsheet.\n", "\n", "We will be working with a parcels data layer for the city of San Francisco, California. Given a list of parcel ids in a spreadsheet, we will extract those parcels and save it to another data layer.\n", "\n", "**Input Layers**:\n", "* `sf_parcels.zip`: A shapefile of parcels San Francisco\n", "* `parcels_to_export.xlsx`: A spreadsheet containing list of parcels to export.\n", "\n", "**Output**:\n", "* `subset.zip`: A zipped shapefile containing a subset of parcels based on the spreadsheet.\n", "\n", "**Data Credit**:\n", "* Parcels downloaded from [DataSF Open Data Portal](https://datasf.org/opendata/)\n", "\n", "\n", "**Watch Video Walkthrough** \n" ] }, { "cell_type": "markdown", "metadata": { "id": "JepwzAj2U5L5" }, "source": [ "## Setup and Data Download\n", "\n", "The following blocks of code will install the required packages and download the datasets to your Colab environment." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "uQovPAjjU5L6" }, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import geopandas as gpd\n", "import zipfile" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "-Zndcd8KU5L6" }, "outputs": [], "source": [ "data_folder = 'data'\n", "output_folder = 'output'\n", "\n", "if not os.path.exists(data_folder):\n", " os.mkdir(data_folder)\n", "if not os.path.exists(output_folder):\n", " os.mkdir(output_folder)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "N9cAjPXSU5L6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloaded data/sf_parcels.zip\n", "Downloaded data/parcels_to_export.xlsx\n" ] } ], "source": [ "def download(url):\n", " filename = os.path.join(data_folder, os.path.basename(url))\n", " if not os.path.exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", "\n", "data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'\n", "\n", "download(data_url + 'sf_parcels.zip')\n", "download(data_url + 'parcels_to_export.xlsx')" ] }, { "cell_type": "markdown", "metadata": { "id": "-D-U34cbYkrC" }, "source": [ "## Procedure" ] }, { "cell_type": "markdown", "metadata": { "id": "gE90KEg9Z9BU" }, "source": [ "We first unzip the `sf_parcels.zip` archive and extract the shapefile contained inside. Then we can read it using GeoPandas.\n", "\n", "> GeoPandas can read zipped files directly using the `zip://` prefix as described in [Reading and Writing Files](https://geopandas.org/en/stable/docs/user_guide/io.html) section of the documentation. `gpd.read_file('zip:///data/sf_parcels.zip')`. But it was much slower than unzipping and reading the shapefile." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "QWwnTyVyoFMr" }, "outputs": [], "source": [ "parcels_filepath = os.path.join(data_folder, 'sf_parcels.zip')" ] }, { "cell_type": "markdown", "metadata": { "id": "NWqvtTRPb27L" }, "source": [ "We use Python's built-in `zipfile` module to extract the files in the data directory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "zIxrmIW0Y9By" }, "outputs": [], "source": [ "with zipfile.ZipFile(parcels_filepath) as zf:\n", " zf.extractall(data_folder)" ] }, { "cell_type": "markdown", "metadata": { "id": "XoHkIBvzb-6z" }, "source": [ "Once unzipped, we can read the parcels shapefile using GeoPandas." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "0kv8x2JCoMFG" }, "outputs": [], "source": [ "parcels_shp = os.path.join(data_folder, 'sf_parcels.shp')\n", "parcels_gdf = gpd.read_file(parcels_shp)" ] }, { "cell_type": "markdown", "metadata": { "id": "X-la_03PcLth" }, "source": [ "Preview the resulting GeoDataFrame. The parcel ids are contained in the `mapblklot` column." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "Kz-zLg_ucLAh" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mapblklotblklotblock_numlot_numfrom_addre
00001001000100100010010
10002001000200100020010
2000400200040020004002160
3000500100050010005001206
4000600100060010006001350
\n", "
" ], "text/plain": [ " mapblklot blklot block_num lot_num from_addre\n", "0 0001001 0001001 0001 001 0\n", "1 0002001 0002001 0002 001 0\n", "2 0004002 0004002 0004 002 160\n", "3 0005001 0005001 0005 001 206\n", "4 0006001 0006001 0006 001 350" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parcels_gdf.iloc[:5,:5]" ] }, { "cell_type": "markdown", "metadata": { "id": "IIszH40icQ_h" }, "source": [ "Next, we read the Excel file containing the parcel ids that we need to export." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "OwYzuk3QoQxl" }, "outputs": [], "source": [ "export_file_path = os.path.join(data_folder, 'parcels_to_export.xlsx')" ] }, { "cell_type": "markdown", "metadata": { "id": "X4kJD_XOciij" }, "source": [ "Pandas can read Excel files directly using `read_excel()` function. If you get an error, make sure to install the package `openpyxl` which is used to read excel files." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "Pc2UHj9VprO9" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mapblklotblklotblock_numlot_num
004780130478013478013
104780010478001478001
20478001B0478001B478001B
30478001C0478001C478001C
40478002A0478002A478002A
...............
8404990360499037499037
8504990360499038499038
8604990360499039499039
8704990360499040499040
8804990360499041499041
\n", "

89 rows × 4 columns

\n", "
" ], "text/plain": [ " mapblklot blklot block_num lot_num\n", "0 0478013 0478013 478 013\n", "1 0478001 0478001 478 001\n", "2 0478001B 0478001B 478 001B\n", "3 0478001C 0478001C 478 001C\n", "4 0478002A 0478002A 478 002A\n", ".. ... ... ... ...\n", "84 0499036 0499037 499 037\n", "85 0499036 0499038 499 038\n", "86 0499036 0499039 499 039\n", "87 0499036 0499040 499 040\n", "88 0499036 0499041 499 041\n", "\n", "[89 rows x 4 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "export_df = pd.read_excel(export_file_path)\n", "export_df" ] }, { "cell_type": "markdown", "metadata": { "id": "5dRTB-ONctAL" }, "source": [ "We need to export all parcels whose ids are given in the `mapblklot` column. We extract that column and create a list." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "7jPpLxJvpxw8" }, "outputs": [ { "data": { "text/plain": [ "array(['0478013', '0478001', '0478001B', '0478001C', '0478002A',\n", " '0478004', '0478005', '0478007', '0478008', '0478009', '0478010',\n", " '0478010B', '0478011', '0478011A', '0478011B', '0478011C',\n", " '0478011E', '0478014', '0478015', '0478015A', '0478016', '0478021',\n", " '0478022', '0478023', '0478024', '0478025', '0478026', '0478027',\n", " '0478028', '0478029', '0478030', '0478031', '0478032', '0478033',\n", " '0478034', '0478035', '0478036', '0478037', '0478038', '0478039',\n", " '0478040', '0478041', '0478042', '0478043', '0478044', '0478045',\n", " '0478046', '0478047', '0478061', '0478062', '0478063', '0478064',\n", " '0478065', '0478066', '0478067', '0499001', '0499001A', '0499001B',\n", " '0499001C', '0499001F', '0499001H', '0499002', '0499002A',\n", " '0499002B', '0499002D', '0499003', '0499004', '0499005', '0499006',\n", " '0499007', '0499009', '0499013', '0499014', '0499015', '0499016',\n", " '0499017', '0499018', '0499021', '0499022', '0499023', '0499024',\n", " '0499025', '0499026', '0499036', '0499037', '0499038', '0499039',\n", " '0499040', '0499041'], dtype=object)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id_list = export_df['blklot'].values\n", "id_list" ] }, { "cell_type": "markdown", "metadata": { "id": "2sFfj8Y4dCGq" }, "source": [ "Now we can use Pandas `isin()` method to filter the GeoDataFrame where the `\n", "blklot` column matches any ids from the `id_list`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "mjJ_I_AsqE4p" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mapblklotblklotblock_numlot_numfrom_addre
211030478013047801304780132940
211190478001047800104780011101
211200478001B0478001B0478001B2855
211210478001C0478001C0478001C2845
211220478002A0478002A0478002A2821
\n", "
" ], "text/plain": [ " mapblklot blklot block_num lot_num from_addre\n", "21103 0478013 0478013 0478 013 2940\n", "21119 0478001 0478001 0478 001 1101\n", "21120 0478001B 0478001B 0478 001B 2855\n", "21121 0478001C 0478001C 0478 001C 2845\n", "21122 0478002A 0478002A 0478 002A 2821" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset_gdf = parcels_gdf[parcels_gdf['blklot'].isin(id_list)]\n", "subset_gdf.iloc[:5, :5]" ] }, { "cell_type": "markdown", "metadata": { "id": "rFuW7npBdWBb" }, "source": [ "We have successfully selected the subset of parcels. We are ready to save the resulting GeoDataFrame as a shapefile. We define the output file path and save the `subset_gdf`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "2adR9beyqSbw" }, "outputs": [], "source": [ "output_file = 'subset.shp'\n", "output_path = os.path.join(output_folder, output_file)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "kyHcRUOfqmgV" }, "outputs": [], "source": [ "subset_gdf.to_file(output_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "TWr7mRiQdvAa" }, "source": [ "For ease of data sharing, let's zip all the shapefile parts into a single archive. We again use the `zipfile` module and use the `write()` method to add each sidecar file for the shapefile. The `arcname` parameter is used to avoid creating a sub-folder inside the archive." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "AhD3TXNfqop0" }, "outputs": [], "source": [ "output_zip = 'subset.zip'\n", "output_zip_path = os.path.join(output_folder, output_zip)\n", "\n", "with zipfile.ZipFile(output_zip_path, 'w') as output_zf:\n", " for ext in ['.shp', '.shx', '.prj', '.dbf']:\n", " filename = 'subset' + ext\n", " filepath = os.path.join(output_folder, filename)\n", " output_zf.write(filepath, arcname=filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)\n", "\n", "\n", "" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyNN8jPiJlLRGqVPDfy1xf1X", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }