{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N6CsIukeW1-7"
   },
   "source": [
    "# Extract a Shapefile Subset\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Many GIS processes involve extracting a subset from a database. A common pattern is to have data identifiers (like Parcel IDs) sent in a spreadsheet which needs to be queried and extracted from a master file. This tutorial shows how you can automate such a process using Pandas and GeoPandas.\n",
    "\n",
    "## Overview of the task\n",
    "\n",
    "This tutorial shows you how to use extract a subset from a shapefile using data contained in an Excel spreadsheet.\n",
    "\n",
    "We will be working with a parcels data layer for the city of San Francisco, California. Given a list of parcel ids in a spreadsheet, we will extract those parcels and save it to another data layer.\n",
    "\n",
    "**Input Layers**:\n",
    "* `sf_parcels.zip`: A shapefile of parcels San Francisco\n",
    "* `parcels_to_export.xlsx`: A spreadsheet containing list of parcels to export.\n",
    "\n",
    "**Output**:\n",
    "* `subset.zip`: A zipped shapefile containing a subset of parcels based on the spreadsheet.\n",
    "\n",
    "**Data Credit**:\n",
    "* Parcels downloaded from [DataSF Open Data Portal](https://datasf.org/opendata/)\n",
    "\n",
    "\n",
    "**Watch Video Walkthrough** <a href='https://www.youtube.com/watch?v=p7QzF8kQogo' target='_blank'><img src='../images/yt_logo.png'></a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JepwzAj2U5L5"
   },
   "source": [
    "## Setup and Data Download\n",
    "\n",
    "The following blocks of code will install the required packages and download the datasets to your Colab environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "uQovPAjjU5L6"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "import zipfile"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "-Zndcd8KU5L6"
   },
   "outputs": [],
   "source": [
    "data_folder = 'data'\n",
    "output_folder = 'output'\n",
    "\n",
    "if not os.path.exists(data_folder):\n",
    "    os.mkdir(data_folder)\n",
    "if not os.path.exists(output_folder):\n",
    "    os.mkdir(output_folder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "id": "N9cAjPXSU5L6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloaded data/sf_parcels.zip\n",
      "Downloaded data/parcels_to_export.xlsx\n"
     ]
    }
   ],
   "source": [
    "def download(url):\n",
    "    filename = os.path.join(data_folder, os.path.basename(url))\n",
    "    if not os.path.exists(filename):\n",
    "        from urllib.request import urlretrieve\n",
    "        local, _ = urlretrieve(url, filename)\n",
    "        print('Downloaded ' + local)\n",
    "\n",
    "data_url = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'\n",
    "\n",
    "download(data_url + 'sf_parcels.zip')\n",
    "download(data_url + 'parcels_to_export.xlsx')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-D-U34cbYkrC"
   },
   "source": [
    "## Procedure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gE90KEg9Z9BU"
   },
   "source": [
    "We first unzip the `sf_parcels.zip` archive and extract the shapefile contained inside. Then we can read it using GeoPandas.\n",
    "\n",
    "> GeoPandas can read zipped files directly using the `zip://` prefix as described in [Reading and Writing Files](https://geopandas.org/en/stable/docs/user_guide/io.html) section of the documentation. `gpd.read_file('zip:///data/sf_parcels.zip')`. But it was much slower than unzipping and reading the shapefile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "id": "QWwnTyVyoFMr"
   },
   "outputs": [],
   "source": [
    "parcels_filepath = os.path.join(data_folder, 'sf_parcels.zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NWqvtTRPb27L"
   },
   "source": [
    "We use Python's built-in `zipfile` module to extract the files in the data directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "id": "zIxrmIW0Y9By"
   },
   "outputs": [],
   "source": [
    "with zipfile.ZipFile(parcels_filepath) as zf:\n",
    "  zf.extractall(data_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XoHkIBvzb-6z"
   },
   "source": [
    "Once unzipped, we can read the parcels shapefile using GeoPandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "id": "0kv8x2JCoMFG"
   },
   "outputs": [],
   "source": [
    "parcels_shp = os.path.join(data_folder, 'sf_parcels.shp')\n",
    "parcels_gdf = gpd.read_file(parcels_shp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "X-la_03PcLth"
   },
   "source": [
    "Preview the resulting GeoDataFrame. The parcel ids are contained in the `mapblklot` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "id": "Kz-zLg_ucLAh"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mapblklot</th>\n",
       "      <th>blklot</th>\n",
       "      <th>block_num</th>\n",
       "      <th>lot_num</th>\n",
       "      <th>from_addre</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0001001</td>\n",
       "      <td>0001001</td>\n",
       "      <td>0001</td>\n",
       "      <td>001</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0002001</td>\n",
       "      <td>0002001</td>\n",
       "      <td>0002</td>\n",
       "      <td>001</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0004002</td>\n",
       "      <td>0004002</td>\n",
       "      <td>0004</td>\n",
       "      <td>002</td>\n",
       "      <td>160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0005001</td>\n",
       "      <td>0005001</td>\n",
       "      <td>0005</td>\n",
       "      <td>001</td>\n",
       "      <td>206</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0006001</td>\n",
       "      <td>0006001</td>\n",
       "      <td>0006</td>\n",
       "      <td>001</td>\n",
       "      <td>350</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  mapblklot   blklot block_num lot_num from_addre\n",
       "0   0001001  0001001      0001     001          0\n",
       "1   0002001  0002001      0002     001          0\n",
       "2   0004002  0004002      0004     002        160\n",
       "3   0005001  0005001      0005     001        206\n",
       "4   0006001  0006001      0006     001        350"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parcels_gdf.iloc[:5,:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IIszH40icQ_h"
   },
   "source": [
    "Next, we read the Excel file containing the parcel ids that we need to export."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "id": "OwYzuk3QoQxl"
   },
   "outputs": [],
   "source": [
    "export_file_path = os.path.join(data_folder, 'parcels_to_export.xlsx')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "X4kJD_XOciij"
   },
   "source": [
    "Pandas can read Excel files directly using `read_excel()` function. If you get an error, make sure to install the package `openpyxl` which is used to read excel files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "id": "Pc2UHj9VprO9"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mapblklot</th>\n",
       "      <th>blklot</th>\n",
       "      <th>block_num</th>\n",
       "      <th>lot_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0478013</td>\n",
       "      <td>0478013</td>\n",
       "      <td>478</td>\n",
       "      <td>013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0478001</td>\n",
       "      <td>0478001</td>\n",
       "      <td>478</td>\n",
       "      <td>001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0478001B</td>\n",
       "      <td>0478001B</td>\n",
       "      <td>478</td>\n",
       "      <td>001B</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0478001C</td>\n",
       "      <td>0478001C</td>\n",
       "      <td>478</td>\n",
       "      <td>001C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0478002A</td>\n",
       "      <td>0478002A</td>\n",
       "      <td>478</td>\n",
       "      <td>002A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>0499036</td>\n",
       "      <td>0499037</td>\n",
       "      <td>499</td>\n",
       "      <td>037</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>0499036</td>\n",
       "      <td>0499038</td>\n",
       "      <td>499</td>\n",
       "      <td>038</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>86</th>\n",
       "      <td>0499036</td>\n",
       "      <td>0499039</td>\n",
       "      <td>499</td>\n",
       "      <td>039</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87</th>\n",
       "      <td>0499036</td>\n",
       "      <td>0499040</td>\n",
       "      <td>499</td>\n",
       "      <td>040</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>88</th>\n",
       "      <td>0499036</td>\n",
       "      <td>0499041</td>\n",
       "      <td>499</td>\n",
       "      <td>041</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>89 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   mapblklot    blklot  block_num lot_num\n",
       "0    0478013   0478013        478     013\n",
       "1    0478001   0478001        478     001\n",
       "2   0478001B  0478001B        478    001B\n",
       "3   0478001C  0478001C        478    001C\n",
       "4   0478002A  0478002A        478    002A\n",
       "..       ...       ...        ...     ...\n",
       "84   0499036   0499037        499     037\n",
       "85   0499036   0499038        499     038\n",
       "86   0499036   0499039        499     039\n",
       "87   0499036   0499040        499     040\n",
       "88   0499036   0499041        499     041\n",
       "\n",
       "[89 rows x 4 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "export_df = pd.read_excel(export_file_path)\n",
    "export_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5dRTB-ONctAL"
   },
   "source": [
    "We need to export all parcels whose ids are given in the `mapblklot` column. We extract that column and create a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "id": "7jPpLxJvpxw8"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['0478013', '0478001', '0478001B', '0478001C', '0478002A',\n",
       "       '0478004', '0478005', '0478007', '0478008', '0478009', '0478010',\n",
       "       '0478010B', '0478011', '0478011A', '0478011B', '0478011C',\n",
       "       '0478011E', '0478014', '0478015', '0478015A', '0478016', '0478021',\n",
       "       '0478022', '0478023', '0478024', '0478025', '0478026', '0478027',\n",
       "       '0478028', '0478029', '0478030', '0478031', '0478032', '0478033',\n",
       "       '0478034', '0478035', '0478036', '0478037', '0478038', '0478039',\n",
       "       '0478040', '0478041', '0478042', '0478043', '0478044', '0478045',\n",
       "       '0478046', '0478047', '0478061', '0478062', '0478063', '0478064',\n",
       "       '0478065', '0478066', '0478067', '0499001', '0499001A', '0499001B',\n",
       "       '0499001C', '0499001F', '0499001H', '0499002', '0499002A',\n",
       "       '0499002B', '0499002D', '0499003', '0499004', '0499005', '0499006',\n",
       "       '0499007', '0499009', '0499013', '0499014', '0499015', '0499016',\n",
       "       '0499017', '0499018', '0499021', '0499022', '0499023', '0499024',\n",
       "       '0499025', '0499026', '0499036', '0499037', '0499038', '0499039',\n",
       "       '0499040', '0499041'], dtype=object)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "id_list = export_df['blklot'].values\n",
    "id_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2sFfj8Y4dCGq"
   },
   "source": [
    "Now we can use Pandas `isin()` method to filter the GeoDataFrame where the `\n",
    "blklot` column matches any ids from the `id_list`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "id": "mjJ_I_AsqE4p"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mapblklot</th>\n",
       "      <th>blklot</th>\n",
       "      <th>block_num</th>\n",
       "      <th>lot_num</th>\n",
       "      <th>from_addre</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21103</th>\n",
       "      <td>0478013</td>\n",
       "      <td>0478013</td>\n",
       "      <td>0478</td>\n",
       "      <td>013</td>\n",
       "      <td>2940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21119</th>\n",
       "      <td>0478001</td>\n",
       "      <td>0478001</td>\n",
       "      <td>0478</td>\n",
       "      <td>001</td>\n",
       "      <td>1101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21120</th>\n",
       "      <td>0478001B</td>\n",
       "      <td>0478001B</td>\n",
       "      <td>0478</td>\n",
       "      <td>001B</td>\n",
       "      <td>2855</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21121</th>\n",
       "      <td>0478001C</td>\n",
       "      <td>0478001C</td>\n",
       "      <td>0478</td>\n",
       "      <td>001C</td>\n",
       "      <td>2845</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21122</th>\n",
       "      <td>0478002A</td>\n",
       "      <td>0478002A</td>\n",
       "      <td>0478</td>\n",
       "      <td>002A</td>\n",
       "      <td>2821</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      mapblklot    blklot block_num lot_num from_addre\n",
       "21103   0478013   0478013      0478     013       2940\n",
       "21119   0478001   0478001      0478     001       1101\n",
       "21120  0478001B  0478001B      0478    001B       2855\n",
       "21121  0478001C  0478001C      0478    001C       2845\n",
       "21122  0478002A  0478002A      0478    002A       2821"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subset_gdf = parcels_gdf[parcels_gdf['blklot'].isin(id_list)]\n",
    "subset_gdf.iloc[:5, :5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rFuW7npBdWBb"
   },
   "source": [
    "We have successfully selected the subset of parcels. We are ready to save the resulting GeoDataFrame as a shapefile. We define the output file path and save the `subset_gdf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "id": "2adR9beyqSbw"
   },
   "outputs": [],
   "source": [
    "output_file = 'subset.shp'\n",
    "output_path = os.path.join(output_folder, output_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "id": "kyHcRUOfqmgV"
   },
   "outputs": [],
   "source": [
    "subset_gdf.to_file(output_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TWr7mRiQdvAa"
   },
   "source": [
    "For ease of data sharing, let's zip all the shapefile parts into a single archive. We again use the `zipfile` module and use the `write()` method to add each sidecar file for the shapefile. The `arcname` parameter is used to avoid creating a sub-folder inside the archive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "id": "AhD3TXNfqop0"
   },
   "outputs": [],
   "source": [
    "output_zip = 'subset.zip'\n",
    "output_zip_path = os.path.join(output_folder, output_zip)\n",
    "\n",
    "with zipfile.ZipFile(output_zip_path, 'w') as output_zf:\n",
    "  for ext in ['.shp', '.shx', '.prj', '.dbf']:\n",
    "    filename = 'subset' + ext\n",
    "    filepath = os.path.join(output_folder, filename)\n",
    "    output_zf.write(filepath, arcname=filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "\n",
    "If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)\n",
    "\n",
    "\n",
    "<script src='https://utteranc.es/client.js'\n",
    "    repo='spatialthoughts/geopython-tutorials'\n",
    "    issue-term='pathname'\n",
    "    theme='github-light'\n",
    "    crossorigin='anonymous'\n",
    "    async>\n",
    "</script>"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyNN8jPiJlLRGqVPDfy1xf1X",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}