Extracting Time Series at Multiple Polygons with Xarray

Extracting Time Series at Multiple Polygons with Xarray#

Introduction#

Zonal Statistics is used to summarize the values of a raster dataset within the zones of a vector dataset. We can also extend this process over time to extract data for mutliple time-steps. This requires we rasterize each polygon in the input vector for each time step. The regionmask is a helpful library to create a 3D rasterized version of the zones.

This is a large computation that is enabled by cloud-hosted NightTime Lights data in the Cloud-Optimized GeoTIFF (COG) format and parallel computing on a local dask cluster.

Overview of the Task#

We select all Admin1 units of a country and calculate a sum of nighttime light pixel intensities over multiple years.

Input Layers:

ne_10m_admin_1_states_provinces.zip: Admin-1 (States/Provinces) shapefile from Natural Earth
2015.tif, 2016.tif, … 2020.tif: Yearly global rasters of nighttime light data. These files were download from Harvard Dataverse and converted to Cloud-Optimized GeoTIFFs using GDAL.

Example command:
```
gdalwarp -of COG 2021_HasMask/LongNTL_2021.tif 2021.tif \
  -te -180 -90 180 90 -dstnodata 0 \
  -co COMPRESS=DEFLATE -co PREDICTOR=2 -co NUM_THREADS=ALL_CPUS
```

Output Layers:

output.csv : A CSV file with the extracted sum of nighttime lights for each admin-1 region for each year.

Data Credit:

Made with Natural Earth. Free vector and raster map data @ naturalearthdata.com.
Chen, Zuoqi; Yu, Bailang; Yang, Chengshu; Zhou, Yuyu; Yao, Shenjun; Qian, Xingjian; Wang, Congxiao; Wu, Bin; Wu, Jianping; Liao, Lingxing; Shi, Kaifang, 2020, “The global NPP-VIIRS-like nighttime light data (Version 2) for 1992-2024”, https://doi.org/10.7910/DVN/YGIVCD, Harvard Dataverse, V8

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install rioxarray
    !pip install regionmask

import os
import glob
import pandas as pd
import geopandas as gpd
import numpy as np
import xarray as xr
import rioxarray as rxr
import matplotlib.pyplot as plt
from datetime import datetime
import dask
import regionmask

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

Setup a local Dask cluster. This distributes the computation across multiple workers on your computer.

from dask.distributed import Client, progress
client = Client()  # set up local cluster on the machine
client

If you are running this notebook in Colab, you will need to create and use a proxy URL to see the dashboard running on the local server.

if 'google.colab' in str(get_ipython()):
    from google.colab import output
    port_to_expose = 8787  # This is the default port for Dask dashboard
    print(output.eval_js(f'google.colab.kernel.proxyPort({port_to_expose})'))

def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

admin1_zipfile = 'ne_10m_admin_1_states_provinces.zip'
admin1_url = 'https://naciscdn.org/naturalearth/10m/cultural/'

download(admin1_url + admin1_zipfile)

Downloaded data/ne_10m_admin_1_states_provinces.zip

Data Pre-Processing#

First we will read the Admin-1 shapefile and filter to a country. The iso_a2 column contains the 2-digit ISO code for the country. Here we are seelcting all provinces in Sri Lanka.

country_code = 'LK'
admin1_file_path = os.path.join(data_folder, admin1_zipfile)

admin1_df = gpd.read_file(admin1_file_path)

zones = admin1_df[admin1_df['iso_a2'] == country_code]
zones = zones[['adm1_code', 'name', 'iso_a2', 'geometry']].copy()
zones['id'] = zones.reset_index().index + 1
zones.head()

	adm1_code	name	iso_a2	geometry	id
1859	LKA-2454	Trikuṇāmalaya	LK	POLYGON ((80.92292 8.98892, 80.92319 8.97321, ...	1
1860	LKA-2458	Mulativ	LK	POLYGON ((80.62598 9.45475, 80.7444 9.35969, 8...	2
1861	LKA-2459	Yāpanaya	LK	MULTIPOLYGON (((80.44174 9.5769, 80.44402 9.57...	3
1862	LKA-2460	Kilinŏchchi	LK	MULTIPOLYGON (((80.29067 9.68081, 80.33855 9.6...	4
1863	LKA-2457	Mannārama	LK	MULTIPOLYGON (((79.91278 8.57201, 79.91651 8.5...	5

Plot the selected zones.

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(5,5)
zones.plot(
    ax=ax,
    facecolor='none',
    edgecolor='#969696')
ax.set_aspect('equal')
ax.set_axis_off()
plt.show()

../_images/d2e944d33623d9e134a299d5c0d9e0d5b8f87ac3ce5036e1b8f9196d7bc7212c.png

Next we read the Yearly Night-time Lights Rasters and create an Xarray Dataset.

start_year = 2015
end_year = 2020

data_folder = 'https://github.com/spatialthoughts/geopython-tutorials/releases/download/data/'

da_list = []

for year in range(start_year, end_year + 1):
  cog_url = f'{data_folder}/{year}.tif'
  da = rxr.open_rasterio(
      cog_url,
      chunks=True).rio.clip_box(*bbox)
  dt = pd.to_datetime(year, format='%Y')
  da = da.assign_coords(time = dt)
  da = da.expand_dims(dim="time")
  da_list.append(da)

ntl_datacube = xr.concat(da_list, dim='time').chunk('auto')
ntl_datacube

<xarray.DataArray (time: 6, band: 1, y: 870, x: 498)> Size: 10MB
dask.array<rechunk-merge, shape=(6, 1, 870, 498), dtype=float32, chunksize=(6, 1, 870, 498), chunktype=numpy.ndarray>
Coordinates:
  * time         (time) datetime64[ns] 48B 2015-01-01 2016-01-01 ... 2020-01-01
  * band         (band) int64 8B 1
  * y            (y) float64 7kB 9.828 9.823 9.819 9.814 ... 5.933 5.929 5.924
  * x            (x) float64 4kB 79.66 79.66 79.66 79.67 ... 81.88 81.88 81.89
    spatial_ref  int64 8B 0
Attributes:
    DataType:            Generic
    AREA_OR_POINT:       Area
    BandName:            Band_1
    RepresentationType:  ATHEMATIC
    scale_factor:        1.0
    add_offset:          0.0
    long_name:           Band_1
    _FillValue:          0.0

Since all the rasters have a single band of information, we select the first band and drop that dimension.

ntl_datacube = ntl_datacube.sel(band=1, drop=True)
ntl_datacube

<xarray.DataArray (time: 6, y: 870, x: 498)> Size: 10MB
dask.array<getitem, shape=(6, 870, 498), dtype=float32, chunksize=(6, 870, 498), chunktype=numpy.ndarray>
Coordinates:
  * time         (time) datetime64[ns] 48B 2015-01-01 2016-01-01 ... 2020-01-01
  * y            (y) float64 7kB 9.828 9.823 9.819 9.814 ... 5.933 5.929 5.924
  * x            (x) float64 4kB 79.66 79.66 79.66 79.67 ... 81.88 81.88 81.89
    spatial_ref  int64 8B 0
Attributes:
    DataType:            Generic
    AREA_OR_POINT:       Area
    BandName:            Band_1
    RepresentationType:  ATHEMATIC
    scale_factor:        1.0
    add_offset:          0.0
    long_name:           Band_1
    _FillValue:          0.0

Zonal Stats#

Now we will extract the sum of the raster pixel values for every admin1 region in the selected countrey.

First, we need to convert the GeoDataFrame to a Xarray Dataset. We will be using the regionmask module that converts the geodataframe into an Xarray Dataset having dimension and coordinates same as of given input xarray dataset.

# Create mask of multiple regions from shapefile
mask = regionmask.mask_3D_geopandas(
        zones,
        ntl_datacube.x,
        ntl_datacube.y,
        drop=True,
        numbers='id',
        overlap=True
    )

Apply the mask on the dataset.

ntl_datacube = ntl_datacube.where(mask).chunk('auto')
ntl_datacube

<xarray.DataArray (time: 6, y: 870, x: 498, region: 25)> Size: 260MB
dask.array<rechunk-merge, shape=(6, 870, 498, 25), dtype=float32, chunksize=(5, 747, 426, 21), chunktype=numpy.ndarray>
Coordinates:
  * time         (time) datetime64[ns] 48B 2015-01-01 2016-01-01 ... 2020-01-01
  * y            (y) float64 7kB 9.828 9.823 9.819 9.814 ... 5.933 5.929 5.924
  * x            (x) float64 4kB 79.66 79.66 79.66 79.67 ... 81.88 81.88 81.89
  * region       (region) int64 200B 1 2 3 4 5 6 7 8 ... 18 19 20 21 22 23 24 25
    spatial_ref  int64 8B 0
Attributes:
    DataType:            Generic
    AREA_OR_POINT:       Area
    BandName:            Band_1
    RepresentationType:  ATHEMATIC
    scale_factor:        1.0
    add_offset:          0.0
    long_name:           Band_1
    _FillValue:          0.0

Next we calculate zonal statistics on the masked DataArray. Here we want the sum of nighttime lights within each region for each year, so we group by time and sum of all pixels values within each region.

grouped = ntl_datacube.groupby('time').sum(['x','y'])
grouped

<xarray.DataArray (time: 6, region: 25)> Size: 600B
dask.array<getitem, shape=(6, 25), dtype=float32, chunksize=(1, 21), chunktype=numpy.ndarray>
Coordinates:
  * time         (time) datetime64[ns] 48B 2015-01-01 2016-01-01 ... 2020-01-01
  * region       (region) int64 200B 1 2 3 4 5 6 7 8 ... 18 19 20 21 22 23 24 25
    spatial_ref  int64 8B 0
Attributes:
    DataType:            Generic
    AREA_OR_POINT:       Area
    BandName:            Band_1
    RepresentationType:  ATHEMATIC
    scale_factor:        1.0
    add_offset:          0.0
    long_name:           Band_1
    _FillValue:          0.0

xarray.DataArray

time: 6
region: 25

dask.array<chunksize=(1, 21), meta=np.ndarray>

	Array	Chunk
Bytes	600 B	84 B
Shape	(6, 25)	(1, 21)
Dask graph	12 chunks in 65 graph layers
Data type	float32 numpy.ndarray

Coordinates: (3)
- time
  (time)
  datetime64[ns]
  2015-01-01 ... 2020-01-01
```
array(['2015-01-01T00:00:00.000000000', '2016-01-01T00:00:00.000000000',
       '2017-01-01T00:00:00.000000000', '2018-01-01T00:00:00.000000000',
       '2019-01-01T00:00:00.000000000', '2020-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')
```
- region
  (region)
  int64
  1 2 3 4 5 6 7 ... 20 21 22 23 24 25
```
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25])
```
- spatial_ref
  ()
  int64
  0
  crs_wkt :
  GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]
  semi_major_axis :
  6378137.0
  semi_minor_axis :
  6356752.314245179
  inverse_flattening :
  298.257223563
  reference_ellipsoid_name :
  WGS 84
  longitude_of_prime_meridian :
  0.0
  prime_meridian_name :
  Greenwich
  geographic_crs_name :
  WGS 84
  horizontal_datum_name :
  World Geodetic System 1984
  grid_mapping_name :
  latitude_longitude
  spatial_ref :
  GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]
  GeoTransform :
  79.65364940736123 0.004491578290704913 0.0 9.829819089207735 0.0 -0.00449157829070492
```
array(0)
```
Attributes: (8)
DataType :
Generic
AREA_OR_POINT :
Area
BandName :
Band_1
RepresentationType :
ATHEMATIC
scale_factor :
1.0
add_offset :
0.0
long_name :
Band_1
_FillValue :
0.0

Compute the result. Dask will now distribute the computation across all chunks using available workers.

%time grouped = grouped.compute()

Conver the results to a Pandas DataFrame. We now have total of nighttime lights values for each region for each zone.

stats = grouped.drop_vars('spatial_ref').to_dataframe('sum').reset_index()
stats.head()

	time	region	sum
0	2015-01-01	1	449.778412
1	2015-01-01	2	49.645641
2	2015-01-01	3	891.009949
3	2015-01-01	4	100.430511
4	2015-01-01	5	27.500204

Add a year column and rename the region to id so we can join it back to the original shapefile.

stats['year'] = stats.time.dt.year
stats = stats.rename(columns={'region': 'id'})
stats.head()

	time	id	sum	year
0	2015-01-01	1	449.778412	2015
1	2015-01-01	2	49.645641	2015
2	2015-01-01	3	891.009949	2015
3	2015-01-01	4	100.430511	2015
4	2015-01-01	5	27.500204	2015

Performa table join to merge the results with the original zones GeoDataFrame.

results = zones.merge(stats, on='id')
results.head()

	adm1_code	name	iso_a2	geometry	id	time	sum	year
0	LKA-2454	Trikuṇāmalaya	LK	POLYGON ((80.92292 8.98892, 80.92319 8.97321, ...	1	2015-01-01	449.778412	2015
1	LKA-2454	Trikuṇāmalaya	LK	POLYGON ((80.92292 8.98892, 80.92319 8.97321, ...	1	2016-01-01	453.339569	2016
2	LKA-2454	Trikuṇāmalaya	LK	POLYGON ((80.92292 8.98892, 80.92319 8.97321, ...	1	2017-01-01	640.339172	2017
3	LKA-2454	Trikuṇāmalaya	LK	POLYGON ((80.92292 8.98892, 80.92319 8.97321, ...	1	2018-01-01	656.189758	2018
4	LKA-2454	Trikuṇāmalaya	LK	POLYGON ((80.92292 8.98892, 80.92319 8.97321, ...	1	2019-01-01	854.694946	2019

Finally, we save the result to disk.

output = results[['adm1_code', 'name', 'iso_a2', 'year', 'sum']]
output.head()

	adm1_code	name	iso_a2	year	sum
0	LKA-2454	Trikuṇāmalaya	LK	2015	449.778412
1	LKA-2454	Trikuṇāmalaya	LK	2016	453.339569
2	LKA-2454	Trikuṇāmalaya	LK	2017	640.339172
3	LKA-2454	Trikuṇāmalaya	LK	2018	656.189758
4	LKA-2454	Trikuṇāmalaya	LK	2019	854.694946

output_file = 'output.csv'
output_path = os.path.join(output_folder, output_file)

output.to_csv(output_path, index=False)
print('Successfully written output file at {}'.format(output_path))

Successfully written output file at output/output.csv

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)