Extracting Time Series at Multiple Points with Xarray

Extracting Time Series at Multiple Points with Xarray#

Introduction#

This notebook shows how to use XArray to efficiently extract a time-series at multiple coordinates. You will also learn how to use XArray to construct a datacube from individual cloud-hosted files and sample data from it with minimal data transfer.

Overview of the Task#

We will take 12 individual Cloud-Optimized GeoTIFF (COG) files representing Soil Moisture for each month of a year, create a XArray Dataset from it and extract the soil moisture values for each month at multiple different locations.

Input Layers:

soil_moisture_202301.tif, soil_moisture_202302.tif …: Cloud-Optimized GeoTIFF files representing average soil misture for each month. This files were exported from TerraClimate data hosted on Google Earth Engine using. See reference script

Output:

soil_moisture.csv: A CSV file containing extracted monthly soil moisture values at 3 different coordinates.

Data Credit:

Abatzoglou, J.T., S.Z. Dobrowski, S.A. Parks, K.C. Hegewisch, 2018, Terraclimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958-2015, Scientific Data 5:170191, doi:10.1038/sdata.2017.191

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install rioxarray

import datetime
import glob
import os
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import xarray as xr
import rioxarray as rxr
import zipfile

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

Data Pre-Processing#

In this example, we have a folder on Google Cloud Storage (GCS) has 12 Cloud-Optomized GeoTIFF files representing soil moisture for each month. The file URIs are as below

gs://spatialthoughts-public-data/terraclimate/soil_moisture_202301.tif
gs://spatialthoughts-public-data/terraclimate/soil_moisture_202302.tif
gs://spatialthoughts-public-data/terraclimate/soil_moisture_202303.tif
...

First we create an XArray Dataset from these individual files.

The files will be read using the GDAL library which allows reading cloud-hosted files directly. Files on Google Cloud Storage can be read using the /vsigs/ prefix. Since our files are public, we specify the GS_NO_SIGN_REQUEST=YES option. For accessing data on private bucket, see the authentication options specified in GDAL and Google Cloud Storage (GCS) .

os.environ['GS_NO_SIGN_REQUEST'] = 'YES'
prefix = '/vsigs/spatialthoughts-public-data/terraclimate/'

# Create a list of URIs with /vgigs/ prefix
urls = []
for month in range(1, 13):
    image_id = f'soil_moisture_2023{month:02d}.tif'
    path = prefix + image_id
    urls.append(path)

urls

def path_to_datetimeindex(filepath):
  filename = os.path.basename(filepath)
  pattern = r'soil_moisture_(\d+).tif'
  match = re.search(pattern, filepath)
  if match:
      doy_value = match.group(1)
      timestamp = datetime.datetime.strptime(doy_value, '%Y%m')
      return timestamp
  else:
    print('Could not extract DOY from filename', filename)


timestamps = []
filepaths = []

for filepath in urls:
  timestamp = path_to_datetimeindex(filepath)
  filepaths.append(filepath)
  timestamps.append(timestamp)

unique_timestamps = sorted(set(timestamps))

scenes = []

for timestamp in unique_timestamps:
  filename = f'soil_moisture_{timestamp.strftime('%Y%m')}.tif'
  filepath = prefix + filename
  print(f'Reading {filepath}')
  scene = rxr.open_rasterio(filepath, chunks={'x':512, 'y':512})
  scenes.append(scene)

time_var = xr.Variable('time', list(unique_timestamps))

time_series_scenes = xr.concat(scenes, dim=time_var)
time_series_scenes

<xarray.DataArray (time: 12, band: 1, y: 4321, x: 8641)> Size: 4GB
dask.array<concatenate, shape=(12, 1, 4321, 8641), dtype=float64, chunksize=(1, 1, 512, 512), chunktype=numpy.ndarray>
Coordinates:
  * time         (time) datetime64[ns] 96B 2023-01-01 2023-02-01 ... 2023-12-01
  * band         (band) int64 8B 1
  * y            (y) float64 35kB 89.98 89.94 89.9 ... -89.94 -89.98 -90.02
  * x            (x) float64 69kB -180.0 -179.9 -179.9 ... 179.9 180.0 180.0
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Area
    scale_factor:   1.0
    add_offset:     0.0
    long_name:      soil

Plotting the time-series#

We extract the time-series at a specific X,Y coordinate.

time_series = time_series_scenes.interp(x=80.449, y=18.728)

fig, ax = plt.subplots(1, 1)

fig.set_size_inches(15, 7)
time_series.plot.line(ax=ax, x='time', marker='o', linestyle='-', linewidth=1)
plt.show()

../_images/ca92041768a20f6a34b5261ede26f8c0a18ed9cf2ae086825ddded2d44ebe613.png

Extracting Time-Series at Multiple Locations#

locations = [
    ('Location 1', 80.449, 18.728),
    ('Location 2', 79.1488, 15.2797),
    ('Location 3', 74.656, 25.144)
]

df = pd.DataFrame(locations, columns=['Name', 'x', 'y'])
geometry = gpd.points_from_xy(df.x, df.y)
gdf = gpd.GeoDataFrame(df, crs='EPSG:4326', geometry=geometry)
gdf

	Name	x	y	geometry
0	Location 1	80.4490	18.7280	POINT (80.449 18.728)
1	Location 2	79.1488	15.2797	POINT (79.1488 15.2797)
2	Location 3	74.6560	25.1440	POINT (74.656 25.144)

x_coords = gdf.geometry.x.to_xarray()
y_coords = gdf.geometry.y.to_xarray()

sampled = time_series_scenes.sel(x=x_coords, y=y_coords, method='nearest')

Convert the sampled dataset to a Pandas Series.

sampled_df = sampled.to_series()
sampled_df.name = 'soil_moisture'
sampled_df = sampled_df.reset_index()
sampled_df.head()

	time	band	index	soil_moisture
0	2023-01-01	1	0	322.5
1	2023-01-01	1	1	33.5
2	2023-01-01	1	2	25.1
3	2023-02-01	1	0	264.4
4	2023-02-01	1	1	26.4

Add a formatted time column

sampled_df['formatted_time'] = sampled_df['time'].dt.strftime("%Y_%m")
sampled_df.head()

	time	band	index	soil_moisture	formatted_time
0	2023-01-01	1	0	322.5	2023_01
1	2023-01-01	1	1	33.5	2023_01
2	2023-01-01	1	2	25.1	2023_01
3	2023-02-01	1	0	264.4	2023_02
4	2023-02-01	1	1	26.4	2023_02

Convert this to a wide table, with each month’s value as a column.

wide_df = sampled_df.pivot(index='index', columns='formatted_time', values='soil_moisture')
wide_df = wide_df.reset_index()
wide_df.iloc[:, :10]

formatted_time	index	2023_01	2023_02	2023_03	2023_04	2023_05	2023_06	2023_07	2023_08	2023_09
0	0	322.5	264.4	223.0	204.8	138.1	104.7	498.9	462.5	498.9
1	1	33.5	26.4	21.8	18.7	16.3	14.5	125.6	51.1	67.2
2	2	25.1	21.2	18.4	16.3	14.6	13.7	34.6	27.5	37.7

Merge the extracted value with the original data.

merged = pd.merge(gdf, wide_df, left_index=True, right_index=True)
merged.iloc[:, :10]

	Name	x	y	geometry	index	2023_01	2023_02	2023_03	2023_04	2023_05
0	Location 1	80.4490	18.7280	POINT (80.449 18.728)	0	322.5	264.4	223.0	204.8	138.1
1	Location 2	79.1488	15.2797	POINT (79.1488 15.2797)	1	33.5	26.4	21.8	18.7	16.3
2	Location 3	74.6560	25.1440	POINT (74.656 25.144)	2	25.1	21.2	18.4	16.3	14.6

Finally, we save the sampled result to disk as a CSV.

output_filename = 'soil_moisture.csv'
output_path = os.path.join(output_folder, output_filename)
output_df = merged.drop(columns=['geometry', 'index'])
output_df.to_csv(output_path, index=False)

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)