Sampling Raster Data with XArray#

Introduction#

Many scientific and environmental datasets come as gridded rasters. If you want to know the value of a variable at a single or multiple locations, you can use sampling techniques to extract the values. XArray has powerful indexing methods that allow us to extract values at multiple coordinates easily.

Overview of the Task#

In this tutorial, we will take a raster file of temperature anomalies and a CSV file with locations of all urban areas in the US. We will use Pandas and Xarray to find the temperature anomaly in all the urban areas and find the top 10 areas experiencing the highest anomaly. We will also use GeoPandas to save the results as a vector layer.

Input Layers:

  • t.anom.202207.tif: Raster grid of temprature anomaly for the month of July 2022 in the US.

  • 2021_Gaz_ua_national.zip: A CSV file with point locations representing urban areas in the US.

Output Layers:

  • tanomaly.gpkg : A GeoPackage containing a vector layer of point locations with anomaly values sampled from the raster.

Data Credit:

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install rioxarray
import os
import pandas as pd
import geopandas as gpd
import rioxarray as rxr
import matplotlib.pyplot as plt
data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)
def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

raster_file = 't.anom.202207.tif'
csv_file = '2021_Gaz_ua_national.zip'

files = [
    'https://ftp.cpc.ncep.noaa.gov/GIS/USDM_Products/temp/anom/monthly/' + raster_file,
    'https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2021_Gazetteer/' + csv_file,
]

for file in files:
  download(file)
Downloaded data/t.anom.202207.tif
Downloaded data/2021_Gaz_ua_national.zip

Data Pre-Processing#

First we read the raster file using rioxarray

raster_filepath = os.path.join(data_folder, raster_file)
raster = rxr.open_rasterio(raster_filepath, mask_and_scale=True)
raster
<xarray.DataArray (band: 1, y: 32, x: 64)> Size: 8kB
[2048 values with dtype=float32]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 512B -130.0 -129.0 -128.0 ... -69.0 -68.0 -67.0
  * y            (y) float64 256B 52.0 51.0 50.0 49.0 ... 24.0 23.0 22.0 21.0
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_SOFTWARE:        GrADS version 2.0.2 
    TIFFTAG_XRESOLUTION:     0.171875
    TIFFTAG_YRESOLUTION:     0.265625
    TIFFTAG_RESOLUTIONUNIT:  2 (pixels/inch)
    AREA_OR_POINT:           Area

Let’s Look at some pixel values.

raster.values
array([[[-9.99e+08, -9.99e+08, -9.99e+08, ..., -9.99e+08, -9.99e+08,
         -9.99e+08],
        [-9.99e+08, -9.99e+08, -9.99e+08, ..., -9.99e+08, -9.99e+08,
         -9.99e+08],
        [-9.99e+08, -9.99e+08, -9.99e+08, ..., -9.99e+08, -9.99e+08,
         -9.99e+08],
        ...,
        [-9.99e+08, -9.99e+08, -9.99e+08, ..., -9.99e+08, -9.99e+08,
         -9.99e+08],
        [-9.99e+08, -9.99e+08, -9.99e+08, ..., -9.99e+08, -9.99e+08,
         -9.99e+08],
        [-9.99e+08, -9.99e+08, -9.99e+08, ..., -9.99e+08, -9.99e+08,
         -9.99e+08]]], dtype=float32)

You will notice that the raster has many pixels with value -9.99e+08. These are NoData values but they are not encoded as such. We will mask these values to get only the valid pixels which will set them to nan values.

nodata = -999000000
raster_masked = raster.where(raster != nodata)
raster_masked
<xarray.DataArray (band: 1, y: 32, x: 64)> Size: 8kB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]], dtype=float32)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 512B -130.0 -129.0 -128.0 ... -69.0 -68.0 -67.0
  * y            (y) float64 256B 52.0 51.0 50.0 49.0 ... 24.0 23.0 22.0 21.0
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_SOFTWARE:        GrADS version 2.0.2 
    TIFFTAG_XRESOLUTION:     0.171875
    TIFFTAG_YRESOLUTION:     0.265625
    TIFFTAG_RESOLUTIONUNIT:  2 (pixels/inch)
    AREA_OR_POINT:           Area

The raster has only 1 band containing temperature anomaly values, so we select it.

tanomaly = raster_masked.sel(band=1)
tanomaly
<xarray.DataArray (y: 32, x: 64)> Size: 8kB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
Coordinates:
    band         int64 8B 1
  * x            (x) float64 512B -130.0 -129.0 -128.0 ... -69.0 -68.0 -67.0
  * y            (y) float64 256B 52.0 51.0 50.0 49.0 ... 24.0 23.0 22.0 21.0
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_SOFTWARE:        GrADS version 2.0.2 
    TIFFTAG_XRESOLUTION:     0.171875
    TIFFTAG_YRESOLUTION:     0.265625
    TIFFTAG_RESOLUTIONUNIT:  2 (pixels/inch)
    AREA_OR_POINT:           Area

Let’s plot the data. It shows high temperature anomaly in the north-west US due to a record-breaking heatwave.

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)

tanomaly.plot.imshow(ax=ax,
    vmin=-3, vmax=3, add_labels=False, cmap='coolwarm')

ax.set_title('Temprature Anomaly for July 2022', fontsize = 14)

plt.show()
../_images/313d7efd7a7dc8b6d98d009fc8514cabb4a76b01065f38c51984ae3f3f1ead3c.png

Next, we read the Zipped CSV file containing the urban areas. This is a plain-text file with tab separated values. The column names also contain trailing whitespaces, so we remove them as well.

csv_filepath = os.path.join(data_folder, csv_file)
df  = pd.read_csv(csv_filepath, delimiter = '\t', compression='zip')
df.columns = df.columns.str.replace(' ', '')
df
GEOID NAME UATYPE ALAND AWATER ALAND_SQMI AWATER_SQMI INTPTLAT INTPTLONG
0 37 Abbeville, LA Urban Cluster C 29189598 298416 11.270 0.115 29.967156 -92.095966
1 64 Abbeville, SC Urban Cluster C 11271136 19786 4.352 0.008 34.179273 -82.379776
2 91 Abbotsford, WI Urban Cluster C 5426584 13221 2.095 0.005 44.948612 -90.315875
3 118 Aberdeen, MS Urban Cluster C 7416338 52820 2.863 0.020 33.824742 -88.554591
4 145 Aberdeen, SD Urban Cluster C 33032902 120864 12.754 0.047 45.463186 -98.471033
... ... ... ... ... ... ... ... ... ...
3596 98101 Zapata--Medina, TX Urban Cluster C 13451264 0 5.194 0.000 26.889081 -99.266192
3597 98182 Zephyrhills, FL Urbanized Area U 112593840 1615599 43.473 0.624 28.285373 -82.198969
3598 98209 Zimmerman, MN Urban Cluster C 24456008 2495147 9.443 0.963 45.455850 -93.606705
3599 98236 Zumbrota, MN Urban Cluster C 4829469 0 1.865 0.000 44.292793 -92.670931
3600 98263 Zuni Pueblo, NM Urban Cluster C 11876786 0 4.586 0.000 35.071066 -108.823725

3601 rows × 9 columns

Since we have the coordinates of each urban center in the INTPTLAT and INTPTLONG columns, we can convert this dataframe to a GeoDataFrame.

geometry = gpd.points_from_xy(df.INTPTLONG, df.INTPTLAT)
gdf = gpd.GeoDataFrame(df, crs='EPSG:4326', geometry=geometry)

Let’s visualize the points along with the raster. As the GeoDataFrame contains points outside the continental US, we limit our plot to the bounds of the tanomaly layer.

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)

tanomaly.plot.imshow(ax=ax,
    vmin=-3, vmax=3, add_labels=False, cmap='coolwarm')
gdf.plot(ax=ax, markersize=5)
ax.set_xlim([tanomaly.x.min(), tanomaly.x.max()])
ax.set_ylim([tanomaly.y.min(), tanomaly.y.max()])
ax.set_title('Urban Areas and Temperature Anomaly for July 2022', fontsize = 14)

plt.show()
../_images/4b22dfc2bcdc0e8b45de637f3ffd4cd01eb8a41dfbf2d6b1f5c01258d21d7853.png

Sampling Raster Values#

Now we will extract the value of the raster pixels at each urban area point. Xarray’s sel() method allows you to specify coordinates at multiple dimentions to extract the array value.

We have our coordinates in a GeoDataFrame, we can use Panda’s to_xarray() method to convert the X and Y coordinates to a DataArray. When we use this to select the pixels from the raster array, the resulting DataArray being reindexed by the GeoDataFrame’s index.This will allow us to merge the results easily to the original GeoDataFrame.

x_coords = gdf.geometry.x.to_xarray()
y_coords = gdf.geometry.y.to_xarray()

Now we sample the values at these coordinates. Note that since the raster data pixels will not be indexed at the exact X and Y coordinates, we use method=nearest to pick the closest pixel.

sampled = tanomaly.sel(x=x_coords, y=y_coords, method='nearest')

To convert the resulting DataArray to Pandas, we use to_series() method.

gdf['tanomaly'] = sampled.to_series()
gdf.iloc[:, -5:]
AWATER_SQMI INTPTLAT INTPTLONG geometry tanomaly
0 0.115 29.967156 -92.095966 POINT (-92.09597 29.96716) -0.272462
1 0.008 34.179273 -82.379776 POINT (-82.37978 34.17927) -0.704404
2 0.005 44.948612 -90.315875 POINT (-90.31588 44.94861) 0.130515
3 0.020 33.824742 -88.554591 POINT (-88.55459 33.82474) 0.018811
4 0.047 45.463186 -98.471033 POINT (-98.47103 45.46319) 1.311932
... ... ... ... ... ...
3596 0.000 26.889081 -99.266192 POINT (-99.26619 26.88908) -1.866690
3597 0.624 28.285373 -82.198969 POINT (-82.19897 28.28537) -0.064647
3598 0.963 45.455850 -93.606705 POINT (-93.6067 45.45585) 0.607703
3599 0.000 44.292793 -92.670931 POINT (-92.67093 44.29279) 0.007632
3600 0.000 35.071066 -108.823725 POINT (-108.82372 35.07107) -0.732679

3601 rows × 5 columns

Now that we have the anomaly at each point location, let’s sort and find out the Top10 locations with highest anomaly.

sorted_gdf = gdf.sort_values(by=['tanomaly'], ascending=False)
top10 = sorted_gdf.iloc[:10].reset_index()
top10[['NAME', 'tanomaly']]
NAME tanomaly
0 Osburn, ID Urban Cluster 5.214408
1 Kellogg, ID Urban Cluster 4.543093
2 Libby, MT Urban Cluster 4.543093
3 Bonners Ferry, ID Urban Cluster 4.487802
4 Orofino, ID Urban Cluster 4.411723
5 Grangeville, ID Urban Cluster 4.411723
6 La Grande, OR Urban Cluster 4.300000
7 Baker City, OR Urban Cluster 4.300000
8 Rathdrum, ID Urban Cluster 4.138633
9 Deer Park, WA Urban Cluster 4.138633

Let’s plot them on a map.

fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 7)

tanomaly.plot.imshow(ax=ax,
    vmin=-3, vmax=3, add_labels=False, cmap='coolwarm', add_colorbar=False)
top10.plot(ax=ax, markersize=25, color='yellow', marker='*')
ax.set_xlim([tanomaly.x.min(), tanomaly.x.max()])
ax.set_ylim([tanomaly.y.min(), tanomaly.y.max()])
ax.set_title('Top 10 Temperature Anomaly Locations for July 2022', fontsize = 14)

plt.tight_layout()
plt.show()
../_images/bbbe6298d47be4223ff6f95cc8f166d9bdbbb6b9a8bb740090c41b099fd96b73.png

Finally, we save the sampled result to disk as .gpkg.

output_filename = 'tanomaly.gpkg'
output_path = os.path.join(output_folder, output_filename)

gdf.to_file(driver='GPKG', filename=output_path)

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)