Creating Spatial Data#

Introduction#

A common operation in spatial analysis is to take non-spatial data, such as CSV files, and creating a spatial dataset from it using coordinate information contained in the file. While doing this is straightfroward with Pandas and GeoPandas, you run into memory limits when working with very large datasets that do not fit into your RAM. Dask provides a drop-in replacement for Pandas workflows that does lazy computing (load data incrementally into RAM) and parellel computing (use all cores of your CPU or a cluster of machines). In this tutorial, we will use dask and dask-geopandas packages to efficiently process a large dataset.

Overview of the Task#

GeoNames is a free and open database of geographic names of the world. It distributes a large database containing millions of records in a tab-delimited format. We will read a tab-delimited file of places in dask environment, filter it to select all mountain features, create a GeoDataFrame and export it as a GeoPackage file containing all mountains of the world.

Input Layers:

  • allCountries.zip: GeoNames Gazetteer extract of all countries.

Output Layers:

  • mountains.gpkg : A GeoPackage containing a vector layer of mountains of the world.

Data Credit:

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install dask_geopandas leafmap lonboard
import dask.dataframe as dd
import dask_geopandas as dg
import leafmap.deckgl as leafmap
import lonboard
import geopandas as gpd
import os
import zipfile
data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)
def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

filename = 'allCountries.zip'
data_url = 'https://download.geonames.org/export/dump/'

download(data_url + filename)
Downloaded data/allCountries.zip

Data Pre-Processing#

Extract the zip file.

zip_file_path = os.path.join(data_folder, filename)
with zipfile.ZipFile(zip_file_path) as f:
    f.extractall(data_folder)

The files do not contain a header row with column names, so we need to specify them when reading the data. The data format is described in detail on the Data Export page. We will be also specifying the data type of the column names.

column_names = [
    'geonameid', 'name', 'asciiname', 'alternatenames',
    'latitude', 'longitude', 'feature class', 'feature code',
    'country code', 'cc2', 'admin1 code', 'admin2 code',
    'admin3 code', 'admin4 code', 'population', 'elevation',
    'dem', 'timezone', 'modification date'
]

dtypes = {
    'geonameid':int , 'name':object , 'asciiname':object,
    'alternatenames':object, 'latitude':float, 'longitude':float,
    'feature class':object, 'feature code':object, 'country code':object,
    'cc2':object, 'admin1 code':object, 'admin2 code':object,
    'admin3 code':object, 'admin4 code':object, 'population':int,
    'elevation':float, 'dem':int, 'timezone':object, 'modification date':object
}

We specify the separator as \t (tab) as an argument to the read_csv() method in dask dataframe.

txt_file_path = zip_file_path.replace('.zip', '.txt')
df = dd.read_csv(txt_file_path, sep = '\t', names = column_names, dtype=dtypes)

We can preview the Dask Dataframe with the head() function.

df.head().iloc[:, :7]
geonameid name asciiname alternatenames latitude longitude feature class
0 2994701 Roc Meler Roc Meler Roc Mele,Roc Meler,Roc Mélé 42.58765 1.74180 T
1 3017832 Pic de les Abelletes Pic de les Abelletes Pic de la Font-Negre,Pic de la Font-Nègre,Pic ... 42.52535 1.73343 T
2 3017833 Estany de les Abelletes Estany de les Abelletes Estany de les Abelletes,Etang de Font-Negre,Ét... 42.52915 1.73362 H
3 3023203 Port Vieux de la Coume d’Ose Port Vieux de la Coume d'Ose Port Vieux de Coume d'Ose,Port Vieux de Coume ... 42.62568 1.61823 T
4 3029315 Port de la Cabanette Port de la Cabanette Port de la Cabanette,Porteille de la Cabanette 42.60000 1.73333 T

Since Dask does not read the entire data into memory, we need to use .compute() to get the results. For example, if we wanted to know the total number of records in the dataframe, we can do the following.

df.shape[0].compute()
13040598

The input data as a column feature_class categorizing the place into 9 feature classes. We can select all rows with the value T with the category mountain,hill,rock…

filtered = df[df['feature class']== 'T']
filtered.head().iloc[:, :7]
geonameid name asciiname alternatenames latitude longitude feature class
0 2994701 Roc Meler Roc Meler Roc Mele,Roc Meler,Roc Mélé 42.58765 1.74180 T
1 3017832 Pic de les Abelletes Pic de les Abelletes Pic de la Font-Negre,Pic de la Font-Nègre,Pic ... 42.52535 1.73343 T
3 3023203 Port Vieux de la Coume d’Ose Port Vieux de la Coume d'Ose Port Vieux de Coume d'Ose,Port Vieux de Coume ... 42.62568 1.61823 T
4 3029315 Port de la Cabanette Port de la Cabanette Port de la Cabanette,Porteille de la Cabanette 42.60000 1.73333 T
5 3034945 Roc de Port Dret Roc de Port Dret <NA> 42.60288 1.45736 T

Creating Spatial Data#

GeoPandas has a conveinent function points_from_xy() that creates a Geometry column from X and Y coordinates. We can then take a dask dataframe and create a dask goDataFrame by specifying a CRS and the geometry column.

filtered['geometry'] = dg.points_from_xy(
    filtered, 'longitude', 'latitude', crs='EPSG:4326')

We now convert the dask dataframe to dask geodatafeame by using the from_dask_dataframe function and then compute() it to convert the lazy Dask collection into its in-memory equivalent i.e. Geodataframe.

%%time
mountain_gdf = dg.from_dask_dataframe(filtered).compute()
CPU times: user 1min 20s, sys: 10.2 s, total: 1min 30s
Wall time: 1min 16s

The result is a GeoDataFrame containing over 1.7 million mountain features.

mountain_gdf.iloc[:, :7]
geonameid name asciiname alternatenames latitude longitude feature class
0 2994701 Roc Meler Roc Meler Roc Mele,Roc Meler,Roc Mélé 42.58765 1.74180 T
1 3017832 Pic de les Abelletes Pic de les Abelletes Pic de la Font-Negre,Pic de la Font-Nègre,Pic ... 42.52535 1.73343 T
3 3023203 Port Vieux de la Coume d’Ose Port Vieux de la Coume d'Ose Port Vieux de Coume d'Ose,Port Vieux de Coume ... 42.62568 1.61823 T
4 3029315 Port de la Cabanette Port de la Cabanette Port de la Cabanette,Porteille de la Cabanette 42.60000 1.73333 T
5 3034945 Roc de Port Dret Roc de Port Dret <NA> 42.60288 1.45736 T
... ... ... ... ... ... ... ...
515973 12185903 Bedoach Ridge Bedoach Ridge <NA> 12.17251 135.02423 T
516009 12185995 Ipil Hill Ipil Hill Ipil Hill,Ipil Seamount 16.83118 125.57572 T
516043 12186035 Taranaki Terrace Taranaki Terrace <NA> -39.65853 172.20361 T
516306 12358943 Morro São Pedro Morro Sao Pedro <NA> -28.49750 -28.49750 T
516593 12865271 Liverpool Bars Liverpool Bars <NA> 70.60000 -129.75000 T

1771446 rows × 7 columns

We use Leafmap with the Lonboard backend to visualize the results. This generates an interactive map

m = leafmap.Map(height=600)
m.add_gdf(mountain_gdf,
          pickable=True,
          auto_highlight=True,
          get_fill_color='blue',
          zoom_to_layer=True,
          get_radius=1,
          radius_units='pixels'

)
m

We can write the resulting GeoDataFrame to a new GeoPackage file.

output_filename = 'mountains.gpkg'
output_path = os.path.join(output_folder, output_filename)

mountain_gdf.to_file(
    filename=output_path, layer='mountains', encoding='utf-8')
print('Successfully written output file at {}'.format(output_path))
Successfully written output file at output/mountains.gpkg

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)