Creating Spatial Data

Creating Spatial Data#

Introduction#

A common operation in spatial analysis is to take non-spatial data, such as CSV files, and creating a spatial dataset from it using coordinate information contained in the file. While doing this is straightfroward with Pandas and GeoPandas, you run into memory limits when working with very large datasets that do not fit into your RAM. Dask provides a drop-in replacement for Pandas workflows that does lazy computing (load data incrementally into RAM) and parellel computing (use all cores of your CPU or a cluster of machines). In this tutorial, we will use dask and dask-geopandas packages to efficiently process a large dataset.

Overview of the Task#

GeoNames is a free and open database of geographic names of the world. It distributes a large database containing millions of records in a tab-delimited format. We will read a tab-delimited file of places in dask environment, filter it to select all mountain features, create a GeoDataFrame and export it as a GeoPackage file containing all mountains of the world.

Input Layers:

allCountries.zip: GeoNames Gazetteer extract of all countries.

Output Layers:

mountains.gpkg : A GeoPackage containing a vector layer of mountains of the world.

Data Credit:

Geonames. Retrieved 2025-02

Setup and Data Download#

The following blocks of code will install the required packages and download the datasets to your Colab environment.

%%capture
if 'google.colab' in str(get_ipython()):
    !pip install dask_geopandas leafmap lonboard

import dask.dataframe as dd
import dask_geopandas as dg
import leafmap.deckgl as leafmap
import lonboard
import geopandas as gpd
import os
import zipfile

data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

filename = 'allCountries.zip'
data_url = 'https://download.geonames.org/export/dump/'

download(data_url + filename)

Downloaded data/allCountries.zip

Data Pre-Processing#

Extract the zip file.

zip_file_path = os.path.join(data_folder, filename)
with zipfile.ZipFile(zip_file_path) as f:
    f.extractall(data_folder)

The files do not contain a header row with column names, so we need to specify them when reading the data. The data format is described in detail on the Data Export page. We will be also specifying the data type of the column names.

column_names = [
    'geonameid', 'name', 'asciiname', 'alternatenames',
    'latitude', 'longitude', 'feature class', 'feature code',
    'country code', 'cc2', 'admin1 code', 'admin2 code',
    'admin3 code', 'admin4 code', 'population', 'elevation',
    'dem', 'timezone', 'modification date'
]

dtypes = {
    'geonameid':int , 'name':object , 'asciiname':object,
    'alternatenames':object, 'latitude':float, 'longitude':float,
    'feature class':object, 'feature code':object, 'country code':object,
    'cc2':object, 'admin1 code':object, 'admin2 code':object,
    'admin3 code':object, 'admin4 code':object, 'population':int,
    'elevation':float, 'dem':int, 'timezone':object, 'modification date':object
}

We specify the separator as \t (tab) as an argument to the read_csv() method in dask dataframe.

txt_file_path = zip_file_path.replace('.zip', '.txt')
df = dd.read_csv(txt_file_path, sep = '\t', names = column_names, dtype=dtypes)

We can preview the Dask Dataframe with the head() function.

df.head().iloc[:, :7]

	geonameid	name	asciiname	alternatenames	latitude	longitude	feature class
0	2994701	Roc Meler	Roc Meler	Roc Mele,Roc Meler,Roc Mélé	42.58765	1.74180	T
1	3017832	Pic de les Abelletes	Pic de les Abelletes	Pic de la Font-Negre,Pic de la Font-Nègre,Pic ...	42.52535	1.73343	T
2	3017833	Estany de les Abelletes	Estany de les Abelletes	Estany de les Abelletes,Etang de Font-Negre,Ét...	42.52915	1.73362	H
3	3023203	Port Vieux de la Coume d’Ose	Port Vieux de la Coume d'Ose	Port Vieux de Coume d'Ose,Port Vieux de Coume ...	42.62568	1.61823	T
4	3029315	Port de la Cabanette	Port de la Cabanette	Port de la Cabanette,Porteille de la Cabanette	42.60000	1.73333	T

Since Dask does not read the entire data into memory, we need to use .compute() to get the results. For example, if we wanted to know the total number of records in the dataframe, we can do the following.

df.shape[0].compute()

13040598

The input data as a column feature_class categorizing the place into 9 feature classes. We can select all rows with the value T with the category mountain,hill,rock…

filtered = df[df['feature class']== 'T']
filtered.head().iloc[:, :7]

	geonameid	name	asciiname	alternatenames	latitude	longitude	feature class
0	2994701	Roc Meler	Roc Meler	Roc Mele,Roc Meler,Roc Mélé	42.58765	1.74180	T
1	3017832	Pic de les Abelletes	Pic de les Abelletes	Pic de la Font-Negre,Pic de la Font-Nègre,Pic ...	42.52535	1.73343	T
3	3023203	Port Vieux de la Coume d’Ose	Port Vieux de la Coume d'Ose	Port Vieux de Coume d'Ose,Port Vieux de Coume ...	42.62568	1.61823	T
4	3029315	Port de la Cabanette	Port de la Cabanette	Port de la Cabanette,Porteille de la Cabanette	42.60000	1.73333	T
5	3034945	Roc de Port Dret	Roc de Port Dret	<NA>	42.60288	1.45736	T

Creating Spatial Data#

GeoPandas has a conveinent function points_from_xy() that creates a Geometry column from X and Y coordinates. We can then take a dask dataframe and create a dask goDataFrame by specifying a CRS and the geometry column.

filtered['geometry'] = dg.points_from_xy(
    filtered, 'longitude', 'latitude', crs='EPSG:4326')

We now convert the dask dataframe to dask geodatafeame by using the from_dask_dataframe function and then compute() it to convert the lazy Dask collection into its in-memory equivalent i.e. Geodataframe.

%%time
mountain_gdf = dg.from_dask_dataframe(filtered).compute()

CPU times: user 1min 20s, sys: 10.2 s, total: 1min 30s
Wall time: 1min 16s

The result is a GeoDataFrame containing over 1.7 million mountain features.

mountain_gdf.iloc[:, :7]

	geonameid	name	asciiname	alternatenames	latitude	longitude	feature class
0	2994701	Roc Meler	Roc Meler	Roc Mele,Roc Meler,Roc Mélé	42.58765	1.74180	T
1	3017832	Pic de les Abelletes	Pic de les Abelletes	Pic de la Font-Negre,Pic de la Font-Nègre,Pic ...	42.52535	1.73343	T
3	3023203	Port Vieux de la Coume d’Ose	Port Vieux de la Coume d'Ose	Port Vieux de Coume d'Ose,Port Vieux de Coume ...	42.62568	1.61823	T
4	3029315	Port de la Cabanette	Port de la Cabanette	Port de la Cabanette,Porteille de la Cabanette	42.60000	1.73333	T
5	3034945	Roc de Port Dret	Roc de Port Dret	<NA>	42.60288	1.45736	T
...	...	...	...	...	...	...	...
515973	12185903	Bedoach Ridge	Bedoach Ridge	<NA>	12.17251	135.02423	T
516009	12185995	Ipil Hill	Ipil Hill	Ipil Hill,Ipil Seamount	16.83118	125.57572	T
516043	12186035	Taranaki Terrace	Taranaki Terrace	<NA>	-39.65853	172.20361	T
516306	12358943	Morro São Pedro	Morro Sao Pedro	<NA>	-28.49750	-28.49750	T
516593	12865271	Liverpool Bars	Liverpool Bars	<NA>	70.60000	-129.75000	T

1771446 rows × 7 columns

We use Leafmap with the Lonboard backend to visualize the results. This generates an interactive map

m = leafmap.Map(height=600)
m.add_gdf(mountain_gdf,
          pickable=True,
          auto_highlight=True,
          get_fill_color='blue',
          zoom_to_layer=True,
          get_radius=1,
          radius_units='pixels'

)
m

We can write the resulting GeoDataFrame to a new GeoPackage file.

output_filename = 'mountains.gpkg'
output_path = os.path.join(output_folder, output_filename)

mountain_gdf.to_file(
    filename=output_path, layer='mountains', encoding='utf-8')
print('Successfully written output file at {}'.format(output_path))

Successfully written output file at output/mountains.gpkg

If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)

Creating Spatial Data

Contents

Creating Spatial Data#

Introduction#

Overview of the Task#

Setup and Data Download#

Data Pre-Processing#

Creating Spatial Data#