Creating Spatial Data#
Introduction#
A common operation in spatial analysis is to take non-spatial data, such as CSV files, and creating a spatial dataset from it using coordinate information contained in the file. While doing this is straightfroward with Pandas and GeoPandas, you run into memory limits when working with very large datasets that do not fit into your RAM. Dask provides a drop-in replacement for Pandas workflows that does lazy computing (load data incrementally into RAM) and parellel computing (use all cores of your CPU or a cluster of machines). In this tutorial, we will use dask
and dask-geopandas
packages to efficiently process a large dataset.
Overview of the Task#
GeoNames is a free and open database of geographic names of the world. It distributes a large database containing millions of records in a tab-delimited format. We will read a tab-delimited file of places in dask environment, filter it to select all mountain features, create a GeoDataFrame and export it as a GeoPackage file containing all mountains of the world.
Input Layers:
allCountries.zip
: GeoNames Gazetteer extract of all countries.
Output Layers:
mountains.gpkg
: A GeoPackage containing a vector layer of mountains of the world.
Data Credit:
Geonames. Retrieved 2025-02
Setup and Data Download#
The following blocks of code will install the required packages and download the datasets to your Colab environment.
%%capture
if 'google.colab' in str(get_ipython()):
!pip install dask_geopandas leafmap lonboard
import dask.dataframe as dd
import dask_geopandas as dg
import leafmap.deckgl as leafmap
import lonboard
import geopandas as gpd
import os
import zipfile
data_folder = 'data'
output_folder = 'output'
if not os.path.exists(data_folder):
os.mkdir(data_folder)
if not os.path.exists(output_folder):
os.mkdir(output_folder)
def download(url):
filename = os.path.join(data_folder, os.path.basename(url))
if not os.path.exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
filename = 'allCountries.zip'
data_url = 'https://download.geonames.org/export/dump/'
download(data_url + filename)
Downloaded data/allCountries.zip
Data Pre-Processing#
Extract the zip file.
zip_file_path = os.path.join(data_folder, filename)
with zipfile.ZipFile(zip_file_path) as f:
f.extractall(data_folder)
The files do not contain a header row with column names, so we need to specify them when reading the data. The data format is described in detail on the Data Export page. We will be also specifying the data type of the column names.
column_names = [
'geonameid', 'name', 'asciiname', 'alternatenames',
'latitude', 'longitude', 'feature class', 'feature code',
'country code', 'cc2', 'admin1 code', 'admin2 code',
'admin3 code', 'admin4 code', 'population', 'elevation',
'dem', 'timezone', 'modification date'
]
dtypes = {
'geonameid':int , 'name':object , 'asciiname':object,
'alternatenames':object, 'latitude':float, 'longitude':float,
'feature class':object, 'feature code':object, 'country code':object,
'cc2':object, 'admin1 code':object, 'admin2 code':object,
'admin3 code':object, 'admin4 code':object, 'population':int,
'elevation':float, 'dem':int, 'timezone':object, 'modification date':object
}
We specify the separator as \t (tab) as an argument to the read_csv() method in dask dataframe.
txt_file_path = zip_file_path.replace('.zip', '.txt')
df = dd.read_csv(txt_file_path, sep = '\t', names = column_names, dtype=dtypes)
We can preview the Dask Dataframe with the head()
function.
df.head().iloc[:, :7]
geonameid | name | asciiname | alternatenames | latitude | longitude | feature class | |
---|---|---|---|---|---|---|---|
0 | 2994701 | Roc Meler | Roc Meler | Roc Mele,Roc Meler,Roc Mélé | 42.58765 | 1.74180 | T |
1 | 3017832 | Pic de les Abelletes | Pic de les Abelletes | Pic de la Font-Negre,Pic de la Font-Nègre,Pic ... | 42.52535 | 1.73343 | T |
2 | 3017833 | Estany de les Abelletes | Estany de les Abelletes | Estany de les Abelletes,Etang de Font-Negre,Ét... | 42.52915 | 1.73362 | H |
3 | 3023203 | Port Vieux de la Coume d’Ose | Port Vieux de la Coume d'Ose | Port Vieux de Coume d'Ose,Port Vieux de Coume ... | 42.62568 | 1.61823 | T |
4 | 3029315 | Port de la Cabanette | Port de la Cabanette | Port de la Cabanette,Porteille de la Cabanette | 42.60000 | 1.73333 | T |
Since Dask does not read the entire data into memory, we need to use .compute()
to get the results. For example, if we wanted to know the total number of records in the dataframe, we can do the following.
df.shape[0].compute()
13040598
The input data as a column feature_class
categorizing the place into 9 feature classes. We can select all rows with the value T
with the category mountain,hill,rock…
filtered = df[df['feature class']== 'T']
filtered.head().iloc[:, :7]
geonameid | name | asciiname | alternatenames | latitude | longitude | feature class | |
---|---|---|---|---|---|---|---|
0 | 2994701 | Roc Meler | Roc Meler | Roc Mele,Roc Meler,Roc Mélé | 42.58765 | 1.74180 | T |
1 | 3017832 | Pic de les Abelletes | Pic de les Abelletes | Pic de la Font-Negre,Pic de la Font-Nègre,Pic ... | 42.52535 | 1.73343 | T |
3 | 3023203 | Port Vieux de la Coume d’Ose | Port Vieux de la Coume d'Ose | Port Vieux de Coume d'Ose,Port Vieux de Coume ... | 42.62568 | 1.61823 | T |
4 | 3029315 | Port de la Cabanette | Port de la Cabanette | Port de la Cabanette,Porteille de la Cabanette | 42.60000 | 1.73333 | T |
5 | 3034945 | Roc de Port Dret | Roc de Port Dret | <NA> | 42.60288 | 1.45736 | T |
Creating Spatial Data#
GeoPandas has a conveinent function points_from_xy()
that creates a Geometry column from X and Y coordinates. We can then take a dask dataframe and create a dask goDataFrame by specifying a CRS and the geometry column.
filtered['geometry'] = dg.points_from_xy(
filtered, 'longitude', 'latitude', crs='EPSG:4326')
We now convert the dask dataframe to dask geodatafeame by using the from_dask_dataframe
function and then compute()
it to convert the lazy Dask collection into its in-memory equivalent i.e. Geodataframe.
%%time
mountain_gdf = dg.from_dask_dataframe(filtered).compute()
CPU times: user 1min 20s, sys: 10.2 s, total: 1min 30s
Wall time: 1min 16s
The result is a GeoDataFrame containing over 1.7 million mountain features.
mountain_gdf.iloc[:, :7]
geonameid | name | asciiname | alternatenames | latitude | longitude | feature class | |
---|---|---|---|---|---|---|---|
0 | 2994701 | Roc Meler | Roc Meler | Roc Mele,Roc Meler,Roc Mélé | 42.58765 | 1.74180 | T |
1 | 3017832 | Pic de les Abelletes | Pic de les Abelletes | Pic de la Font-Negre,Pic de la Font-Nègre,Pic ... | 42.52535 | 1.73343 | T |
3 | 3023203 | Port Vieux de la Coume d’Ose | Port Vieux de la Coume d'Ose | Port Vieux de Coume d'Ose,Port Vieux de Coume ... | 42.62568 | 1.61823 | T |
4 | 3029315 | Port de la Cabanette | Port de la Cabanette | Port de la Cabanette,Porteille de la Cabanette | 42.60000 | 1.73333 | T |
5 | 3034945 | Roc de Port Dret | Roc de Port Dret | <NA> | 42.60288 | 1.45736 | T |
... | ... | ... | ... | ... | ... | ... | ... |
515973 | 12185903 | Bedoach Ridge | Bedoach Ridge | <NA> | 12.17251 | 135.02423 | T |
516009 | 12185995 | Ipil Hill | Ipil Hill | Ipil Hill,Ipil Seamount | 16.83118 | 125.57572 | T |
516043 | 12186035 | Taranaki Terrace | Taranaki Terrace | <NA> | -39.65853 | 172.20361 | T |
516306 | 12358943 | Morro São Pedro | Morro Sao Pedro | <NA> | -28.49750 | -28.49750 | T |
516593 | 12865271 | Liverpool Bars | Liverpool Bars | <NA> | 70.60000 | -129.75000 | T |
1771446 rows × 7 columns
We use Leafmap with the Lonboard backend to visualize the results. This generates an interactive map
m = leafmap.Map(height=600)
m.add_gdf(mountain_gdf,
pickable=True,
auto_highlight=True,
get_fill_color='blue',
zoom_to_layer=True,
get_radius=1,
radius_units='pixels'
)
m
We can write the resulting GeoDataFrame to a new GeoPackage file.
output_filename = 'mountains.gpkg'
output_path = os.path.join(output_folder, output_filename)
mountain_gdf.to_file(
filename=output_path, layer='mountains', encoding='utf-8')
print('Successfully written output file at {}'.format(output_path))
Successfully written output file at output/mountains.gpkg
If you want to give feedback or share your experience with this tutorial, please comment below. (requires GitHub account)