elapid's Geospatial Features¶
This notebook reviews common patterns for working with vector and raster data using elapid
.
We'll review the core tools used for extracting raster data from points and polygons, assigning sample weights based on sample density, and creating spatially-explicit train/test data, while also showing general patterns for working with the GeoDataFrame
and GeoSeries
classes from geopandas
, which are frequently used.
The sample data we'll use includes species occurrence records from the genus Ariolimax, a group of North American terrestrial slugs, as well as some climate and vegetation data.
Several datasets used here are hosted on a web server. These files can be accessed locally with ela.download_sample_data()
, but we'll work with them directly on the web to demonstrate that reading remote data is nearly identical to reading local data thanks to the amazing libraries elapid
is built on top of.
Packages, paths & preferences¶
# packages
import elapid as ela
import os
import warnings
from pprint import pprint
import numpy as np
import pandas as pd
import geopandas as gpd
import rasterio as rio
from rasterio import plot as rioplot
from sklearn import metrics
import matplotlib as mpl
import matplotlib.pyplot as plt
# paths
https = "https://earth-chris.github.io/images/research"
csv = f"{https}/ariolimax-ca.csv"
gpkg = f"{https}/ariolimax-ca.gpkg"
raster_names = [
"ca-cloudcover-mean.tif",
"ca-cloudcover-stdv.tif",
"ca-leafareaindex-mean.tif",
"ca-leafareaindex-stdv.tif",
"ca-surfacetemp-mean.tif",
"ca-surfacetemp-stdv.tif",
]
rasters = [f"{https}/{raster}" for raster in raster_names]
labels = ["CloudCoverMean", "CloudCoverStdv", "LAIMean", "LAIStdv", "SurfaceTempMean", "SurfaceTempStdv"]
# preferences
mpl.style.use('ggplot')
warnings.filterwarnings("ignore")
pair_colors = ['#FFCC02', '#00458C']
print(f"Notebook last run with elapid version {ela.__version__}")
Notebook last run with elapid version 1.0.1
Working with Point-format data¶
Species distribution modeling is the science and the art of working with point-format species occurrence data. These are locations on the landscape where a species has been observed and identified and recorded as present in an x/y location. We then model the relative likelihood of observing that species in other locations based on the environmental conditions where that species is observed.
Occurrence data are typically recorded with latitude/longitude coordinates for a single location, also known as "point-format data." Environmental data are typically provided as image data with coordinate reference information, or "raster data." elapid
provides a series of tools for working with these datasets.
Reading vector files¶
Point-format data are most commonly provided in formats like GeoPackage, GeoJSON, or Shapefiles. You can read these into memory easily with geopandas
.
gdf = gpd.read_file(gpkg)
print(gdf.head())
species year geometry 0 Ariolimax buttoni 2005 POINT (432271.994 4348048.850) 1 Ariolimax buttoni 2005 POINT (643606.559 4427302.366) 2 Ariolimax buttoni 2006 POINT (537588.166 4194012.740) 3 Ariolimax buttoni 2007 POINT (468489.994 4361538.275) 4 Ariolimax buttoni 2007 POINT (537466.797 4194096.065)
Reading x/y data¶
Not all datasets are packaged so conveniently. SDM data often come as CSV files with a column for latitude and longitude, for eample. We can convert that information into a GeoDataFrame
so we can natively perform geospatial operations.
The example CSV is a subset of this GBIF occurrence dataset, which contains data from four species of Banana Slug across California.
# read and preview the data
df = pd.read_csv(csv)
# we convert the lat/lon columns into a GeoSeries dataset, which is spatially aware
geometry = ela.xy_to_geoseries(
x = df['decimalLongitude'],
y = df['decimalLatitude']
)
# then merge the two together into a GeoDataFrame
ariolimax = gpd.GeoDataFrame(df[["species", "year"]], geometry=geometry)
# show the evolution
print("DataFrame")
print(df.head())
print("\nGeoSeries")
print(geometry.head())
print("\nGeoDataFrame")
print(ariolimax.head())
DataFrame species year decimalLatitude decimalLongitude 0 Ariolimax buttoni 2022 37.897280 -122.709870 1 Ariolimax buttoni 2020 39.331897 -123.784692 2 Ariolimax buttoni 2022 37.934980 -122.636929 3 Ariolimax buttoni 2022 38.032908 -122.721855 4 Ariolimax buttoni 2022 38.042295 -122.876520 GeoSeries 0 POINT (-122.70987 37.89728) 1 POINT (-123.78469 39.33190) 2 POINT (-122.63693 37.93498) 3 POINT (-122.72186 38.03291) 4 POINT (-122.87652 38.04230) dtype: geometry GeoDataFrame species year geometry 0 Ariolimax buttoni 2022 POINT (-122.70987 37.89728) 1 Ariolimax buttoni 2020 POINT (-123.78469 39.33190) 2 Ariolimax buttoni 2022 POINT (-122.63693 37.93498) 3 Ariolimax buttoni 2022 POINT (-122.72186 38.03291) 4 Ariolimax buttoni 2022 POINT (-122.87652 38.04230)
# plot the points geographically by species
fig, ax = plt.subplots(figsize=(5,5))
sp = ariolimax.plot(
column='species',
ax=ax,
legend=True,
legend_kwds={"bbox_to_anchor": (1, 0.5)}
)
Converting from arrays to coordinates¶
Or if you're working with coordinate data in memory, say as an array or a list, xy_to_geoseries
works with pretty much any iterable type.
# a couple of dummy locations
lons = [-122.49, 151.0]
lats = [37.79, -33.87]
# it's smart to explicity set the CRS/projection to ensure the points are put in the right place
# the CRS below assigns the points the lat/lon crs type
crs = "EPSG:4326"
random_locations = ela.xy_to_geoseries(lons, lats, crs=crs)
# these points now have lots of neat geographic methods and properties
xmin, ymin, xmax, ymax = random_locations.total_bounds
print(f"Bounding box: [{xmin}, {ymin}, {xmax}, {ymax}]")
# including getting a description of the coordinate reference system (crs)
print("")
pprint(random_locations.crs)
Bounding box: [-122.49, -33.87, 151.0, 37.79] <Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
Working with raster data¶
This demo includes 6 raster files defined at the start of the notebook, corresponding to the average and standard deviation in annual cloud cover, temperature, and vegetation growth across California.
Following a brief overview of the rasters, the sections below will review how to extract data from them using point and polygon vector data.
# first we'll just read and display the surface temperature data
fig, ax = plt.subplots(figsize=(6,6))
surfacetemp_raster = rasters[4]
with rio.open(surfacetemp_raster, 'r') as src:
profile = src.profile.copy()
# this plots the mean annual temperature data
# low temps are dark grey, high temps are light grey
rioplot.show(src, ax=ax, cmap="gray")
# overlay the ariolimax data
ariolimax.to_crs(src.crs).plot(column='species', ax=ax, legend=True)
print("profile metadata")
pprint(profile)
profile metadata {'blockxsize': 512, 'blockysize': 512, 'compress': 'lzw', 'count': 1, 'crs': CRS.from_epsg(32610), 'driver': 'GTiff', 'dtype': 'float32', 'height': 1040, 'interleave': 'band', 'nodata': -9999.0, 'tiled': True, 'transform': Affine(1000.0, 0.0, 380000.0, 0.0, -1000.0, 4654000.0), 'width': 938}
Drawing random pseudoabsence locations¶
In addition to species occurrence records (where y = 1
), species distributions models often require a set of random pseudo-absence/background points (y = 0
). These are a random geographic sampling of where you might expect to find a species across the target landscape.
The background point distribution is extremely important for presence-only models, and should be selected with care. elapid
provides a range of methods for sample selection.
For the unintiated, these papers by Morgane Barbet-Massin et al. and Phillips et al. are excellent resources to consult for considering how background point samples should be selected.
From a raster's extent¶
You can use elapid
to create a uniform random geographic sample from unmasked locations within a raster's extent:
# sampling
count = 10_000
pseudoabsence = ela.sample_raster(surfacetemp_raster, count)
# plotting
pseudoabsence.plot(markersize=0.5)
pseudoabsence.describe()
count 10000 unique 9879 top POINT (1243500 3737500) freq 2 dtype: object
# some pixel centroids were sampled multiple times
# you can drop them if you'd like.
#pseudoabsence.drop_duplicates(inplace=True)
If you have a large raster that doesn't fit in memory, you can also set ignore_mask=True
to use the rectangular bounds of the raster to draw samples. Many of the sample locations will be in nodata
locations but we can address that later.
# sampling
pseudoabsence_uniform = ela.sample_raster(
surfacetemp_raster,
count,
ignore_mask=True,
)
# plotting
pseudoabsence_uniform.plot(markersize=0.5)
pseudoabsence_uniform.describe()
count 10000 unique 10000 top POINT (1036258.4043319515 4263597.03776096) freq 1 dtype: object