Skip to content

Geospatial Data Support

The following code snippets are guidelines for how to work with spatially-explicit datasets in elapid.

A note on point-format data types

Almost all of the elapid point sampling and indexing tools use geopandas.GeoDataFrame or geopandas.GeoSeries objects. The latter are the format of the geometry column for a GeoDataFrame.

import geopandas as gpd

gdf = gpd.read_file('/path/to/some/vector.shp')
print(type(gdf.geometry))

> <class 'geopandas.geoseries.GeoSeries'>

Several elapid routines return GeoSeries objects (like elapid.sample_raster()) or GeoDataFrame objects (like elapid.annotate()). It also includes tools for converting x/y data from list, numpy.ndarray, or pandas.DataFrame to GeoSeries.


Working with X/Y data

From CSVs

Sometimes you don't have a vector of point-format location data. The java implementation of maxent uses csv files, for example. You can convert those using the elapid.xy_to_geoseries() function:

import pandas as pd

csv_path = "/home/cba/ariolimax-californicus.csv"
df = pd.read_csv(csv_path)
presence = elapid.xy_to_geoseries(df.x, df.y, crs="EPSG:32610")

This assumes that the input CSV file has an x and a y column.

Make sure you specify the projection of your x/y data. The default assumption is lat/lon, which in many cases is not correct.

From arrays or lists

You can also convert arbitrary arrays of x/y data:

lons = [-122.49, 151.0]
lats = [37.79, -33.87]
locations = elapid.xy_to_geoseries(lons, lats)
print(locations)

> 0    POINT (-122.49000 37.79000)
> 1    POINT (151.00000 -33.87000)
> dtype: geometry

Drawing point location samples

In addition to species occurrence records (y = 1), Maxent requires a set of pseudo-absence/background points (y = 0). These are a random geographic sampling of where you might expect to find a species across the target landscape.

From a raster's extent

You can use elapid to create a random geographic sampling of points from unmasked locations within a raster's extent:

count = 10000 # the number of points to generate
pseudoabsence_points = elapid.sample_raster(raster_path, count)

If you have a large raster that doesn't fit in memory, you can also set ignore_mask=True to use the rectangular bounds of the raster to draw samples.

pseudoabsence_points = elapid.sample_raster(raster_path, count, ignore_mask=True)

From a polygon vector

Species occurrence records are often biased in their collection (collected near roads, in protected areas, etc.), so we typically need to be more precise in where we select pseudo-absence points. You could use a vector with a species range map to select records:

range_path = "/home/slug/ariolimax-californicus-range.shp"
pseudoabsence_points = elapid.sample_vector(range_path, count)

This currently computes the spatial union of all polygons within a vector, compressing the geometry into a single MultiPolygon object to sample from.

If you've already computed a polygon using geopandas, you can pass it instead to elapid.sample_geoseries(), which is what sample_vector() does under the hood.

From a bias raster

You could also pass a raster bias file, where the raster grid cells contain information on the probability of sampling an area:

# assume covariance between vertebrate and invertebrate banana slugs
bias_path = "/home/slug/proximity-to-ucsc.tif"
pseudoabsence_points = sample_bias_file(bias_path)

The grid cells can be an arbitrary range of values. What's important is that the values encode a linear range of numbers that are higher where you're more likely to draw a sample. The probability of drawing a sample is dependent on two factors: the range of values provided and the frequency of values across the dataset.

So, for a raster with values of 1 and 2, you're sampling probability for raster locations of 2 is twice that as 1 locations. If these occur in equal frequency (i.e. half the data are 1 valuess, half are 2 values), then you'll likely sample twice as many areas with 2 values. But if the frequency of 1 values is much greater than 2 values, you'll shift the distribution. But you're still more likely, on a per-draw basis, to draw samples from 2 locations.

The above example prioritizes sampling frequency in the areas around UC Santa Cruz, home to all types of slug, based on the distance to the campus.


Point Annotation

Annotation refers to reading and storing raster values at the locations of a series of point occurrence records in a single GeoDataFrame table.

Once you have your species presence and pseudo-absence records, you can annotate these records with the covariate data from each location.

pseudoabsence_covariates = elapid.annotate(
    pseudoabsence_points,
    list_of_raster_paths,
    drop_na = True,
)

This function, since it's geographically indexed, doesn't require the point data and the raster data to be in the same projection. elapid handles reprojection and sampling on the fly.

It also allows you to pass multiple raster files, which can be in different projections, extents, or grid sizes. This means you don't have to explicitly re-sample your raster data prior to analysis, which is always a chore.

raster_paths = [
    "/home/slug/california-leaf-area-index.tif", # 1-band vegetation data
    "/home/slug/global-cloud-cover.tif", # 3-band min, mean, max annual cloud cover
    "/home/slug/usa-mean-temperature.tif", # 1-band mean temperature
]

# this fake dataset has five raster bands total, so specify each band label
labels = [
    "LAI",
    "CLD-min",
    "CLD-mean",
    "CLD-max",
    "TMP-mean",
]

pseudoabsence_covariates = elapid.annotate(
    pseudoabsence_points,
    raster_paths,
    labels = labels
    drop_na = True,
)

If you don't specify the labels, elapid will assign ['b1', 'b2', ..., 'bn'] labels to each column.

Setting drop_na=True requires that the raster datasets have nodata values assigned in their metadata. These point locations will be dropped from the output dataframe, which will have fewer rows than the input points.


Zonal statistics

In addition to the tools for working with Point data, elapid contains a routine for calculating zonal statistics from Polygon or MutliPolygon geometry types.

This routine reads an array of raster data covering the extent of a polygon, masks the areas outside the polygon, and computes summary statistics such as the mean, standard deviation, and mode of the array.

ecoregions = gpd.read_file('/path/to/california-ecoregions.shp')
zs = elapid.zonal_stats(
    ecoregions,
    raster_paths,
    labels = labels,
    mean = True,
    stdv = True,
    percentiles = [10, 90],
)

Which stats are reported is managed by a set of keywords (count=True, sum=True, skew=True). The all=True keyword is a shortcut to compute all of the stats. You'll still have to explicitly pass a list of percentiles, however, like this:

zs = elapid.zonal_stats(
    ecoregions,
    raster_paths,
    labels = labels,
    all = True,
    percentiles = [25, 50, 75],
)

What sets the elapid zonal stats method apart from other zonal stats packages is it:

  • handles reprojection on the fly, meaning the input vector/raster data don't need to be reprojected a priori
  • handles mutli-band input, computing summary stats on all bands (instead of having to specify which band)
  • handles multi-raster input, reading inputs in serial but creating a single output GeoDataFrame.

Applying predictions to data

Once you've fit a model (we're skipping a step here, but see the Maxent overview for an example), you can apply it to a set of raster covariates to produce gridded habitat suitability maps.

elapid.apply_model_to_rasters(
    model,
    raster_paths,
    output_path,
    template_idx = 0,
    transform = "cloglog",
    nodata = -9999,
)

The list and band order of the rasters must match the order of the covariates used to train the model. It reads each dataset in a block-wise basis, applies the model, and writes gridded predictions.

If the raster datasets are not consistent (different extents, resolutions, etc.), it wll re-project the data on the fly, with the grid size, extent and projection based on a 'template' raster. Use the template_idx keyword to specify the index of which raster file to use as the template (template_idx=0 sets the first raster as the template).

In the example above, it's important to set the template to the california-leaf-area-index.tif file. This is because this is the smallest extent with data, and it'll only read and apply the model to the usa and global datasets in the area covered by california. If you were to set the extent to usa-mean-temperature.tif, it would still technically function, but there would be a large region of nodata values where there's insufficient covariate coverage.

Applying other model predictions

The apply_model_to_rasters() function can be used to apply model predictions from any estimator with a model.predict() method. This includes the majority of sklearn model estimators.

If you wanted to train and apply a Random Forest model, you'd use a pattern like this:

import elapid
from sklearn.ensemble import RandomForestClassifier

x, y = elapid.load_sample_data()

model = RandomForestClassifier()
model.fit(x, y)

input_rasters = ['/path/to/raster1.tif', '/path/to/raster2.tif']
output_raster = 'rf-model-prediction-categorical.tif'
elapid.apply_model_to_rasters(
  model,
  input_rasters,
  output_raster,
)

These models contain an additional method for estimating the continuous prediction probabilities. To write these out, set predict_proba=True. To use this option, however, you also have to specify the number of output raster bands. Since the sample data is a 2-class model, the output prediction probabilities are 2-band outputs, so we set count=2.

output_raster = 'rf-model-prediction-probabilities.tif'
elapid.apply_model_to_rasters(
  model,
  input_rasters,
  output_raster,
  predict_proba = True,
  count = 2,
)
Back to top