elapid.train_test_split¶
Methods for geographlically splitting data into train/test splits
BufferedLeaveOneOut
¶
Bases: BaseCrossValidator
Leave-one-out CV that excludes training points within a buffered distance.
Source code in elapid/train_test_split.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
__init__(distance)
¶
Buffered leave-one-out cross-validation strategy.
Drops points from the training data based on a buffered distance to the left-out test point(s). Implemented from Ploton et al. 2020, www.nature.com/articles/s41467-020-18321-y
Parameters:
Name | Type | Description | Default |
---|---|---|---|
distance |
float
|
drop training data points within this distance of test data. |
required |
Source code in elapid/train_test_split.py
120 121 122 123 124 125 126 127 128 129 130 |
|
get_n_splits(points, class_label=None, groups=None)
¶
Return the number of splitting iterations in the cross-validator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
points |
Vector
|
point-format GeoSeries or GeoDataFrame. |
required |
class_label |
str
|
column to specify presence locations (y==1). |
None
|
groups |
str
|
column to group train/test splits by. |
None
|
Returns:
Type | Description |
---|---|
int
|
Splitting iteration count. |
Source code in elapid/train_test_split.py
215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
split(points, class_label=None, groups=None)
¶
Split point data into train/test folds and return their array indices.
Default behaviour is to perform leave-one-out cross-validation, meaning
there will be as many train/test splits as there are samples.
To run leave-one-out splits for each y==1 sample, use the
class_label
parameter to define which column includes the class
to leave out. To run a grouped leave-one-out, use the groups
parameter to define which column includes unique IDs to group by.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
points |
Vector
|
point-format GeoSeries or GeoDataFrame. |
required |
class_label |
str
|
column to specify presence locations (y==1). |
None
|
groups |
str
|
column to group train/test splits by. |
None
|
Yields:
Type | Description |
---|---|
Tuple[ndarray, ndarray]
|
(train_idxs, test_idxs) the train/test splits for each fold. |
Source code in elapid/train_test_split.py
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
GeographicKFold
¶
Bases: BaseCrossValidator
Compute geographically-clustered train/test folds using KMeans clustering
Source code in elapid/train_test_split.py
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
__init__(n_splits=4)
¶
Cluster x/y points into separate cross-validation folds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_splits |
int
|
Number of geographic clusters to split the data into. |
4
|
Source code in elapid/train_test_split.py
76 77 78 79 80 81 82 |
|
get_n_splits()
¶
Return the number of splitting iterations in the cross-validator.
Returns:
Type | Description |
---|---|
int
|
Splitting iteration count. |
Source code in elapid/train_test_split.py
108 109 110 111 112 113 114 |
|
split(points)
¶
Split point data into geographically-clustered train/test folds and return their array indices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
points |
Vector
|
point-format GeoSeries or GeoDataFrame. |
required |
Yields:
Type | Description |
---|---|
Tuple[ndarray, ndarray]
|
(train_idxs, test_idxs) the train/test splits for each geo fold. |
Source code in elapid/train_test_split.py
95 96 97 98 99 100 101 102 103 104 105 106 |
|
checkerboard_split(points, grid_size, buffer=0, bounds=None)
¶
Create train/test splits with a spatially-gridded checkerboard.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
points |
Vector
|
point-format GeoSeries or GeoDataFrame |
required |
grid_size |
float
|
the height and width of each checkerboard side to split data using. Should match the units of the points CRS (i.e. grid_size=1000 is a 1km grid for UTM data) |
required |
buffer |
float
|
add an x/y buffer around the initial checkerboard bounds |
0
|
bounds |
Tuple[float, float, float, float]
|
instead of deriving the checkerboard bounds from |
None
|
Returns:
Type | Description |
---|---|
Tuple[GeoDataFrame, GeoDataFrame]
|
(train_points, test_points) split using a checkerboard grid. |
Source code in elapid/train_test_split.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|