Skip to content
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
5b1273f
remove dataset from main
bw4sz May 27, 2025
18884f5
don't overwrite utilities
bw4sz May 10, 2025
dc6e957
checking out to fix style editor
bw4sz May 12, 2025
19b5920
seperate NMS for per-image tiles
bw4sz May 12, 2025
e34f2ef
working on global box class
bw4sz May 13, 2025
36f2a2c
split out dataset tests
bw4sz May 13, 2025
2e7a677
split out individual tests for each dataset strategy
bw4sz May 13, 2025
812488e
predict_tile tests pass
bw4sz May 14, 2025
fde2f48
revert training, needs to be seperate PR
bw4sz May 14, 2025
5c039dc
rebase main
bw4sz May 19, 2025
a7ed84f
confirm model loading once
bw4sz May 19, 2025
1324914
local tests pass
bw4sz May 19, 2025
006cea7
Style updates
bw4sz May 19, 2025
c9e8f90
docformatter
bw4sz May 19, 2025
0964cc6
update tests with logging error
bw4sz May 19, 2025
0429a53
sphinx error for deepforest datasets is now a submodule
bw4sz May 20, 2025
8d4d8f3
more explicit multi-processing tests
bw4sz May 20, 2025
ec0e843
multiprocessing GPU error
bw4sz May 20, 2025
41c3bfe
move the image height into __init__
bw4sz May 20, 2025
21f095b
more multiprocessing tests
bw4sz May 20, 2025
2c808d7
eval edits from Josh
bw4sz May 27, 2025
9006019
improve the documentation around inference
bw4sz May 28, 2025
90afbbd
add image on dataloader strategy
bw4sz May 28, 2025
7543e1c
fix image path
bw4sz May 28, 2025
97b2627
doc edits
bw4sz Jun 3, 2025
e5089df
predict tile paths argument
bw4sz Jun 3, 2025
ea6298a
Revert "predict tile paths argument"
bw4sz Jun 3, 2025
7e7f1e8
remove paths argument from predict_tile
bw4sz Jun 3, 2025
0bf9c72
style updates
bw4sz Jun 3, 2025
84be208
avoid edge cases when predict_tile passes an image
bw4sz Jun 3, 2025
c43dcc2
edge case for image path
bw4sz Jun 3, 2025
fd19e07
batch results error
bw4sz Jun 4, 2025
2c8d59e
fold then pad, not pad then fold
bw4sz Jun 4, 2025
4aac4e9
crop result is wrong indent
bw4sz Jun 4, 2025
abc2221
skip if there is an empty image within a list
bw4sz Jun 4, 2025
a957fd1
take validation metrics out of epoch interval
bw4sz Jun 5, 2025
c8e2119
use predict batch instead of predict dataloader
bw4sz Jun 5, 2025
64f1d34
postprocess dataloader
bw4sz Jun 5, 2025
1f93d4e
move to batch on eval
bw4sz Jun 5, 2025
e54f400
specifically invoke eval mode
bw4sz Jun 5, 2025
509ec5e
style edits
bw4sz Jun 5, 2025
b4757fe
add batch size arg to predict dataloader, related to https://github.c…
bw4sz Jun 9, 2025
0150519
went back to passing path to val_step instead of a seperate loader
bw4sz Jun 10, 2025
0f68ecb
add paths to tests for train boxdataset
bw4sz Jun 10, 2025
ea5672f
add a recreate model argument from the cropmodel
bw4sz Jun 10, 2025
2b0d480
reverted model behavior to allow blank model if desired
bw4sz Jun 10, 2025
90ac4de
style update
bw4sz Jun 10, 2025
768321a
don't return model on self.create_model
bw4sz Jun 10, 2025
f50384e
typo in cropmodel test
bw4sz Jun 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 1 addition & 8 deletions docs/source/deepforest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Subpackages
:maxdepth: 4

deepforest.data
deepforest.datasets

Submodules
----------
Expand All @@ -28,14 +29,6 @@ deepforest.callbacks module
:undoc-members:
:show-inheritance:

deepforest.dataset module
-------------------------

.. automodule:: deepforest.dataset
:members:
:undoc-members:
:show-inheritance:

deepforest.evaluate module
--------------------------

Expand Down
66 changes: 13 additions & 53 deletions docs/user_guide/07_scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,68 +34,28 @@ https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#troublesho

## Prediction

Often we have a large number of tiles we want to predict. DeepForest uses [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to scale inference. This gives us access to powerful tools for scaling without any changes to user code. DeepForest automatically detects whether you are running on GPU or CPU. The parallelization strategy is to run each tile on a separate GPU, we cannot parallelize crops from within the same tile across GPUs inside of main.predict_tile(). If you set m.create_trainer(accelerator="gpu", devices=4), and run predict_tile, you will only use 1 GPU per tile. This is because we need access to all crops to create a mosiac of the predictions.
Often we have a large number of tiles we want to predict. DeepForest uses [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/) to scale inference. This gives us access to powerful tools for scaling without any changes to user code. DeepForest automatically detects whether you are running on GPU or CPU.

### Scaling prediction across multiple GPUs
There are three dataset strategies that *balance cpu memory, gpu memory, and gpu utilization* using batch sizes.

There are a few situations in which it is useful to replicate the DeepForest module across many separate Python processes. This is especially helpful when we have a series of non-interacting tasks, often called 'embarrassingly parallel' processes. In these cases, no DeepForest instance needs to communicate with another instance. Rather than coordinating GPUs with the associated annoyance of overhead and backend errors, we can just launch separate jobs and let them finish on their own. One helpful tool in Python is [Dask](https://www.dask.org/). Dask is a wonderful open-source tool for coordinating large-scale jobs. Dask can be run locally, across multiple machines, and with an arbitrary set of resources.
```python
prediction_single = m.predict_tile(path=path, patch_size=300, dataloader_strategy="single")
```
The `dataloader_strategy` parameter has three options:

### Example Dask and DeepForest integration using SLURM
* **single**: Loads the entire image into CPU memory and passes individual windows to GPU.

Imagine we have a list of images we want to predict using `deepforest.main.predict_tile()`. DeepForest does not allow multi-GPU inference within each tile, as it is too much of a headache to make sure the threads return the correct overlapping window. Instead, we can parallelize across tiles, such that each GPU takes a tile and performs an action. The general structure is to create a Dask client across multiple GPUs, submit each DeepForest `predict_tile()` instance, and monitor the results. In this example, we are using a SLURMCluster, a common job scheduler for large clusters. There are many similar ways to create a Dask client object that will be specific to a particular organization. The following arguments are specific to the University of Florida cluster, but will be largely similar to other SLURM naming conventions. We use the extra Dask package, `dask-jobqueue`, which helps format the call.
* **batch**: Loads the entire image into GPU memory and creates views of the image as batches. Requires the entire tile to fit into GPU memory. CPU parallelization is possible for loading images.

* **window**: Loads only the desired window of the image from the raster dataset. Most memory efficient option, but cannot parallelize across windows due to rasterio's Global Interpreter Lock (GIL), workers must be set to 0.

```python
from dask_jobqueue import SLURMCluster
from dask.distributed import Client

cluster = SLURMCluster(processes=1,
cores=10,
memory="40 GB",
walltime='24:00:00',
job_extra=extra_args,
extra=['--resources gpu=1'],
nanny=False,
scheduler_options={"dashboard_address": ":8787"},
local_directory="/orange/idtrees-collab/tmp/",
death_timeout=100)
print(cluster.job_script())
cluster.scale(10)

dask_client = Client(cluster)
```
## Data Loading

This job script gets a single GPUs with "40GB" of memory with 10 cpus. We then ask for 10 instances of this setup.
Now that we have a dask client, we can send our custom function.
DeepForest uses PyTorch's DataLoader for efficient data loading. One important parameter for scaling is `num_workers`, which controls parallel data loading using multiple CPU processes. This can be set

```python
import os
from deepforest import main

def function_to_parallelize(tile):
m = main.deepforest()
m.load_model("weecology/deepforest-tree") # sub in the custom logic to load your own models
boxes = m.predict_tile(raster_path=tile)
# save the predictions using the tile pathname
filename = "{}.csv".format(os.path.splitext(os.path.basename(tile))[0])
filename = os.path.join(<savedir>,filename)
boxes.to_csv(filename)

return filename
```

```python
tiles = [<list of tiles to predict>]
futures = []
for tile in tiles:
future = client.submit(function_to_parallelize, tile)
futures.append(future)
m.config["workers"] = 10
```
0 workers runs without multiprocessing, workers > 1 runs with multiprocessing. Increase this value slowly, as IO constraints can lead to deadlocks among workers.

We can wait to see the futures as they complete! Dask also has a beautiful visualization tool using bokeh.

```python
for x in futures:
completed_filename = x.result()
print(completed_filename)
```
68 changes: 49 additions & 19 deletions docs/user_guide/12_evaluation.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,69 @@
# Evaluation

We stress that evaluation data must be different from training data, as neural networks have millions of parameters and can easily memorize thousands of samples. Avoid random train-test splits, try to create test datasets that mimic downstream tasks. If you are predicting among temporal surveys or across imaging platforms, your train-test data should reflect these partitions. Random sampling is almost never the right choice, biological data often has high spatial, temporal or taxonomic correlation that makes it easier for your model to generalize, but will fail when pushed into new situations.
DeepForest allows users to assess model performance compared to ground-truth data.

DeepForest provides several evaluation metrics. There is no one-size-fits all evaluation approach, and the user needs to consider which evaluation metric best fits the task. There is significant information online about the evaluation of object detection networks. Our philosophy is to provide a user with a range of statistics and visualizations. Always visualize results and trust your judgment. Never be guided by a single metric.
## Summary

## Further Reading
1. Recall - the proportion of ground-truth objects correctly covered by predictions.
2. Predicision - the proportion of predictions that overlap ground-truth.
3. Empty-frame accuracy - the proportion of ground-truth images that are currently predicted to have no objects of interest.
4. iou - Intersection-over-Union, a computer vision metric that assesses how tightly a bounding box prediction overlaps with its matched ground-truth.
5. mAP - Mean-Average-Precision, a computer vision metric that assesses the performance of the model incoporating precision, recall and average score of true positives. See below.

[MeanAveragePrecision in torchmetrics](https://medium.com/data-science-at-microsoft/how-to-smoothly-integrate-meanaverageprecision-into-your-training-loop-using-torchmetrics-7d6f2ce0a2b3)
## Evaluation code

[A general explanation of the mAP metric](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173)
```
from deepforest import main, get_data
m = main.deepforest()
m.load_model("Weecology/deepforest-tree")
# Sample data
csv_file = get_data("OSBS_029.csv")
results = m.evaluate(csv_file, iou_threshold=0.4)
```

[Comparing Object Detection Models](https://www.comet.com/site/blog/compare-object-detection-models-from-torchvision/)
Results is a dictionary of metrics, the predictions dataframe and the ground truth dataframe.

## Evaluation Philosophy and Further Information

We stress that evaluation data must be different from training data, as neural networks have millions of parameters and can easily memorize thousands of samples. Avoid random train-test splits, try to create test datasets that mimic downstream tasks. If you are predicting among temporal surveys or across imaging platforms, your train-test data should reflect these partitions. Random sampling is almost never the right choice, biological data often has high spatial, temporal or taxonomic correlation that makes it easier for your model to generalize, but will fail when pushed into new situations.

## Average Intersection over Union
DeepForest provides several evaluation metrics. There is no one-size-fits all evaluation approach, and the user needs to consider which evaluation metric best fits the task. There is significant information online about the evaluation of object detection networks. Our philosophy is to provide a user with a range of statistics and visualizations. Always visualize results and trust your judgment. Never be guided by a single metric.

## Metrics

### Average Intersection over Union
DeepForest modules use torchmetric's [IntersectionOverUnion](https://torchmetrics.readthedocs.io/en/stable/detection/intersection_over_union.html) metric. This calculates the average overlap between predictions and ground truth boxes. This can be considered a general indicator of model performance but is not sufficient on its own for model evaluation. There are lots of reasons predictions might overlap with ground truth; for example, consider a model that covers an entire image with boxes. This would have a high IoU but a low value for model utility.

## Mean-Average-Precision (mAP)
### Mean-Average-Precision (mAP)
mAP is the standard COCO evaluation metric and the most common for comparing computer vision models. It is useful as a summary statistic. However, it has several limitations for an ecological use case.

1. Not intuitive and difficult to translate to ecological applications. Read the sections above and visualize the mAP metric, which is essentially the area under the precision-recall curve at a range of IoU values.
2. The vast majority of biological applications use a fixed cutoff to determine an object of interest in an image. Perhaps in the future we will weight tree boxes by their confidence score, but currently we do things like, "All predictions > 0.4 score are considered positive detections". This does not connect well with the mAP metric.

## Precision and Recall at a set IoU threshold.
For information on how to calculate mAP, see the [torchmetrics documentation](https://torchmetrics.readthedocs.io/en/stable/detection/mean_average_precision.html) and further reading below.

### Precision and Recall at a set IoU threshold.
This was the original DeepForest metric, set to an IoU of 0.4. This means that all predictions that overlap a ground truth box at IoU > 0.4 are true positives. As opposed to the torchmetrics above, it is intuitive and matches downstream ecological tasks. The drawback is that it is slow, coarse, and does not fully reward the model for having high confidence scores on true positives.

There is an additional difference between ecological object detection methods like tree crowns and traditional computer vision methods. Instead of a single or set of easily differentiated ground truths, we could have 60 or 70 objects that overlap in an image. How do you best assign each prediction to each ground truth?

DeepForest uses the [hungarian matching algorithm](https://thinkautonomous.medium.com/computer-vision-for-tracking-8220759eee85) to assign predictions to ground truth based on maximum IoU overlap. This is slow compared to the methods above, and so isn't a good choice for running hundreds of times during model training see config.validation.val_accuracy_interval for setting the frequency of the evaluate callback for this metric.
DeepForest uses the [hungarian matching algorithm](https://thinkautonomous.medium.com/computer-vision-for-tracking-8220759eee85) to assign predictions to ground truth based on maximum IoU overlap. This is slow compared to the methods above, and so isn't a good choice for running hundreds of times during model training see config.validation.val_accuracy_interval for setting the frequency of the evaluate callback for this metric.

When there are no true positives, this metric is undefined.

### Empty Frame Accuracy

DeepForest allows the user to pass empty frames to evaluation by setting xmin, ymin, xmax, ymax to 0. This is useful for evaluating models on data that has empty frames. The empty frame accuracy is the proportion of empty frames that are contain no predictions. The 'label' column in this case is ignored, but must be one of the labels in the model to be included in the evaluation.

# Calculating Evaluation Metrics
### Further Reading

[MeanAveragePrecision in torchmetrics](https://medium.com/data-science-at-microsoft/how-to-smoothly-integrate-meanaverageprecision-into-your-training-loop-using-torchmetrics-7d6f2ce0a2b3)

[A general explanation of the mAP metric](https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173)

## Torchmetrics and loss scores
[Comparing Object Detection Models](https://www.comet.com/site/blog/compare-object-detection-models-from-torchvision/)

### Evaluation loss and map scores

These metrics are largely used during training to keep track of model performance. They are relatively fast and will be automatically run during training.

Expand Down Expand Up @@ -75,7 +104,7 @@ This creates a dictionary of the average IoU ('iou') as well as 'iou' for each c

> **_Advanced tip:_** Users can set the frequency of pytorch lightning evaluation using kwargs passed to main.deepforest.create_trainer(). For example [check_val_every_n_epochs](https://lightning.ai/docs/pytorch/stable/common/trainer.html#check-val-every-n-epoch).

## Recall and Precision at a fixed IoU Score
### Recall and Precision at a fixed IoU Score
To get a recall and precision at a set IoU evaluation score, specify an annotations' file using the m.evaluate method.

```python
Expand Down Expand Up @@ -113,7 +142,7 @@ results["box_precision"]
0.781
```

### Worked example of calculating IoU and recall/precision values
## Worked example of calculating IoU and recall/precision values
To convert overlap among predicted and ground truth bounding boxes into measures of accuracy and precision, the most common approach is to compare the overlap using the intersection-over-union metric (IoU).
IoU is the ratio between the area of the overlap between the predicted polygon box and the ground truth polygon box divided by the area of the combined bounding box region.

Expand Down Expand Up @@ -186,9 +215,9 @@ true_positive = sum(result["match"])
recall = true_positive / result.shape[0]
precision = true_positive / predictions.shape[0]
recall
0.819672131147541
0.81967
precision
0.5494505494505495
0.54945
```

This can be stated as 81.97% of the ground truth boxes are correctly matched to a predicted box at IoU threshold of 0.4, and 54.94% of predicted boxes match a ground truth box.
Expand All @@ -203,7 +232,7 @@ This is a dictionary with keys

```
result.keys()
dict_keys(['results', 'box_precision', 'box_recall', 'class_recall'])
dict_keys(['results', 'box_precision', 'box_recall', 'class_recall','predictions','ground_df'])
```

The added class_recall dataframe is mostly relevant for multi-class problems, in which the recall and precision per class is given.
Expand All @@ -214,7 +243,8 @@ result["class_recall"]
0 Tree 1.0 0.67033 61
```

### How to average evaluation metrics across images?
## How to average evaluation metrics across images?

One important decision was how to average precision and recall across multiple images. Two reasonable options might be to take all predictions and all ground truth and compute the statistic on the entire dataset. This strategy makes more sense for evaluation data that is relatively homogenous across images. We prefer to take the average of per-image precision and recall. This helps balance the dataset if some images have many objects and others have few objects, such as when you are comparing multiple habitat types.
Users are welcome to calculate their own statistics directly from the results dataframe.

Expand All @@ -228,7 +258,7 @@ result["results"].head()
34 34 4 0.595862 ... Tree OSBS_029.tif True
```

### Evaluating tiles too large for memory
## Evaluating tiles too large for memory

The evaluation method uses deepforest.predict_image for each of the paths supplied in the image_path column. This means that the entire image is passed for prediction. This will not work for large images. The deepforest.predict_tile method does a couple things under the hood that need to be repeated for evaluation.

Expand Down
Loading