Dataset redesign for multi-gpu, multi-processing and multi-geometry #1057

bw4sz · 2025-05-14T17:29:25Z

Summary

This is a large PR aimed to create flexible dataset classes and predict_tile dataloader strategies. The deepforest.dataset.TreeDataset is one of the oldest parts of the codebase. Over time we have added other dataset classes, like TileDataset and RasterDataset, but there is no unifying structure or organization. There is a reason that single GPU predict_tile and the TreeDataset logic lasted 4 years, changing the structure required touching nearly every file in the codebase. It was far easier to redesign the datasets knowing that we have immediate need for refactoring for 2.0. Unpicking them in a halfway complete process would have made the next steps difficult for anyone else to contribute.

Motivation

We want to process predict_tile faster and easier.
Greater clarity and organization among dataset classes.
Prepare datasets for differing geometry types for milestone 2.0
Cleanup unused imports and naming based on config argument changes in Hydra integration #1035

Desired Dataset Functionality

Prediction and Training Base classes with defined standards
Flexibility post-processing to recombine batches
Load multiple images on CPU using pytorch dataloader num_workers multi-processing
Stack multiple images works of windows to utilize a full GPU
without the use of dask. https://deepforest.readthedocs.io/en/v1.5.0/user_guide/07_scaling.html#example-dask-and-deepforest-integration-using-slurm
scales with multiple gpus
plays nicely with pytorch lightning.
reasonable RAM load
Introduces coherent dataset classes that can connect to multiple geometries.

Major improvements

Introduced a single PredictionDataset class that establishes a general structure for prediction and post-processing
Split out training and cropmodel datasets into separate classes
Renamed TileRaster to SingleImage and MemoryRaster to TiledRaster for greater clarity
Add a dataloader-strategy (should this be more accurately, 'dataset-strategy'?) argument to main.predict_tile
Add a MultiImage approach to combine batches of images to take advantage of multi-image batches of GPUs

Minor improvements

Significant code cleanup, renaming, and deletion of used imports

Co-pilot summary of code changes

This pull request introduces significant updates to the DeepForest project, including enhancements to prediction scaling, dataset handling, and configuration management. The changes focus on improving usability, performance, and modularity by refining documentation, restructuring datasets, and updating configuration files.

Enhancements to Prediction Scaling:

Updated the prediction documentation to introduce three new dataset strategies (single, batch, window) for balancing CPU/GPU memory and utilization during inference. These strategies are configurable via the dataloader_strategy parameter.
Removed the Dask-based multi-GPU scaling example, simplifying the documentation and focusing on the new dataset strategies.

Dataset Refactoring:

Removed the TreeDataset, TileDataset, and RasterDataset classes from src/deepforest/dataset.py and modularized the BoundingBoxDataset into a new file, src/deepforest/datasets/cropmodel.py. This improves code organization and clarity. [1] [2]
Updated the import path for TileDataset in the prediction example in docs/user_guide/16_prediction.md to reflect the new file structure.

Configuration Updates:

Renamed the pin_images parameter to preload_images in src/deepforest/conf/config.yaml for better clarity and added a new predict.pin_memory configuration option to control memory pinning during prediction. [1] [2]

Code Simplification:

Removed unused imports from src/deepforest/callbacks.py, streamlining the file and reducing unnecessary dependencies.

Faster tests

the retinanet.check_model() function was performing a forward pass on creation that slowed things down.

Refactored on_validation_epoch_end

code is much more readable and uses trainer.predict after working through a long-standing pytorch lightning bug, closed Cannot call self.log in evaluation_hooks after using trainer.predict, even if using a new trainer object. Lightning-AI/pytorch-lightning#19101 (comment)

Next steps

Local tests passing
Update documentation
Demonstrate multi-gpu scaling and functionality on 2 different datasets
Rebase main
Code review by 2 DeepForest contributors
Squash commits

Additional issues need to be considered.

main.predict_batch is an anti-pattern that exists outside of pytorch lightning trainer.predict. I left it since it does not interfere.
I did not make a separate training base class, this will need to be done to achieve 2.0 model integration, with all of its connectors among torchvision/transformers input types.

bw4sz · 2025-05-20T22:21:33Z

Here are the figures for dataset. The multi-gpu, multi-processing confirmed is faster.

BOEM data

1 GPU

3 GPU

NEON Data

Note must be run on a100s, to fit into memory!

Profiling Results Comparison:
============================================================================================================================================
+----------+-----------+------------+--------------+-----------------+----------------+
| Device   |   Workers | Strategy   |   Num Images |   Mean Time (s) |   Std Time (s) |
+==========+===========+============+==============+=================+================+
| cuda     |         0 | single     |            4 |           54.09 |              0 |
+----------+-----------+------------+--------------+-----------------+----------------+
| cuda     |         0 | batch      |            4 |           13.58 |              0 |
+----------+-----------+------------+--------------+-----------------+----------------+
| cuda     |         5 | batch      |            4 |           16.39 |              0 |
+----------+-----------+------------+--------------+-----------------+----------------+
| cuda     |         0 | single     |           10 |           54.92 |              0 |
+----------+-----------+------------+--------------+-----------------+----------------+
| cuda     |         0 | batch      |           10 |           27.33 |              0 |
+----------+-----------+------------+--------------+-----------------+----------------+
| cuda     |         5 | batch      |           10 |           19.98 |              0 |
+----------+-----------+------------+--------------+-----------------+----------------+

2 GPUs

bw4sz · 2025-05-21T17:13:45Z

okay, @ethanwhite , @jveitchmichaelis and @henrykironde, i've met the criteria I set out above. Clearly this is quite large and complex, but it is ready to be discussed.

docs/user_guide/12_evaluation.md

out on vacation, all pieces are resolved above, and if anything lingers we can address when he returns.

bw4sz · 2025-06-04T18:27:46Z

This is not ready to be merged.

~~1. if paths is a list, but dataloader_strategy is single, it silently just predicts the first path.~~ complete.

~~2. I'm seeing some window issues with batch strategy compared to single~~ complete locally, will test on BOEM data.

getting way more predictions with batch and a couple strange window issues.

bw4sz · 2025-06-09T18:06:34Z

@jveitchmichaelis and @henrykironde this ready to be merged. I have confirmed and corrected edge cases.

bw4sz · 2025-06-10T22:27:18Z

This should be passing now, @jveitchmichaelis let's get this merged, i'm worried we will start getting behind on other PRs and end up discouraging contributions if we let this massive thing hang. We can do follow up PRs if need. Once all tests pass let's have one more look and be done here.

jveitchmichaelis

LGTM unless @henrykironde has any further comment. Maybe just fix that mosiac typo.

docs/user_guide/16_prediction.md

tests/conftest.py

jveitchmichaelis · 2025-05-27T21:51:13Z

src/deepforest/evaluate.py

        class_recall: a pandas dataframe of class level recall and precision with class sizes
    """
+
+    # If all empty ground truth, return 0 recall and precision


"0 precision and undefined recall"?

jveitchmichaelis · 2025-05-27T21:53:13Z

src/deepforest/predict.py

+

-    print(f"{mosaic_df.shape[0]} predictions kept after non-max suppression")
+def mosiac(predictions, iou_threshold=0.1):


Rename mosiac -> mosaic

bw4sz force-pushed the multi_gpu_predict_tiles branch from 95319eb to 86105b3 Compare May 19, 2025 20:29

bw4sz mentioned this pull request May 22, 2025

Fix albumentations import and pin version <2.0.0 for compatibility #1059

Closed

bw4sz changed the title ~~[WIP] Dataset redesign for multi-gpu, multi-processing and multi-geometry~~ Dataset redesign for multi-gpu, multi-processing and multi-geometry May 22, 2025

bw4sz removed a link to an issue May 22, 2025

albumentations functional error #1058

Open

bw4sz linked an issue May 22, 2025 that may be closed by this pull request

albumentations functional error #1058

Open

bw4sz requested review from ethanwhite, henrykironde and jveitchmichaelis May 23, 2025 20:39

jveitchmichaelis reviewed May 23, 2025

View reviewed changes

docs/user_guide/12_evaluation.md Show resolved Hide resolved

jveitchmichaelis reviewed May 23, 2025

View reviewed changes

docs/user_guide/12_evaluation.md Show resolved Hide resolved

bw4sz added 13 commits May 27, 2025 09:16

remove dataset from main

5b1273f

don't overwrite utilities

18884f5

checking out to fix style editor

dc6e957

seperate NMS for per-image tiles

19b5920

working on global box class

e34f2ef

split out dataset tests

36f2a2c

split out individual tests for each dataset strategy

2e7a677

predict_tile tests pass

812488e

revert training, needs to be seperate PR

fde2f48

rebase main

5c039dc

confirm model loading once

a7ed84f

local tests pass

1324914

Style updates

006cea7

bw4sz added 2 commits June 3, 2025 10:33

avoid edge cases when predict_tile passes an image

84be208

edge case for image path

c43dcc2

bw4sz added 10 commits June 4, 2025 14:25

batch results error

fd19e07

fold then pad, not pad then fold

2c8d59e

crop result is wrong indent

4aac4e9

skip if there is an empty image within a list

abc2221

take validation metrics out of epoch interval

a957fd1

use predict batch instead of predict dataloader

c8e2119

postprocess dataloader

64f1d34

move to batch on eval

1f93d4e

specifically invoke eval mode

e54f400

style edits

509ec5e

add batch size arg to predict dataloader, related to #889

b4757fe

bw4sz mentioned this pull request Jun 10, 2025

main.evaluate takes to long compared to other dataloader options. #1070

Closed

bw4sz added 7 commits June 10, 2025 11:38

went back to passing path to val_step instead of a seperate loader

0150519

add paths to tests for train boxdataset

0f68ecb

add a recreate model argument from the cropmodel

ea5672f

reverted model behavior to allow blank model if desired

2b0d480

style update

90ac4de

don't return model on self.create_model

768321a

typo in cropmodel test

f50384e

jveitchmichaelis self-requested a review June 11, 2025 03:16

jveitchmichaelis approved these changes Jun 11, 2025

View reviewed changes

henrykironde merged commit af9458b into main Jun 11, 2025
11 checks passed

henrykironde deleted the multi_gpu_predict_tiles branch June 11, 2025 07:04

jveitchmichaelis mentioned this pull request Jun 13, 2025

bump albumentations to >=2.0.0 and python to 3.9+ #1077

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset redesign for multi-gpu, multi-processing and multi-geometry #1057

Dataset redesign for multi-gpu, multi-processing and multi-geometry #1057

Uh oh!

bw4sz commented May 14, 2025 •

edited

Loading

Uh oh!

bw4sz commented May 20, 2025 •

edited

Loading

Uh oh!

bw4sz commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

bw4sz commented Jun 4, 2025 •

edited

Loading

Uh oh!

bw4sz commented Jun 9, 2025

Uh oh!

bw4sz commented Jun 10, 2025

Uh oh!

jveitchmichaelis left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jveitchmichaelis May 27, 2025

Uh oh!

jveitchmichaelis May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants



		print(f"{mosaic_df.shape[0]} predictions kept after non-max suppression")
		def mosiac(predictions, iou_threshold=0.1):

Dataset redesign for multi-gpu, multi-processing and multi-geometry #1057

Dataset redesign for multi-gpu, multi-processing and multi-geometry #1057

Uh oh!

Conversation

bw4sz commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Desired Dataset Functionality

Related

Major improvements

Minor improvements

Co-pilot summary of code changes

Enhancements to Prediction Scaling:

Dataset Refactoring:

Configuration Updates:

Code Simplification:

Faster tests

Refactored on_validation_epoch_end

Next steps

Additional issues need to be considered.

Uh oh!

bw4sz commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BOEM data

NEON Data

Uh oh!

bw4sz commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

bw4sz commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Jun 9, 2025

Uh oh!

bw4sz commented Jun 10, 2025

Uh oh!

jveitchmichaelis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jveitchmichaelis May 27, 2025

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bw4sz commented May 14, 2025 •

edited

Loading

bw4sz commented May 20, 2025 •

edited

Loading

bw4sz commented Jun 4, 2025 •

edited

Loading

jveitchmichaelis left a comment •

edited

Loading