fix import timeouts on increasing datasets by precomputing batch size #5231

CamronStaley · 2024-12-06T14:01:09Z

What changes are proposed in this pull request?

When importing a dataset that changes greatly in size between samples with the content size dynamic batcher we were seeing timeouts from mongo because based on the previous batch of samples we would predict future batch sizes which wasn't scaling up fast enough. Since most situations we have the content sizes before doing any writes to the database we can precompute these values and create batch sizes upfront.

How is this patch tested? If it is not, please explain why.

Set default batcher to "size"
Downloaded the zip in the jira ticket above
Ran the following command:

import fiftyone as fo

dns = '/Users/camronstaley/Downloads/Dr_Wright_Drives_Beta.zip'
ds=fo.Dataset.from_archive(dns,dataset_type=fo.types.FiftyOneDataset)

Output:

(TEAMS) (3.10.14) ➜  scripts python import_archive.py
/Users/camronstaley/Workplace/envs/TEAMS/lib/python3.10/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/Users/camronstaley/Workplace/envs/TEAMS/lib/python3.10/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
Assuming '/Users/camronstaley/Downloads/Dr_Wright_Drives_Beta' contains the extracted contents of '/Users/camronstaley/Downloads/Dr_Wright_Drives_Beta.zip'
Importing samples...
 100% |███████████████████████████████████████████████████████████████████████████████████████████| 3/3 [154.1ms elapsed, 0s remaining, 19.5 samples/s] 
Importing frames...
 100% |█████████████████████████████████████████████

batch sizes created:

[3670, 3666, 3663, 3653, 3653, 3670, 3666, 3663, 3653, 3653, 1968, 50, 50, 50, 50, 50, 50, 52, 50, 50, 50, 50, 50, 50, 50, 69, 50, 50, 86, 56, 51, 51, 51, 51, 53, 51, 64, 94, 56, 51, 50, 50, 50, 51, 50, 51, 50, 50, 50, 50, 50, 50, 50, 113, 63, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 65, 51, 51, 51, 51, 96, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 53, 57, 51, 51, 2203, 3031, 3023, 3022, 3022, 1015]

batch sizes created before these changes:

batch content size: 1048992
length of batch: 3666                                                                                                                                  
batch content size: 1050321
length of batch: 3660                                                                                                                                  
batch content size: 1050664
length of batch: 3653                                                                                                                                  
batch content size: 1048607
length of batch: 3653                                                                                                                                  
batch content size: 45736747
length of batch: 457

Above is just a small chunk of the batch sizes but the idea is that it ramped from 0 - 3600ish which maintained the 1 mb target size until it reached the 3rd video which was way more densely populated with labels. This resulted in the batch size being 45736747 and often times it timed out.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

[x ] No. You can skip the rest of this section.
[] Yes. Give a description of this change to be included in the release
notes for FiftyOne users.

What areas of FiftyOne does this PR affect?

App: FiftyOne application changes
Build: Build and test infrastructure changes
Core: Core fiftyone Python library changes
Documentation: FiftyOne documentation changes
Other

Summary by CodeRabbit

Release Notes

New Features
- Introduced new methods for managing dataset samples, including adding images and videos, and saving/loading dataset views.
- Added a new ContentSizeBatcher for improved batch processing based on content size.
Enhancements
- Enhanced sample management with improved validation and dynamic schema expansion.
- Streamlined error handling for clearer messages and better edge case management.
Bug Fixes
- Updated tests to ensure coverage for the new batching functionality while maintaining existing test integrity.
Documentation
- Updated method documentation to reflect new functionalities and parameters.

coderabbitai · 2024-12-06T14:01:17Z

Walkthrough

The changes in this pull request introduce significant enhancements to the fiftyone/core/dataset.py, fiftyone/core/odm/database.py, and fiftyone/core/utils.py files. Key modifications include the addition of new methods for dataset and sample management, improvements to error handling, and optimizations in performance. A new batching class, ContentSizeBatcher, is introduced to streamline batch processing based on content size. Additionally, backpressure handling has been removed from document insertion and bulk write operations in the database module, simplifying those processes.

Changes

File	Change Summary
`fiftyone/core/dataset.py`	Added methods for managing datasets, including `add_importer`, `merge_importer`, `add_images`, `add_labeled_images`, `add_videos`, `add_labeled_videos`, `save_view`, `load_saved_view`, `delete_saved_view`, `list_workspaces`, `save_workspace`, `load_workspace`, `delete_workspace`, and `delete_workspaces`. Enhanced existing methods for sample management and error handling.
`fiftyone/core/odm/database.py`	Removed backpressure handling from `insert_documents` and `bulk_write` functions, streamlining document insertion and bulk writing processes.
`fiftyone/core/utils.py`	Introduced `ContentSizeBatcher` class with methods for dynamic batch size adjustment based on content size. Updated `get_default_batcher` to return `ContentSizeBatcher`.
`tests/unittests/utils_tests.py`	Replaced `ContentSizeDynamicBatcher` with `ContentSizeBatcher` in tests. Added `test_content_size_batcher` to evaluate the new batching class under various conditions.

Possibly related PRs

fix copied doc updates not insert #4729: Modifies functions related to document copying in fiftyone/core/dataset.py, which may relate to the overall dataset management improvements in the main PR.
Support large deletions by sample/frame IDs #4787: Enhances deletion methods in fiftyone/core/dataset.py, which could be relevant to the sample management improvements in the main PR.
Protect against infinitely growing content size batcher #4806: Introduces a new batching strategy in fiftyone/core/utils.py, which may connect to the sample ingestion and management enhancements in the main PR.
Adding builtin operators for more of the FO interface #4830: Adds new operators for dataset management in fiftyone/operators/builtin.py, which could relate to the new methods introduced in the main PR for managing datasets and samples.
Summary field and index improvements #5091: Improves summary field handling in fiftyone/core/dataset.py, which aligns with the enhancements made to dataset management in the main PR.
add support for detection.mask_path #5120: Adds support for detection.mask_path, which may relate to the overall improvements in sample management and metadata handling in the main PR.

Suggested labels

bug, enhancement

Suggested reviewers

brimoor
minhtuev
benjaminpkane

Poem

In the fields where datasets grow,
New methods sprout, as rivers flow.
With batches sized just right, oh so neat,
Managing samples is now a treat!
From views to workspaces, all in line,
A hop, a skip, our code will shine! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1649f2f and 680656b.

📒 Files selected for processing (4)

fiftyone/core/dataset.py (0 hunks)
fiftyone/core/odm/database.py (0 hunks)
fiftyone/core/utils.py (2 hunks)
tests/unittests/utils_tests.py (3 hunks)

💤 Files with no reviewable changes (2)

fiftyone/core/odm/database.py
fiftyone/core/dataset.py

🔇 Additional comments (1)

tests/unittests/utils_tests.py (1)

85-127: Unit tests for ContentSizeBatcher provide comprehensive coverage

The added tests effectively validate the ContentSizeBatcher under various conditions, ensuring correct batching behavior with different target sizes and batch size constraints. This thorough testing enhances the reliability of the new batching strategy.

fiftyone/core/utils.py

kaixi-wang · 2024-12-06T14:23:48Z

fiftyone/core/utils.py

+        for obj in iterable:
+            try:
+                curr_batch_content_size += len(
+                    json_util.dumps(self._make_dict(obj))


Couldn't this extra serialization potentially be very expensive?

We currently do this already in the _upsert_samples_batch so we are basically doubling how much we are doing it in that one method. In most other cases (such as bulk_write and insert_documents) its the same total computation since we are just moving it from post computing for each batch to precomputing for the entire collection.

though I think it might be possible to refactor _upsert_sample_batch so that we just pass the dict as an iterable to begin with instead of calculating it twice

brimoor · 2024-12-14T01:54:27Z

fiftyone/core/utils.py

+        )
+        self.curr_batch = 0
+
+    def _compute_batch_sizes(


This is not ideal because it requires iterating over the entire iterator before any writes start happening.

A large scale add_samples() call could take 1 hour or more to complete. With this implementation, it will take an hour just to compute safe batch sizes, during which time there will be no progress bar or indication to the user of what's happening, and then will it start adding samples to the dataset for the next hour.

We used dynamic batching with backpressure to try to find a compromise where writes will start immediately while also tuning the batch size to maximize throughput.

Perhaps there's a way to keep the greedy batching but add a safeguard to split unexpectedly large batches into multiple writes?

I think an alternative is we can override the next implementation for just this batch size and build it as we go so instead of doing it upfront we can just iterate and build a list up to the max size and then return that.

We probably want to refactor a bit as well because currently we have two content size bathers: one for backpressure where we don't know the size of the items before batching (think it's just a bunch of ops and not actually looking at the items) and another for if we have the items and can calculate their sizes.

currently we have two content size bathers

What are you referring to? We currently have three batchers:
https://github.com/voxel51/fiftyone-teams/blob/318d3abb38b478f557180bb6aed4f25681cc7c64/fiftyone/core/utils.py#L1579-L1606

LatencyDynamicBatcher: targets a specific latency between calls to generate the next batch

ContentSizeDynamicBatcher: targets a specific content size in each batch

StaticBatcher: simple fixed batch size

Latency and content size are correlated heuristics, and both would encounter problems if there are wildly different Sample sizes in a collection

I guess you're referring to latency and content.

The latency batcher is the default for OSS because it is effectively parameterized by the frame rate on the progress bars you want to see (and the assumption that the database is a big boi and will take as much content in 0.2 seconds as you can throw at it, for example)

The content batcher is the default for Teams when there's an API connection involved because the assumption is that the API gateway has a hard limit on request size, so we want to be certain to not exceed say 1 MB).

https://github.com/voxel51/fiftyone-teams/blob/318d3abb38b478f557180bb6aed4f25681cc7c64/fiftyone/core/config.py#L171

The batcher in this PR "ContentSizeBatcher" is separate from ContentSizeDynamicBatcher because it doesn't use backpressure and has a different use case where we can view content size upfront which, keep me honest @swheaton, isn't always the case. Thats what I meant by specifically two "content size batchers" I know there's other batchers as well.

I am proposing a couple things:

rename both of the content size batchers so that it's clear one is used for computing before writing the batch and the other is for estimating based on the previous batch's results

make it so this new batcher we have added computes batch size everytime next is called so that we don't have to do it all upfront (this would require overriding the inherited next method in its current state so maybe a refactor would be good)

or if I am wrong about there being two use cases (compute before running content size batch vs compute after running content size batch) consolidate these into one method that just does number 2 above instead of precomputing the entire iterable.

fix import timeouts on increasing datasets by precomputing batch size

680656b

CamronStaley requested a review from swheaton December 6, 2024 14:01

CamronStaley self-assigned this Dec 6, 2024

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

fiftyone/core/utils.py Show resolved Hide resolved

kaixi-wang reviewed Dec 6, 2024

View reviewed changes

brimoor requested changes Dec 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix import timeouts on increasing datasets by precomputing batch size #5231

fix import timeouts on increasing datasets by precomputing batch size #5231

CamronStaley commented Dec 6, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 6, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

kaixi-wang Dec 6, 2024

CamronStaley Dec 6, 2024

CamronStaley Dec 6, 2024

brimoor Dec 14, 2024

CamronStaley Dec 14, 2024

brimoor Dec 14, 2024

brimoor Dec 14, 2024 •

edited

Loading

CamronStaley Dec 14, 2024 •

edited

Loading

fix import timeouts on increasing datasets by precomputing batch size #5231

Are you sure you want to change the base?

fix import timeouts on increasing datasets by precomputing batch size #5231

Conversation

CamronStaley commented Dec 6, 2024 • edited by coderabbitai bot Loading

What changes are proposed in this pull request?

How is this patch tested? If it is not, please explain why.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

What areas of FiftyOne does this PR affect?

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Dec 6, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

kaixi-wang Dec 6, 2024

Choose a reason for hiding this comment

CamronStaley Dec 6, 2024

Choose a reason for hiding this comment

CamronStaley Dec 6, 2024

Choose a reason for hiding this comment

brimoor Dec 14, 2024

Choose a reason for hiding this comment

CamronStaley Dec 14, 2024

Choose a reason for hiding this comment

brimoor Dec 14, 2024

Choose a reason for hiding this comment

brimoor Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

CamronStaley Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

CamronStaley commented Dec 6, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 6, 2024 •

edited

Loading

brimoor Dec 14, 2024 •

edited

Loading

CamronStaley Dec 14, 2024 •

edited

Loading