Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix import timeouts on increasing datasets by precomputing batch size #5231

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

CamronStaley
Copy link
Contributor

@CamronStaley CamronStaley commented Dec 6, 2024

What changes are proposed in this pull request?

When importing a dataset that changes greatly in size between samples with the content size dynamic batcher we were seeing timeouts from mongo because based on the previous batch of samples we would predict future batch sizes which wasn't scaling up fast enough. Since most situations we have the content sizes before doing any writes to the database we can precompute these values and create batch sizes upfront.

How is this patch tested? If it is not, please explain why.

  • Set default batcher to "size"
  • Downloaded the zip in the jira ticket above
  • Ran the following command:
import fiftyone as fo

dns = '/Users/camronstaley/Downloads/Dr_Wright_Drives_Beta.zip'
ds=fo.Dataset.from_archive(dns,dataset_type=fo.types.FiftyOneDataset)

Output:

(TEAMS) (3.10.14) ➜  scripts python import_archive.py
/Users/camronstaley/Workplace/envs/TEAMS/lib/python3.10/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/Users/camronstaley/Workplace/envs/TEAMS/lib/python3.10/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
Assuming '/Users/camronstaley/Downloads/Dr_Wright_Drives_Beta' contains the extracted contents of '/Users/camronstaley/Downloads/Dr_Wright_Drives_Beta.zip'
Importing samples...
 100% |███████████████████████████████████████████████████████████████████████████████████████████| 3/3 [154.1ms elapsed, 0s remaining, 19.5 samples/s] 
Importing frames...
 100% |█████████████████████████████████████████████

batch sizes created:

[3670, 3666, 3663, 3653, 3653, 3670, 3666, 3663, 3653, 3653, 1968, 50, 50, 50, 50, 50, 50, 52, 50, 50, 50, 50, 50, 50, 50, 69, 50, 50, 86, 56, 51, 51, 51, 51, 53, 51, 64, 94, 56, 51, 50, 50, 50, 51, 50, 51, 50, 50, 50, 50, 50, 50, 50, 113, 63, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 65, 51, 51, 51, 51, 96, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 53, 57, 51, 51, 2203, 3031, 3023, 3022, 3022, 1015]

batch sizes created before these changes:

batch content size: 1048992
length of batch: 3666                                                                                                                                  
batch content size: 1050321
length of batch: 3660                                                                                                                                  
batch content size: 1050664
length of batch: 3653                                                                                                                                  
batch content size: 1048607
length of batch: 3653                                                                                                                                  
batch content size: 45736747
length of batch: 457

Above is just a small chunk of the batch sizes but the idea is that it ramped from 0 - 3600ish which maintained the 1 mb target size until it reached the 3rd video which was way more densely populated with labels. This resulted in the batch size being 45736747 and often times it timed out.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • [x ] No. You can skip the rest of this section.
  • [] Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new methods for managing dataset samples, including adding images and videos, and saving/loading dataset views.
    • Added a new ContentSizeBatcher for improved batch processing based on content size.
  • Enhancements

    • Enhanced sample management with improved validation and dynamic schema expansion.
    • Streamlined error handling for clearer messages and better edge case management.
  • Bug Fixes

    • Updated tests to ensure coverage for the new batching functionality while maintaining existing test integrity.
  • Documentation

    • Updated method documentation to reflect new functionalities and parameters.

@CamronStaley CamronStaley requested a review from swheaton December 6, 2024 14:01
@CamronStaley CamronStaley self-assigned this Dec 6, 2024
Copy link
Contributor

coderabbitai bot commented Dec 6, 2024

Walkthrough

The changes in this pull request introduce significant enhancements to the fiftyone/core/dataset.py, fiftyone/core/odm/database.py, and fiftyone/core/utils.py files. Key modifications include the addition of new methods for dataset and sample management, improvements to error handling, and optimizations in performance. A new batching class, ContentSizeBatcher, is introduced to streamline batch processing based on content size. Additionally, backpressure handling has been removed from document insertion and bulk write operations in the database module, simplifying those processes.

Changes

File Change Summary
fiftyone/core/dataset.py Added methods for managing datasets, including add_importer, merge_importer, add_images, add_labeled_images, add_videos, add_labeled_videos, save_view, load_saved_view, delete_saved_view, list_workspaces, save_workspace, load_workspace, delete_workspace, and delete_workspaces. Enhanced existing methods for sample management and error handling.
fiftyone/core/odm/database.py Removed backpressure handling from insert_documents and bulk_write functions, streamlining document insertion and bulk writing processes.
fiftyone/core/utils.py Introduced ContentSizeBatcher class with methods for dynamic batch size adjustment based on content size. Updated get_default_batcher to return ContentSizeBatcher.
tests/unittests/utils_tests.py Replaced ContentSizeDynamicBatcher with ContentSizeBatcher in tests. Added test_content_size_batcher to evaluate the new batching class under various conditions.

Possibly related PRs

Suggested labels

bug, enhancement

Suggested reviewers

  • brimoor
  • minhtuev
  • benjaminpkane

Poem

In the fields where datasets grow,
New methods sprout, as rivers flow.
With batches sized just right, oh so neat,
Managing samples is now a treat!
From views to workspaces, all in line,
A hop, a skip, our code will shine! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1649f2f and 680656b.

📒 Files selected for processing (4)
  • fiftyone/core/dataset.py (0 hunks)
  • fiftyone/core/odm/database.py (0 hunks)
  • fiftyone/core/utils.py (2 hunks)
  • tests/unittests/utils_tests.py (3 hunks)
💤 Files with no reviewable changes (2)
  • fiftyone/core/odm/database.py
  • fiftyone/core/dataset.py
🔇 Additional comments (1)
tests/unittests/utils_tests.py (1)

85-127: Unit tests for ContentSizeBatcher provide comprehensive coverage

The added tests effectively validate the ContentSizeBatcher under various conditions, ensuring correct batching behavior with different target sizes and batch size constraints. This thorough testing enhances the reliability of the new batching strategy.

fiftyone/core/utils.py Show resolved Hide resolved
for obj in iterable:
try:
curr_batch_content_size += len(
json_util.dumps(self._make_dict(obj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this extra serialization potentially be very expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently do this already in the _upsert_samples_batch so we are basically doubling how much we are doing it in that one method. In most other cases (such as bulk_write and insert_documents) its the same total computation since we are just moving it from post computing for each batch to precomputing for the entire collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I think it might be possible to refactor _upsert_sample_batch so that we just pass the dict as an iterable to begin with instead of calculating it twice

)
self.curr_batch = 0

def _compute_batch_sizes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal because it requires iterating over the entire iterator before any writes start happening.

A large scale add_samples() call could take 1 hour or more to complete. With this implementation, it will take an hour just to compute safe batch sizes, during which time there will be no progress bar or indication to the user of what's happening, and then will it start adding samples to the dataset for the next hour.

We used dynamic batching with backpressure to try to find a compromise where writes will start immediately while also tuning the batch size to maximize throughput.

Perhaps there's a way to keep the greedy batching but add a safeguard to split unexpectedly large batches into multiple writes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an alternative is we can override the next implementation for just this batch size and build it as we go so instead of doing it upfront we can just iterate and build a list up to the max size and then return that.

We probably want to refactor a bit as well because currently we have two content size bathers: one for backpressure where we don't know the size of the items before batching (think it's just a bunch of ops and not actually looking at the items) and another for if we have the items and can calculate their sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we have two content size bathers

What are you referring to? We currently have three batchers:
https://github.com/voxel51/fiftyone-teams/blob/318d3abb38b478f557180bb6aed4f25681cc7c64/fiftyone/core/utils.py#L1579-L1606

  • LatencyDynamicBatcher: targets a specific latency between calls to generate the next batch
  • ContentSizeDynamicBatcher: targets a specific content size in each batch
  • StaticBatcher: simple fixed batch size

Latency and content size are correlated heuristics, and both would encounter problems if there are wildly different Sample sizes in a collection

Copy link
Contributor

@brimoor brimoor Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you're referring to latency and content.

The latency batcher is the default for OSS because it is effectively parameterized by the frame rate on the progress bars you want to see (and the assumption that the database is a big boi and will take as much content in 0.2 seconds as you can throw at it, for example)

The content batcher is the default for Teams when there's an API connection involved because the assumption is that the API gateway has a hard limit on request size, so we want to be certain to not exceed say 1 MB).

https://github.com/voxel51/fiftyone-teams/blob/318d3abb38b478f557180bb6aed4f25681cc7c64/fiftyone/core/config.py#L171

Copy link
Contributor Author

@CamronStaley CamronStaley Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batcher in this PR "ContentSizeBatcher" is separate from ContentSizeDynamicBatcher because it doesn't use backpressure and has a different use case where we can view content size upfront which, keep me honest @swheaton, isn't always the case. Thats what I meant by specifically two "content size batchers" I know there's other batchers as well.

I am proposing a couple things:

  1. rename both of the content size batchers so that it's clear one is used for computing before writing the batch and the other is for estimating based on the previous batch's results
  2. make it so this new batcher we have added computes batch size everytime next is called so that we don't have to do it all upfront (this would require overriding the inherited next method in its current state so maybe a refactor would be good)

or if I am wrong about there being two use cases (compute before running content size batch vs compute after running content size batch) consolidate these into one method that just does number 2 above instead of precomputing the entire iterable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants