Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for materializing views #4773

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from
Open

Add support for materializing views #4773

wants to merge 8 commits into from

Conversation

brimoor
Copy link
Contributor

@brimoor brimoor commented Sep 5, 2024

Adds a materialize() view stage that materializes the current view into a temporary database collection.

Apply this stage to an expensive view (eg an unindexed filtering operation on a large dataset) if you plan to perform multiple downstream operations on the view.

Example usage

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")

view = dataset.filter_labels("ground_truth", F("label") == "cat")
materialized_view = view.materialize()

# `materialized_view` contains the same contents as `view`
print(view.count("ground_truth.detections"))
print(materialized_view.count("ground_truth.detections"))

# but it doesn't require aggregation to serve its contents!
fo.pprint(view._pipeline())
fo.pprint(materialized_view._pipeline())

# so downstream tasks like statistics are more efficient
print(view.count_values("ground_truth.detections.label"))
print(materialized_view.count_values("ground_truth.detections.label"))

# Materialized views are fully-compliant with the DatasetView interface, so
# you are free to chain additional view stages as usual
complex_view = materialized_view.shuffle().limit(10).to_patches("ground_truth")
print(complex_view.count_values("ground_truth.label"))

Like other generated views (to_patches(), to_frames(), and to_clips()), materialized views automatically sync changes back to their source dataset:

sample = materialized_view.first()

sample["foo"] = "bar"
sample.save()

assert "foo" in dataset.get_field_schema()
assert dataset[sample.id].foo == "bar"

materialized_view.set_values("has_cats", ["True"] * len(materialized_view))

print(materialized_view.count_values("has_cats"))
print(dataset.count_values("has_cats"))

materialized_view.set_field("has_cats", "False (teehee)").save()

print(materialized_view.count_values("has_cats"))
print(dataset.count_values("has_cats"))

materialized_view.exclude_fields(["foo", "has_cats"]).keep_fields()

assert not dataset.has_field("foo")
assert not materialized_view.has_field("foo")

materialized_view.limit(0).keep()

assert len(materialized_view) == 0
assert len(dataset) == 186

Summary by CodeRabbit

Release Notes: View Materialization

  • New Features

    • Added materialize() method to cache dataset views as temporary collections
    • Enables faster access to complex views by precomputing and caching results
  • Improvements

    • Materialized views support chaining of view stages
    • Can be saved as new views
    • Provides option to refresh materialized views when underlying dataset changes
  • Documentation

    • Updated user guide with comprehensive explanation of view materialization functionality
    • Added examples demonstrating view materialization process

Copy link
Contributor

coderabbitai bot commented Sep 5, 2024

Walkthrough

This pull request introduces a new feature for materializing views in FiftyOne, allowing users to cache a view as a temporary database collection. The implementation spans multiple files, adding a materialize() method to the SampleCollection class, creating a new Materialize view stage, and introducing supporting classes and methods across the core library. The changes enable users to optimize performance for complex views by precomputing and caching their contents while maintaining the existing view interface.

Changes

File Change Summary
docs/source/user_guide/using_views.rst Added documentation for view materialization
fiftyone/__public__.py Added Materialize to public imports
fiftyone/core/materialize.py New module with MaterializedSampleView and MaterializedView classes
fiftyone/core/stages.py Added Materialize view stage class
fiftyone/core/collections.py Added materialize() method and related modifications
fiftyone/core/dataset.py Updated methods to support materialized views
Multiple core files Added _bulk_write and _check_for_field_edits methods

Sequence Diagram

sequenceDiagram
    participant User
    participant DatasetView
    participant Materialize
    participant MaterializedView
    
    User->>DatasetView: materialize()
    DatasetView->>Materialize: Create materialization stage
    Materialize->>MaterializedView: Generate temporary collection
    MaterializedView-->>User: Return materialized view
Loading

Possibly related PRs

Suggested labels

feature, enhancement

Suggested reviewers

  • minhtuev
  • ritch

Poem

🐰 Hop, hop, materialize!
Views cached with rabbit-like speed
Temporary collections bloom
Performance now takes the lead
Data dancing, swift and free! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@brimoor brimoor force-pushed the materialize branch 2 times, most recently from d6da097 to 51e742e Compare September 17, 2024 13:54
@brimoor brimoor force-pushed the materialize branch 4 times, most recently from 75431ed to 0c63d19 Compare October 7, 2024 11:15
@brimoor brimoor marked this pull request as ready for review January 4, 2025 17:10
@brimoor brimoor added the feature Work on a feature request label Jan 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (27)
fiftyone/core/materialize.py (7)

177-180: Replace if-else block with a ternary operator for brevity
You can simplify lines 177-180 as recommended by the linter:

-        if self._stages:
-            ids = self.values("id")
-        else:
-            ids = None
+        ids = self.values("id") if self._stages else None
🧰 Tools
🪛 Ruff (0.8.2)

177-180: Use ternary operator ids = self.values("id") if self._stages else None instead of if-else-block

Replace if-else-block with ids = self.values("id") if self._stages else None

(SIM108)


515-515: Use for field_name in schema instead of .keys()
When iterating over a dictionary, calling .keys() is unnecessary. Simply iterate over the dictionary.

-        for field_name in schema.keys():
+        for field_name in schema:
🧰 Tools
🪛 Ruff (0.8.2)

515-515: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


520-520: Use key in dict instead of key in dict.keys()
Same reasoning; accessing .keys() is redundant.

-            for field_name in src_schema.keys():
+            for field_name in src_schema:
🧰 Tools
🪛 Ruff (0.8.2)

520-520: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


537-537: Remove .keys() for dictionary iteration
Again, avoid the extra function call by directly iterating over the dictionary.

-            for field_name in frame_schema.keys():
+            for field_name in frame_schema:
🧰 Tools
🪛 Ruff (0.8.2)

537-537: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


542-542: Use direct membership test for dictionaries
Improve performance slightly and reduce verbosity.

-                for field_name in src_frame_schema.keys():
+                for field_name in src_frame_schema:
🧰 Tools
🪛 Ruff (0.8.2)

542-542: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


193-210: Consider specialized concurrency controls for save() method
Repeated "save()" calls from multiple threads may introduce concurrency issues, such as partial updates or data race conditions, if the underlying DB concurrency model is not explicitly configured. Evaluate employing explicit locking or transaction-based concurrency if needed.


372-455: Evaluate performance of _sync_source for large datasets
The _sync_source method runs multiple large pipelines (like $merge and $aggregate). For very large datasets, these might become bottlenecks or block other operations. Consider more granular or batch-based approaches if performance proves to be a concern in real-world usage.

fiftyone/core/patches.py (2)

353-367: Streamline _check_for_field_edits method logic
The approach is correct, but you might unify the updates from "$set" and "$unset" to reduce iterations. Potentially combine these into a single pass to maintain performance on large operations.


386-396: Investigate concurrency model for _bulk_write calls
If multiple threads/processes call _bulk_write concurrently, consider whether you need transaction support or locking to guard partial writes. The present approach may lead to race conditions in shared DB usage scenarios.

fiftyone/core/clips.py (2)

51-57: Validate concurrency for _save method usage
Similar to _PatchesView and _PatchView, the deferral logic for saving frames might need explicit concurrency coordination in massively parallel writes scenarios.


341-355: Improve readability of _check_for_field_edits
Your iteration over operations in _check_for_field_edits can be compressed into a more Pythonic approach. This could improve readability without altering functionality.

fiftyone/core/stages.py (2)

5534-5601: Provide an optional dataset name when materializing
By default, the code uses an auto-generated name. Offering the user an argument to specify a custom name might simplify debugging or re-initializing from a known dataset reference.

🧰 Tools
🪛 Ruff (0.8.2)

5572-5575: Use ternary operator name = last_state.pop("name", None) if last_state is not None else None instead of if-else-block

Replace if-else-block with name = last_state.pop("name", None) if last_state is not None else None

(SIM108)


5572-5575: Incorporate a ternary operator for improved clarity
Ruff suggests merging lines 5572-5575 into a single ternary expression for a more succinct logic flow.

-        if last_state is not None:
-            name = last_state.pop("name", None)
-        else:
-            name = None
+        name = last_state.pop("name", None) if last_state is not None else None
🧰 Tools
🪛 Ruff (0.8.2)

5572-5575: Use ternary operator name = last_state.pop("name", None) if last_state is not None else None instead of if-else-block

Replace if-else-block with name = last_state.pop("name", None) if last_state is not None else None

(SIM108)

fiftyone/core/collections.py (2)

188-192: Optimize bulk write operations

The bulk write operations could be optimized by combining the sample and frame operations into a single batch when possible.

- self.sample_collection._bulk_write(
-     self._sample_ops,
-     ids=self._batch_ids,
-     ordered=False,
- )
- self._sample_ops.clear()

- if self._frame_ops:
-     self.sample_collection._bulk_write(
-         self._frame_ops,
-         sample_ids=self._batch_ids,
-         frames=True,
-         ordered=False,
-     )
-     self._frame_ops.clear()

+ # Combine sample and frame operations when possible
+ if self._sample_ops or self._frame_ops:
+     self.sample_collection._bulk_write(
+         self._sample_ops + self._frame_ops,
+         ids=self._batch_ids,
+         frames=bool(self._frame_ops),
+         ordered=False,
+     )
+     self._sample_ops.clear()
+     self._frame_ops.clear()

Also applies to: 196-201


1903-1907: Simplify conditional logic using ternary operator

The code can be simplified using a ternary operator.

- if ids is None:
-     _ids = self.values("_id")
- else:
-     _ids = [ObjectId(_id) for _id in ids]
+ _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]
🧰 Tools
🪛 Ruff (0.8.2)

1903-1906: Use ternary operator _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids] instead of if-else-block

Replace if-else-block with _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]

(SIM108)

tests/unittests/materialize_tests.py (1)

19-343: Consider splitting the large test into multiple smaller tests.
This method verifies a wide variety of behaviors in one go. Splitting it into several more focused tests (e.g., one for indexing behavior, one for field exclusion/dropping, one for frame operations, etc.) would make the code easier to maintain and debug in the future.

fiftyone/core/dataset.py (2)

450-453: Consider future naming alignment
_is_materialized is a private property. If there's a chance users will want to check is_materialized themselves, it might be better to expose a public read-only property or explicitly mention usage in docs.


Line range hint 4901-4919: Allow optional copy-on-write for large datasets
When _clone is used on very large datasets, a full copy might be expensive. Consider an optional "lazy clone" approach or partial read of the original dataset if only partial fields are needed. This reduces duplication and processing time.

fiftyone/core/video.py (4)

14-14: Confirm necessity of both UpdateOne and UpdateMany
Including UpdateMany expands bulk operations. Ensure that only necessary frames are updated to avoid unexpected overhead or partial updates on large frame sets.


60-63: Clarify sample-level read/write semantics
The additional call to _save_sample for non-deferred saves might introduce confusion if _sync_source_sample was previously responsible for that behavior. Ensure the code path is well-documented to avoid user confusion on usage.


333-349: Refine logic for filtering sample-only fields
The _prune_sample_only_field_updates method discards sample-only fields from the ops. If future fields need partial merges (like partial arrays), reconsider whether removing them entirely is appropriate or if partial merges might be needed.


389-419: Ensure consistent handling of frame vs. sample-only fields
When sample_ops is None, _sync_source_sample merges partial updates of frames with the root dataset. If the user missed some fields, they might end up with incomplete data. Document how to properly pass sample_ops/frame_ops or consider raising a warning if critical fields are absent.

docs/source/user_guide/using_views.rst (5)

1009-1058: Great introduction and example of the materialize() feature!

The introduction clearly explains the purpose and benefits of materializing views. The code example effectively demonstrates basic usage with a realistic filtering/sorting scenario.

Consider adding a note about the performance implications and when materialization is most beneficial, e.g., "Materialization is particularly beneficial when:

  • The view involves complex filtering/sorting that would be expensive to recompute
  • The view will be used multiple times
  • The dataset is large"

1059-1064: Important persistence behavior is clearly documented.

The note effectively explains that materialized views are temporary and cleaned up automatically.

Consider adding a cross-reference to the "dataset-persistence" section since it's mentioned in the note:

-    Like :ref:`non-persistent datasets <dataset-persistence>`, the temporary
+    Like :ref:`non-persistent datasets <dataset-persistence>`, the temporary 
+    collections created by materializing views are automatically deleted when

1065-1089: Comprehensive coverage of additional functionality.

The section effectively demonstrates that materialized views:

  1. Support chaining additional view stages
  2. Can be saved as saved views

Consider adding a brief example showing how to verify that a view is materialized, e.g.:

# Verify that a view is materialized
print(view._is_materialized())  # True

1090-1095: Clear explanation of cache regeneration behavior.

The note effectively explains how saved materialized views handle cache regeneration.

Consider adding information about any performance implications of cache regeneration.


1127-1145: Clear explanation of view behavior consistency.

The section effectively explains that materialized views maintain consistent behavior with regular views for:

  1. Sample modifications
  2. Label tagging
  3. Saving changes back to the source dataset

Consider adding a note about any performance implications when modifying materialized views vs regular views.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fddb6a4 and 1e1264f.

📒 Files selected for processing (11)
  • docs/source/user_guide/using_views.rst (1 hunks)
  • fiftyone/__public__.py (1 hunks)
  • fiftyone/core/clips.py (4 hunks)
  • fiftyone/core/collections.py (7 hunks)
  • fiftyone/core/dataset.py (13 hunks)
  • fiftyone/core/materialize.py (1 hunks)
  • fiftyone/core/patches.py (3 hunks)
  • fiftyone/core/stages.py (3 hunks)
  • fiftyone/core/video.py (3 hunks)
  • fiftyone/core/view.py (2 hunks)
  • tests/unittests/materialize_tests.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/__public__.py

217-217: .core.stages.Materialize imported but unused

Remove unused import

(F401)

fiftyone/core/materialize.py

177-180: Use ternary operator ids = self.values("id") if self._stages else None instead of if-else-block

Replace if-else-block with ids = self.values("id") if self._stages else None

(SIM108)


515-515: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


520-520: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


537-537: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


542-542: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

fiftyone/core/stages.py

5572-5575: Use ternary operator name = last_state.pop("name", None) if last_state is not None else None instead of if-else-block

Replace if-else-block with name = last_state.pop("name", None) if last_state is not None else None

(SIM108)

fiftyone/core/collections.py

1903-1906: Use ternary operator _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids] instead of if-else-block

Replace if-else-block with _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]

(SIM108)

🔇 Additional comments (17)
fiftyone/core/materialize.py (2)

38-48: Ensure consistent deferred save behavior in _save method
Your logic for deferring sample saving is correct. Verify that no external calls skip the deferred parameter, and ensure consistent usage across other method calls to avoid partial writes.


299-310: Validate error handling in _bulk_write
When writing large ops arrays, be sure to handle any partial failures, such as a subset of records failing to write. Confirm that your error-handling strategy (e.g., retriable writes) is robust enough, especially when operating on large, distributed setups.

fiftyone/core/patches.py (1)

41-47: Check deferred save logic consistency
Your _save method consistently calls super()._save(deferred=True). Ensure that subsequent synchronization or indexing is not bypassed due to this deferred logic.

fiftyone/core/clips.py (1)

380-383: Force reindex or reload after _sync_source?
After performing partial label edits, it may be helpful to confirm usage of self._source_collection._dataset._reload_docs(ids=ids) in all relevant scenarios, so that future queries do not use stale indexes.

fiftyone/core/stages.py (1)

8699-8699: Confirm the addition to _STAGES matches any documented plugin requirements
Be sure to update documentation or plugin references for the newly introduced Materialize stage to prevent mismatches between code and official user guides.

fiftyone/core/collections.py (3)

6218-6244: LGTM: Well-implemented materialization method

The materialize() method is well-implemented with clear docstrings and examples. It provides a useful optimization for expensive views by caching them in a temporary database collection.


286-290: LGTM: Good addition of materialized view property

The _is_materialized property is properly added to track materialized views.


3219-3236: LGTM: Well-structured bulk write method

The _bulk_write method properly handles both sample and frame operations with appropriate parameters.

fiftyone/__public__.py (1)

217-217: LGTM: Proper export of Materialize stage

The Materialize stage is correctly exported in the public interface.

🧰 Tools
🪛 Ruff (0.8.2)

217-217: .core.stages.Materialize imported but unused

Remove unused import

(F401)

tests/unittests/materialize_tests.py (1)

344-374: Verify test coverage for materialization edge cases.
It appears these tests handle a variety of practical scenarios, but consider verifying corner cases such as empty frames, partial selects on large datasets, or passing unusual parameters.

fiftyone/core/view.py (2)

202-205: Ensure _is_materialized attribute is standardized.
This property simply proxies _is_materialized from the dataset. Confirm that _is_materialized is consistently set on any dataset object that might be used here to avoid attribute errors.

✅ Verification successful

_is_materialized implementation is properly standardized

Based on the code analysis:

  • The base implementation in Dataset class properly defines _is_materialized based on the sample collection name
  • The abstract base class in collections.py enforces implementation through NotImplementedError
  • The view implementation correctly delegates to the underlying dataset's _is_materialized

The attribute is consistently implemented across the codebase with clear patterns:

  • Dataset objects determine materialization based on collection names
  • Views delegate to their parent dataset
  • The collections interface enforces implementation in subclasses
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Script to check for all references to _is_materialized in the codebase
rg "_is_materialized"

Length of output: 430


Script:

#!/bin/bash
# Check the implementation of _is_materialized in dataset.py and collections.py
ast-grep --pattern 'def _is_materialized($_):
  $$$'

Length of output: 623


Script:

#!/bin/bash
# Check for classes that inherit from collections.py and implement _is_materialized
ast-grep --pattern 'class $_ ($$$collections.$$$):
  $$$'

Length of output: 59


Script:

#!/bin/bash
# Let's try a different pattern to find classes that inherit from SampleCollection
ast-grep --pattern 'class $_ (SampleCollection):
  $$$'

Length of output: 57


1327-1328: Confirm that the revised docstring reflects intended behavior.
The update clarifies that entire frames rather than just frame-level labels get removed. Ensure the code indeed removes the entire frame data rather than just the labels.

✅ Verification successful

Let me gather more information about the implementation to verify the behavior.


Let me try a different approach to check the implementation.


Let me try one more search pattern to find the implementation.


Let me check one more aspect to see the complete implementation.


Let me get the final part of the implementation.


The docstring accurately reflects the implementation's behavior

Based on the implementation in fiftyone/core/dataset.py, the _keep_frames method indeed removes entire frames that are not in the view from the underlying dataset. This is evidenced by:

  1. The method performs a DeleteMany operation on the frame collection for all frame numbers that are not in the view ("frame_number": {"$not": {"$in": fns}})
  2. It operates on the entire frame document, not just labels
  3. The frame documents are completely reset for the sample with only the kept frame numbers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Script to search keep_frames method usage to confirm consistent references
rg -A 5 "keep_frames"

Length of output: 4305


Script:

#!/bin/bash
# Check the implementation of _keep_frames in dataset.py
ast-grep --pattern 'def _keep_frames($$$)'

Length of output: 44


Script:

#!/bin/bash
# Search for the implementation details of _keep_frames
rg -B 2 -A 10 "_keep_frames.*\{" 

Length of output: 34


Script:

#!/bin/bash
# Search for the implementation with a more specific pattern
rg -B 2 -A 15 "def _keep_frames"

Length of output: 2007


Script:

#!/bin/bash
# Get more context of the _keep_frames implementation
rg -B 2 -A 25 "def _keep_frames"

Length of output: 3218


Script:

#!/bin/bash
# Get the remaining implementation of _keep_frames
rg -A 35 "for sample_id, fns in d.items\(\)"

Length of output: 2285

fiftyone/core/dataset.py (2)

429-434: Ensure backward compatibility when introducing new view categories
Introducing _is_materialized into the _is_generated property might affect existing code relying on _is_generated=True only for patches/frames/clips. Validate whether older workflows or tools check for _is_generated=True and appropriately handle or skip materialized datasets.


Line range hint 3468-3491: Potential concurrency concerns with bulk writes
The _bulk_write method can write both sample and frame docs and then reload them. If multiple threads/processes call _bulk_write on large datasets simultaneously, concurrency issues might arise. Consider adding concurrency-safe checks or mention recommended usage patterns in docs.

fiftyone/core/video.py (2)

350-379: Review data synchronization order
In _bulk_write, the code syncs the frames dataset first, then prunes sample-only fields, then writes to the source collection. If a user merges data that contains both frame-level and sample-level changes, re-check if partial updates could be lost or overwritten out of sequence.


380-386: Validate partial updates on _save_sample
This method performs a direct bulk write to _frames_dataset followed by _sync_source_sample. Confirm that partially updated frames in _frames_dataset are properly reflected if the user modifies fields that exist in the root dataset only.

docs/source/user_guide/using_views.rst (1)

1096-1126: Excellent coverage of dataset change handling.

The section thoroughly explains and demonstrates:

  1. How materialized views handle dataset changes
  2. The two methods to refresh the view (reload() and re-materialization)

@@ -54,10 +54,13 @@ def _sample_id(self):
return ObjectId(self._doc.sample_id)

def _save(self, deferred=False):
sample_ops, frame_ops = super()._save(deferred=deferred)
sample_ops, frame_ops = super()._save(deferred=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Make the deferred argument more explicit
Changing _save(self, deferred=False) to _save(self, deferred=True) can break calls that relied on a default immediate save. If possible, explicitly highlight this in release notes or docstrings to warn users of the new default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Work on a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants