-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for materializing views #4773
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request introduces a new feature for materializing views in FiftyOne, allowing users to cache a view as a temporary database collection. The implementation spans multiple files, adding a Changes
Sequence DiagramsequenceDiagram
participant User
participant DatasetView
participant Materialize
participant MaterializedView
User->>DatasetView: materialize()
DatasetView->>Materialize: Create materialization stage
Materialize->>MaterializedView: Generate temporary collection
MaterializedView-->>User: Return materialized view
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
d6da097
to
51e742e
Compare
51e742e
to
5ea020e
Compare
75431ed
to
0c63d19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (27)
fiftyone/core/materialize.py (7)
177-180
: Replace if-else block with a ternary operator for brevity
You can simplify lines 177-180 as recommended by the linter:- if self._stages: - ids = self.values("id") - else: - ids = None + ids = self.values("id") if self._stages else None🧰 Tools
🪛 Ruff (0.8.2)
177-180: Use ternary operator
ids = self.values("id") if self._stages else None
instead ofif
-else
-blockReplace
if
-else
-block withids = self.values("id") if self._stages else None
(SIM108)
515-515
: Usefor field_name in schema
instead of.keys()
When iterating over a dictionary, calling.keys()
is unnecessary. Simply iterate over the dictionary.- for field_name in schema.keys(): + for field_name in schema:🧰 Tools
🪛 Ruff (0.8.2)
515-515: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
520-520
: Usekey in dict
instead ofkey in dict.keys()
Same reasoning; accessing.keys()
is redundant.- for field_name in src_schema.keys(): + for field_name in src_schema:🧰 Tools
🪛 Ruff (0.8.2)
520-520: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
537-537
: Remove.keys()
for dictionary iteration
Again, avoid the extra function call by directly iterating over the dictionary.- for field_name in frame_schema.keys(): + for field_name in frame_schema:🧰 Tools
🪛 Ruff (0.8.2)
537-537: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
542-542
: Use direct membership test for dictionaries
Improve performance slightly and reduce verbosity.- for field_name in src_frame_schema.keys(): + for field_name in src_frame_schema:🧰 Tools
🪛 Ruff (0.8.2)
542-542: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
193-210
: Consider specialized concurrency controls forsave()
method
Repeated "save()" calls from multiple threads may introduce concurrency issues, such as partial updates or data race conditions, if the underlying DB concurrency model is not explicitly configured. Evaluate employing explicit locking or transaction-based concurrency if needed.
372-455
: Evaluate performance of_sync_source
for large datasets
The_sync_source
method runs multiple large pipelines (like$merge
and$aggregate
). For very large datasets, these might become bottlenecks or block other operations. Consider more granular or batch-based approaches if performance proves to be a concern in real-world usage.fiftyone/core/patches.py (2)
353-367
: Streamline_check_for_field_edits
method logic
The approach is correct, but you might unify the updates from "$set" and "$unset" to reduce iterations. Potentially combine these into a single pass to maintain performance on large operations.
386-396
: Investigate concurrency model for_bulk_write
calls
If multiple threads/processes call_bulk_write
concurrently, consider whether you need transaction support or locking to guard partial writes. The present approach may lead to race conditions in shared DB usage scenarios.fiftyone/core/clips.py (2)
51-57
: Validate concurrency for_save
method usage
Similar to_PatchesView
and_PatchView
, the deferral logic for saving frames might need explicit concurrency coordination in massively parallel writes scenarios.
341-355
: Improve readability of_check_for_field_edits
Your iteration over operations in_check_for_field_edits
can be compressed into a more Pythonic approach. This could improve readability without altering functionality.fiftyone/core/stages.py (2)
5534-5601
: Provide an optional dataset name when materializing
By default, the code uses an auto-generated name. Offering the user an argument to specify a custom name might simplify debugging or re-initializing from a known dataset reference.🧰 Tools
🪛 Ruff (0.8.2)
5572-5575: Use ternary operator
name = last_state.pop("name", None) if last_state is not None else None
instead ofif
-else
-blockReplace
if
-else
-block withname = last_state.pop("name", None) if last_state is not None else None
(SIM108)
5572-5575
: Incorporate a ternary operator for improved clarity
Ruff suggests merging lines 5572-5575 into a single ternary expression for a more succinct logic flow.- if last_state is not None: - name = last_state.pop("name", None) - else: - name = None + name = last_state.pop("name", None) if last_state is not None else None🧰 Tools
🪛 Ruff (0.8.2)
5572-5575: Use ternary operator
name = last_state.pop("name", None) if last_state is not None else None
instead ofif
-else
-blockReplace
if
-else
-block withname = last_state.pop("name", None) if last_state is not None else None
(SIM108)
fiftyone/core/collections.py (2)
188-192
: Optimize bulk write operationsThe bulk write operations could be optimized by combining the sample and frame operations into a single batch when possible.
- self.sample_collection._bulk_write( - self._sample_ops, - ids=self._batch_ids, - ordered=False, - ) - self._sample_ops.clear() - if self._frame_ops: - self.sample_collection._bulk_write( - self._frame_ops, - sample_ids=self._batch_ids, - frames=True, - ordered=False, - ) - self._frame_ops.clear() + # Combine sample and frame operations when possible + if self._sample_ops or self._frame_ops: + self.sample_collection._bulk_write( + self._sample_ops + self._frame_ops, + ids=self._batch_ids, + frames=bool(self._frame_ops), + ordered=False, + ) + self._sample_ops.clear() + self._frame_ops.clear()Also applies to: 196-201
1903-1907
: Simplify conditional logic using ternary operatorThe code can be simplified using a ternary operator.
- if ids is None: - _ids = self.values("_id") - else: - _ids = [ObjectId(_id) for _id in ids] + _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]🧰 Tools
🪛 Ruff (0.8.2)
1903-1906: Use ternary operator
_ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]
instead ofif
-else
-blockReplace
if
-else
-block with_ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]
(SIM108)
tests/unittests/materialize_tests.py (1)
19-343
: Consider splitting the large test into multiple smaller tests.
This method verifies a wide variety of behaviors in one go. Splitting it into several more focused tests (e.g., one for indexing behavior, one for field exclusion/dropping, one for frame operations, etc.) would make the code easier to maintain and debug in the future.fiftyone/core/dataset.py (2)
450-453
: Consider future naming alignment
_is_materialized
is a private property. If there's a chance users will want to checkis_materialized
themselves, it might be better to expose a public read-only property or explicitly mention usage in docs.
Line range hint
4901-4919
: Allow optional copy-on-write for large datasets
When_clone
is used on very large datasets, a full copy might be expensive. Consider an optional "lazy clone" approach or partial read of the original dataset if only partial fields are needed. This reduces duplication and processing time.fiftyone/core/video.py (4)
14-14
: Confirm necessity of both UpdateOne and UpdateMany
IncludingUpdateMany
expands bulk operations. Ensure that only necessary frames are updated to avoid unexpected overhead or partial updates on large frame sets.
60-63
: Clarify sample-level read/write semantics
The additional call to_save_sample
for non-deferred saves might introduce confusion if_sync_source_sample
was previously responsible for that behavior. Ensure the code path is well-documented to avoid user confusion on usage.
333-349
: Refine logic for filtering sample-only fields
The_prune_sample_only_field_updates
method discards sample-only fields from the ops. If future fields need partial merges (like partial arrays), reconsider whether removing them entirely is appropriate or if partial merges might be needed.
389-419
: Ensure consistent handling of frame vs. sample-only fields
Whensample_ops
is None,_sync_source_sample
merges partial updates of frames with the root dataset. If the user missed some fields, they might end up with incomplete data. Document how to properly passsample_ops
/frame_ops
or consider raising a warning if critical fields are absent.docs/source/user_guide/using_views.rst (5)
1009-1058
: Great introduction and example of the materialize() feature!The introduction clearly explains the purpose and benefits of materializing views. The code example effectively demonstrates basic usage with a realistic filtering/sorting scenario.
Consider adding a note about the performance implications and when materialization is most beneficial, e.g., "Materialization is particularly beneficial when:
- The view involves complex filtering/sorting that would be expensive to recompute
- The view will be used multiple times
- The dataset is large"
1059-1064
: Important persistence behavior is clearly documented.The note effectively explains that materialized views are temporary and cleaned up automatically.
Consider adding a cross-reference to the "dataset-persistence" section since it's mentioned in the note:
- Like :ref:`non-persistent datasets <dataset-persistence>`, the temporary + Like :ref:`non-persistent datasets <dataset-persistence>`, the temporary + collections created by materializing views are automatically deleted when
1065-1089
: Comprehensive coverage of additional functionality.The section effectively demonstrates that materialized views:
- Support chaining additional view stages
- Can be saved as saved views
Consider adding a brief example showing how to verify that a view is materialized, e.g.:
# Verify that a view is materialized print(view._is_materialized()) # True
1090-1095
: Clear explanation of cache regeneration behavior.The note effectively explains how saved materialized views handle cache regeneration.
Consider adding information about any performance implications of cache regeneration.
1127-1145
: Clear explanation of view behavior consistency.The section effectively explains that materialized views maintain consistent behavior with regular views for:
- Sample modifications
- Label tagging
- Saving changes back to the source dataset
Consider adding a note about any performance implications when modifying materialized views vs regular views.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
docs/source/user_guide/using_views.rst
(1 hunks)fiftyone/__public__.py
(1 hunks)fiftyone/core/clips.py
(4 hunks)fiftyone/core/collections.py
(7 hunks)fiftyone/core/dataset.py
(13 hunks)fiftyone/core/materialize.py
(1 hunks)fiftyone/core/patches.py
(3 hunks)fiftyone/core/stages.py
(3 hunks)fiftyone/core/video.py
(3 hunks)fiftyone/core/view.py
(2 hunks)tests/unittests/materialize_tests.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/__public__.py
217-217: .core.stages.Materialize
imported but unused
Remove unused import
(F401)
fiftyone/core/materialize.py
177-180: Use ternary operator ids = self.values("id") if self._stages else None
instead of if
-else
-block
Replace if
-else
-block with ids = self.values("id") if self._stages else None
(SIM108)
515-515: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
520-520: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
537-537: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
542-542: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
fiftyone/core/stages.py
5572-5575: Use ternary operator name = last_state.pop("name", None) if last_state is not None else None
instead of if
-else
-block
Replace if
-else
-block with name = last_state.pop("name", None) if last_state is not None else None
(SIM108)
fiftyone/core/collections.py
1903-1906: Use ternary operator _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]
instead of if
-else
-block
Replace if
-else
-block with _ids = self.values("_id") if ids is None else [ObjectId(_id) for _id in ids]
(SIM108)
🔇 Additional comments (17)
fiftyone/core/materialize.py (2)
38-48
: Ensure consistent deferred save behavior in_save
method
Your logic for deferring sample saving is correct. Verify that no external calls skip thedeferred
parameter, and ensure consistent usage across other method calls to avoid partial writes.
299-310
: Validate error handling in_bulk_write
When writing large ops arrays, be sure to handle any partial failures, such as a subset of records failing to write. Confirm that your error-handling strategy (e.g., retriable writes) is robust enough, especially when operating on large, distributed setups.fiftyone/core/patches.py (1)
41-47
: Check deferred save logic consistency
Your_save
method consistently callssuper()._save(deferred=True)
. Ensure that subsequent synchronization or indexing is not bypassed due to this deferred logic.fiftyone/core/clips.py (1)
380-383
: Force reindex or reload after_sync_source
?
After performing partial label edits, it may be helpful to confirm usage ofself._source_collection._dataset._reload_docs(ids=ids)
in all relevant scenarios, so that future queries do not use stale indexes.fiftyone/core/stages.py (1)
8699-8699
: Confirm the addition to_STAGES
matches any documented plugin requirements
Be sure to update documentation or plugin references for the newly introducedMaterialize
stage to prevent mismatches between code and official user guides.fiftyone/core/collections.py (3)
6218-6244
: LGTM: Well-implemented materialization methodThe
materialize()
method is well-implemented with clear docstrings and examples. It provides a useful optimization for expensive views by caching them in a temporary database collection.
286-290
: LGTM: Good addition of materialized view propertyThe
_is_materialized
property is properly added to track materialized views.
3219-3236
: LGTM: Well-structured bulk write methodThe
_bulk_write
method properly handles both sample and frame operations with appropriate parameters.fiftyone/__public__.py (1)
217-217
: LGTM: Proper export of Materialize stageThe
Materialize
stage is correctly exported in the public interface.🧰 Tools
🪛 Ruff (0.8.2)
217-217:
.core.stages.Materialize
imported but unusedRemove unused import
(F401)
tests/unittests/materialize_tests.py (1)
344-374
: Verify test coverage for materialization edge cases.
It appears these tests handle a variety of practical scenarios, but consider verifying corner cases such as empty frames, partial selects on large datasets, or passing unusual parameters.fiftyone/core/view.py (2)
202-205
: Ensure_is_materialized
attribute is standardized.
This property simply proxies_is_materialized
from the dataset. Confirm that_is_materialized
is consistently set on any dataset object that might be used here to avoid attribute errors.✅ Verification successful
_is_materialized
implementation is properly standardizedBased on the code analysis:
- The base implementation in
Dataset
class properly defines_is_materialized
based on the sample collection name- The abstract base class in
collections.py
enforces implementation throughNotImplementedError
- The view implementation correctly delegates to the underlying dataset's
_is_materialized
The attribute is consistently implemented across the codebase with clear patterns:
- Dataset objects determine materialization based on collection names
- Views delegate to their parent dataset
- The collections interface enforces implementation in subclasses
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Script to check for all references to _is_materialized in the codebase rg "_is_materialized"Length of output: 430
Script:
#!/bin/bash # Check the implementation of _is_materialized in dataset.py and collections.py ast-grep --pattern 'def _is_materialized($_): $$$'Length of output: 623
Script:
#!/bin/bash # Check for classes that inherit from collections.py and implement _is_materialized ast-grep --pattern 'class $_ ($$$collections.$$$): $$$'Length of output: 59
Script:
#!/bin/bash # Let's try a different pattern to find classes that inherit from SampleCollection ast-grep --pattern 'class $_ (SampleCollection): $$$'Length of output: 57
1327-1328
: Confirm that the revised docstring reflects intended behavior.
The update clarifies that entire frames rather than just frame-level labels get removed. Ensure the code indeed removes the entire frame data rather than just the labels.✅ Verification successful
Let me gather more information about the implementation to verify the behavior.
Let me try a different approach to check the implementation.
Let me try one more search pattern to find the implementation.
Let me check one more aspect to see the complete implementation.
Let me get the final part of the implementation.
The docstring accurately reflects the implementation's behavior
Based on the implementation in
fiftyone/core/dataset.py
, the_keep_frames
method indeed removes entire frames that are not in the view from the underlying dataset. This is evidenced by:
- The method performs a
DeleteMany
operation on the frame collection for all frame numbers that are not in the view ("frame_number": {"$not": {"$in": fns}}
)- It operates on the entire frame document, not just labels
- The frame documents are completely reset for the sample with only the kept frame numbers
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Script to search keep_frames method usage to confirm consistent references rg -A 5 "keep_frames"Length of output: 4305
Script:
#!/bin/bash # Check the implementation of _keep_frames in dataset.py ast-grep --pattern 'def _keep_frames($$$)'Length of output: 44
Script:
#!/bin/bash # Search for the implementation details of _keep_frames rg -B 2 -A 10 "_keep_frames.*\{"Length of output: 34
Script:
#!/bin/bash # Search for the implementation with a more specific pattern rg -B 2 -A 15 "def _keep_frames"Length of output: 2007
Script:
#!/bin/bash # Get more context of the _keep_frames implementation rg -B 2 -A 25 "def _keep_frames"Length of output: 3218
Script:
#!/bin/bash # Get the remaining implementation of _keep_frames rg -A 35 "for sample_id, fns in d.items\(\)"Length of output: 2285
fiftyone/core/dataset.py (2)
429-434
: Ensure backward compatibility when introducing new view categories
Introducing_is_materialized
into the_is_generated
property might affect existing code relying on_is_generated=True
only for patches/frames/clips. Validate whether older workflows or tools check for_is_generated=True
and appropriately handle or skip materialized datasets.
Line range hint
3468-3491
: Potential concurrency concerns with bulk writes
The_bulk_write
method can write both sample and frame docs and then reload them. If multiple threads/processes call_bulk_write
on large datasets simultaneously, concurrency issues might arise. Consider adding concurrency-safe checks or mention recommended usage patterns in docs.fiftyone/core/video.py (2)
350-379
: Review data synchronization order
In_bulk_write
, the code syncs the frames dataset first, then prunes sample-only fields, then writes to the source collection. If a user merges data that contains both frame-level and sample-level changes, re-check if partial updates could be lost or overwritten out of sequence.
380-386
: Validate partial updates on_save_sample
This method performs a direct bulk write to_frames_dataset
followed by_sync_source_sample
. Confirm that partially updated frames in_frames_dataset
are properly reflected if the user modifies fields that exist in the root dataset only.docs/source/user_guide/using_views.rst (1)
1096-1126
: Excellent coverage of dataset change handling.The section thoroughly explains and demonstrates:
- How materialized views handle dataset changes
- The two methods to refresh the view (reload() and re-materialization)
@@ -54,10 +54,13 @@ def _sample_id(self): | |||
return ObjectId(self._doc.sample_id) | |||
|
|||
def _save(self, deferred=False): | |||
sample_ops, frame_ops = super()._save(deferred=deferred) | |||
sample_ops, frame_ops = super()._save(deferred=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the deferred argument more explicit
Changing _save(self, deferred=False)
to _save(self, deferred=True)
can break calls that relied on a default immediate save. If possible, explicitly highlight this in release notes or docstrings to warn users of the new default.
Adds a
materialize()
view stage that materializes the current view into a temporary database collection.Apply this stage to an expensive view (eg an unindexed filtering operation on a large dataset) if you plan to perform multiple downstream operations on the view.
Example usage
Like other generated views (
to_patches()
,to_frames()
, andto_clips()
), materialized views automatically sync changes back to their source dataset:Summary by CodeRabbit
Release Notes: View Materialization
New Features
materialize()
method to cache dataset views as temporary collectionsImprovements
Documentation