fix copied doc updates not insert #4729

swheaton · 2024-08-26T15:10:48Z

What changes are proposed in this pull request?

Another weird bug that doesn't cause issues really until to-come changes are introduced.

If you do this code (done 3 times with dataset clone)

doc_copy = doc.copy()
_id = bson.ObjectId()
doc_copy.id = _id

Then mongoengine sets _created to False which means it thinks it's an update object not a new one.
When you call save() it emits an upsert call instead of an insert.
OK, not so bad, maybe a little weird but whatever ...

But in #4597 @brimoor proposes an optimization where only changed fields are serialized to get the document. In combo, this causes all kinds of strange behavior. It just so happens to work in that PR because doc._changed_fields is uninitialized (a code smell in mongoengine, they have a TODO to clean it up...) and so doc._delta() returns the whole doc.

But say we cleared changed fields because we didn't know about this strange requirement.

import bson
import fiftyone.core.odm as foo

run_doc = foo.RunDocument(config={"foo": "bar"})
run_doc.save()
doc_copy = run_doc.copy()
doc_copy.id = bson.ObjectId()

doc_copy._clear_changed_fields()
doc_copy.version = "51.51"
print(doc_copy._get_changed_fields()  # ["version"]
doc_copy.save(upsert=True)
doc_copy.reload()

# Oops our config field is {} because it's not a changed field so update only wrote "version" field
assert doc_copy.config == {"foo": "bar"}

doc_copy.delete()
run_doc.delete()

Ok but we don't clear changed_field so we're fine? Nope, balancing on a thread due to another mongoengine weirdness where _get_changed_fields() can return something due to embedded documents being edited

import bson
import fiftyone as fo

ds = fo.Dataset()

# Pretending to clone the dataset doc
doc_copy = ds._doc.copy()
doc_copy.id = bson.ObjectId()
doc_copy.sample_collection_name=f"samples.{str(doc_copy.id)}"
doc_copy.name="blah"
doc_copy.slug="blah"

# Making an embedded document update
doc_copy.sample_fields[0].description = "blah"

# Wha? changed fields is actually ["sample_fields.0.description"] because
#  simple fields aren't tracked but embedded docs are
assert doc_copy._get_changed_fields() == []

ds.delete()
doc_copy.delete()

In mongoengine code, this appears to be band-aided with this:
which is what @brimoor 's optimization is trying to avoid

        # Handles cases where not loaded from_son but has _id
        doc = self.to_mongo()

How is this patch tested? If it is not, please explain why.

Added test for copy_with_new_id.

Ensured that cloning dataset uses INSERT methods not UPSERT. Added print in Document._save()

import fiftyone as fo

ds = fo.Dataset()
ds.clone("blah")

$$ INSERT SON([('_id', ObjectId('66cc999093e25d6a662346e4')), ('name', 'blah'), ('slug', 'blah'), ('version', '0.24.1'), ('created_at', datetime.datetime(2024, 8, 26, 15, 4, 48, 474872)), ('sample_collection_name', 'samples.66cc999093e25d6a662346e4'), ('persistent', False), ('group_media_types', {}), ('tags', []), ('info', {}), ('app_config', SON([('grid_media_field', 'filepath'), ('media_fallback', False), ('media_fields', ['filepath']), ('modal_media_field', 'filepath'), ('plugins', {})])), ('classes', {}), ('default_classes', []), ('mask_targets', {}), ('default_mask_targets', {}), ('skeletons', {}), ('sample_fields', [SON([('name', 'id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_id'), ('description', None), ('info', None)]), SON([('name', 'filepath'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'filepath'), ('description', None), ('info', None)]), SON([('name', 'tags'), ('ftype', 'fiftyone.core.fields.ListField'), ('embedded_doc_type', None), ('subfield', 'fiftyone.core.fields.StringField'), ('fields', []), ('db_field', 'tags'), ('description', None), ('info', None)]), SON([('name', 'metadata'), ('ftype', 'fiftyone.core.fields.EmbeddedDocumentField'), ('embedded_doc_type', 'fiftyone.core.metadata.Metadata'), ('subfield', None), ('fields', [SON([('name', 'size_bytes'), ('ftype', 'fiftyone.core.fields.IntField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'size_bytes'), ('description', None), ('info', None)]), SON([('name', 'mime_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'mime_type'), ('description', None), ('info', None)])]), ('db_field', 'metadata'), ('description', None), ('info', None)]), SON([('name', '_media_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_media_type'), ('description', None), ('info', None)]), SON([('name', '_rand'), ('ftype', 'fiftyone.core.fields.FloatField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_rand'), ('description', None), ('info', None)]), SON([('name', '_dataset_id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_dataset_id'), ('description', None), ('info', None)])]), ('frame_fields', []), ('saved_views', []), ('workspaces', []), ('annotation_runs', {}), ('brain_methods', {}), ('evaluations', {}), ('runs', {})])

^^UPDATES {'$set': {'last_loaded_at': datetime.datetime(2024, 8, 26, 15, 4, 48, 625081)}}

Previously,

^^UPDATES {'$set': SON([('name', 'blah2'), ('slug', 'blah2'), ('version', '0.24.1'), ('created_at', datetime.datetime(2024, 8, 26, 15, 7, 4, 166390)), ('sample_collection_name', 'samples.66cc9a1861678cf7500272b4'), ('persistent', False), ('app_config', SON([('grid_media_field', 'filepath'), ('media_fallback', False), ('media_fields', ['filepath']), ('modal_media_field', 'filepath'), ('plugins', {})])), ('sample_fields', [SON([('name', 'id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_id'), ('description', None), ('info', None)]), SON([('name', 'filepath'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'filepath'), ('description', None), ('info', None)]), SON([('name', 'tags'), ('ftype', 'fiftyone.core.fields.ListField'), ('embedded_doc_type', None), ('subfield', 'fiftyone.core.fields.StringField'), ('fields', []), ('db_field', 'tags'), ('description', None), ('info', None)]), SON([('name', 'metadata'), ('ftype', 'fiftyone.core.fields.EmbeddedDocumentField'), ('embedded_doc_type', 'fiftyone.core.metadata.Metadata'), ('subfield', None), ('fields', [SON([('name', 'size_bytes'), ('ftype', 'fiftyone.core.fields.IntField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'size_bytes'), ('description', None), ('info', None)]), SON([('name', 'mime_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', 'mime_type'), ('description', None), ('info', None)])]), ('db_field', 'metadata'), ('description', None), ('info', None)]), SON([('name', '_media_type'), ('ftype', 'fiftyone.core.fields.StringField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_media_type'), ('description', None), ('info', None)]), SON([('name', '_rand'), ('ftype', 'fiftyone.core.fields.FloatField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_rand'), ('description', None), ('info', None)]), SON([('name', '_dataset_id'), ('ftype', 'fiftyone.core.fields.ObjectIdField'), ('embedded_doc_type', None), ('subfield', None), ('fields', []), ('db_field', '_dataset_id'), ('description', None), ('info', None)])])]), '$unset': {'group_media_types': 1, 'tags': 1, 'info': 1, 'classes': 1, 'default_classes': 1, 'mask_targets': 1, 'default_mask_targets': 1, 'skeletons': 1, 'frame_fields': 1, 'saved_views': 1, 'workspaces': 1, 'annotation_runs': 1, 'brain_methods': 1, 'evaluations': 1, 'runs': 1}}

^^UPDATES {'$set': {'last_loaded_at': datetime.datetime(2024, 8, 26, 15, 7, 4, 291793)}}

Summary by CodeRabbit

New Features
- Introduced a method to duplicate documents with a new unique identifier, enhancing document management.
Bug Fixes
- Improved ID generation for copied documents, reducing potential errors and improving maintainability.
Tests
- Added new tests to ensure the functionality of copying documents with new IDs works as intended.

coderabbitai · 2024-08-26T15:10:56Z

Walkthrough

The changes involve significant modifications to the document handling functionalities in the FiftyOne library. A new method, copy, is introduced to the Document class for creating copies of documents with unique identifiers. This change affects the cloning process in dataset and view management, ensuring that copied documents are distinct and properly marked as newly created. Additionally, a new test class has been added to verify the correct behavior of this functionality.

Changes

Files	Change Summary
`fiftyone/core/dataset.py`	Modified `_clone_dataset_or_view`, `_clone_extras`, and `_clone_run` to use `copy(new_id=True)`.
`fiftyone/core/odm/document.py`	Added `copy` method to the `Document` class for creating copies with new IDs.
`tests/unittests/odm_tests.py`	Introduced `DocumentTests` class with `test_doc_copy_with_new_id` to validate new functionality.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Document
    participant Dataset

    User->>Dataset: Clone Document
    Dataset->>Document: Call copy(new_id=True)
    Document->>Document: Create new document instance
    Document->>Document: Generate new ObjectId
    Document->>Document: Set _created attribute to True
    Document-->>Dataset: Return new document
    Dataset-->>User: Provide cloned document

🐇 In the land of code, where documents play,
A new ID hops in, brightening the day.
With each little copy, fresh and anew,
The documents dance, as if they all grew.
So let’s celebrate this change with delight,
For every new clone brings joy to our sight! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 76cdadf and 581d3f9.

Files selected for processing (3)

fiftyone/core/dataset.py (3 hunks)
fiftyone/core/odm/document.py (2 hunks)
tests/unittests/odm_tests.py (2 hunks)

Additional comments not posted (6)

tests/unittests/odm_tests.py (1)

35-56: LGTM!

The test function is well-structured and covers the necessary test cases for the copy_with_new_id method.

The code changes are approved.

fiftyone/core/odm/document.py (1)

587-599: LGTM!

The method is well-implemented and follows the necessary steps to create a new document with a unique ID.

The code changes are approved.

fiftyone/core/dataset.py (4)

7776-7776: LGTM!

The code correctly uses copy_with_new_id() to create a new ID for the cloned dataset document.

The code changes are approved.

7780-7781: LGTM!

The code correctly assigns the newly cloned dataset document to the variable dataset_doc.

The code changes are approved.

8252-8253: LGTM!

The code correctly uses copy_with_new_id() to create a new ID for the cloned reference document.

The code changes are approved.

Line range hint 8257-8264: LGTM!

The code correctly uses copy_with_new_id() to create a new ID for the cloned run document and handles copying the GridFS files.

The code changes are approved.

benjaminpkane · 2024-08-27T17:54:28Z

Nice find. Odd and opaque behavior.

It makes sense to me. Trying to grok things a bit...we still like regular copy() for other use cases? And this only impacts documents (not embedded documents, e.g. labels), correct?

Thinking out loud, what about a doc.copy(new_id=False) signature?

brimoor · 2024-08-28T15:08:24Z

@swheaton +1 to both of @benjaminpkane's thoughts here:

Still trying to fully grok the implications of what you've found here. Are the other places where we use copy() okay?
I like folding this into a copy(new_id=True) syntax. Or, another option could be clone(), since when you dataset.clone() you're creating an identical but fully-independent copy of the dataset (with a new dataset ID).

But, I'm now wondering if we ever use copy() for the purposes of making edits to a doc that we intend to upsert in-place... 🤔

Okay, here's an interesting bit of code when dealing with embedded docs:

fiftyone/fiftyone/utils/eval/coco.py

Lines 770 to 781 in c9cfc05

    
           def _copy_labels(labels): 
        
               if labels is None: 
        
                   return None 
        
               field = labels._LABEL_LIST_FIELD 
        
               _labels = labels.copy() 
        
               # We need the IDs to stay the same 
        
               for _label, label in zip(_labels[field], labels[field]): 
        
                   _label.id = label.id 
        
               return _labels

The use case here is that we actually want a copy of the embedded docs with the same IDs because we intend to do in-memory computations on them but don't want the stuff we do to be tracked and persisted on the label objects in the database. Annnnd, this reminds me that apparently foo.EmbeddedDocument.copy() does not behave the same as foo.Document.copy(), it creates a new ID by default!

d = fo.Detection()
assert d.id == d.copy().id  # False!

And then there's fiftyone.core.document.Document.copy(), which does a third thing: it explicitly returns a document with id == None:

fiftyone/fiftyone/core/document.py

Lines 376 to 390 in c9cfc05

    
               def copy(self, fields=None, omit_fields=None): 
        
                   """Returns a deep copy of the document that has not been added to the 
        
                   database. 
        
                   Args: 
        
                       fields (None): an optional field or iterable of fields to which to 
        
                           restrict the copy. This can also be a dict mapping existing 
        
                           field names to new field names 
        
                       omit_fields (None): an optional field or iterable of fields to 
        
                           exclude from the copy 
        
                   Returns: 
        
                       a :class:`Document` 
        
                   """ 
        
                   raise NotImplementedError("subclass must implement copy()")

swheaton · 2024-08-30T02:39:23Z

Somehow I missed the emails of these reviews coming in ...

Good ideas, I didn't love copy_with_new_id anyways.

foo.Document does behave that same way, copying resets the ID to None. It's just that in our clone_* implementations we want to know the ID before saving so that we can create sample collection name and set up references and stuff. So we don't just leave the ID as None which would cause an insert and new ID generated in the normal way.

But yes embedded document is different, not sure why exactly. But I don't see any further bugs due to this.

swheaton · 2024-08-30T14:18:43Z

changed to copy(new_id=False)

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 581d3f9 and 63a5cc4.

Files selected for processing (3)

fiftyone/core/dataset.py (3 hunks)
fiftyone/core/odm/document.py (2 hunks)
tests/unittests/odm_tests.py (2 hunks)

Files skipped from review as they are similar to previous changes (3)

fiftyone/core/dataset.py
fiftyone/core/odm/document.py
tests/unittests/odm_tests.py

brimoor

LGTM

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 63a5cc4 and 86aad06.

Files selected for processing (1)

fiftyone/core/odm/document.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

fiftyone/core/odm/document.py

benjaminpkane

Nice!

fix copied doc updates not insert

581d3f9

swheaton requested a review from brimoor August 26, 2024 15:10

swheaton mentioned this pull request Aug 26, 2024

Adding created_at, last_modified_at, and read-only fields #4597

Merged

coderabbitai bot reviewed Aug 26, 2024

View reviewed changes

swheaton mentioned this pull request Aug 26, 2024

Fields have created_at attribute #4730

Merged

7 tasks

pr comment: copy_with_new_id->copy(new_id=False)

63a5cc4

coderabbitai bot reviewed Aug 30, 2024

View reviewed changes

tweak

86aad06

brimoor approved these changes Sep 4, 2024

View reviewed changes

coderabbitai bot reviewed Sep 4, 2024

View reviewed changes

benjaminpkane approved these changes Sep 9, 2024

View reviewed changes

brimoor merged commit 8811fd9 into develop Sep 12, 2024
13 checks passed

brimoor deleted the fix/mongoengine-document-misc-bugs branch September 12, 2024 01:49

coderabbitai bot mentioned this pull request Oct 10, 2024

add support for atomic state transitions #4893

Merged

7 tasks

coderabbitai bot mentioned this pull request Dec 6, 2024

fix import timeouts on increasing datasets by precomputing batch size #5231

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix copied doc updates not insert #4729

fix copied doc updates not insert #4729

swheaton commented Aug 26, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 26, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

benjaminpkane commented Aug 27, 2024

brimoor commented Aug 28, 2024

swheaton commented Aug 30, 2024

swheaton commented Aug 30, 2024

coderabbitai bot left a comment

brimoor left a comment

coderabbitai bot left a comment

benjaminpkane left a comment

fix copied doc updates not insert #4729

fix copied doc updates not insert #4729

Conversation

swheaton commented Aug 26, 2024 • edited by coderabbitai bot Loading

What changes are proposed in this pull request?

How is this patch tested? If it is not, please explain why.

Summary by CodeRabbit

coderabbitai bot commented Aug 26, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

benjaminpkane commented Aug 27, 2024

brimoor commented Aug 28, 2024

swheaton commented Aug 30, 2024

swheaton commented Aug 30, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

brimoor left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

benjaminpkane left a comment

Choose a reason for hiding this comment

swheaton commented Aug 26, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 26, 2024 •

edited

Loading