Revise, simplify, and fix to_partial#1496
Conversation
Deploying datachain with
|
| Latest commit: |
8f78f83
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://0622a7bf.datachain-2g6.pages.dev |
| Branch Preview URL: | https://revise-to-partial.datachain-2g6.pages.dev |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
207a0c7 to
5df932e
Compare
f91563c to
a7a4e67
Compare
b0e4e10 to
5434471
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the to_partial method in SignalSchema to simplify and unify its behavior for both primitive and complex types. The key change is the introduction of deterministic fingerprint-based naming for partial models to prevent collisions and ensure stability across sessions.
Key Changes:
- Implements deterministic partial model naming using content-based fingerprints
- Refactors
to_partialto build partial types hierarchically rather than through serialization/deserialization - Extracts
type_to_strfunction to a shared utility module for reusability
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit/test_signal_schema_partials.py |
Adds comprehensive tests for partial model creation, fingerprinting, and collision detection |
tests/unit/test_data_model.py |
Tests for the new compute_model_fingerprint function |
tests/func/test_signal_schema.py |
Functional test verifying partial collision handling during dataset reload |
tests/unit/lib/test_utils.py |
Tests for the extracted type_to_str utility function |
tests/unit/lib/test_signal_schema.py |
Removes old partial tests moved to dedicated file; fixes typo "Seince" → "Since" |
tests/func/test_datachain.py |
Updates test assertions to match new fingerprint-based partial naming scheme |
src/datachain/lib/utils.py |
Extracts type_to_str from SignalSchema for shared use across modules |
src/datachain/lib/signal_schema.py |
Refactors to_partial to use fingerprints; delegates type serialization to type_to_str |
src/datachain/lib/model_store.py |
Adds support for tracking base names separately from class names to support partial naming |
src/datachain/lib/data_model.py |
Implements deterministic fingerprint computation for model selections |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5434471 to
8f78f83
Compare
| return str(PurePosixPath(new_base) / new_relative_path) | ||
|
|
||
|
|
||
| def type_to_str( # noqa: C901, PLR0911, PLR0912 |
There was a problem hiding this comment.
C: moved it from signal schema with minor modifications. Since it is now used also to calculate fingerprints (hashes) of the models.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@dreadatour any luck reviewing this? :) |
dreadatour
left a comment
There was a problem hiding this comment.
Quite a complex change inside the core, I went through all the changes and related code several times looking for any possible issues.
Looks good to me overall, great improvement!
One thing confuses me — is a little mess with "double" versioning — e.g. MyType_v1 VS MyType@v1. But I don't have good solution for this and this is out of the scope of this PR.
Also thank you for reordering tests (moving signal schema partials out).
Might be closing #1329
Simplifies
to_partialto behave the same for primitive and complex types.This the first step before reviewing also
selectbehavior (we need to make also producing partials instead of flattening)Note: Description and examples were AI generated and reviewed.
When does DataChain generate a partial model?
Partials are generated when DataChain needs a schema that includes only a subset of fields from a nested
DataModel(e.g. selecting or grouping byfile.pathbut not the entirefile). Internally this is driven bySignalSchema.to_partial().One concrete call site:
SignalSchema.group_by()constructs the grouped schema by doingorig_schema.to_partial(*partition_by)(sopartition_bydetermines whether partials are needed)."name")name: strstays as-is"file.path")file)file: FilePartial_<fingerprint>@v1containing onlypath"file")file: Filestays as-is"file.path","file.source", ...)file: File(no partial created)Example (group_by / partition_by context)
Assume your schema includes a nested model signal:
When you partition/group by a nested attribute, DataChain only needs that attribute in the grouping key, so it builds a partial schema:
If you instead partition/group by the whole nested object, or the nested selection ends up including every field, no partial model is generated.
Potentially breaking changes
SignalSchema.to_partial()may return the original nested model type when selection includes all fieldsFooPartial_<hashprefix>@v1ModelStore.register()stores models under both base/logical name and runtime__name____name__ModelStore.storekeys expecting a 1:1 mapping (more keys now)Fixes
FieldInfo.is_requiredtreated like a truthy attribute (v2 uses a method), leading to incorrect requiredness/default propagationfield_info.is_required(); defaults preserved for non-required fields in partialsCustomTypedeserialize could surface less clearlyValidationErrorwith a clearSignalSchemaErrormessagetype_to_str()with schema-specific warning type injectionOther changes
partial_fingerprintmetadata to serialized custom types (when present)CustomTypenow carriespartial_fingerprintand serialization excludesNonevaluesexclude_none=True)compute_model_fingerprint()added as a shared utility(model, selection)ModelStore.storeinstead of wiping itcreate_feature_model()andto_partial()docs expanded/standardized;ModelStorenaming rationale clarifiedBefore/after examples
main)FieldInfo.is_requiredwas treated like a truthy attribute (v2 exposes it as a method). This could cause defaults to be lost or fields to be treated as required incorrectly in generated partials.field_info.is_required()consistently; non-required fields keep their defaults in the partial model.class Foo(DataModel): a: int; b: int = 7schema.to_partial("foo.b")bmight become required / default mishandledbstays optional-with-default (default7)schema.to_partial("person.name", "person.age")person: PersonPartial...person: Person(original type reused)schema.to_partial("person.age")PersonPartial...(sometimes wrong defaults/requiredness)PersonPartial_...with correct field metadata(model, selection); collisions with a different fingerprint raise a clear error.FooPartial_<hashprefix>@v1for the same selection__name__, so resolution works via either identifier.ModelStore.get("Foo", 1)might fail if runtime class isFoo_v1onlyFooandFoo_v1resolve to version1type_to_str()with an injectablewarn_with, while schema still emitsSignalSchemaWarning.__name__:SignalSchemaWarningModelStore.store, causing order-dependent failures.ModelStore.storeto avoid polluting global state.