Skip to content

Revise, simplify, and fix to_partial#1496

Merged
shcheklein merged 2 commits intomainfrom
revise-to-partial
Dec 14, 2025
Merged

Revise, simplify, and fix to_partial#1496
shcheklein merged 2 commits intomainfrom
revise-to-partial

Conversation

@shcheklein
Copy link
Contributor

@shcheklein shcheklein commented Dec 7, 2025

Might be closing #1329

Simplifies to_partial to behave the same for primitive and complex types.

This the first step before reviewing also select behavior (we need to make also producing partials instead of flattening)

Note: Description and examples were AI generated and reviewed.

When does DataChain generate a partial model?

Partials are generated when DataChain needs a schema that includes only a subset of fields from a nested DataModel (e.g. selecting or grouping by file.path but not the entire file). Internally this is driven by SignalSchema.to_partial().

One concrete call site: SignalSchema.group_by() constructs the grouped schema by doing orig_schema.to_partial(*partition_by) (so partition_by determines whether partials are needed).

Operation / selection Partial generated? Resulting type
Select/group by a top-level primitive (e.g. "name") No name: str stays as-is
Select/group by a nested field (e.g. "file.path") Yes (for file) file: FilePartial_<fingerprint>@v1 containing only path
Select/group by the whole nested model (e.g. "file") No file: File stays as-is
Select/group by all nested fields via leaves (e.g. "file.path", "file.source", ...) No (reuses original) file: File (no partial created)

Example (group_by / partition_by context)

Assume your schema includes a nested model signal:

from datachain.lib.file import File
from datachain.lib.signal_schema import SignalSchema

schema = SignalSchema({"name": str, "file": File})

When you partition/group by a nested attribute, DataChain only needs that attribute in the grouping key, so it builds a partial schema:

grouped_schema = schema.to_partial("file.path")
file_type = grouped_schema.values["file"]

# `file_type` is a generated partial model containing only the selected fields.

If you instead partition/group by the whole nested object, or the nested selection ends up including every field, no partial model is generated.

Potentially breaking changes

Change Before After Why it might break someone
SignalSchema.to_partial() may return the original nested model type when selection includes all fields Even “full nested selection” could still produce a generated partial model type Reuses the original model type when the nested selection covers all fields If downstream code relied on “partial always means a new model type/name” (introspection, exact name matching, class identity checks)
Partial model naming scheme changed to fingerprint-based Partial names were effectively ad-hoc/counter-ish Partial names include a deterministic fingerprint prefix: FooPartial_<hashprefix>@v1 If code/tests asserted on exact partial class names, or expected names to be sequential
ModelStore.register() stores models under both base/logical name and runtime __name__ Typically registered under the runtime __name__ Registered under both logical base name and runtime name If any code iterates ModelStore.store keys expecting a 1:1 mapping (more keys now)

Fixes

Fix Before After User-visible effect
Pydantic v2 required/default handling in partial models FieldInfo.is_required treated like a truthy attribute (v2 uses a method), leading to incorrect requiredness/default propagation Always calls field_info.is_required(); defaults preserved for non-required fields in partials Partial schemas/models match original required/default semantics
More robust custom type deserialization error reporting Validation issues during CustomType deserialize could surface less clearly Wraps ValidationError with a clear SignalSchemaError message Easier debugging when schema metadata is malformed or incompatible
Type string conversion warning path covered/consistent Type-to-string logic was duplicated inside schema Centralized in type_to_str() with schema-specific warning type injection More consistent type strings; warning behavior is testable and stable

Other changes

Change What changed Impact
Added partial_fingerprint metadata to serialized custom types (when present) CustomType now carries partial_fingerprint and serialization excludes None values Helps diagnose/validate partial model identity; keeps payload tidy (exclude_none=True)
compute_model_fingerprint() added as a shared utility New helper to deterministically hash (model, selection) Enables stable partial naming and collision detection
Test suite stability improvements Tests snapshot/restore ModelStore.store instead of wiping it Eliminates order-dependent failures; makes unit/func tests more reliable
Docstrings/comments clarified create_feature_model() and to_partial() docs expanded/standardized; ModelStore naming rationale clarified Improves maintainability and makes behavior easier to understand

Before/after examples

Area / Scenario Before (on main) After (this branch) Example (before → after)
Pydantic v2 requiredness + defaults in partial models Required/default detection could be wrong because FieldInfo.is_required was treated like a truthy attribute (v2 exposes it as a method). This could cause defaults to be lost or fields to be treated as required incorrectly in generated partials. Uses field_info.is_required() consistently; non-required fields keep their defaults in the partial model. Model: class Foo(DataModel): a: int; b: int = 7
Selection: schema.to_partial("foo.b")
Before: b might become required / default mishandled
After: b stays optional-with-default (default 7)
Selecting all nested fields Even if the selection effectively included every field of a nested model, a new “partial” model could still be generated. If the nested selection includes all fields (and only leaves), it reuses the original model type (no new partial type created). schema.to_partial("person.name", "person.age")
Before: person: PersonPartial...
After: person: Person (original type reused)
Selecting only some nested fields Generates a partial nested model (but naming/required/default handling could be shaky under v2). Still generates a partial nested model, with correct required/default handling. schema.to_partial("person.age")
Before: PersonPartial... (sometimes wrong defaults/requiredness)
After: PersonPartial_... with correct field metadata
Partial model naming stability across reloads / processes Names were more ad-hoc (effectively counter-based), making collisions harder to reason about and potentially unstable across reloads. Partial model names are derived from a deterministic fingerprint of (model, selection); collisions with a different fingerprint raise a clear error. Same selection in another process/run:
Before: might produce a different partial name, or collide unpredictably
After: produces the same FooPartial_<hashprefix>@v1 for the same selection
ModelStore lookup robustness for restored/generated models Restored/generated versioned Python class names could complicate lookups by logical name + version. Registers models under both the logical base name and the runtime __name__, so resolution works via either identifier. Before: ModelStore.get("Foo", 1) might fail if runtime class is Foo_v1 only
After: both Foo and Foo_v1 resolve to version 1
Type stringification warnings Type-to-string logic lived in schema code; warning path coverage was awkward. Centralized type_to_str() with an injectable warn_with, while schema still emits SignalSchemaWarning. A type without __name__:
Before: warning emitted from schema helper
After: warning emitted via shared helper but still categorized as SignalSchemaWarning
Test isolation / order dependence Some tests wiped global ModelStore.store, causing order-dependent failures. Tests snapshot/restore ModelStore.store to avoid polluting global state. Before: running tests in a different order could fail
After: order-independent runs

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Dec 7, 2025

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8f78f83
Status: ✅  Deploy successful!
Preview URL: https://0622a7bf.datachain-2g6.pages.dev
Branch Preview URL: https://revise-to-partial.datachain-2g6.pages.dev

View logs

@codecov
Copy link

codecov bot commented Dec 7, 2025

Codecov Report

❌ Patch coverage is 98.59155% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/lib/signal_schema.py 97.24% 1 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@shcheklein shcheklein force-pushed the revise-to-partial branch 3 times, most recently from 207a0c7 to 5df932e Compare December 10, 2025 06:37
@shcheklein shcheklein marked this pull request as draft December 10, 2025 16:58
@shcheklein shcheklein self-assigned this Dec 10, 2025
@shcheklein shcheklein force-pushed the revise-to-partial branch 2 times, most recently from f91563c to a7a4e67 Compare December 10, 2025 21:52
@shcheklein shcheklein force-pushed the revise-to-partial branch 8 times, most recently from b0e4e10 to 5434471 Compare December 12, 2025 00:45
@shcheklein shcheklein requested a review from Copilot December 12, 2025 02:06
@shcheklein shcheklein marked this pull request as ready for review December 12, 2025 02:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the to_partial method in SignalSchema to simplify and unify its behavior for both primitive and complex types. The key change is the introduction of deterministic fingerprint-based naming for partial models to prevent collisions and ensure stability across sessions.

Key Changes:

  • Implements deterministic partial model naming using content-based fingerprints
  • Refactors to_partial to build partial types hierarchically rather than through serialization/deserialization
  • Extracts type_to_str function to a shared utility module for reusability

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/unit/test_signal_schema_partials.py Adds comprehensive tests for partial model creation, fingerprinting, and collision detection
tests/unit/test_data_model.py Tests for the new compute_model_fingerprint function
tests/func/test_signal_schema.py Functional test verifying partial collision handling during dataset reload
tests/unit/lib/test_utils.py Tests for the extracted type_to_str utility function
tests/unit/lib/test_signal_schema.py Removes old partial tests moved to dedicated file; fixes typo "Seince" → "Since"
tests/func/test_datachain.py Updates test assertions to match new fingerprint-based partial naming scheme
src/datachain/lib/utils.py Extracts type_to_str from SignalSchema for shared use across modules
src/datachain/lib/signal_schema.py Refactors to_partial to use fingerprints; delegates type serialization to type_to_str
src/datachain/lib/model_store.py Adds support for tracking base names separately from class names to support partial naming
src/datachain/lib/data_model.py Implements deterministic fingerprint computation for model selections

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return str(PurePosixPath(new_base) / new_relative_path)


def type_to_str( # noqa: C901, PLR0911, PLR0912
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C: moved it from signal schema with minor modifications. Since it is now used also to calculate fingerprints (hashes) of the models.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shcheklein shcheklein changed the title Revise and simplify to_partial Revise, simplify, and fix to_partial Dec 12, 2025
@shcheklein
Copy link
Contributor Author

@dreadatour any luck reviewing this? :)

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a complex change inside the core, I went through all the changes and related code several times looking for any possible issues.

Looks good to me overall, great improvement!

One thing confuses me — is a little mess with "double" versioning — e.g. MyType_v1 VS MyType@v1. But I don't have good solution for this and this is out of the scope of this PR.

Also thank you for reordering tests (moving signal schema partials out).

@shcheklein shcheklein merged commit 66f9590 into main Dec 14, 2025
63 of 65 checks passed
@shcheklein shcheklein deleted the revise-to-partial branch December 14, 2025 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants