-
Notifications
You must be signed in to change notification settings - Fork 134
fix(delta): read qualified dataset name on no changes in delta #1326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR fixes delta datasets incorrectly reading from the default namespace when there are no changes by qualifying dataset names with project and namespace in both test and production code, and adds comprehensive tests to validate this behavior across various namespace/project combinations. Sequence diagram for reading a dataset with qualified namespace and projectsequenceDiagram
participant Caller
participant "read_dataset()"
Caller->>"read_dataset()": read_dataset(name, namespace, project, ...)
"read_dataset()"->>"Target Namespace": Lookup dataset in specified namespace/project
"read_dataset()"-->>Caller: Return dataset from correct namespace
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Consider using pytest.mark.parametrize for the three dataset‐name/namespace cases in test_delta_returns_correct_dataset_on_no_changes to reduce manual loops and improve readability.
- The custom _get_short_ds_name helper duplicates naming logic—see if you can reuse or wrap an existing catalog/metastore method to keep dataset qualification consistent and avoid subtle drift.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider using pytest.mark.parametrize for the three dataset‐name/namespace cases in test_delta_returns_correct_dataset_on_no_changes to reduce manual loops and improve readability.
- The custom _get_short_ds_name helper duplicates naming logic—see if you can reuse or wrap an existing catalog/metastore method to keep dataset qualification consistent and avoid subtle drift.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1326 +/- ##
=======================================
Coverage 88.84% 88.84%
=======================================
Files 155 155
Lines 14240 14240
Branches 2025 2025
=======================================
Hits 12652 12652
Misses 1124 1124
Partials 464 464
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical bug in the delta save functionality where the system would incorrectly read from the default namespace instead of the target namespace when there are no changes on subsequent runs. This could lead to using wrong data when datasets with the same name exist in multiple namespaces.
Key Changes
- Fix delta save to properly pass namespace and project parameters when reading existing datasets on no-change scenarios
- Update test utilities to handle fully qualified dataset names and namespace/project context
- Add comprehensive test coverage for the no-change delta behavior across different namespace and project combinations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/datachain/lib/dc/datachain.py | Fixes the core bug by passing namespace and project parameters to read_dataset |
| tests/func/test_delta.py | Adds helper functions and comprehensive tests to verify the fix works across multiple namespace scenarios |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
dreadatour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
| from datachain.lib.file import File, ImageFile | ||
|
|
||
|
|
||
| def _get_short_ds_name(catalog, name, project_name, namespace_name) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why _get_short_ds_name? IMO it should be something like _get_full_ds_name 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is trying to get the shortest name (check if namespace passed is default and project is default and doesn't use them)
Fixes the scenario in delta case:
Quite serious issue since it can lead to usage of some wrong data
Summary by Sourcery
Ensure delta save with no changes returns the correct dataset version from the intended namespace and project instead of defaulting to the default namespace
Bug Fixes:
Enhancements:
Tests: