-
Notifications
You must be signed in to change notification settings - Fork 134
Fixing delta updates #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing delta updates #1194
Conversation
Reviewer's GuideThis PR enhances delta update logic to correctly handle non-default project contexts by propagating namespace and project parameters through delta functions, updating dataset read and catalog queries, extending source info retrieval, amending DataChain.save, and adjusting tests accordingly. Sequence diagram for delta_retry_update with project contextsequenceDiagram
participant User as actor User
participant DataChain
participant Catalog
participant Metastore
participant Project
participant Dataset
User->>DataChain: call delta_retry_update(namespace_name, project_name, name, ...)
DataChain->>Catalog: session.catalog
DataChain->>Metastore: get_project(project_name, namespace_name)
Metastore-->>DataChain: Project
DataChain->>Catalog: get_dataset(name, project=Project)
Catalog-->>DataChain: Dataset
DataChain->>Catalog: get_dataset_dependencies(name, latest_version, project=Project, indirect=False)
Catalog-->>DataChain: dependencies
DataChain->>Catalog: get_dataset(source_ds_name, project=source_ds_project)
Catalog-->>DataChain: source Dataset
DataChain->>...: (proceeds with delta/retry logic)
Class diagram for updated delta update logic with project contextclassDiagram
class DataChain {
+save(namespace_name, project_name, name, ...)
}
class Catalog {
+get_dataset(name, project=None)
+get_dataset_dependencies(name, version, project=None, indirect=False)
metastore: Metastore
}
class Metastore {
+get_project(project_name, namespace_name)
}
class Project {
name
namespace: Namespace
}
class Namespace {
name
}
DataChain --> Catalog : uses
Catalog --> Metastore : uses
Metastore --> Project : returns
Project --> Namespace : has
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ilongin - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/delta.py:241` </location>
<code_context>
if source_ds_name is None:
return None, None, True
+ assert source_ds_project
assert source_ds_version
assert source_ds_latest_version
</code_context>
<issue_to_address>
Use of assert for runtime validation may be bypassed in optimized mode.
Asserts are skipped with Python optimizations. Raise an exception instead if source_ds_project is essential.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
assert source_ds_project
=======
if not source_ds_project:
raise ValueError("source_ds_project must be set")
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if source_ds_name is None: | ||
| return None, None, True | ||
|
|
||
| assert source_ds_project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Use of assert for runtime validation may be bypassed in optimized mode.
Asserts are skipped with Python optimizations. Raise an exception instead if source_ds_project is essential.
| assert source_ds_project | |
| if not source_ds_project: | |
| raise ValueError("source_ds_project must be set") |
tests/func/test_delta.py
Outdated
| if project: | ||
| starting_ds_name = f"{project}.starting_ds" | ||
| else: | ||
| starting_ds_name = "local.local.starting_ds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
Deploying datachain-documentation with
|
| Latest commit: |
248ea1e
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://6ccd68f1.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-1193-fix-delta-updat.datachain-documentation.pages.dev |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1194 +/- ##
=======================================
Coverage 88.72% 88.72%
=======================================
Files 152 152
Lines 13545 13549 +4
Branches 1885 1885
=======================================
+ Hits 12018 12022 +4
Misses 1086 1086
Partials 441 441
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Fix for delta updates when non default project is used for source dataset
Summary by Sourcery
Fix delta update logic to correctly handle non-default projects by passing explicit project and namespace context to dataset operations and update functions.
Bug Fixes:
read_datasetcalls in_get_delta_chain,_get_retry_chain,_get_source_info, anddelta_retry_updateto support datasets in non-default projects_get_source_infoDataChain.saveto forwardnamespace_nameandproject_nametodelta_retry_updateTests: