Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Jun 30, 2025

Fix for delta updates when non default project is used for source dataset

Summary by Sourcery

Fix delta update logic to correctly handle non-default projects by passing explicit project and namespace context to dataset operations and update functions.

Bug Fixes:

  • Pass project and namespace parameters to all read_dataset calls in _get_delta_chain, _get_retry_chain, _get_source_info, and delta_retry_update to support datasets in non-default projects
  • Include project context when retrieving dataset dependencies and latest versions in _get_source_info
  • Update DataChain.save to forward namespace_name and project_name to delta_retry_update

Tests:

  • Parametrize functional delta update test to run against both default and non-default projects
  • Adjust test helper to format dependency names with namespace and project

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jun 30, 2025

Reviewer's Guide

This PR enhances delta update logic to correctly handle non-default project contexts by propagating namespace and project parameters through delta functions, updating dataset read and catalog queries, extending source info retrieval, amending DataChain.save, and adjusting tests accordingly.

Sequence diagram for delta_retry_update with project context

sequenceDiagram
    participant User as actor User
    participant DataChain
    participant Catalog
    participant Metastore
    participant Project
    participant Dataset

    User->>DataChain: call delta_retry_update(namespace_name, project_name, name, ...)
    DataChain->>Catalog: session.catalog
    DataChain->>Metastore: get_project(project_name, namespace_name)
    Metastore-->>DataChain: Project
    DataChain->>Catalog: get_dataset(name, project=Project)
    Catalog-->>DataChain: Dataset
    DataChain->>Catalog: get_dataset_dependencies(name, latest_version, project=Project, indirect=False)
    Catalog-->>DataChain: dependencies
    DataChain->>Catalog: get_dataset(source_ds_name, project=source_ds_project)
    Catalog-->>DataChain: source Dataset
    DataChain->>...: (proceeds with delta/retry logic)
Loading

Class diagram for updated delta update logic with project context

classDiagram
    class DataChain {
        +save(namespace_name, project_name, name, ...)
    }
    class Catalog {
        +get_dataset(name, project=None)
        +get_dataset_dependencies(name, version, project=None, indirect=False)
        metastore: Metastore
    }
    class Metastore {
        +get_project(project_name, namespace_name)
    }
    class Project {
        name
        namespace: Namespace
    }
    class Namespace {
        name
    }
    DataChain --> Catalog : uses
    Catalog --> Metastore : uses
    Metastore --> Project : returns
    Project --> Namespace : has
Loading

File-Level Changes

Change Details Files
Propagate project context through delta function signatures
  • Add project or source_ds_project parameters to _get_delta_chain, _get_retry_chain, _get_source_info, and delta_retry_update
  • Update internal calls to pass the new project parameters
src/datachain/delta.py
Include namespace and project in dataset read and catalog queries
  • Enhance datachain.read_dataset invocations to accept namespace and project arguments
  • Modify catalog.get_dataset and catalog.get_dataset_dependencies calls to use project context
src/datachain/delta.py
Extend source info retrieval to return project context
  • Fetch the source dataset's project using catalog.metastore.get_project
  • Update _get_source_info to include source_ds_project in its return tuple
src/datachain/delta.py
Forward project context from DataChain.save to delta_retry_update
  • Add namespace_name and project_name parameters to the save method
  • Pass these parameters when invoking delta_retry_update
src/datachain/lib/dc/datachain.py
Adjust functional tests to cover non-default project scenarios
  • Parametrize test_delta_update_from_dataset over a project value
  • Compute starting_ds_name dynamically based on the project parameter
tests/func/test_delta.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/delta.py:241` </location>
<code_context>
     if source_ds_name is None:
         return None, None, True

+    assert source_ds_project
     assert source_ds_version
     assert source_ds_latest_version
</code_context>

<issue_to_address>
Use of assert for runtime validation may be bypassed in optimized mode.

Asserts are skipped with Python optimizations. Raise an exception instead if source_ds_project is essential.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    assert source_ds_project
=======
    if not source_ds_project:
        raise ValueError("source_ds_project must be set")
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

if source_ds_name is None:
return None, None, True

assert source_ds_project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Use of assert for runtime validation may be bypassed in optimized mode.

Asserts are skipped with Python optimizations. Raise an exception instead if source_ds_project is essential.

Suggested change
assert source_ds_project
if not source_ds_project:
raise ValueError("source_ds_project must be set")

Comment on lines 27 to 30
if project:
starting_ds_name = f"{project}.starting_ds"
else:
starting_ds_name = "local.local.starting_ds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jun 30, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 248ea1e
Status: ✅  Deploy successful!
Preview URL: https://6ccd68f1.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1193-fix-delta-updat.datachain-documentation.pages.dev

View logs

@codecov
Copy link

codecov bot commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.72%. Comparing base (8e6b5b6) to head (248ea1e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/delta.py 92.30% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1194   +/-   ##
=======================================
  Coverage   88.72%   88.72%           
=======================================
  Files         152      152           
  Lines       13545    13549    +4     
  Branches     1885     1885           
=======================================
+ Hits        12018    12022    +4     
  Misses       1086     1086           
  Partials      441      441           
Flag Coverage Δ
datachain 88.66% <92.30%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/lib/dc/datachain.py 89.82% <ø> (ø)
src/datachain/delta.py 92.77% <92.30%> (+0.36%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilongin ilongin merged commit c18e0c9 into main Jun 30, 2025
35 checks passed
@ilongin ilongin deleted the ilongin/1193-fix-delta-updates branch June 30, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants