Skip to content

Conversation

@shcheklein
Copy link
Contributor

@shcheklein shcheklein commented Sep 9, 2025

Fixes the scenario in delta case:

  • No changes / second run
  • Output dataset name exists in the default namespace (as well as in the target namespace)
  • Delta silently reading the dataset from de the default namespace instead of the proper one (in the target namespace)

Quite serious issue since it can lead to usage of some wrong data

Summary by Sourcery

Ensure delta save with no changes returns the correct dataset version from the intended namespace and project instead of defaulting to the default namespace

Bug Fixes:

  • Pass namespace and project parameters to read_dataset when handling no-change delta saves to avoid reading from the default namespace

Enhancements:

  • Update dependency retrieval to include project and namespace context and generate properly qualified dataset names

Tests:

  • Introduce helper _get_short_ds_name and update existing tests to use fully qualified names
  • Add test to verify no-change delta behavior across multiple namespace and project combinations

@shcheklein shcheklein requested review from a team and ilongin September 9, 2025 01:56
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Sep 9, 2025

Reviewer's Guide

This PR fixes delta datasets incorrectly reading from the default namespace when there are no changes by qualifying dataset names with project and namespace in both test and production code, and adds comprehensive tests to validate this behavior across various namespace/project combinations.

Sequence diagram for reading a dataset with qualified namespace and project

sequenceDiagram
participant Caller
participant "read_dataset()"
Caller->>"read_dataset()": read_dataset(name, namespace, project, ...)
"read_dataset()"->>"Target Namespace": Lookup dataset in specified namespace/project
"read_dataset()"-->>Caller: Return dataset from correct namespace
Loading

File-Level Changes

Change Details Files
Introduce dataset name qualification in tests
  • Added _get_short_ds_name helper to format names based on default/target namespace/project
  • Updated _get_dependencies to use the helper and pass project_name/namespace_name to dependency lookup
tests/func/test_delta.py
Extend delta tests for no-change scenarios
  • Changed existing tests to use qualified dataset names
  • Added test_delta_returns_correct_dataset_on_no_changes covering default and custom namespaces
tests/func/test_delta.py
Ensure correct namespace/project is used in save when no delta changes
  • Updated save method to pass namespace and project to read_dataset on no-change path
src/datachain/lib/dc/datachain.py

Possibly linked issues

  • Initial DataChain Commit #1: PR fixes delta updates by ensuring correct dataset is read from specified namespace/project when no changes occur.
  • #0: The PR fixes a bug where delta incorrectly resolves dataset names to the default namespace instead of the intended qualified one, directly addressing a core problem the issue aims to prevent.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@shcheklein shcheklein self-assigned this Sep 9, 2025
@shcheklein shcheklein added the bug Something isn't working label Sep 9, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider using pytest.mark.parametrize for the three dataset‐name/namespace cases in test_delta_returns_correct_dataset_on_no_changes to reduce manual loops and improve readability.
  • The custom _get_short_ds_name helper duplicates naming logic—see if you can reuse or wrap an existing catalog/metastore method to keep dataset qualification consistent and avoid subtle drift.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider using pytest.mark.parametrize for the three dataset‐name/namespace cases in test_delta_returns_correct_dataset_on_no_changes to reduce manual loops and improve readability.
- The custom _get_short_ds_name helper duplicates naming logic—see if you can reuse or wrap an existing catalog/metastore method to keep dataset qualification consistent and avoid subtle drift.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@codecov
Copy link

codecov bot commented Sep 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.84%. Comparing base (91617c0) to head (cc2a36f).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1326   +/-   ##
=======================================
  Coverage   88.84%   88.84%           
=======================================
  Files         155      155           
  Lines       14240    14240           
  Branches     2025     2025           
=======================================
  Hits        12652    12652           
  Misses       1124     1124           
  Partials      464      464           
Flag Coverage Δ
datachain 88.78% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/lib/dc/datachain.py 91.14% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shcheklein shcheklein requested a review from Copilot September 9, 2025 02:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical bug in the delta save functionality where the system would incorrectly read from the default namespace instead of the target namespace when there are no changes on subsequent runs. This could lead to using wrong data when datasets with the same name exist in multiple namespaces.

Key Changes

  • Fix delta save to properly pass namespace and project parameters when reading existing datasets on no-change scenarios
  • Update test utilities to handle fully qualified dataset names and namespace/project context
  • Add comprehensive test coverage for the no-change delta behavior across different namespace and project combinations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/datachain/lib/dc/datachain.py Fixes the core bug by passing namespace and project parameters to read_dataset
tests/func/test_delta.py Adds helper functions and comprehensive tests to verify the fix works across multiple namespace scenarios

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

from datachain.lib.file import File, ImageFile


def _get_short_ds_name(catalog, name, project_name, namespace_name) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why _get_short_ds_name? IMO it should be something like _get_full_ds_name 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is trying to get the shortest name (check if namespace passed is default and project is default and doesn't use them)

@shcheklein shcheklein merged commit 8355e28 into main Sep 10, 2025
61 of 63 checks passed
@shcheklein shcheklein deleted the fix-delta-no-changes-read branch September 10, 2025 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants