feat(datasets): accept pep version spec in read_dataset by shcheklein · Pull Request #1178 · datachain-ai/datachain

shcheklein · 2025-06-26T04:10:44Z

Add a way to specify version spec in read_dataset to find and match the most recent compatible version.

Among other tings:

I've dropped fallback_to_studio flag as part of this PR (because I needed to touch how we fetch and resolve dataset versions against Studio)
Added a read_dataset(update=True) to always try to get the latest version from Studio that satisfies the version spec

TODO:

Add for local dataset version resolving
Add more tests for update=True and remote dataset version resolving
Update docs

cloudflare-workers-and-pages · 2025-06-26T04:11:16Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`105a2e4`
Status:	✅ Deploy successful!
Preview URL:	https://a22d7e21.datachain-documentation.pages.dev
Branch Preview URL:	https://version-specs-support.datachain-documentation.pages.dev

View logs

sourcery-ai

Hey @shcheklein - I've reviewed your changes - here's some feedback:

In latest_compatible_version, using max(compatible_versions) directly may not pick the highest semantic version—use something like max(compatible_versions, key=lambda r: Version(r.version)) to compare with packaging.Version.
Consider catching packaging.InvalidVersion in latest_compatible_version so that any malformed version strings on the dataset side don’t cause an unexpected crash when parsing.
The DatasetVersionNotFoundError message references the original version argument even after it’s converted to a specifier—switch to using the actual specifier string (e.g. version_spec) in the error message to avoid confusion.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- In `latest_compatible_version`, using max(compatible_versions) directly may not pick the highest semantic version—use something like `max(compatible_versions, key=lambda r: Version(r.version))` to compare with `packaging.Version`.
- Consider catching `packaging.InvalidVersion` in `latest_compatible_version` so that any malformed version strings on the dataset side don’t cause an unexpected crash when parsing.
- The `DatasetVersionNotFoundError` message references the original `version` argument even after it’s converted to a specifier—switch to using the actual specifier string (e.g. `version_spec`) in the error message to avoid confusion.

## Individual Comments

### Comment 1
<location> `src/datachain/lib/dc/datasets.py:171` </location>
<code_context>
+        else:
+            version_spec = str(version)
+
+        from packaging.specifiers import InvalidSpecifier, SpecifierSet
+
         try:
</code_context>

<issue_to_address>
Importing inside the function may have performance implications.

Consider moving the import of `InvalidSpecifier` and `SpecifierSet` to the module level to prevent repeated imports if the function is called often.
</issue_to_address>

### Comment 2
<location> `tests/func/test_read_dataset_version_specifiers.py:50` </location>
<code_context>
+    assert result.to_values("dataset_version")[0] == "1.2.0"
+
+
+def test_read_dataset_version_specifiers_no_match(test_session):
+    """Test read_dataset with version specifiers that don't match any version."""
+    # Create a dataset with a single version
+    dataset_name = "test_no_match_specifiers"
+
+    (
+        dc.read_values(data=[1, 2], session=test_session)
+        .mutate(dataset_version="1.0.0")
+        .save(dataset_name, version="1.0.0")
+    )
+
+    # Test version specifier that doesn't match any existing version
</code_context>

<issue_to_address>
Consider testing the behavior when the dataset exists but has no versions at all.

Adding a test for datasets with no versions will help verify that the correct exception and error message are produced in this scenario.

Suggested implementation:

```python

def test_read_dataset_with_no_versions(test_session):
    """Test read_dataset when the dataset exists but has no versions."""
    dataset_name = "test_no_versions"

    # Create an empty dataset entry (no versions saved)
    # Assuming there is a way to register a dataset without saving a version.
    # If not, this may need to be adjusted to fit the actual API.
    dc.create_dataset(dataset_name, session=test_session)

    with pytest.raises(DatasetVersionNotFoundError) as exc_info:
        dc.read_dataset(dataset_name, version="*", session=test_session)

    assert (
        f"No dataset {dataset_name} version matching specifier *"
        in str(exc_info.value)
    )


def test_read_dataset_version_specifiers_exact_version(test_session):

```

- If `dc.create_dataset` does not exist, replace it with the appropriate method to register a dataset without any versions, or mock the dataset registration as needed for your codebase.
- Ensure that `DatasetVersionNotFoundError` is imported if not already present.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/datachain/lib/dc/datasets.py

tests/func/test_read_dataset_version_specifiers.py

src/datachain/dataset.py

codecov · 2025-06-28T22:58:36Z

Codecov Report

Attention: Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.72%. Comparing base (1692ee3) to head (105a2e4).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/catalog/catalog.py	92.85%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1178      +/-   ##
==========================================
+ Coverage   88.71%   88.72%   +0.01%     
==========================================
  Files         152      152              
  Lines       13535    13545      +10     
  Branches     1879     1885       +6     
==========================================
+ Hits        12007    12018      +11     
  Misses       1086     1086              
+ Partials      442      441       -1

Flag	Coverage Δ
datachain	`88.66% <97.56%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/dataset.py	`86.99% <100.00%> (+0.26%)`	⬆️
src/datachain/lib/dc/datasets.py	`94.66% <100.00%> (+2.56%)`	⬆️
src/datachain/lib/dc/listings.py	`88.88% <100.00%> (ø)`
src/datachain/lib/projects.py	`100.00% <ø> (ø)`
src/datachain/query/dataset.py	`93.60% <ø> (-0.04%)`	⬇️
src/datachain/remote/studio.py	`80.80% <100.00%> (-0.42%)`	⬇️
src/datachain/catalog/catalog.py	`86.04% <92.85%> (+0.12%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/datachain/lib/dc/datasets.py

dmpetrov

LGTM - but I checked only users API, not implementation.
Approving to not to block anyone

dmpetrov · 2025-06-29T00:47:14Z

src/datachain/lib/dc/datasets.py


        ```py
-        chain = dc.read_dataset("my_cats", fallback_to_studio=False)
+        chain = dc.read_dataset("my_cats", version="1.0.0")


why don't we make this a part of dataset name? like my_cats@1.0.0 or package_name>=1.0,<2.0

So, we can give up a whole parameter from almost every API call 🙂

src/datachain/lib/dc/datasets.py

This comment was marked as outdated.

Sign in to view

shcheklein requested a review from a team June 26, 2025 04:10

shcheklein force-pushed the version-specs-support branch 3 times, most recently from 18913a3 to ba45bb0 Compare June 27, 2025 23:08

shcheklein marked this pull request as ready for review June 27, 2025 23:09

sourcery-ai bot reviewed Jun 27, 2025

View reviewed changes

src/datachain/lib/dc/datasets.py Show resolved Hide resolved

tests/func/test_read_dataset_version_specifiers.py Show resolved Hide resolved

tests/func/test_read_dataset_version_specifiers.py Show resolved Hide resolved

src/datachain/dataset.py Show resolved Hide resolved

shcheklein added 8 commits June 28, 2025 15:39

feat(datasets): accept pep version spec in read_dataset

3b69f9f

add func tests for version specs

22ac7eb

simplify tests

5636df8

simplify tests

30e3eff

refactor to properly handle remote datasets

3187ae4

fix tests

6dc360a

add proper exception to get remote dataset

0fc72e7

fix tests

7eb29e2

shcheklein force-pushed the version-specs-support branch from 7abfd99 to 7eb29e2 Compare June 28, 2025 22:53

shcheklein requested a review from ilongin June 28, 2025 23:29

ilongin requested changes Jun 29, 2025

View reviewed changes

src/datachain/lib/dc/datasets.py Outdated Show resolved Hide resolved

dmpetrov approved these changes Jun 29, 2025

View reviewed changes

ilongin approved these changes Jun 29, 2025

View reviewed changes

shcheklein added 2 commits June 29, 2025 12:14

add tests for remote dataset read

7455288

improve docs for update flag

105a2e4

shcheklein requested a review from a team June 29, 2025 19:23

shcheklein merged commit 2b729de into main Jun 29, 2025
58 of 59 checks passed

shcheklein deleted the version-specs-support branch June 29, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): accept pep version spec in read_dataset#1178

feat(datasets): accept pep version spec in read_dataset#1178
shcheklein merged 10 commits intomainfrom
version-specs-support

shcheklein commented Jun 26, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

cloudflare-workers-and-pages bot commented Jun 26, 2025 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

dmpetrov left a comment

Uh oh!

dmpetrov Jun 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shcheklein commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO:

Uh oh!

This comment was marked as outdated.

cloudflare-workers-and-pages bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

dmpetrov Jun 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shcheklein commented Jun 26, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Jun 26, 2025 •

edited

Loading

codecov bot commented Jun 28, 2025 •

edited

Loading