Skip to content

feat(datasets): accept pep version spec in read_dataset#1178

Merged
shcheklein merged 10 commits intomainfrom
version-specs-support
Jun 29, 2025
Merged

feat(datasets): accept pep version spec in read_dataset#1178
shcheklein merged 10 commits intomainfrom
version-specs-support

Conversation

@shcheklein
Copy link
Contributor

@shcheklein shcheklein commented Jun 26, 2025

Add a way to specify version spec in read_dataset to find and match the most recent compatible version.

Among other tings:

  • I've dropped fallback_to_studio flag as part of this PR (because I needed to touch how we fetch and resolve dataset versions against Studio)
  • Added a read_dataset(update=True) to always try to get the latest version from Studio that satisfies the version spec

TODO:

  • Add for local dataset version resolving
  • Add more tests for update=True and remote dataset version resolving
  • Update docs

@sourcery-ai

This comment was marked as outdated.

@shcheklein shcheklein requested a review from a team June 26, 2025 04:10
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jun 26, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 105a2e4
Status: ✅  Deploy successful!
Preview URL: https://a22d7e21.datachain-documentation.pages.dev
Branch Preview URL: https://version-specs-support.datachain-documentation.pages.dev

View logs

@shcheklein shcheklein force-pushed the version-specs-support branch 3 times, most recently from 18913a3 to ba45bb0 Compare June 27, 2025 23:08
@shcheklein shcheklein marked this pull request as ready for review June 27, 2025 23:09
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @shcheklein - I've reviewed your changes - here's some feedback:

  • In latest_compatible_version, using max(compatible_versions) directly may not pick the highest semantic version—use something like max(compatible_versions, key=lambda r: Version(r.version)) to compare with packaging.Version.
  • Consider catching packaging.InvalidVersion in latest_compatible_version so that any malformed version strings on the dataset side don’t cause an unexpected crash when parsing.
  • The DatasetVersionNotFoundError message references the original version argument even after it’s converted to a specifier—switch to using the actual specifier string (e.g. version_spec) in the error message to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `latest_compatible_version`, using max(compatible_versions) directly may not pick the highest semantic version—use something like `max(compatible_versions, key=lambda r: Version(r.version))` to compare with `packaging.Version`.
- Consider catching `packaging.InvalidVersion` in `latest_compatible_version` so that any malformed version strings on the dataset side don’t cause an unexpected crash when parsing.
- The `DatasetVersionNotFoundError` message references the original `version` argument even after it’s converted to a specifier—switch to using the actual specifier string (e.g. `version_spec`) in the error message to avoid confusion.

## Individual Comments

### Comment 1
<location> `src/datachain/lib/dc/datasets.py:171` </location>
<code_context>
+        else:
+            version_spec = str(version)
+
+        from packaging.specifiers import InvalidSpecifier, SpecifierSet
+
         try:
</code_context>

<issue_to_address>
Importing inside the function may have performance implications.

Consider moving the import of `InvalidSpecifier` and `SpecifierSet` to the module level to prevent repeated imports if the function is called often.
</issue_to_address>

### Comment 2
<location> `tests/func/test_read_dataset_version_specifiers.py:50` </location>
<code_context>
+    assert result.to_values("dataset_version")[0] == "1.2.0"
+
+
+def test_read_dataset_version_specifiers_no_match(test_session):
+    """Test read_dataset with version specifiers that don't match any version."""
+    # Create a dataset with a single version
+    dataset_name = "test_no_match_specifiers"
+
+    (
+        dc.read_values(data=[1, 2], session=test_session)
+        .mutate(dataset_version="1.0.0")
+        .save(dataset_name, version="1.0.0")
+    )
+
+    # Test version specifier that doesn't match any existing version
</code_context>

<issue_to_address>
Consider testing the behavior when the dataset exists but has no versions at all.

Adding a test for datasets with no versions will help verify that the correct exception and error message are produced in this scenario.

Suggested implementation:

```python

def test_read_dataset_with_no_versions(test_session):
    """Test read_dataset when the dataset exists but has no versions."""
    dataset_name = "test_no_versions"

    # Create an empty dataset entry (no versions saved)
    # Assuming there is a way to register a dataset without saving a version.
    # If not, this may need to be adjusted to fit the actual API.
    dc.create_dataset(dataset_name, session=test_session)

    with pytest.raises(DatasetVersionNotFoundError) as exc_info:
        dc.read_dataset(dataset_name, version="*", session=test_session)

    assert (
        f"No dataset {dataset_name} version matching specifier *"
        in str(exc_info.value)
    )


def test_read_dataset_version_specifiers_exact_version(test_session):

```

- If `dc.create_dataset` does not exist, replace it with the appropriate method to register a dataset without any versions, or mock the dataset registration as needed for your codebase.
- Ensure that `DatasetVersionNotFoundError` is imported if not already present.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@shcheklein shcheklein force-pushed the version-specs-support branch from 7abfd99 to 7eb29e2 Compare June 28, 2025 22:53
@codecov
Copy link

codecov bot commented Jun 28, 2025

Codecov Report

Attention: Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.72%. Comparing base (1692ee3) to head (105a2e4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/catalog/catalog.py 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1178      +/-   ##
==========================================
+ Coverage   88.71%   88.72%   +0.01%     
==========================================
  Files         152      152              
  Lines       13535    13545      +10     
  Branches     1879     1885       +6     
==========================================
+ Hits        12007    12018      +11     
  Misses       1086     1086              
+ Partials      442      441       -1     
Flag Coverage Δ
datachain 88.66% <97.56%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/dataset.py 86.99% <100.00%> (+0.26%) ⬆️
src/datachain/lib/dc/datasets.py 94.66% <100.00%> (+2.56%) ⬆️
src/datachain/lib/dc/listings.py 88.88% <100.00%> (ø)
src/datachain/lib/projects.py 100.00% <ø> (ø)
src/datachain/query/dataset.py 93.60% <ø> (-0.04%) ⬇️
src/datachain/remote/studio.py 80.80% <100.00%> (-0.42%) ⬇️
src/datachain/catalog/catalog.py 86.04% <92.85%> (+0.12%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shcheklein shcheklein requested a review from ilongin June 28, 2025 23:29
Copy link
Contributor

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but I checked only users API, not implementation.
Approving to not to block anyone


```py
chain = dc.read_dataset("my_cats", fallback_to_studio=False)
chain = dc.read_dataset("my_cats", version="1.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we make this a part of dataset name? like my_cats@1.0.0 or package_name>=1.0,<2.0

So, we can give up a whole parameter from almost every API call 🙂

@shcheklein shcheklein requested a review from a team June 29, 2025 19:23
@shcheklein shcheklein merged commit 2b729de into main Jun 29, 2025
58 of 59 checks passed
@shcheklein shcheklein deleted the version-specs-support branch June 29, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants