feat(datasets): accept pep version spec in read_dataset#1178
Merged
shcheklein merged 10 commits intomainfrom Jun 29, 2025
Merged
feat(datasets): accept pep version spec in read_dataset#1178shcheklein merged 10 commits intomainfrom
shcheklein merged 10 commits intomainfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Deploying datachain-documentation with
|
| Latest commit: |
105a2e4
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://a22d7e21.datachain-documentation.pages.dev |
| Branch Preview URL: | https://version-specs-support.datachain-documentation.pages.dev |
18913a3 to
ba45bb0
Compare
Contributor
There was a problem hiding this comment.
Hey @shcheklein - I've reviewed your changes - here's some feedback:
- In
latest_compatible_version, using max(compatible_versions) directly may not pick the highest semantic version—use something likemax(compatible_versions, key=lambda r: Version(r.version))to compare withpackaging.Version. - Consider catching
packaging.InvalidVersioninlatest_compatible_versionso that any malformed version strings on the dataset side don’t cause an unexpected crash when parsing. - The
DatasetVersionNotFoundErrormessage references the originalversionargument even after it’s converted to a specifier—switch to using the actual specifier string (e.g.version_spec) in the error message to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `latest_compatible_version`, using max(compatible_versions) directly may not pick the highest semantic version—use something like `max(compatible_versions, key=lambda r: Version(r.version))` to compare with `packaging.Version`.
- Consider catching `packaging.InvalidVersion` in `latest_compatible_version` so that any malformed version strings on the dataset side don’t cause an unexpected crash when parsing.
- The `DatasetVersionNotFoundError` message references the original `version` argument even after it’s converted to a specifier—switch to using the actual specifier string (e.g. `version_spec`) in the error message to avoid confusion.
## Individual Comments
### Comment 1
<location> `src/datachain/lib/dc/datasets.py:171` </location>
<code_context>
+ else:
+ version_spec = str(version)
+
+ from packaging.specifiers import InvalidSpecifier, SpecifierSet
+
try:
</code_context>
<issue_to_address>
Importing inside the function may have performance implications.
Consider moving the import of `InvalidSpecifier` and `SpecifierSet` to the module level to prevent repeated imports if the function is called often.
</issue_to_address>
### Comment 2
<location> `tests/func/test_read_dataset_version_specifiers.py:50` </location>
<code_context>
+ assert result.to_values("dataset_version")[0] == "1.2.0"
+
+
+def test_read_dataset_version_specifiers_no_match(test_session):
+ """Test read_dataset with version specifiers that don't match any version."""
+ # Create a dataset with a single version
+ dataset_name = "test_no_match_specifiers"
+
+ (
+ dc.read_values(data=[1, 2], session=test_session)
+ .mutate(dataset_version="1.0.0")
+ .save(dataset_name, version="1.0.0")
+ )
+
+ # Test version specifier that doesn't match any existing version
</code_context>
<issue_to_address>
Consider testing the behavior when the dataset exists but has no versions at all.
Adding a test for datasets with no versions will help verify that the correct exception and error message are produced in this scenario.
Suggested implementation:
```python
def test_read_dataset_with_no_versions(test_session):
"""Test read_dataset when the dataset exists but has no versions."""
dataset_name = "test_no_versions"
# Create an empty dataset entry (no versions saved)
# Assuming there is a way to register a dataset without saving a version.
# If not, this may need to be adjusted to fit the actual API.
dc.create_dataset(dataset_name, session=test_session)
with pytest.raises(DatasetVersionNotFoundError) as exc_info:
dc.read_dataset(dataset_name, version="*", session=test_session)
assert (
f"No dataset {dataset_name} version matching specifier *"
in str(exc_info.value)
)
def test_read_dataset_version_specifiers_exact_version(test_session):
```
- If `dc.create_dataset` does not exist, replace it with the appropriate method to register a dataset without any versions, or mock the dataset registration as needed for your codebase.
- Ensure that `DatasetVersionNotFoundError` is imported if not already present.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
7abfd99 to
7eb29e2
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1178 +/- ##
==========================================
+ Coverage 88.71% 88.72% +0.01%
==========================================
Files 152 152
Lines 13535 13545 +10
Branches 1879 1885 +6
==========================================
+ Hits 12007 12018 +11
Misses 1086 1086
+ Partials 442 441 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
ilongin
requested changes
Jun 29, 2025
dmpetrov
approved these changes
Jun 29, 2025
Contributor
dmpetrov
left a comment
There was a problem hiding this comment.
LGTM - but I checked only users API, not implementation.
Approving to not to block anyone
|
|
||
| ```py | ||
| chain = dc.read_dataset("my_cats", fallback_to_studio=False) | ||
| chain = dc.read_dataset("my_cats", version="1.0.0") |
Contributor
There was a problem hiding this comment.
why don't we make this a part of dataset name? like my_cats@1.0.0 or package_name>=1.0,<2.0
So, we can give up a whole parameter from almost every API call 🙂
ilongin
approved these changes
Jun 29, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a way to specify version spec in
read_datasetto find and match the most recent compatible version.Among other tings:
fallback_to_studioflag as part of this PR (because I needed to touch how we fetch and resolve dataset versions against Studio)read_dataset(update=True)to always try to get the latest version from Studio that satisfies the version specTODO:
update=Trueand remote dataset version resolving