Refactoring `Catalog.get_dataset()` #1249

ilongin · 2025-07-18T14:50:34Z

Refactoring Catalog.get_dataset() to accept project_name and namespace_name instead of already instantiated Project. This is because in more than one place we have project and namespace name available, but don't have whole Project instance and we needed to fetch it from DB because of that which was basically +1 SQL command that could've been avoided.

Summary by Sourcery

Refactor dataset lookup to use namespace and project names instead of passing Project objects throughout the codebase

Enhancements:

Change Catalog.get_dataset signature to accept namespace_name and project_name parameters
Update Metastore.get_dataset interface and queries to use namespace and project names
Modify all internal callers and tests to pass namespace_name and project_name rather than Project instances

sourcery-ai · 2025-07-18T14:50:45Z

Reviewer's Guide

This PR refactors the dataset lookup API by changing Catalog.get_dataset (and its metastore counterpart) to accept namespace_name and project_name instead of a Project object, centralizes default resolution of namespace/project, and updates all callers and tests accordingly.

Sequence diagram for dataset lookup with new get_dataset signature

sequenceDiagram
    participant Caller
    participant Catalog
    participant Metastore
    Caller->>Catalog: get_dataset(name, namespace_name, project_name)
    Catalog->>Metastore: get_dataset(name, namespace_name, project_name)
    Metastore-->>Catalog: DatasetRecord
    Catalog-->>Caller: DatasetRecord

Class diagram for refactored Catalog.get_dataset and Metastore.get_dataset

classDiagram
    class Catalog {
        +get_dataset(name: str, namespace_name: Optional[str] = None, project_name: Optional[str] = None) DatasetRecord
    }
    class Metastore {
        +get_dataset(name: str, namespace_name: Optional[str] = None, project_name: Optional[str] = None, conn=None) DatasetRecord
    }
    Catalog --> Metastore : uses
    class Project {
        +name: str
        +namespace: Namespace
    }
    class Namespace {
        +name: str
    }
    class DatasetRecord {
        +name: str
        +project: Project
        +latest_version: str
        +has_version(version: str): bool
    }
    Metastore --> Project
    Project --> Namespace
    Catalog --> DatasetRecord
    Metastore --> DatasetRecord

File-Level Changes

Change	Details	Files
Refactor Catalog.get_dataset signature and internals	Replace the `project: Project` parameter with `namespace_name` and `project_name` Resolve default namespace/project inside the method for listing and default cases Remove the old try/except and delegate directly to metastore.get_dataset	`src/datachain/catalog/catalog.py`
Update Metastore.get_dataset interface and query logic	Extend signature to accept `namespace_name` and `project_name` instead of `project_id` Adjust SQL query to join namespaces and projects tables by name Modify dataset creation and version methods to call new signature	`src/datachain/data_storage/metastore.py`
Propagate new get_dataset signature across all callers	Replace calls passing `Project` objects with namespace/project name arguments Align helper modules (listing, query, delta, datachain core, CLI) to new API Simplify redundant code around dataset resolution	`src/datachain/query/dataset.py` `src/datachain/listing.py` `src/datachain/lib/dc/datasets.py` `src/datachain/lib/dc/datachain.py` `src/datachain/delta.py` `src/datachain/cli/commands/datasets.py`
Adjust functional tests to reflect new signature	Remove passing `Project` to `get_dataset` in tests Verify namespace/project defaults by calling `get_dataset(name)` only	`tests/func/test_datasets.py` `tests/func/test_pull.py`

Possibly linked issues

Initial DataChain Commit #1: The PR refactors Catalog.get_dataset to accept namespace_name and project_name, directly addressing the issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/data_storage/metastore.py:921` </location>
<code_context>
     ) -> DatasetRecord:
         """Creates new dataset."""
-        project_id = project_id or self.default_project.id
+        if not project_id:
+            project = self.default_project
+        else:
+            project = self.get_project_by_id(project_id)

         query = self._datasets_insert().values(
</code_context>

<issue_to_address>
Logic for determining project_id in create_dataset may be redundant.

Since project.id is always used in the insert, fetching the full project object when project_id is already provided may be unnecessary. Consider simplifying to avoid the extra lookup.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-07-18T14:52:03Z

src/datachain/data_storage/metastore.py

+        if not project_id:
+            project = self.default_project
+        else:
+            project = self.get_project_by_id(project_id)


nitpick: Logic for determining project_id in create_dataset may be redundant.

Since project.id is always used in the insert, fetching the full project object when project_id is already provided may be unnecessary. Consider simplifying to avoid the extra lookup.

sourcery-ai · 2025-07-18T14:52:03Z

tests/func/test_datasets.py

+    for r in catalog.get_dataset("dogs_custom_columns").get_version("1.0.0").preview:
        assert isinstance(r.get("file__last_modified"), str)


issue (code-quality): Avoid loops in tests. (no-loop-in-tests)

Explanation
Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-07-18T14:52:03Z

src/datachain/data_storage/metastore.py

        ds = self._parse_dataset(self.db.execute(query, conn=conn))
        if not ds:
            raise DatasetNotFoundError(
-                f"Dataset {name} not found in project with id {project_id}"
+                f"Dataset {name} not found in namespace {namespace_name}"
+                f" and project {project_name}"
            )


issue (code-quality): We've found these issues:

Use named expression to simplify assignment and conditional (use-named-expression)

Lift code into else after jump in control flow (reintroduce-else)

Swap if/else branches (swap-if-else-branches)

codecov · 2025-07-18T14:58:59Z

Codecov Report

❌ Patch coverage is 88.57143% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.73%. Comparing base (c8a23d7) to head (7d33924).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/catalog/catalog.py	83.33%	3 Missing ⚠️
src/datachain/cli/commands/datasets.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1249      +/-   ##
==========================================
+ Coverage   88.71%   88.73%   +0.01%     
==========================================
  Files         155      155              
  Lines       14130    14123       -7     
  Branches     1993     1994       +1     
==========================================
- Hits        12536    12532       -4     
+ Misses       1127     1124       -3     
  Partials      467      467

Flag	Coverage Δ
datachain	`88.67% <88.57%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/data_storage/metastore.py	`93.86% <100.00%> (+0.05%)`	⬆️
src/datachain/delta.py	`92.68% <100.00%> (-0.09%)`	⬇️
src/datachain/lib/dc/datachain.py	`91.50% <100.00%> (ø)`
src/datachain/lib/dc/datasets.py	`95.12% <100.00%> (ø)`
src/datachain/listing.py	`85.32% <100.00%> (+1.25%)`	⬆️
src/datachain/query/dataset.py	`93.34% <ø> (-0.02%)`	⬇️
src/datachain/cli/commands/datasets.py	`71.95% <0.00%> (+0.86%)`	⬆️
src/datachain/catalog/catalog.py	`85.96% <83.33%> (-0.07%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cloudflare-workers-and-pages · 2025-08-04T06:21:52Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`7d33924`
Status:	✅ Deploy successful!
Preview URL:	https://ecc011cf.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-1237-refactore-get-d.datachain-documentation.pages.dev

View logs

dmpetrov

Let's use fully qualified dataset names instead of 3 params: name, namespace, project.

dmpetrov · 2025-08-04T06:49:47Z

src/datachain/catalog/catalog.py

+        self,
+        name: str,
+        namespace_name: Optional[str] = None,
+        project_name: Optional[str] = None,


Let's use full project name instead of additional params - ns1.ns2.ds3

This is internal API where we already have it split this into 3 parts. In public API we use fully qualified name. Honestly I wouldn't change this atm as it would again require me some refactoring which would take time and we already have this kind of signature in other internal methods so not much would be changed. WDYT?

Let's approve to move fast. But we are accumulating tech depth using different notations in internal and external APIs. Please create a followup issue.

Just an idea: we can store full name in dataset table as dataset name, this will simplify everything and we don't need anymore to pass namespace and project name everywhere.

We don't need to do that as dataset table already have connection to project table which has connection to namespaces table, so it's normalized. Also, when fetching dataset from table object DatasetRecord already has Project object in it and property method full_name which returns namespace, project and dataset name so it's all ok.
We need explicit namespace / project when we are fetching dataset (as here) or saving the new one in DB. The only question is should we everywhere have it split into 3 arguments, as now, or should we have just dataset name which must, by convention, have namespace and project embedded in it, e.g namespace.project.dataset. This is what we already have in public API but not in internal ones.

dmpetrov

LG

dreadatour

Looks good to me overall 👍

dreadatour · 2025-08-20T12:46:11Z

src/datachain/catalog/catalog.py

+        namespace_name = namespace_name or self.metastore.default_namespace_name
+        project_name = project_name or self.metastore.default_project_name


Just wonder, if it make sense if project_name is set with namespace_name is empty (default one) and vise versa (namespace_name is set, but project_name is default one).

I assume yes, it make sense in some cases, but it also may leads to some weird behavior. Taking the full name (ns1.ns2.ds3 as @dmpetrov suggests below) may solve this issue (it will not be possible to set namespace name only, without setting project name). Or may be I am wrong and this is a nice feature to have. Just something to keep in mind and think about, may be later.

Note: to be honest I am still confused with namespace and project — which one goes first 😥

It def makes sense to have just project name set, as then default namespace will be used, but you are right about vice versa .. if someone just sets namespace name then default project will be used which can be little bit weird. I'm not 100% sure if that's ok or not TBH but for now we allow all combinations.

It's namespace.project.dataset :D That's hierarchy in DB as well.

fixing tests

60e6717

ilongin linked an issue Jul 18, 2025 that may be closed by this pull request

Refactor Catalog.get_dataset() to accept project_name and namespace_name instead of Project #1237

Closed

sourcery-ai bot reviewed Jul 18, 2025

View reviewed changes

Merge branch 'main' into ilongin/1237-refactore-get-dataset

8060b62

dmpetrov requested changes Aug 4, 2025

View reviewed changes

ilongin requested a review from dmpetrov August 4, 2025 09:45

shcheklein approved these changes Aug 4, 2025

View reviewed changes

dmpetrov approved these changes Aug 4, 2025

View reviewed changes

Merge branch 'main' into ilongin/1237-refactore-get-dataset

d0bf1ed

dreadatour approved these changes Aug 20, 2025

View reviewed changes

ilongin added 2 commits August 21, 2025 10:22

Merge branch 'main' into ilongin/1237-refactore-get-dataset

34ba589

fixing get_dataset_dependencies to use namespace name and project name

7d33924

ilongin merged commit aec36fc into main Aug 21, 2025
37 of 38 checks passed

ilongin deleted the ilongin/1237-refactore-get-dataset branch August 21, 2025 13:11

		for r in catalog.get_dataset("dogs_custom_columns").get_version("1.0.0").preview:
		assert isinstance(r.get("file__last_modified"), str)

		namespace_name = namespace_name or self.metastore.default_namespace_name
		project_name = project_name or self.metastore.default_project_name

Refactoring Catalog.get_dataset() #1249

Refactoring Catalog.get_dataset() #1249

Uh oh!

Conversation

ilongin commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for dataset lookup with new get_dataset signature

Class diagram for refactored Catalog.get_dataset and Metastore.get_dataset

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

dmpetrov Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

ilongin Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

dmpetrov Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

dreadatour Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

ilongin Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

dreadatour Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

ilongin Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Refactoring `Catalog.get_dataset()` #1249

Refactoring `Catalog.get_dataset()` #1249

ilongin commented Jul 18, 2025 •

edited

Loading

sourcery-ai bot commented Jul 18, 2025 •

edited

Loading

codecov bot commented Jul 18, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Aug 4, 2025 •

edited

Loading