Adding ability to set namespace and project using env variables#1186
Adding ability to set namespace and project using env variables#1186
Conversation
Reviewer's GuideThis PR centralizes dataset name resolution in a new Catalog.get_full_dataset_name method to incorporate DATACHAIN_NAMESPACE and DATACHAIN_PROJECT environment variables and replaces ad-hoc parsing across save, read, delete, and CLI commands, with corresponding updates to documentation and tests. Sequence diagram for dataset name resolution with environment variablessequenceDiagram
participant User
participant CLI_or_API as CLI/API/Library
participant Catalog
participant Metastore
User->>CLI_or_API: Call save/read/delete_dataset(name, ...)
CLI_or_API->>Catalog: get_full_dataset_name(name, project_name, namespace_name)
Catalog->>parse_dataset_name: parse_dataset_name(name)
Catalog->>Env: Read DATACHAIN_NAMESPACE and DATACHAIN_PROJECT
Catalog->>Metastore: Get default_namespace_name, default_project_name
Catalog-->>CLI_or_API: (namespace, project, name)
CLI_or_API->>Catalog: get_dataset(name, project)
Catalog->>Metastore: get_project(project_name, namespace_name)
Metastore-->>Catalog: Project
Catalog-->>CLI_or_API: DatasetRecord
Class diagram for Catalog and dataset name resolutionclassDiagram
class Catalog {
+get_full_dataset_name(name: str, project_name: Optional[str], namespace_name: Optional[str]) tuple[str, str, str]
+get_dataset(name: str, project: Optional[Project]) DatasetRecord
}
Catalog --> Metastore : uses
class Metastore {
+default_namespace_name
+default_project_name
+get_project(project_name, namespace_name)
+is_local_dataset(namespace_name)
}
class parse_dataset_name {
<<function>>
}
Catalog ..> parse_dataset_name : calls
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey @ilongin - I've reviewed your changes - here's some feedback:
- Please add a clear docstring to get_full_dataset_name outlining parameter precedence (parsed name, args, env vars, defaults) and what each returned tuple element represents.
- Consider extracting direct os.environ lookups into a configuration helper or injecting the env values at initialization to improve testability and separation of concerns.
- In test_save_all_ways_to_set_project the use_settings parameter is unused; either implement branching on that flag to cover both code paths or remove it to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Please add a clear docstring to get_full_dataset_name outlining parameter precedence (parsed name, args, env vars, defaults) and what each returned tuple element represents.
- Consider extracting direct os.environ lookups into a configuration helper or injecting the env values at initialization to improve testability and separation of concerns.
- In test_save_all_ways_to_set_project the use_settings parameter is unused; either implement branching on that flag to cover both code paths or remove it to avoid confusion.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Deploying datachain-documentation with
|
| Latest commit: |
ff182b8
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://e2e64851.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-1185-get-project-fro.datachain-documentation.pages.dev |
docs/guide/env.md
Outdated
|
|
||
| ### Namespaces and projects | ||
| - `DATACHAIN_NAMESPACE` – Namespace name to use as default. | ||
| - `DATACHAIN_PROJECT` – Project name to use as default. |
There was a problem hiding this comment.
Can project name include namespace?
There was a problem hiding this comment.
yes. I'd expect there is a single variable, not two.
There was a problem hiding this comment.
So you would like for example this: DATACHAIN_PROJECT=dev.analytics?
There was a problem hiding this comment.
yes, but we still need a way to specify only NAMESPACE
any ideas for a good var name?
(I'm fine to keep both vars and make sure that DATACHAIN_PROJECT can accept dev.analytics`)
There was a problem hiding this comment.
Yea, I agree that we need to keep NAMESPACE. I will do it as you propose, project can include namespace and that's it.
There was a problem hiding this comment.
Second example is strange to me. DATACHAIN_PROJECT env clearly states it's about projects but then we set dev namespace with it and no project? I'm more in favor of what Ivan suggested above, examples:
DATACHAIN_PROJECT=dev.analytics -> dev namespace and analytics project
DATACHAIN_PROJECT=analytics -> analytics project and default namespace
DATACHAIN_NAMESPACE=dev -> dev namespace and default project
DATACHAIN_NAMESPACE=dev & DATACHAIN_PROJECT=analytics -> dev namespace and analytics project
There was a problem hiding this comment.
sorry. I meant DATACHAIN_NAMESPACE. modifying...
There was a problem hiding this comment.
on a separate note...
- does
DATACHAIN_prefix too long? Why notDC_? - why don't we operate in datachain config instead of env variables. it's more common practice. User might work on several projects with different namespaces (or other env) at the same time.
There was a problem hiding this comment.
- We already use
DATACHAIN_prefix so I wanted to be consistent. We can do alias for all of those variables toDC_as well if we want, but I would do it in a separate issue - Main motivation for this was to enable Studio users to set default values for namespace / project. Locally user can only have
local.localproject and cannot create another one so dataset config doesn't really helps.
There was a problem hiding this comment.
Re env vs config. It's a general question. It feels like we are too heavy on env while in general config is preferred option.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1186 +/- ##
==========================================
+ Coverage 88.69% 88.71% +0.01%
==========================================
Files 152 152
Lines 13531 13535 +4
Branches 1875 1879 +4
==========================================
+ Hits 12001 12007 +6
+ Misses 1088 1086 -2
Partials 442 442
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey @ilongin - I've reviewed your changes - here's some feedback:
- Add a detailed docstring to Catalog.get_full_dataset_name that explains its parameters, return values, and the exact resolution precedence (parsed name, settings, env variables, defaults).
- There’s a duplicate entry in the test_save_all_ways_to_set_project parameter list; deduplicate or clarify that scenario to keep the matrix maintainable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Add a detailed docstring to Catalog.get_full_dataset_name that explains its parameters, return values, and the exact resolution precedence (parsed name, settings, env variables, defaults).
- There’s a duplicate entry in the test_save_all_ways_to_set_project parameter list; deduplicate or clarify that scenario to keep the matrix maintainable.
## Individual Comments
### Comment 1
<location> `tests/unit/lib/test_datachain.py:3455` </location>
<code_context>
+ "env_namespace,env_project,"
+ "result_ds_namespace,result_ds_project"
+ ),
+ [
+ ("n3", "p3", "n2", "p2", "n1", "p1", "n3", "p3"),
+ ("", "", "n2", "p2", "n1", "p1", "n2", "p2"),
+ ("", "", "", "", "n1", "p1", "n1", "p1"),
+ ("", "", "", "", "n5", "n1.p1", "n1", "p1"),
+ ("", "", "", "", "", "n1.p1", "n1", "p1"),
+ ("", "", "", "", "", "n1.p1", "n1", "p1"),
+ ("n3", "p3", "n2", "p2", "", "", "n3", "p3"),
+ ("n3", "p3", "", "", "", "", "n3", "p3"),
+ ("n3", "p3", "", "", "n1", "p1", "n3", "p3"),
+ ("", "", "", "", "", "", "", ""),
+ ],
+)
</code_context>
<issue_to_address>
Consider adding test cases for invalid or malformed environment variable values.
Adding tests for malformed or unexpected values (such as extra dots, leading/trailing dots, whitespace, or special characters) will help verify error handling and prevent subtle bugs.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
[
("n3", "p3", "n2", "p2", "n1", "p1", "n3", "p3"),
("", "", "n2", "p2", "n1", "p1", "n2", "p2"),
("", "", "", "", "n1", "p1", "n1", "p1"),
("", "", "", "", "n5", "n1.p1", "n1", "p1"),
("", "", "", "", "", "n1.p1", "n1", "p1"),
("", "", "", "", "", "n1.p1", "n1", "p1"),
("n3", "p3", "n2", "p2", "", "", "n3", "p3"),
("n3", "p3", "", "", "", "", "n3", "p3"),
("n3", "p3", "", "", "n1", "p1", "n3", "p3"),
("", "", "", "", "", "", "", ""),
],
)
=======
[
("n3", "p3", "n2", "p2", "n1", "p1", "n3", "p3"),
("", "", "n2", "p2", "n1", "p1", "n2", "p2"),
("", "", "", "", "n1", "p1", "n1", "p1"),
("", "", "", "", "n5", "n1.p1", "n1", "p1"),
("", "", "", "", "", "n1.p1", "n1", "p1"),
("", "", "", "", "", "n1.p1", "n1", "p1"),
("n3", "p3", "n2", "p2", "", "", "n3", "p3"),
("n3", "p3", "", "", "", "", "n3", "p3"),
("n3", "p3", "", "", "n1", "p1", "n3", "p3"),
("", "", "", "", "", "", "", ""),
# Malformed/invalid env var values
("", "", "", "", " n1 ", " p1 ", "n1", "p1"), # leading/trailing whitespace
("", "", "", "", ".n1", "p1.", "n1", "p1"), # leading/trailing dots
("", "", "", "", "n1..n2", "p1..p2", "n1..n2", "p1..p2"), # extra dots
("", "", "", "", "n1$", "p1#", "n1$", "p1#"), # special characters
("", "", "", "", "n1.p1", "p1.n1", "n1", "p1"), # swapped/invalid format
("", "", "", "", "n1..", "..p1", "n1..", "..p1"), # trailing/leading double dots
],
)
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `docs/guide/namespaces.md:161` </location>
<code_context>
ds = dc.read_dataset("local.local.metrics")
-ds.show()
-```
+ds.sho
</code_context>
<issue_to_address>
Typo: 'ds.sho' should be 'ds.show()'.
Please correct 'ds.sho' to 'ds.show()' for consistency and to ensure the example works as intended.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| [ | ||
| ("n3", "p3", "n2", "p2", "n1", "p1", "n3", "p3"), | ||
| ("", "", "n2", "p2", "n1", "p1", "n2", "p2"), | ||
| ("", "", "", "", "n1", "p1", "n1", "p1"), | ||
| ("", "", "", "", "n5", "n1.p1", "n1", "p1"), | ||
| ("", "", "", "", "", "n1.p1", "n1", "p1"), | ||
| ("", "", "", "", "", "n1.p1", "n1", "p1"), | ||
| ("n3", "p3", "n2", "p2", "", "", "n3", "p3"), | ||
| ("n3", "p3", "", "", "", "", "n3", "p3"), | ||
| ("n3", "p3", "", "", "n1", "p1", "n3", "p3"), | ||
| ("", "", "", "", "", "", "", ""), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
suggestion (testing): Consider adding test cases for invalid or malformed environment variable values.
Adding tests for malformed or unexpected values (such as extra dots, leading/trailing dots, whitespace, or special characters) will help verify error handling and prevent subtle bugs.
| [ | |
| ("n3", "p3", "n2", "p2", "n1", "p1", "n3", "p3"), | |
| ("", "", "n2", "p2", "n1", "p1", "n2", "p2"), | |
| ("", "", "", "", "n1", "p1", "n1", "p1"), | |
| ("", "", "", "", "n5", "n1.p1", "n1", "p1"), | |
| ("", "", "", "", "", "n1.p1", "n1", "p1"), | |
| ("", "", "", "", "", "n1.p1", "n1", "p1"), | |
| ("n3", "p3", "n2", "p2", "", "", "n3", "p3"), | |
| ("n3", "p3", "", "", "", "", "n3", "p3"), | |
| ("n3", "p3", "", "", "n1", "p1", "n3", "p3"), | |
| ("", "", "", "", "", "", "", ""), | |
| ], | |
| ) | |
| [ | |
| ("n3", "p3", "n2", "p2", "n1", "p1", "n3", "p3"), | |
| ("", "", "n2", "p2", "n1", "p1", "n2", "p2"), | |
| ("", "", "", "", "n1", "p1", "n1", "p1"), | |
| ("", "", "", "", "n5", "n1.p1", "n1", "p1"), | |
| ("", "", "", "", "", "n1.p1", "n1", "p1"), | |
| ("", "", "", "", "", "n1.p1", "n1", "p1"), | |
| ("n3", "p3", "n2", "p2", "", "", "n3", "p3"), | |
| ("n3", "p3", "", "", "", "", "n3", "p3"), | |
| ("n3", "p3", "", "", "n1", "p1", "n3", "p3"), | |
| ("", "", "", "", "", "", "", ""), | |
| # Malformed/invalid env var values | |
| ("", "", "", "", " n1 ", " p1 ", "n1", "p1"), # leading/trailing whitespace | |
| ("", "", "", "", ".n1", "p1.", "n1", "p1"), # leading/trailing dots | |
| ("", "", "", "", "n1..n2", "p1..p2", "n1..n2", "p1..p2"), # extra dots | |
| ("", "", "", "", "n1$", "p1#", "n1$", "p1#"), # special characters | |
| ("", "", "", "", "n1.p1", "p1.n1", "n1", "p1"), # swapped/invalid format | |
| ("", "", "", "", "n1..", "..p1", "n1..", "..p1"), # trailing/leading double dots | |
| ], | |
| ) |
docs/guide/namespaces.md
Outdated
| ds = dc.read_dataset("local.local.metrics") | ||
| ds.show() | ||
| ``` | ||
| ds.sho |
There was a problem hiding this comment.
issue (typo): Typo: 'ds.sho' should be 'ds.show()'.
Please correct 'ds.sho' to 'ds.show()' for consistency and to ensure the example works as intended.
| if namespace and project: | ||
| return f"{namespace}.{project}.{name}" |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if not result_ds_namespace and not result_ds_project: | ||
| # special case when nothing is defined - we set default ones | ||
| result_ds_namespace = metastore.default_namespace_name | ||
| result_ds_project = metastore.default_project_name |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if namespace and project: | ||
| return f"{namespace}.{project}.{name}" | ||
| return name |
There was a problem hiding this comment.
suggestion (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp)
| if namespace and project: | |
| return f"{namespace}.{project}.{name}" | |
| return name | |
| return f"{namespace}.{project}.{name}" if namespace and project else name |
…ive/datachain into ilongin/1185-get-project-from-env
|
@ilongin it feels like we introduce 2 variables to public API without a need. We can handle this using a single namespace params. Let's try this instead. |
|
@dmpetrov I think we need then a good idea on the name. If you have a good suggestion please share. (keep in mind please - we need to release it today). |
The idea: I cannot imaging cases when user needs to set only project. Also, we should avoid using such vague names asa project especially because we already have project in experiments. I'm pretty sure we won't find additional space in UI to see two params - it will be a single box for both. So, why we push them on users in the API? |
Setting Namespace and Project via Environment Variables
In addition to using
.settings(), you can configure the namespace and project using environment variables:DATACHAIN_NAMESPACEsets the namespace.DATACHAIN_PROJECTsets the project name, or both the namespace and project using the formatnamespace.project.Examples
How Namespace and Project Are Resolved
When determining which namespace and project to use, Datachain applies the following precedence:
If the dataset name includes both the namespace and project, these values take highest precedence.
Values provided via
.settings()or passed directly toread_dataset()or similar methods.Namespace and project set using environment variables:
export DATACHAIN_PROJECT=dev.analyticsSummary by Sourcery
Enable setting default namespace and project via environment variables and centralize dataset name resolution through Catalog.get_full_dataset_name
New Features:
Enhancements:
Documentation:
Tests:
Summary by Sourcery
Enable environment variable support for namespace and project defaults and centralize dataset name resolution through Catalog.get_full_dataset_name, updating dataset operations, CLI commands, documentation, and tests accordingly
New Features:
Enhancements:
Documentation:
Tests: