Conversation
Deploying datachain with
|
| Latest commit: |
3f8b982
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://a9aceeef.datachain-2g6.pages.dev |
| Branch Preview URL: | https://ilongin-1511-new-job-referen.datachain-2g6.pages.dev |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
run_group_id and rerun_from_job_id
amritghimire
left a comment
There was a problem hiding this comment.
LGTM apart from the discussions about the naming. I have no strong opinion about any.
|
@ilongin Please check the test cases too |
There was a problem hiding this comment.
Pull request overview
This PR introduces dedicated fields for tracking job rerun relationships to replace the generic parent_job_id field. Two new fields are added: rerun_from_job_id (references the immediate parent job in a rerun chain) and run_group_id (groups all jobs that belong to the same rerun lineage).
Key changes:
- Database schema adds
rerun_from_job_idandrun_group_idcolumns with indexes to the jobs table - Validation logic ensures both fields are provided together when creating rerun jobs
- The
get_ancestor_job_ids()function now filters byrun_group_idfor safety when traversing rerun chains
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/datachain/data_storage/metastore.py | Adds new columns and indexes to jobs table; implements validation for rerun fields; updates get_ancestor_job_ids() to use rerun_from_job_id and filter by run_group_id |
| src/datachain/job.py | Adds rerun_from_job_id and run_group_id fields to the Job dataclass and updates the parse method |
| src/datachain/query/session.py | Updates get_or_create_job() to pass rerun_from_job_id and run_group_id when continuing from a previous job |
| src/datachain/lib/dc/datachain.py | Updates checkpoint resolution to use rerun_from_job_id instead of parent_job_id |
| tests/unit/test_job_management.py | Updates test assertion to verify rerun_from_job_id instead of parent_job_id |
| tests/unit/lib/test_checkpoints.py | Updates test to pass both rerun_from_job_id and run_group_id when creating a rerun job |
| tests/func/test_metastore.py | Updates test_get_ancestor_job_ids to use new rerun fields instead of parent_job_id |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| Creates a new job. | ||
| Returns the job id. | ||
| """ |
There was a problem hiding this comment.
The docstring should be expanded to document the new parameters rerun_from_job_id and run_group_id, especially their relationship and the validation requirements. Consider adding documentation that explains:
rerun_from_job_id: The ID of the job that this job is rerunning from (None for initial jobs)run_group_id: The ID that groups all jobs in a rerun chain (automatically set to the job's own ID for initial jobs, must be provided when rerun_from_job_id is set)- The validation requirement that both must be provided together or neither
| job2 = session2.get_or_create_job() | ||
|
|
||
| assert job2.parent_job_id == job1.id | ||
| assert job2.rerun_from_job_id == job1.id |
There was a problem hiding this comment.
The test verifies that rerun_from_job_id is set correctly but doesn't verify that run_group_id is also set properly. Since these two fields must be provided together (as enforced by the validation in create_job()), the test should also assert that job2.run_group_id == job1.run_group_id (or more specifically, that job2.run_group_id == job1.id since the first job in a chain sets its own ID as the group ID).
| assert run_group_id is not None, ( | ||
| "run_group_id must be provided when rerun_from_job_id is set" | ||
| ) | ||
| else: | ||
| # First job: run_group_id should not be provided (we set it here) | ||
| assert run_group_id is None, ( | ||
| "run_group_id should not be provided when rerun_from_job_id is not set" | ||
| ) |
There was a problem hiding this comment.
The validation uses assertions which will be removed in optimized Python code (when running with -O flag). For production code, consider using explicit if statements with ValueError or AssertionError exceptions instead. This ensures the validation always runs regardless of Python optimization settings, which is important for maintaining data integrity in the jobs table.
| assert run_group_id is not None, ( | |
| "run_group_id must be provided when rerun_from_job_id is set" | |
| ) | |
| else: | |
| # First job: run_group_id should not be provided (we set it here) | |
| assert run_group_id is None, ( | |
| "run_group_id should not be provided when rerun_from_job_id is not set" | |
| ) | |
| if run_group_id is None: | |
| raise ValueError( | |
| "run_group_id must be provided when rerun_from_job_id is set" | |
| ) | |
| else: | |
| # First job: run_group_id should not be provided (we set it here) | |
| if run_group_id is not None: | |
| raise ValueError( | |
| "run_group_id should not be provided when rerun_from_job_id is not set" | |
| ) |
| python_version = f"{sys.version_info.major}.{sys.version_info.minor}" | ||
|
|
||
| # try to find the parent job | ||
| # try to find the parent job for checkpoint/rerun chain |
There was a problem hiding this comment.
The comment says "try to find the parent job for checkpoint/rerun chain" but the code now uses rerun_from_job_id instead of parent_job_id. Consider updating the comment to "try to find the previous job in the checkpoint/rerun chain" to better reflect the new terminology and avoid confusion.
| # try to find the parent job for checkpoint/rerun chain | |
| # try to find the previous job in the checkpoint/rerun chain |
| & ( | ||
| self._jobs.c.run_group_id | ||
| == cast( | ||
| ancestors_cte.c.run_group_id, self._jobs.c.run_group_id.type | ||
| ) | ||
| ), # Safety: only traverse within same run group |
There was a problem hiding this comment.
The safety check that filters by run_group_id could have issues with NULL values from legacy jobs created before this PR. In SQL, NULL == NULL evaluates to NULL (not TRUE), which means the recursive traversal might not work correctly for jobs with NULL run_group_id values.
Consider adding a check at the beginning of the function to handle the case where the starting job has a NULL run_group_id, or document that this function requires jobs to have non-NULL run_group_id values. For example, you could either:
- Return an empty list if the initial job's run_group_id is NULL
- Skip the run_group_id filter if the initial job's run_group_id is NULL (though this would reduce safety)
- Use a coalesce or IS NOT DISTINCT FROM comparison that handles NULLs properly
| & ( | |
| self._jobs.c.run_group_id | |
| == cast( | |
| ancestors_cte.c.run_group_id, self._jobs.c.run_group_id.type | |
| ) | |
| ), # Safety: only traverse within same run group | |
| & self._jobs.c.run_group_id.isnot_distinct_from( | |
| cast( | |
| ancestors_cte.c.run_group_id, | |
| self._jobs.c.run_group_id.type, | |
| ) | |
| ), # Safety: only traverse within same run group (handles NULLs) |
Decouple job rerun tracking from generic
parent_job_idby adding:rerun_from_job_id: references the parent job when rerunningrun_group_id: groups all jobs in a rerun chain (set to first job's ID)Changes:
get_ancestor_job_ids()to filter byrun_group_idfor safetySession.get_or_create_job()to pass both fields when rerunning