Skip to content

Add dedicated fields for job reruns: run_group_id and rerun_from_job_id#1512

Merged
ilongin merged 8 commits intomainfrom
ilongin/1511-new-job-reference-field-for-checkpoints
Dec 31, 2025
Merged

Add dedicated fields for job reruns: run_group_id and rerun_from_job_id#1512
ilongin merged 8 commits intomainfrom
ilongin/1511-new-job-reference-field-for-checkpoints

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Dec 15, 2025

Decouple job rerun tracking from generic parent_job_id by adding:

  • rerun_from_job_id: references the parent job when rerunning
  • run_group_id: groups all jobs in a rerun chain (set to first job's ID)

Changes:

  • Add columns to jobs table with indexes
  • Add validation: both fields must be provided together or neither
  • Update get_ancestor_job_ids() to filter by run_group_id for safety
  • Update Session.get_or_create_job() to pass both fields when rerunning

@ilongin ilongin linked an issue Dec 15, 2025 that may be closed by this pull request
@ilongin ilongin marked this pull request as draft December 15, 2025 02:56
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Dec 15, 2025

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3f8b982
Status: ✅  Deploy successful!
Preview URL: https://a9aceeef.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-1511-new-job-referen.datachain-2g6.pages.dev

View logs

@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ilongin ilongin changed the title New field for job reference with checkpoints Add dedicated fields for job reruns: run_group_id and rerun_from_job_id Dec 17, 2025
@ilongin ilongin marked this pull request as ready for review December 17, 2025 14:31
Copy link
Contributor

@amritghimire amritghimire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from the discussions about the naming. I have no strong opinion about any.

@amritghimire
Copy link
Contributor

@ilongin Please check the test cases too

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces dedicated fields for tracking job rerun relationships to replace the generic parent_job_id field. Two new fields are added: rerun_from_job_id (references the immediate parent job in a rerun chain) and run_group_id (groups all jobs that belong to the same rerun lineage).

Key changes:

  • Database schema adds rerun_from_job_id and run_group_id columns with indexes to the jobs table
  • Validation logic ensures both fields are provided together when creating rerun jobs
  • The get_ancestor_job_ids() function now filters by run_group_id for safety when traversing rerun chains

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/datachain/data_storage/metastore.py Adds new columns and indexes to jobs table; implements validation for rerun fields; updates get_ancestor_job_ids() to use rerun_from_job_id and filter by run_group_id
src/datachain/job.py Adds rerun_from_job_id and run_group_id fields to the Job dataclass and updates the parse method
src/datachain/query/session.py Updates get_or_create_job() to pass rerun_from_job_id and run_group_id when continuing from a previous job
src/datachain/lib/dc/datachain.py Updates checkpoint resolution to use rerun_from_job_id instead of parent_job_id
tests/unit/test_job_management.py Updates test assertion to verify rerun_from_job_id instead of parent_job_id
tests/unit/lib/test_checkpoints.py Updates test to pass both rerun_from_job_id and run_group_id when creating a rerun job
tests/func/test_metastore.py Updates test_get_ancestor_job_ids to use new rerun fields instead of parent_job_id

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1778 to 1781
"""
Creates a new job.
Returns the job id.
"""
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should be expanded to document the new parameters rerun_from_job_id and run_group_id, especially their relationship and the validation requirements. Consider adding documentation that explains:

  • rerun_from_job_id: The ID of the job that this job is rerunning from (None for initial jobs)
  • run_group_id: The ID that groups all jobs in a rerun chain (automatically set to the job's own ID for initial jobs, must be provided when rerun_from_job_id is set)
  • The validation requirement that both must be provided together or neither

Copilot uses AI. Check for mistakes.
job2 = session2.get_or_create_job()

assert job2.parent_job_id == job1.id
assert job2.rerun_from_job_id == job1.id
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies that rerun_from_job_id is set correctly but doesn't verify that run_group_id is also set properly. Since these two fields must be provided together (as enforced by the validation in create_job()), the test should also assert that job2.run_group_id == job1.run_group_id (or more specifically, that job2.run_group_id == job1.id since the first job in a chain sets its own ID as the group ID).

Copilot uses AI. Check for mistakes.
Comment on lines +1787 to +1794
assert run_group_id is not None, (
"run_group_id must be provided when rerun_from_job_id is set"
)
else:
# First job: run_group_id should not be provided (we set it here)
assert run_group_id is None, (
"run_group_id should not be provided when rerun_from_job_id is not set"
)
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation uses assertions which will be removed in optimized Python code (when running with -O flag). For production code, consider using explicit if statements with ValueError or AssertionError exceptions instead. This ensures the validation always runs regardless of Python optimization settings, which is important for maintaining data integrity in the jobs table.

Suggested change
assert run_group_id is not None, (
"run_group_id must be provided when rerun_from_job_id is set"
)
else:
# First job: run_group_id should not be provided (we set it here)
assert run_group_id is None, (
"run_group_id should not be provided when rerun_from_job_id is not set"
)
if run_group_id is None:
raise ValueError(
"run_group_id must be provided when rerun_from_job_id is set"
)
else:
# First job: run_group_id should not be provided (we set it here)
if run_group_id is not None:
raise ValueError(
"run_group_id should not be provided when rerun_from_job_id is not set"
)

Copilot uses AI. Check for mistakes.
python_version = f"{sys.version_info.major}.{sys.version_info.minor}"

# try to find the parent job
# try to find the parent job for checkpoint/rerun chain
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "try to find the parent job for checkpoint/rerun chain" but the code now uses rerun_from_job_id instead of parent_job_id. Consider updating the comment to "try to find the previous job in the checkpoint/rerun chain" to better reflect the new terminology and avoid confusion.

Suggested change
# try to find the parent job for checkpoint/rerun chain
# try to find the previous job in the checkpoint/rerun chain

Copilot uses AI. Check for mistakes.
Comment on lines +2122 to +2127
& (
self._jobs.c.run_group_id
== cast(
ancestors_cte.c.run_group_id, self._jobs.c.run_group_id.type
)
), # Safety: only traverse within same run group
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety check that filters by run_group_id could have issues with NULL values from legacy jobs created before this PR. In SQL, NULL == NULL evaluates to NULL (not TRUE), which means the recursive traversal might not work correctly for jobs with NULL run_group_id values.

Consider adding a check at the beginning of the function to handle the case where the starting job has a NULL run_group_id, or document that this function requires jobs to have non-NULL run_group_id values. For example, you could either:

  1. Return an empty list if the initial job's run_group_id is NULL
  2. Skip the run_group_id filter if the initial job's run_group_id is NULL (though this would reduce safety)
  3. Use a coalesce or IS NOT DISTINCT FROM comparison that handles NULLs properly
Suggested change
& (
self._jobs.c.run_group_id
== cast(
ancestors_cte.c.run_group_id, self._jobs.c.run_group_id.type
)
), # Safety: only traverse within same run group
& self._jobs.c.run_group_id.isnot_distinct_from(
cast(
ancestors_cte.c.run_group_id,
self._jobs.c.run_group_id.type,
)
), # Safety: only traverse within same run group (handles NULLs)

Copilot uses AI. Check for mistakes.
@ilongin ilongin merged commit 2fc30a4 into main Dec 31, 2025
36 checks passed
@ilongin ilongin deleted the ilongin/1511-new-job-reference-field-for-checkpoints branch December 31, 2025 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid using parent_job_id for reference to previous job in checkpoints

3 participants