Add dedicated fields for job reruns: `run_group_id` and `rerun_from_job_id` by ilongin · Pull Request #1512 · datachain-ai/datachain

ilongin · 2025-12-15T02:56:44Z

Decouple job rerun tracking from generic parent_job_id by adding:

rerun_from_job_id: references the parent job when rerunning
run_group_id: groups all jobs in a rerun chain (set to first job's ID)

Changes:

Add columns to jobs table with indexes
Add validation: both fields must be provided together or neither
Update get_ancestor_job_ids() to filter by run_group_id for safety
Update Session.get_or_create_job() to pass both fields when rerunning

cloudflare-workers-and-pages · 2025-12-15T02:58:07Z

Deploying datachain with Cloudflare Pages

Latest commit:	`3f8b982`
Status:	✅ Deploy successful!
Preview URL:	https://a9aceeef.datachain-2g6.pages.dev
Branch Preview URL:	https://ilongin-1511-new-job-referen.datachain-2g6.pages.dev

View logs

codecov · 2025-12-15T11:22:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…ckpoints

amritghimire

LGTM apart from the discussions about the naming. I have no strong opinion about any.

amritghimire · 2025-12-18T10:52:53Z

@ilongin Please check the test cases too

Copilot

Pull request overview

This PR introduces dedicated fields for tracking job rerun relationships to replace the generic parent_job_id field. Two new fields are added: rerun_from_job_id (references the immediate parent job in a rerun chain) and run_group_id (groups all jobs that belong to the same rerun lineage).

Key changes:

Database schema adds rerun_from_job_id and run_group_id columns with indexes to the jobs table
Validation logic ensures both fields are provided together when creating rerun jobs
The get_ancestor_job_ids() function now filters by run_group_id for safety when traversing rerun chains

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/datachain/data_storage/metastore.py	Adds new columns and indexes to jobs table; implements validation for rerun fields; updates `get_ancestor_job_ids()` to use `rerun_from_job_id` and filter by `run_group_id`
src/datachain/job.py	Adds `rerun_from_job_id` and `run_group_id` fields to the Job dataclass and updates the parse method
src/datachain/query/session.py	Updates `get_or_create_job()` to pass `rerun_from_job_id` and `run_group_id` when continuing from a previous job
src/datachain/lib/dc/datachain.py	Updates checkpoint resolution to use `rerun_from_job_id` instead of `parent_job_id`
tests/unit/test_job_management.py	Updates test assertion to verify `rerun_from_job_id` instead of `parent_job_id`
tests/unit/lib/test_checkpoints.py	Updates test to pass both `rerun_from_job_id` and `run_group_id` when creating a rerun job
tests/func/test_metastore.py	Updates `test_get_ancestor_job_ids` to use new rerun fields instead of `parent_job_id`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-18T10:58:29Z

src/datachain/data_storage/metastore.py

        """
        Creates a new job.
        Returns the job id.
        """


The docstring should be expanded to document the new parameters rerun_from_job_id and run_group_id, especially their relationship and the validation requirements. Consider adding documentation that explains:

rerun_from_job_id: The ID of the job that this job is rerunning from (None for initial jobs)

run_group_id: The ID that groups all jobs in a rerun chain (automatically set to the job's own ID for initial jobs, must be provided when rerun_from_job_id is set)

The validation requirement that both must be provided together or neither

Copilot · 2025-12-18T10:58:30Z

tests/unit/test_job_management.py

    job2 = session2.get_or_create_job()

-    assert job2.parent_job_id == job1.id
+    assert job2.rerun_from_job_id == job1.id


The test verifies that rerun_from_job_id is set correctly but doesn't verify that run_group_id is also set properly. Since these two fields must be provided together (as enforced by the validation in create_job()), the test should also assert that job2.run_group_id == job1.run_group_id (or more specifically, that job2.run_group_id == job1.id since the first job in a chain sets its own ID as the group ID).

Copilot · 2025-12-18T10:58:30Z

src/datachain/data_storage/metastore.py

+            assert run_group_id is not None, (
+                "run_group_id must be provided when rerun_from_job_id is set"
+            )
+        else:
+            # First job: run_group_id should not be provided (we set it here)
+            assert run_group_id is None, (
+                "run_group_id should not be provided when rerun_from_job_id is not set"
+            )


The validation uses assertions which will be removed in optimized Python code (when running with -O flag). For production code, consider using explicit if statements with ValueError or AssertionError exceptions instead. This ensures the validation always runs regardless of Python optimization settings, which is important for maintaining data integrity in the jobs table.

Suggested change

assert run_group_id is not None, (

"run_group_id must be provided when rerun_from_job_id is set"

)

else:

# First job: run_group_id should not be provided (we set it here)

assert run_group_id is None, (

"run_group_id should not be provided when rerun_from_job_id is not set"

)

if run_group_id is None:

raise ValueError(

"run_group_id must be provided when rerun_from_job_id is set"

)

else:

# First job: run_group_id should not be provided (we set it here)

if run_group_id is not None:

raise ValueError(

"run_group_id should not be provided when rerun_from_job_id is not set"

)

Copilot · 2025-12-18T10:58:30Z

src/datachain/query/session.py

            python_version = f"{sys.version_info.major}.{sys.version_info.minor}"

-            # try to find the parent job
+            # try to find the parent job for checkpoint/rerun chain


The comment says "try to find the parent job for checkpoint/rerun chain" but the code now uses rerun_from_job_id instead of parent_job_id. Consider updating the comment to "try to find the previous job in the checkpoint/rerun chain" to better reflect the new terminology and avoid confusion.

Suggested change

# try to find the parent job for checkpoint/rerun chain

# try to find the previous job in the checkpoint/rerun chain

Copilot · 2025-12-18T10:58:31Z

src/datachain/data_storage/metastore.py

+                    & (
+                        self._jobs.c.run_group_id
+                        == cast(
+                            ancestors_cte.c.run_group_id, self._jobs.c.run_group_id.type
+                        )
+                    ),  # Safety: only traverse within same run group


The safety check that filters by run_group_id could have issues with NULL values from legacy jobs created before this PR. In SQL, NULL == NULL evaluates to NULL (not TRUE), which means the recursive traversal might not work correctly for jobs with NULL run_group_id values.

Consider adding a check at the beginning of the function to handle the case where the starting job has a NULL run_group_id, or document that this function requires jobs to have non-NULL run_group_id values. For example, you could either:

Return an empty list if the initial job's run_group_id is NULL

Skip the run_group_id filter if the initial job's run_group_id is NULL (though this would reduce safety)

Use a coalesce or IS NOT DISTINCT FROM comparison that handles NULLs properly

Suggested change

& (

self._jobs.c.run_group_id

== cast(

ancestors_cte.c.run_group_id, self._jobs.c.run_group_id.type

)

), # Safety: only traverse within same run group

& self._jobs.c.run_group_id.isnot_distinct_from(

cast(

ancestors_cte.c.run_group_id,

self._jobs.c.run_group_id.type,

)

), # Safety: only traverse within same run group (handles NULLs)

…ckpoints

introducing new field for job reference with checkpoints

c369f4e

ilongin linked an issue Dec 15, 2025 that may be closed by this pull request

Avoid using parent_job_id for reference to previous job in checkpoints #1511

Closed

ilongin marked this pull request as draft December 15, 2025 02:56

fixing test

54615cb

ilongin added 2 commits December 17, 2025 10:32

Merge branch 'main' into ilongin/1511-new-job-reference-field-for-che…

54076e7

…ckpoints

adding run_group_id

9826083

ilongin changed the title ~~New field for job reference with checkpoints~~ Add dedicated fields for job reruns: run_group_id and rerun_from_job_id Dec 17, 2025

ilongin marked this pull request as ready for review December 17, 2025 14:31

This was referenced Dec 18, 2025

Use run_group_id + hash for UDF input table name #1521

Closed

UDF Checkpoints #1422

Open

amritghimire approved these changes Dec 18, 2025

View reviewed changes

amritghimire requested a review from Copilot December 18, 2025 10:52

Copilot started reviewing on behalf of amritghimire December 18, 2025 10:52 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

shcheklein approved these changes Dec 18, 2025

View reviewed changes

ivan and others added 4 commits December 22, 2025 23:31

Merge branch 'main' into ilongin/1511-new-job-reference-field-for-che…

2d94f54

…ckpoints

Merge branch 'main' into ilongin/1511-new-job-reference-field-for-che…

fc60f82

…ckpoints

fixing typing issue

a2dead6

Merge branch 'main' into ilongin/1511-new-job-reference-field-for-che…

3f8b982

…ckpoints

ilongin merged commit 2fc30a4 into main Dec 31, 2025
36 checks passed

ilongin deleted the ilongin/1511-new-job-reference-field-for-checkpoints branch December 31, 2025 10:52

ilongin mentioned this pull request Jan 2, 2026

Bump local DB schema version #1532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dedicated fields for job reruns: `run_group_id` and `rerun_from_job_id`#1512

Add dedicated fields for job reruns: `run_group_id` and `rerun_from_job_id`#1512
ilongin merged 8 commits intomainfrom
ilongin/1511-new-job-reference-field-for-checkpoints

ilongin commented Dec 15, 2025 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 15, 2025

Uh oh!

amritghimire left a comment

Uh oh!

amritghimire commented Dec 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# try to find the parent job for checkpoint/rerun chain
	# try to find the previous job in the checkpoint/rerun chain

Conversation

ilongin commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain with Cloudflare Pages

Uh oh!

codecov bot commented Dec 15, 2025

Codecov Report

Uh oh!

amritghimire left a comment

Choose a reason for hiding this comment

Uh oh!

amritghimire commented Dec 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilongin commented Dec 15, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Dec 15, 2025 •

edited

Loading