Skip to content

Added CheckpointEvent model to track checkpoint events#1575

Merged
ilongin merged 2 commits intoilongin/1392-udf-checkpointsfrom
ilongin/1574-checkpoint-events
Feb 4, 2026
Merged

Added CheckpointEvent model to track checkpoint events#1575
ilongin merged 2 commits intoilongin/1392-udf-checkpointsfrom
ilongin/1574-checkpoint-events

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Feb 3, 2026

WIP

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Feb 3, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7d375da
Status: ✅  Deploy successful!
Preview URL: https://06e6b5aa.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-1574-checkpoint-even.datachain-2g6.pages.dev

View logs

@ilongin ilongin changed the base branch from main to ilongin/1392-udf-checkpoints February 3, 2026 14:26
@ilongin ilongin linked an issue Feb 3, 2026 that may be closed by this pull request
@ilongin ilongin marked this pull request as draft February 3, 2026 14:27
@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 97.32143% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/data_storage/metastore.py 92.30% 1 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@ilongin ilongin marked this pull request as ready for review February 4, 2026 23:03
@ilongin ilongin merged commit 0f61ea7 into ilongin/1392-udf-checkpoints Feb 4, 2026
33 checks passed
@ilongin ilongin deleted the ilongin/1574-checkpoint-events branch February 4, 2026 23:04
ilongin added a commit that referenced this pull request Feb 15, 2026
* using session instead of catalog in udfstep

* refactoring job creation in datachain

* implementing first phase of UDF checkpoints

* refactoring

* changing udf table names

* adding checkpoint tests and fixing cleaning udf tables in test

* added udf checkpoint continue from partial results

* added udf generator logic and tests

* fixing logic

* fixing issues and tests

* refactoring tests

* refactoring

* refactoring

* refactoring

* refactoring udf table ownership logic

* refactoring

* refactoring tests

* fixing cast of recursive sql

* using has_table instead checking metadata

* fixing tests

* fixing cleaning table and partition by

* fixing test

* fixing aggregator

* fixing hash collision

* refactoring and removing processed table

* fixing tests

* fixing tests

* returning

* updated coverage

* removed coverate sysmon

* refactoring checkpoint cleaning

* Remove cleanup_checkpoints functionality for separate PR

* fixing tests

* fixing tests

* added udf checkpoint docs

* refactoring

* fixing tests

* fix creating processed table even in reset mode

* added tests

* refactoring processed tracking for generators

* refactoring tests

* refactoring create_table method

* fix re-run when UDF output changes

* Update src/datachain/cli/commands/misc.py

Co-authored-by: Vladimir Rudnykh <dreadatour@gmail.com>

* fixing docs and some code parts

* refactoring

* returning sysmon

* renaming create_checkpoint method

* simplified logic

* removing batch_callback

* refactoring

* removing tracking_fiedl

* fixing ancestor job id find

* refactor remove_checkpoint to accept only id

* removed comment

* refactoring creating table

* refactoring

* updated docs by removing parent verb

* adding staging sufix for table atomicity when doing copy

* break parent connection when reset flag is present

* fixing docs

* fixing docs and other small fixes

* fixing docs and other small fixes

* fixing comments

* discarding changes with garabage collecting method of cli

* moving list_tables function to tests util

* unifying prepare_row functions

* adding hash_input and hash_output as default args in apply method of UDFStep

* renaming sys_id to sys__processed_id

* removed not needed quote_schema from sqlite in removing tables for test

* fixing issue with incomplete inputs in generator

* added docs

* reorganizing tests

* var renaming

* added regression test for subtract

* make hash_callable not fail if unexpected callalbe is input

* disable checkpoints in threading / multiprocess

* added custom migration function for checkpoints

* renaming checkpointstable and removing not needed migration function

* fixing non determinisitc tests for CH

* fixing bug with continuing udf processing

* fixing test

* fixing docs

* removed not needde comments

* removed not needed flag

* removed not needed env var

* renamed env var

* reduced number of parallel

* added envs to env docs

* moved function to check concurrency for checkpoints from session to utils

* removed comment

* moving check if checkpoint is enabled because of concurency from metastore to higher level code

* removed partial constraint

* removing test

* refactoring test

* returning old checkpoints table name

* refactoring input table name hash

* using group id for input table name in udf

* using pid and thread ownership to determine if checkpoints are enabled or not

* fixing test

* refactoring tests

* refactoring tests

* removing not needed conditions

* refactoring

* fixing comment

* refactoring

* fixing race condition

* adde safe_copy_table

* refactoring copy_table methods

* continuing UDF if parent partial table is not found

* added try/catch of missing table

* refactor transaction context usage

* optimized query

* added thread lock

* updated docs with hashing limitations

* renaming function

* removed unrelated lint exception

* refactoring checkpoint tests

* fixing env vars and verbose comments

* ading runtime error

* refactoring

* removing name and job_aware to hash method of DataChain

* refactoring

* refactoring

* refactoring

* added logs

* fixing env vars

* refactoring tests

* removing not neededd monkeypatch

* added more tests

* closing sqlite connections in test

* moving get_table to db specific implementation

* return get_table to db_engine

* added job_id to hash

* improved logging

* Added `CheckpointEvent` model to track checkpoint events (#1575)

* added new checkpoint event model

* added tests

* added prints

* added print only when it is second job

* removed not used var

* removed print

* fixing reading files on udf continue

* UDF checkpoint visibility (#1576)

* added add_udf method

* refactoring

* fixing udf stats

* refactoring checkpoint events

* fixing lint

* adding missing tests and fixing issues

---------

Co-authored-by: Vladimir Rudnykh <dreadatour@gmail.com>
Co-authored-by: ivan <ilongin@iterative.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CheckpointEvent table for checkpoint debugging and visibility

2 participants