Skip to content

Conversation

@garrett361
Copy link
Owner

@garrett361 garrett361 commented Jun 27, 2025

Changes the main_process_first ctx manager wrapping data creation to local_main_process_first, which is more appropriate for multi-node use cases (and works for single-node as well).

PR Title
1 #15 padding-free
2 #16 clean_checkpoints_at_end
3 #17 final_lr_ratio
4 #18 add_seed_and_date_to_run_name
5 #19 additional_model_arguments
6 #20 sync_each_batch=True grad acc
7 #21 no grad acc averaging for sum losses
8 #22 extra reporting
9 >23 local_main_process_first when building dataset

@garrett361 garrett361 changed the title [9/9] WIP: local_main_process_first when building dataset [9/9] local_main_process_first when building dataset Jun 27, 2025
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from 62edcb7 to 73b0c43 Compare June 27, 2025 19:23
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from d21578f to 3463f5f Compare June 27, 2025 19:23
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from 73b0c43 to f0227b6 Compare June 27, 2025 20:22
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from 3463f5f to 2dc198a Compare June 27, 2025 20:22
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from f0227b6 to 11b51fd Compare June 27, 2025 20:48
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from 2dc198a to 16b7b19 Compare June 27, 2025 20:48
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from 11b51fd to 1dee77e Compare June 27, 2025 20:54
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from 16b7b19 to 62cefe3 Compare June 27, 2025 20:54
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from 1dee77e to f0df295 Compare June 27, 2025 21:17
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from 62cefe3 to eb01382 Compare June 27, 2025 21:17
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from f0df295 to 4bbf9f0 Compare June 27, 2025 21:19
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from eb01382 to b1d9ee9 Compare June 27, 2025 21:19
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from 4bbf9f0 to 0772599 Compare June 28, 2025 01:41
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from b1d9ee9 to 95014d7 Compare June 28, 2025 01:41
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from 0772599 to a673cd1 Compare June 28, 2025 01:50
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from 95014d7 to b7aaa46 Compare June 28, 2025 01:50
prev-branch: padding-free-squashing-8
@garrett361 garrett361 force-pushed the padding-free-squashing-8 branch from a673cd1 to ef0da35 Compare June 28, 2025 02:23
@garrett361 garrett361 force-pushed the padding-free-squashing-9 branch from b7aaa46 to 8fe00b0 Compare June 28, 2025 02:23
@garrett361 garrett361 marked this pull request as draft July 1, 2025 16:13
@garrett361
Copy link
Owner Author

Converted to draft, due to concerns that local_main_process_first seems like it can cause errors in multi-node scenarios where all nodes share the same HF cache. When this is the case, it appears there may be race conditions where multiple procs attempt to delete the same cache files, leading to errors.

@garrett361 garrett361 closed this Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants