-
Notifications
You must be signed in to change notification settings - Fork 491
feat: extra SFT reporting #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
prev-branch: padding-free-squashing-7
hamishivi
approved these changes
Jul 21, 2025
garrett361
added a commit
to garrett361/open-instruct
that referenced
this pull request
Jul 23, 2025
* Update oe-eval.sh to set a default timeout of 48h. (allenai#789) * Updated configs to support changes. (allenai#790) * Add benchmark scripts (allenai#786) * Added scripts to run benchmarks. * Removed install script. * Added install script back. * Add remap verifier (allenai#773) * first pass remap verifier * make judge json parsing a little more robust * typoooooooo * typoooooooo * fix logic... * clean logging naming up * Ran the linter. (allenai#792) * fix the URL for code api setup (allenai#791) Co-authored-by: Michael Noukhovitch <[email protected]> * Add nltk setup to uv dockerfile (allenai#785) * add punk tokenizer * fix up command * Switches the actors to use the Ray queue. (allenai#784) * Made changes. * Switched to use ray.util.queue.Queue instead of a custom RayQueue class. * Now, only handles new version. * Updated benchmark_generators.py and test_grpo_fast.py. * CLeaned up code from Claude. * training_step defaults to None. * Added an info dataclass to replace the tuple. * Removes assumption that queries_prompt_Q and inference_results_Q are in sync by moving queries_prompt_Q to be a map. * CLeaned up benchmark * Added code to split batch sizes. * Removed benchmark scripts, which are now in a separate PR. * Now, we create all Ray queues in main, and pass them in as appropriate. * Removed changes * Test changes. * Linter passes * Added tests. * Now, we index with the dataset indices. * Checks and tests pass. * Ran linter * Added benchmark scripts back. Whoops. * Set new default value for num_samples * Updates the benchmark script (allenai#795) * Set new default value for num_samples * Now run N batches at once * different batch size * Fix pack length * Fix pack length * Fix wasted compute % (was accidentally multiplying by 100), and fix num rollouts (was referencing the wrong variable). * Now, we save benchmark results to CSV. * Now show a percentage for time spent generating. * Updated benchmark saving code. * Fixed syntax error. * Fixed benchmark * Fixed timing code. * Removed changes to vllm_utils3.py. * Now, we actually write the data to disk> * Bigger batch * Modified benchmark * Undid changes to benchmark script. * Temp change * Undid changes to benchmark script. * install nginx in uv (allenai#793) it was only being installed in regular Dockerfile Co-authored-by: Michael Noukhovitch <[email protected]> Co-authored-by: Saurabh Shah <[email protected]> * allow passing local models, bubble up dataset cache errors (allenai#797) Co-authored-by: Michael Noukhovitch <[email protected]> * binary reward for code (allenai#798) * binary reward for code * style * binary code reward flag -> pass rate reward threshold * Now, we run individual prompts through the queue. (allenai#796) * Now, we run individual prompts through the queue. * Fixed issues. * Ran linter * Fixed linter errors. * COde lints. * Test passes. * Ran linter. * Ensures that we send single prompts as requests. * Now, code lints. * Cleaned up code. * Fixes test. * Linter passes. * Cleaned test up. * Removed redundant comments. * Adds flashinfer dep. (allenai#800) * Adds flashinfer dep. * Now, open_instruct builds even on mac. * Updated install instructions to add flash-infer. * Now, we set flashinfer as the default attention backend. * Added flashinfer to the base dockerfile. * Ran linter. * Removed extra changes to mason.py. * Undid changes to uv.lock. * Updated requirements.txt * Updated flash-attn version. --------- Co-authored-by: Hamish Ivison <[email protected]> * new beaker names (allenai#803) * Remove Unused DPO Function (allenai#794) * delete function Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Update open_instruct/dataset_transformation.py --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> * extra reporting (allenai#799) prev-branch: padding-free-squashing-7 Co-authored-by: Hamish Ivison <[email protected]> * Revert "Now, we run individual prompts through the queue. (allenai#796)" (allenai#804) This reverts commit 541058c. * Fix misnamed variables. (allenai#808) * Fix misnamed variables. * Ran linter. * Fix broken syntax. (allenai#809) Co-authored-by: Hamish Ivison <[email protected]> * Add new olmo chat templates, and improve data mixing/tokenization (allenai#765) Adds new olmo-core-compatible chat templates. Includes: * New olmo template with support for function-calling. Includes a basic hard-coded system prompt, and appends "You do not have access to any functions" to any SFT examples that do not include functions. * Thinker version of the above template, has <think> included in the generation prompt * R1-style thinker template These 3 templates mirror our current Tulu templates Also includes some necessary changes to the --add_bos logic, to handle the new chat template which does not have a bos token. Includes a few other QoL fixes: * Fixes a bug in the olmocore tokenization script re: label mask * Logs dataset-level statistics during data mixing and tokenization * Supports easy upsampling during data mixing * Fixes from last PR (allenai#810) * fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]> * Delete run_repro.sh (allenai#813) * Fix disk space error on image creation (allenai#814) * remove moar things * create on pr * dont create on pr * use upstream stats --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Michael <[email protected]> Co-authored-by: Michael Noukhovitch <[email protected]> Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Jacob Morrison <[email protected]>
This was referenced Jul 24, 2025
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds additional reporting and a
--verboseflag for additional prints during SFT.The main changes are for SFT with
sumloss. This PR now logs three different losses:These are all useful for different cases. E.g. 3 is useful because it's more comparable between runs on different datasets and/or with different global batch sizes. Reporting for
meanloss is unchanged.Additional quantities related to token counts and memory statistics are also reported.