Add new olmo chat templates, and improve data mixing/tokenization #765

jacob-morrison · 2025-07-09T17:29:51Z

Adds new olmo-core-compatible chat templates. Includes:

New olmo template with support for function-calling. Includes a basic hard-coded system prompt, and appends "You do not have access to any functions" to any SFT examples that do not include functions.
Thinker version of the above template, has <think> included in the generation prompt
R1-style thinker template
These 3 templates mirror our current Tulu templates

Also includes some necessary changes to the --add_bos logic, to handle the new chat template which does not have a bos token.

Includes a few other QoL fixes:

Fixes a bug in the olmocore tokenization script re: label mask
Logs dataset-level statistics during data mixing and tokenization
Supports easy upsampling during data mixing

hamishivi

Could you also add maybe some docs or write something in the PR describing the chat templates with some details? e.g. how function works, system message, thinker vs non-thinker etc.

open_instruct/dataset_transformation.py

jacob-morrison · 2025-07-09T17:44:25Z

Could you also add maybe some docs or write something in the PR describing the chat templates with some details? e.g. how function works, system message, thinker vs non-thinker etc.

Yeah can do, I'll add it to the PR for now

open_instruct/dataset_transformation.py

soldni

General comments:

We should strip whitespaces from jinja templates.
It is weird to have system prompt backed in
I am not sure why we have different templates for thinking and tool use. One template plz.
Please include unit tests to verify that template does what you want to do.
Please include function to bake template into a tokenizer, and then run unit tests after that.
for templates, have a way to produce all special tokens needed.

open_instruct/dataset_transformation.py

hamishivi · 2025-07-09T19:57:21Z

@soldni

It is weird to have system prompt backed in

this is what existing e.g. qwen3 instruct models do. Doesn't it make sense to have this so that off-the-shelf generation behaves nicely, esp if we train with system prompts?

I am not sure why we have different templates for thinking and tool use. One template plz.

Lots of discussion for this on slack. The thinking prompt needs a different generation prompt (with <think>), merging the tool use and thinking templates beyond this is reasonable imo. Qwen relies on extra args during the chat template formatting to toggle this... we could swap to this but I don't have super strong feelings. It feels a bit weird.

Please include function to bake template into a tokenizer

This already happens in our scripts when training happens

open_instruct/dataset_transformation.py

hamishivi · 2025-07-14T17:18:12Z

Have we come to a decision on these @jacob-morrison ?

jacob-morrison · 2025-07-14T17:49:58Z

Have we come to a decision on these @jacob-morrison ?

yeah, my preference is we just have these 3 chat templates for now. it's simplest imo to treat these as drop-in replacements for our current templates, and so far they seem to be working fine in sft and dpo. if we just stick with these, anything else you think I should add to the pr?

open_instruct/dataset_transformation.py

open_instruct/model_utils.py

tyler-romero

LGTM!

* Write out label_mask instead of labels in olmocore tokenization script * Write out dataset token statistics during data mixing * Handle upsampling during data mixing

hamishivi · 2025-07-21T23:28:06Z

open_instruct/grpo_fast.py

+        if "olmo" in chat_template_name:
+            # New chat template has no bos token, and two eos tokens: <|im_end|> and <|endoftext|>
+            model_to_save.generation_config = GenerationConfig(
+                temperature=None,


should we set a default temp / top_p? And maybe its good to put this GenerationConfig as a constant in a util somewhere so we only have to change it in one place for all scripts?

yes to the constant -- I added a function to handle that.

default temp/top_p, probably? at least for release. not sure what good values would be though (esp. because this is for both thinking and not-thinking), so I'd say for now let's leave it as is and continue to use whatever settings we use during eval/etc

Sure, sounds good!

open_instruct/dataset_transformation.py

* Update oe-eval.sh to set a default timeout of 48h. (allenai#789) * Updated configs to support changes. (allenai#790) * Add benchmark scripts (allenai#786) * Added scripts to run benchmarks. * Removed install script. * Added install script back. * Add remap verifier (allenai#773) * first pass remap verifier * make judge json parsing a little more robust * typoooooooo * typoooooooo * fix logic... * clean logging naming up * Ran the linter. (allenai#792) * fix the URL for code api setup (allenai#791) Co-authored-by: Michael Noukhovitch <[email protected]> * Add nltk setup to uv dockerfile (allenai#785) * add punk tokenizer * fix up command * Switches the actors to use the Ray queue. (allenai#784) * Made changes. * Switched to use ray.util.queue.Queue instead of a custom RayQueue class. * Now, only handles new version. * Updated benchmark_generators.py and test_grpo_fast.py. * CLeaned up code from Claude. * training_step defaults to None. * Added an info dataclass to replace the tuple. * Removes assumption that queries_prompt_Q and inference_results_Q are in sync by moving queries_prompt_Q to be a map. * CLeaned up benchmark * Added code to split batch sizes. * Removed benchmark scripts, which are now in a separate PR. * Now, we create all Ray queues in main, and pass them in as appropriate. * Removed changes * Test changes. * Linter passes * Added tests. * Now, we index with the dataset indices. * Checks and tests pass. * Ran linter * Added benchmark scripts back. Whoops. * Set new default value for num_samples * Updates the benchmark script (allenai#795) * Set new default value for num_samples * Now run N batches at once * different batch size * Fix pack length * Fix pack length * Fix wasted compute % (was accidentally multiplying by 100), and fix num rollouts (was referencing the wrong variable). * Now, we save benchmark results to CSV. * Now show a percentage for time spent generating. * Updated benchmark saving code. * Fixed syntax error. * Fixed benchmark * Fixed timing code. * Removed changes to vllm_utils3.py. * Now, we actually write the data to disk> * Bigger batch * Modified benchmark * Undid changes to benchmark script. * Temp change * Undid changes to benchmark script. * install nginx in uv (allenai#793) it was only being installed in regular Dockerfile Co-authored-by: Michael Noukhovitch <[email protected]> Co-authored-by: Saurabh Shah <[email protected]> * allow passing local models, bubble up dataset cache errors (allenai#797) Co-authored-by: Michael Noukhovitch <[email protected]> * binary reward for code (allenai#798) * binary reward for code * style * binary code reward flag -> pass rate reward threshold * Now, we run individual prompts through the queue. (allenai#796) * Now, we run individual prompts through the queue. * Fixed issues. * Ran linter * Fixed linter errors. * COde lints. * Test passes. * Ran linter. * Ensures that we send single prompts as requests. * Now, code lints. * Cleaned up code. * Fixes test. * Linter passes. * Cleaned test up. * Removed redundant comments. * Adds flashinfer dep. (allenai#800) * Adds flashinfer dep. * Now, open_instruct builds even on mac. * Updated install instructions to add flash-infer. * Now, we set flashinfer as the default attention backend. * Added flashinfer to the base dockerfile. * Ran linter. * Removed extra changes to mason.py. * Undid changes to uv.lock. * Updated requirements.txt * Updated flash-attn version. --------- Co-authored-by: Hamish Ivison <[email protected]> * new beaker names (allenai#803) * Remove Unused DPO Function (allenai#794) * delete function Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Update open_instruct/dataset_transformation.py --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> * extra reporting (allenai#799) prev-branch: padding-free-squashing-7 Co-authored-by: Hamish Ivison <[email protected]> * Revert "Now, we run individual prompts through the queue. (allenai#796)" (allenai#804) This reverts commit 541058c. * Fix misnamed variables. (allenai#808) * Fix misnamed variables. * Ran linter. * Fix broken syntax. (allenai#809) Co-authored-by: Hamish Ivison <[email protected]> * Add new olmo chat templates, and improve data mixing/tokenization (allenai#765) Adds new olmo-core-compatible chat templates. Includes: * New olmo template with support for function-calling. Includes a basic hard-coded system prompt, and appends "You do not have access to any functions" to any SFT examples that do not include functions. * Thinker version of the above template, has <think> included in the generation prompt * R1-style thinker template These 3 templates mirror our current Tulu templates Also includes some necessary changes to the --add_bos logic, to handle the new chat template which does not have a bos token. Includes a few other QoL fixes: * Fixes a bug in the olmocore tokenization script re: label mask * Logs dataset-level statistics during data mixing and tokenization * Supports easy upsampling during data mixing * Fixes from last PR (allenai#810) * fix up my (jacob's) slightly broken pr --------- Co-authored-by: jacob-morrison <[email protected]> * Delete run_repro.sh (allenai#813) * Fix disk space error on image creation (allenai#814) * remove moar things * create on pr * dont create on pr * use upstream stats --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Michael <[email protected]> Co-authored-by: Michael Noukhovitch <[email protected]> Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Jacob Morrison <[email protected]>

add new olmo chat templates

de1db41

jacob-morrison requested review from hamishivi and tyler-romero July 9, 2025 17:29

hamishivi reviewed Jul 9, 2025

View reviewed changes

open_instruct/dataset_transformation.py Outdated Show resolved Hide resolved

open_instruct/dataset_transformation.py Outdated Show resolved Hide resolved

natolambert reviewed Jul 9, 2025

View reviewed changes

open_instruct/dataset_transformation.py Show resolved Hide resolved

mnoukhov reviewed Jul 9, 2025

View reviewed changes

open_instruct/dataset_transformation.py Outdated Show resolved Hide resolved

soldni reviewed Jul 9, 2025

View reviewed changes

natolambert reviewed Jul 9, 2025

View reviewed changes

open_instruct/dataset_transformation.py Outdated Show resolved Hide resolved

update chat templates

78ac527

mnoukhov reviewed Jul 9, 2025

View reviewed changes

open_instruct/dataset_transformation.py Outdated Show resolved Hide resolved

jacob-morrison added 2 commits July 9, 2025 16:54

update

017285f

fix

2888331

jacob-morrison added 6 commits July 15, 2025 11:12

update where we save the generation config

905ae3f

update

a02f092

clean up

09dbb9c

clean

4eb8cc9

remove jinja white space

69d7fbd

Merge branch 'main' into olmo3-chat-templates

227eb31

tyler-romero reviewed Jul 16, 2025

View reviewed changes

open_instruct/dataset_transformation.py Show resolved Hide resolved

tyler-romero reviewed Jul 16, 2025

View reviewed changes

open_instruct/dataset_transformation.py Outdated Show resolved Hide resolved

tyler-romero reviewed Jul 16, 2025

View reviewed changes

open_instruct/model_utils.py Outdated Show resolved Hide resolved

jacob-morrison added 3 commits July 15, 2025 21:43

update

185aa4f

update

ed59b9a

update

5bc36e4

tyler-romero approved these changes Jul 21, 2025

View reviewed changes

Merge branch 'main' into olmo3-chat-templates

1b1c5f0

jacob-morrison and others added 2 commits July 21, 2025 16:11

update hard coded chat template

e1356af

Fix olmocore tokenization script label bug + a few QoL fixes (#749)

e8cc9f6

* Write out label_mask instead of labels in olmocore tokenization script * Write out dataset token statistics during data mixing * Handle upsampling during data mixing

jacob-morrison changed the title ~~Add new olmo chat templates~~ Add new olmo chat templates, and improve data mixing/tokenization Jul 21, 2025

hamishivi reviewed Jul 21, 2025

View reviewed changes

jacob-morrison added 2 commits July 21, 2025 16:31

fix chat template

d32ba77

update generation config bits

05b6bdb

hamishivi approved these changes Jul 21, 2025

View reviewed changes

jacob-morrison merged commit de8a14f into main Jul 21, 2025
3 checks passed

jacob-morrison deleted the olmo3-chat-templates branch July 21, 2025 23:45

hamishivi mentioned this pull request Jul 22, 2025

Fixes from last PR #810

Merged

This was referenced Jul 26, 2025

remove_non_tensor_columns garrett361/open-instruct#40

Merged

remove_non_tensor_columns #831

Closed

Add new olmo chat templates, and improve data mixing/tokenization #765

Add new olmo chat templates, and improve data mixing/tokenization #765

Uh oh!

Conversation

jacob-morrison commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jacob-morrison commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

soldni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hamishivi commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hamishivi commented Jul 14, 2025

Uh oh!

jacob-morrison commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tyler-romero left a comment

Choose a reason for hiding this comment

Uh oh!

hamishivi Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

jacob-morrison Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

hamishivi Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jacob-morrison commented Jul 9, 2025 •

edited

Loading

hamishivi commented Jul 9, 2025 •

edited

Loading