Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config for running data ablations #464

Merged
merged 67 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
50a7704
Makes R2 work, and adds an ablation config
dirkgr Feb 23, 2024
ae538ce
Merge branch 'main' into olmo7-ablations
dirkgr Feb 23, 2024
404ea30
Script for running ablations on LUMI
dirkgr Feb 27, 2024
005c406
It's no longer just s3.
dirkgr Feb 27, 2024
25a0f4f
Merge branch 'olmo7-ablations' of https://github.com/allenai/LLM into…
dirkgr Feb 27, 2024
399d33c
We now think the 1T checkpoint is better.
dirkgr Feb 27, 2024
6d993f3
Try the `spawn` start method
dirkgr Feb 27, 2024
a67053f
Revert "Try the `spawn` start method"
dirkgr Feb 27, 2024
80fba0c
Set start method right away
dirkgr Feb 27, 2024
3834c62
Merge branch 'main' into olmo7-ablations
dirkgr Feb 28, 2024
0cc7f20
Different seed, so we don't train on the same data twice
dirkgr Feb 28, 2024
d765e88
Merge remote-tracking branch 'origin/mmlu-downstream' into olmo7-abla…
dirkgr Feb 28, 2024
cdb6ad9
Config for MosaicML
dirkgr Feb 28, 2024
305d1a8
Mcli has changed its format
dirkgr Feb 28, 2024
e2d0631
Config tweaks
dirkgr Feb 28, 2024
251e89a
More mcli changes
dirkgr Feb 28, 2024
08c1fcb
Merge branch 'main' into olmo7-ablations
OyvindTafjord Feb 28, 2024
f979478
Update downstream tasks
OyvindTafjord Feb 28, 2024
0074545
We also changed our formats.
dirkgr Feb 28, 2024
c1d664b
It's not my day today.
dirkgr Feb 28, 2024
08df810
Changelog
dirkgr Feb 28, 2024
0b7c26e
isort
dirkgr Feb 28, 2024
1961c25
Huggingface offline datasets
dirkgr Feb 29, 2024
9e927f1
Not sure what's going on with openbookqa.
dirkgr Feb 29, 2024
c3c1a28
Don't use compile
dirkgr Feb 29, 2024
7c2b1dd
Back to microbatch of 2
dirkgr Feb 29, 2024
5083517
Same settings as we did for OLMo 7B
dirkgr Feb 29, 2024
b744d5e
Turn off compile
dirkgr Feb 29, 2024
38f8817
Lots of checkpointing
dirkgr Feb 29, 2024
d849d40
Old version of torch
dirkgr Feb 29, 2024
d7b2e59
Revert "Old version of torch"
dirkgr Feb 29, 2024
c773863
mbsz 3
dirkgr Feb 29, 2024
75aacd8
More GPUs, bigger batch
dirkgr Mar 1, 2024
a02ae9c
Set a group name
dirkgr Mar 1, 2024
23951d1
Save and eval more often
dirkgr Mar 2, 2024
3d71dd5
Dolma 1.7 config
dirkgr Mar 2, 2024
ead9ac8
Run less GPUs for longer
dirkgr Mar 2, 2024
20b6514
Configure remote save folders
dirkgr Mar 2, 2024
85492da
Give better names to the configs. Also run the baseline somewhere else.
dirkgr Mar 2, 2024
a4c0f09
Warm the cache for starter checkpoints
dirkgr Mar 2, 2024
5e3bda7
Revert "Warm the cache for starter checkpoints"
dirkgr Mar 4, 2024
c077d03
LLM is now OLMo
dirkgr Mar 4, 2024
17891f1
Adds ability to show all logs
dirkgr Mar 4, 2024
ce70c8b
Uses ability to show all logs
dirkgr Mar 4, 2024
d5ca6e4
Merge branch 'main' into olmo7-ablations
dirkgr Mar 6, 2024
2be09c9
Merge remote-tracking branch 'origin/main' into olmo7-ablations
dirkgr Mar 6, 2024
835dfcf
It's called `all_ranks` now.
dirkgr Mar 6, 2024
c21f6b9
New MMLU evals
dirkgr Mar 6, 2024
5411dc9
Config for dedupedocs
dirkgr Mar 6, 2024
f4ebb62
Disable the MMLU var evals
dirkgr Mar 6, 2024
1a32bec
Bring back the vars
dirkgr Mar 6, 2024
aad1e82
Fix uninitialized prompts bug
OyvindTafjord Mar 6, 2024
b8ad5b8
No more checkpointing
dirkgr Mar 6, 2024
0ada799
More speed
dirkgr Mar 6, 2024
608dfe5
Compile is still broken
dirkgr Mar 6, 2024
d434011
Let's try SHARD_GRAD_OP
dirkgr Mar 7, 2024
86e9a3f
Config for dedupeparas
dirkgr Mar 8, 2024
6524f87
New cluster who dis?
dirkgr Mar 8, 2024
f640740
Missing budget
dirkgr Mar 8, 2024
a0cb855
Warm HF Cache in Beaker
dirkgr Mar 8, 2024
963aa0b
Adds a script to continue the baseline run on Beaker
dirkgr Mar 12, 2024
66dc953
8 nodes
dirkgr Mar 12, 2024
87fc58d
2xnonweb
dirkgr Mar 12, 2024
6ff85b9
Refheavy
dirkgr Mar 12, 2024
cd5c196
Final2 config
dirkgr Mar 18, 2024
1fadaf8
Indentation to make comparisons work
dirkgr Mar 18, 2024
51303ea
Merge branch 'main' into olmo7-ablations
dirkgr Mar 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added the option to directly pass input embeddings to `OLMo` and `OLMoForCausalLM`.
- Added support for Python 3.8.
- Added code to throw an error if `output_attentions` is set to `True` in forward call to `OLMoForCausalLM`. This functionality hasn't been implemented yet.
- Fixed running with data loading workers on LUMI
- Correct scheme displayed in error messages that come from R2
- Fixed running with multiple data loading workers in LUMI
- Minor bug fix: uninitialized prompts variable

### Added
- Added `output_hidden_states` argument and associated functionality to `OLMo` and `OLMoForCausalLM` to return model intermediate hidden states.
- Ability to read from R2 like we read from S3
- Added MMLU downstream evaluation tasks, with prompt variations.
- Added support for PyTorch v2.2.
- Added ability to show logs from all ranks
Expand Down
47 changes: 47 additions & 0 deletions configs/mcli/olmo7-ablation-baseline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: olmo7-ablation-baseline # can't have "_" or "." here
image: mosaicml/pytorch:2.1.2_cu121-python3.10-ubuntu20.04
compute:
gpus: 64
cluster: r7z2
gpu_type: a100_40gb
integrations:
- integration_type: git_repo
git_repo: allenai/OLMo
git_branch: olmo7-ablations
#git_commit: d765e8819f5b0be204c96b0b519de2372b0da729
pip_install: -e .[train]
ssh_clone: true
command: |-
pip freeze
mkdir -p /root/.cache/torch/

export OMP_NUM_THREADS=8
export LOG_FILTER_TYPE=all_ranks
#export OLMO_NO_SSL=1

# warm up huggingface cache
pushd /root/.cache
curl "https://storage.googleapis.com/dirkgr-public/huggingface_cache.tar.gz" | tar -xzf -
popd
export HF_DATASETS_OFFLINE=1

cd OLMo

torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nnodes $NUM_NODES \
--node_rank $NODE_RANK \
--nproc_per_node 8 \
scripts/train.py configs/olmo7-ablation-baseline.yaml \
--run_name=olmo7-ablation-baseline \
--wandb.name=baseline \
--model.flash_attention=true \
--fsdp.wrapping_strategy=by_block_and_size \
--fsdp.sharding_strategy=FULL_SHARD \
--save_folder=runs/ \
--activation_checkpointing=whole_layer \
--device_train_microbatch_size=3 \
--global_train_batch_size=6144 \
--wandb.group=baseline3 \
--remote_save_folder=s3://ai2-llm/checkpoints/olmo7-ablation/baseline3
46 changes: 46 additions & 0 deletions configs/mcli/olmo7-ablation-dedupedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: olmo7-ablation-dedupedocs # can't have "_" or "." here
image: mosaicml/pytorch:2.1.2_cu121-python3.10-ubuntu20.04
compute:
gpus: 64
cluster: r14z3p2
gpu_type: h100_80gb
integrations:
- integration_type: git_repo
git_repo: allenai/OLMo
git_branch: olmo7-ablations
#git_commit: d765e8819f5b0be204c96b0b519de2372b0da729
pip_install: -e .[train]
ssh_clone: true
command: |-
pip freeze
mkdir -p /root/.cache/torch/

export OMP_NUM_THREADS=8
export LOG_FILTER_TYPE=all_ranks
#export OLMO_NO_SSL=1

# warm up huggingface cache
pushd /root/.cache
curl "https://storage.googleapis.com/dirkgr-public/huggingface_cache.tar.gz" | tar -xzf -
popd
export HF_DATASETS_OFFLINE=1

cd OLMo

torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nnodes $NUM_NODES \
--node_rank $NODE_RANK \
--nproc_per_node 8 \
scripts/train.py configs/olmo7-ablation-dedupedocs.yaml \
--run_name=olmo7-ablation-dedupedocs \
--wandb.name=dedupedocs \
--model.flash_attention=true \
--fsdp.wrapping_strategy=by_block_and_size \
--fsdp.sharding_strategy=SHARD_GRAD_OP \
--save_folder=runs/ \
--device_train_microbatch_size=3 \
--global_train_batch_size=6144 \
--wandb.group=dedupedocs \
--remote_save_folder=s3://ai2-llm/checkpoints/olmo7-ablation/dedupedocs
47 changes: 47 additions & 0 deletions configs/mcli/olmo7-ablation-dolma17.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: olmo7-ablation-dolma17 # can't have "_" or "." here
image: mosaicml/pytorch:2.1.2_cu121-python3.10-ubuntu20.04
compute:
gpus: 128
cluster: r12z3
gpu_type: a100_40gb
integrations:
- integration_type: git_repo
git_repo: allenai/OLMo
git_branch: olmo7-ablations
#git_commit: d765e8819f5b0be204c96b0b519de2372b0da729
pip_install: -e .[train]
ssh_clone: true
command: |-
pip freeze
mkdir -p /root/.cache/torch/

export OMP_NUM_THREADS=8
export LOG_FILTER_TYPE=all_ranks
#export OLMO_NO_SSL=1

# warm up huggingface cache
pushd /root/.cache
curl "https://storage.googleapis.com/dirkgr-public/huggingface_cache.tar.gz" | tar -xzf -
popd
export HF_DATASETS_OFFLINE=1

cd OLMo

torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nnodes $NUM_NODES \
--node_rank $NODE_RANK \
--nproc_per_node 8 \
scripts/train.py configs/olmo7-ablation-dolma17.yaml \
--run_name=olmo7-ablation-dolma17 \
--wandb.name=dolma17 \
--model.flash_attention=true \
--fsdp.wrapping_strategy=by_block_and_size \
--fsdp.sharding_strategy=FULL_SHARD \
--save_folder=runs/ \
--activation_checkpointing=whole_layer \
--device_train_microbatch_size=3 \
--global_train_batch_size=6144 \
--wandb.group=dolma17 \
--remote_save_folder=s3://ai2-llm/checkpoints/olmo7-ablation/dolma17
Loading
Loading