Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amberish runs #629

Merged
merged 214 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
214 commits
Select commit Hold shift + click to select a range
9cd5127
Adds a config that runs the whole thing in fp32
dirkgr May 13, 2024
fa403d8
Let's try running this on Jupiter
dirkgr May 13, 2024
e19ca62
Scripts should be executable
dirkgr May 13, 2024
6cd26f7
There is no fp32 flash.
dirkgr May 13, 2024
87b3f0f
Needs more sharding
dirkgr May 13, 2024
18f0fee
change workspace
AkshitaB May 13, 2024
a251fec
run on pluto
AkshitaB May 13, 2024
947b7d8
eval frequently
AkshitaB May 13, 2024
ed77a49
smaller microbatch
AkshitaB May 13, 2024
67adc77
fp32 llamaish from scratch
AkshitaB May 14, 2024
29cd968
training workspace
AkshitaB May 14, 2024
0ba4121
use my secrets
AkshitaB May 14, 2024
03d6ee6
run2
AkshitaB May 14, 2024
84516cc
use my secrets for run2
AkshitaB May 14, 2024
a915d67
log grad norm each step
AkshitaB May 14, 2024
742e572
run on pluto
AkshitaB May 14, 2024
02c2258
no loading
AkshitaB May 14, 2024
66a84ab
turn off flash attn
AkshitaB May 14, 2024
32deea2
change sharding
AkshitaB May 14, 2024
a2fcaf3
configs, launch scripts and llama2 tokenizer for training on pile
drschwenk May 17, 2024
dc0ae92
changed run priority
drschwenk May 17, 2024
d6f79ad
change pad_token_id to match eos
drschwenk May 17, 2024
7842c2c
removed checkpoint load
drschwenk May 17, 2024
0b01c2d
turned off perplexity eval
drschwenk May 17, 2024
bf332dd
added load_path to resume run
drschwenk May 17, 2024
767ba16
specifying RDMA interface to use for NCCL
drschwenk May 17, 2024
92c0e0c
config for llm-360 amber data repro
drschwenk May 28, 2024
5ce8b70
changed beaker secrets aws key/secret pointing to
drschwenk May 28, 2024
92695ba
change to pluto
drschwenk May 28, 2024
070dfac
up node count, back to jupiter
drschwenk May 28, 2024
effc02a
making job preemptible
drschwenk May 28, 2024
2e67908
initial config changes
drschwenk May 29, 2024
1b1382c
change adamw eps to match llm-360
drschwenk May 29, 2024
cac419e
additional config changes
drschwenk May 29, 2024
a160434
change run name
drschwenk May 29, 2024
2a80afb
turn off qkv clipping
drschwenk May 29, 2024
61c87d3
fix init_cutoff param
drschwenk May 29, 2024
ea5061d
fixing some issues/ matching a few more llm-360 hyperparams
drschwenk May 31, 2024
ed15b17
removing overrides in the submission script
drschwenk May 31, 2024
3568d0e
restart from last checkpoint
drschwenk Jun 2, 2024
6c71900
fix load_path
drschwenk Jun 2, 2024
9cfdb6f
specify load_path
drschwenk Jun 2, 2024
63142d4
update changes
epwalsh Jun 12, 2024
6ae2ff6
Merge branch 'main' into epwalsh/amberish-stability
epwalsh Jun 12, 2024
037a51f
revert some other changes
epwalsh Jun 12, 2024
2d3e068
prepare for launch
epwalsh Jun 12, 2024
bc5d9e7
add w&B tag
epwalsh Jun 12, 2024
96b7ee5
change to urgent
epwalsh Jun 12, 2024
d77fca6
fixes
epwalsh Jun 12, 2024
b624e3e
try without activation checkpointing
epwalsh Jun 12, 2024
c4795a7
switch tokenizer
epwalsh Jun 12, 2024
db8c632
renames
epwalsh Jun 12, 2024
aefe6ec
make executable
epwalsh Jun 12, 2024
273636d
download HF cache
epwalsh Jun 12, 2024
a4b11e7
Add amberish7 configs/scripts
epwalsh Jun 12, 2024
324a9e6
load checkpoint
epwalsh Jun 13, 2024
a607ab6
no load
epwalsh Jun 13, 2024
1bd18c4
run llm-360-amber-bz640
epwalsh Jun 13, 2024
ad60997
keep warmup still 2k steps
epwalsh Jun 13, 2024
82be01e
update HF cache
epwalsh Jun 13, 2024
995bb31
restart amberish7-baseline
epwalsh Jun 14, 2024
686b644
run on more GPUs
epwalsh Jun 14, 2024
a89711b
try to debug slow speed
epwalsh Jun 14, 2024
09541e2
fix debug env var
epwalsh Jun 14, 2024
fe19e1a
run with weka filesystem
epwalsh Jun 14, 2024
0c9df12
turn on W&B again
epwalsh Jun 14, 2024
d83ba2f
try increasing micro batch size again
epwalsh Jun 14, 2024
7860b89
revert to 4
epwalsh Jun 14, 2024
341cf85
increase number of GPUs for llm-360-amber-baseline
epwalsh Jun 14, 2024
395b323
fixes
epwalsh Jun 14, 2024
30b9a79
fix
epwalsh Jun 14, 2024
b50b8d9
fix?
epwalsh Jun 14, 2024
295499a
save more often
epwalsh Jun 14, 2024
7efcca9
fix
epwalsh Jun 14, 2024
d3887f8
start over
epwalsh Jun 14, 2024
8c03d7b
increase ephemeral save interval
epwalsh Jun 14, 2024
db0eaba
restart from step 500
epwalsh Jun 14, 2024
5ed3408
ensure ranks know they are using shared fs
epwalsh Jun 14, 2024
bb4f034
try a different way
epwalsh Jun 15, 2024
6488ae9
load from latest
epwalsh Jun 15, 2024
4bb2e37
run in debug
epwalsh Jun 15, 2024
526a346
restart for real
epwalsh Jun 16, 2024
847f29b
add weight to RMS norm, add evals, use weka
epwalsh Jun 17, 2024
40b7cf0
rename config
epwalsh Jun 17, 2024
5f5e956
Make FSDP mixed precision optional
epwalsh Jun 17, 2024
16b1645
turn off FSDP mixed precision
epwalsh Jun 17, 2024
4c35bec
add activation checkpointing
epwalsh Jun 17, 2024
958a061
run with different seed
epwalsh Jun 17, 2024
6276645
lol, fix
epwalsh Jun 17, 2024
8545310
increase warmup
epwalsh Jun 18, 2024
dece9f0
run with longer seq length
epwalsh Jun 20, 2024
9b3c1c4
Add official Amberish configs
epwalsh Jun 21, 2024
4de8195
Merge branch 'main' into epwalsh/amberish
epwalsh Jun 21, 2024
5f930e7
add ppl evaluations
epwalsh Jun 21, 2024
21dd46d
fix
epwalsh Jun 21, 2024
1d34551
propagate preemption
epwalsh Jun 21, 2024
867b4ed
add load path
epwalsh Jun 21, 2024
9795374
update evals
epwalsh Jun 24, 2024
abc8341
tweak NCCL settings
epwalsh Jun 24, 2024
67c351b
go to 512 GPUs
epwalsh Jun 24, 2024
60114cb
set timeout to 20
epwalsh Jun 24, 2024
d735d9f
no timeout
epwalsh Jun 24, 2024
437abe0
try different HCA setting
epwalsh Jun 24, 2024
404de8b
try hybrid sharding
epwalsh Jun 24, 2024
ebad588
less replicas
epwalsh Jun 24, 2024
1e04e7d
increase process group timeout
epwalsh Jun 24, 2024
23aa627
decrease number of nodes
epwalsh Jun 25, 2024
7d4d3b0
try `NCCL_IB_GID_INDEX`
epwalsh Jun 25, 2024
55aa293
back to 8 nodes
epwalsh Jun 25, 2024
224062d
prepare for 512 GPUs again
epwalsh Jun 25, 2024
4b3e5f2
switch workspace
epwalsh Jun 25, 2024
3605e54
revert back to old workspace
epwalsh Jun 25, 2024
67a8bfb
switch back to new workspace
epwalsh Jun 25, 2024
a9d978c
Fail fast if we fall back to ethernet
epwalsh Jun 26, 2024
fd13efd
comment out unnecessary var
epwalsh Jun 26, 2024
c90a7ce
clean up
epwalsh Jun 29, 2024
31e58a9
longer start timeout
epwalsh Jul 1, 2024
59b0294
Add configs and launch scripts for Amberish1B
epwalsh Jul 2, 2024
238dd04
Merge branch 'main' into epwalsh/amberish
epwalsh Jul 5, 2024
1b89c52
test run for 1B
epwalsh Jul 5, 2024
e8c4f9d
try mbz of 4
epwalsh Jul 5, 2024
17506bd
try mbz of 8
epwalsh Jul 5, 2024
01bdbd7
back to mbz of 4
epwalsh Jul 5, 2024
05cf130
Add scripts for WD on embeddings
epwalsh Jul 5, 2024
94469ad
start new epoch
epwalsh Jul 5, 2024
363a316
Fix bug with epochs
epwalsh Jul 5, 2024
8730d46
restart
epwalsh Jul 5, 2024
c215d66
fix
epwalsh Jul 5, 2024
f7824bb
go down to 8 nodes with 7B
epwalsh Jul 8, 2024
1434330
set load path for 1Bs
epwalsh Jul 8, 2024
553f323
Merge branch 'main' into epwalsh/amberish
epwalsh Jul 9, 2024
5af5c6b
Add scripts for 1B with selective updates
epwalsh Jul 9, 2024
2776aab
Add unsharding scripts
epwalsh Jul 10, 2024
c81bfe3
debug
epwalsh Jul 10, 2024
d672805
revert
epwalsh Jul 10, 2024
f99bc17
copy to s3
epwalsh Jul 10, 2024
6df4dbb
auto upload to S3
epwalsh Jul 10, 2024
f595d06
run on 16 nodes
epwalsh Jul 11, 2024
7935cdb
pin flash attention
epwalsh Jul 12, 2024
737e113
run selective updates without WD on embeddings
epwalsh Jul 12, 2024
1fc4c21
Add options for emb init std and emb LN
epwalsh Jul 12, 2024
cced92c
Add scripts for `amberish1-emb-ln`
epwalsh Jul 12, 2024
fb35549
use 8 nodes
epwalsh Jul 12, 2024
e2ce48f
use default std dev for embeddingds
epwalsh Jul 12, 2024
c8db9f1
try with just emb init 1
epwalsh Jul 12, 2024
e9c4cfc
fix name
epwalsh Jul 12, 2024
b3f1aeb
load
epwalsh Jul 13, 2024
2cc3580
add script with z-loss
epwalsh Jul 13, 2024
ea86a79
oops, fix
epwalsh Jul 13, 2024
5118123
try 2 replicas
epwalsh Jul 13, 2024
b7a0161
turn off hybrid
epwalsh Jul 13, 2024
c7bc32e
Add 70B config and scripts
epwalsh Jul 15, 2024
279fbae
Fix `eos_token_id` in configs
epwalsh Jul 16, 2024
6186799
update 70B config
epwalsh Jul 16, 2024
a3916e4
update
epwalsh Jul 16, 2024
bb2bf4e
Add load path
epwalsh Jul 17, 2024
4263354
prep for run
epwalsh Jul 17, 2024
f1eb9bc
adjust # of nodes
epwalsh Jul 17, 2024
dba79f7
no evals
epwalsh Jul 17, 2024
b241565
match mitchish70 batch size and num nodes
epwalsh Jul 17, 2024
fe5073b
rename
epwalsh Jul 17, 2024
2832e71
tweak settings again
epwalsh Jul 17, 2024
b1191ef
prepare 7B
epwalsh Jul 17, 2024
43dc713
prep for 256 GPUs
epwalsh Jul 17, 2024
0eb2e04
more hybrid replicas
epwalsh Jul 17, 2024
279450f
try 16 replicas
epwalsh Jul 17, 2024
22fc958
disable evaluators for now
epwalsh Jul 17, 2024
52f4b79
try no hybrid
epwalsh Jul 17, 2024
9e9454c
run on 512
epwalsh Jul 17, 2024
4ade8a2
2 replicas
epwalsh Jul 17, 2024
22b81f1
update
epwalsh Jul 17, 2024
0316e60
13B config
dirkgr Jul 17, 2024
7f5e812
Merge branch 'epwalsh/amberish' of https://github.com/allenai/LLM int…
dirkgr Jul 17, 2024
e3f0bf1
big run
epwalsh Jul 17, 2024
25b44dc
4 replicas
epwalsh Jul 17, 2024
a3cf646
reset 7B config
epwalsh Jul 17, 2024
b43447d
prepare to run on 256 GPUs
epwalsh Jul 17, 2024
7a79e17
turn off ephemeral checkpoints
epwalsh Jul 17, 2024
fb5372e
move scripts to dedicated folder
epwalsh Jul 18, 2024
5e1c7ed
fix merge conflicts
epwalsh Jul 18, 2024
cc86571
Prep Amberish config with Chameleon fixes
epwalsh Jul 18, 2024
33150cf
clean up
epwalsh Jul 18, 2024
4abe22a
Merge branch 'main' into epwalsh/amberish
epwalsh Jul 18, 2024
198fb5c
clean up
epwalsh Jul 18, 2024
185cd1e
Merge branch 'main' into epwalsh/amberish
epwalsh Jul 18, 2024
5ba9d27
clean up
epwalsh Jul 18, 2024
6e20f3e
exclude norm reordering
epwalsh Jul 18, 2024
fb811d9
tweaks
epwalsh Jul 19, 2024
b35cd52
update
epwalsh Jul 19, 2024
3a4bfce
Merge branch 'main' into epwalsh/amberish
epwalsh Jul 19, 2024
075b6e7
clean up
epwalsh Jul 19, 2024
f68e29a
Add 8k context length scripts
epwalsh Jul 19, 2024
525b315
Merge branch 'main' into epwalsh/amberish
epwalsh Jul 19, 2024
12bf4cd
try another chameleon
epwalsh Jul 19, 2024
ea724d4
run 8k with document masking
epwalsh Jul 20, 2024
01f2bf6
Add chameleon fixes to 8k run
epwalsh Jul 21, 2024
5d53b6c
prep 4k context length w/ doc masking
epwalsh Jul 22, 2024
71e9bf9
rename scripts
epwalsh Jul 23, 2024
24d50c3
fix
epwalsh Jul 23, 2024
2d0b8b0
prep for 8k chameleon without doc masking
epwalsh Jul 23, 2024
95c477d
use doc masking in ppl evals too
epwalsh Jul 23, 2024
64ba2a8
fix
epwalsh Jul 23, 2024
b4e5e91
Make RoPE theta configurable, try a run w/ 500000
epwalsh Jul 24, 2024
b4acf5b
CHANGELOG
epwalsh Jul 24, 2024
00dfa78
add config with new data, new tokenizer
epwalsh Jul 24, 2024
4ff8598
prep peteish run
epwalsh Jul 24, 2024
d709d2d
remove ppl sets for now
epwalsh Jul 24, 2024
ba67ac0
clean up unshard script
epwalsh Jul 24, 2024
27fb675
fix typo
epwalsh Jul 24, 2024
ce930dd
Update olmo/config.py
epwalsh Jul 24, 2024
ac6b650
back to 16 nodes for now
epwalsh Jul 24, 2024
1e32421
Add peteish 7B config
epwalsh Jul 25, 2024
accc3ee
back to 4k
epwalsh Jul 25, 2024
8b7afd0
final tweaks
epwalsh Jul 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added support for document masking via flash-attn during training with `--data.generate_doc_lengths`.
- Added config options for `model.norm_after`, `model.scale_emb_init`, and `auxiliary_loss_multiplier` (used with zloss).
- Added scripts for running experiments on qk_norm, norm reordering, and zloss.
- Added `model.rope_theta` configuration option.
- Added `model.embedding_layer_norm` configuration option for adding a LN to the embeddings.
- Added `model.emb_init_std` configuration option to override the standard deviation used to initialize the embeddings.

### Changed

Expand Down
1,297 changes: 1,297 additions & 0 deletions configs/amberish1-weka.yaml

Large diffs are not rendered by default.

1,293 changes: 1,293 additions & 0 deletions configs/amberish13-weka.yaml

Large diffs are not rendered by default.

1,293 changes: 1,293 additions & 0 deletions configs/amberish7-weka.yaml

Large diffs are not rendered by default.

1,294 changes: 1,294 additions & 0 deletions configs/amberish70-weka.yaml

Large diffs are not rendered by default.

1,383 changes: 1,383 additions & 0 deletions configs/peteish1-weka.yaml

Large diffs are not rendered by default.

1,382 changes: 1,382 additions & 0 deletions configs/peteish7-weka.yaml

Large diffs are not rendered by default.

26 changes: 22 additions & 4 deletions olmo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,11 @@ class ModelConfig(BaseConfig):
apply RoPE at the precision of the input.
"""

rope_theta: int = 10_000
"""
The theta setting for RoPE.
"""

flash_attention: bool = False
"""
If ``True``, use ``FlashAttention``.
Expand Down Expand Up @@ -346,6 +351,11 @@ class ModelConfig(BaseConfig):
The dropout probability for embeddings.
"""

embedding_layer_norm: bool = False
"""
Apply layer norm directly to the embeddings.
"""

layer_norm_type: LayerNormType = LayerNormType.default
"""
The layernorm implementation to use.
Expand Down Expand Up @@ -449,7 +459,13 @@ class ModelConfig(BaseConfig):

scale_emb_init: bool = False
"""
If ``True``, embeddings are scaled up by ``sqrt(d_model)`` during initialization. To be used with `full_megatron` init.
If ``True``, embeddings are scaled up by ``sqrt(d_model)`` during initialization.
Currently this is only used with `full_megatron` init when ``emb_init_std`` is unset.
"""

emb_init_std: Optional[float] = None
"""
Override the standard deviation to use when initializing the embedding weights.
"""

norm_after: bool = False
Expand Down Expand Up @@ -791,7 +807,7 @@ class FSDPConfig(BaseConfig):
FSDP instance.
"""

precision: FSDPPrecision = FSDPPrecision.pure
precision: Optional[FSDPPrecision] = FSDPPrecision.pure

hybrid_sharding_num_model_replicas: Optional[int] = None
"""
Expand Down Expand Up @@ -1213,9 +1229,11 @@ def autocast_precision(self) -> torch.dtype:
raise ValueError(f"Unexpected precision type '{self.precision}'")

@property
def fsdp_precision(self) -> MixedPrecision:
def fsdp_precision(self) -> Optional[MixedPrecision]:
if self.fsdp is not None:
if self.fsdp.precision == FSDPPrecision.pure:
if self.fsdp.precision is None:
return None
elif self.fsdp.precision == FSDPPrecision.pure:
return MixedPrecision(
param_dtype=self.autocast_precision,
reduce_dtype=self.autocast_precision,
Expand Down
22 changes: 16 additions & 6 deletions olmo/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,9 @@ def get_rotary_embedding(self, seq_len: int, device: torch.device) -> Tuple[torc

with torch.autocast(device.type, enabled=False):
dim = self.config.d_model // self.config.n_heads
inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device, dtype=torch.float) / dim))
inv_freq = 1.0 / (
self.config.rope_theta ** (torch.arange(0, dim, 2, device=device, dtype=torch.float) / dim)
)
seq = torch.arange(seq_len, device=device, dtype=torch.float)
freqs = einsum("i , j -> i j", seq, inv_freq)
positions = torch.cat((freqs, freqs), dim=-1)
Expand Down Expand Up @@ -535,7 +537,6 @@ def _scaled_dot_product_attention(
if max_doc_len is not None and cu_doc_lens is not None:
assert self.flash_attn_varlen_func is not None, "flash-attn is required for document masking"
assert attn_mask is None, "attn-mask is currently not supported with document masking"
assert self.training, "document masking is only supported for training, not inference"
B, T, D = q.size(0), q.size(2), q.size(3)
r = self.flash_attn_varlen_func(
q.transpose(1, 2).view(B * T, -1, D),
Expand Down Expand Up @@ -1121,6 +1122,9 @@ def __init__(self, config: ModelConfig, init_params: bool = True):
)
}
)
if config.embedding_layer_norm:
self.transformer.update({"emb_norm": LayerNorm.build(config)})

# When `init_device="meta"` FSDP will call `reset_parameters()` to initialize weights.
if init_params and self.config.init_device != "meta":
self.reset_parameters()
Expand Down Expand Up @@ -1157,14 +1161,16 @@ def reset_parameters(self):
# Note: We may potentially want to multiply the std by a factor of sqrt(d) in case of `scale_logits`
# and `weight_tying`. However, we are currently not using either, and may need to rethink the init logic
# if/when we do want it.
wte_std = self.config.init_std
wte_std = self.config.emb_init_std or self.config.init_std
wte_cutoff_factor = self.config.init_cutoff_factor
elif self.config.init_fn == InitFnType.mitchell:
wte_std = 1.0 / math.sqrt(self.config.d_model)
wte_std = self.config.emb_init_std or 1.0 / math.sqrt(self.config.d_model)
wte_cutoff_factor = self.config.init_cutoff_factor or 3.0
elif self.config.init_fn == InitFnType.full_megatron:
wte_std = self.config.init_std
if self.config.scale_emb_init:
if self.config.emb_init_std is not None:
wte_std = self.config.emb_init_std
elif self.config.scale_emb_init:
wte_std *= math.sqrt(self.config.d_model)
wte_cutoff_factor = self.config.init_cutoff_factor or 3.0
else:
Expand Down Expand Up @@ -1294,6 +1300,10 @@ def forward(
# shape: (batch_size, seq_len, d_model)
x = self.transformer.wte(input_ids) if input_embeddings is None else input_embeddings # type: ignore

# Apply embedding layer norm.
if self.config.embedding_layer_norm:
x = self.transformer.emb_norm(x)

if not (self.config.alibi or self.config.rope):
# Get positional embeddings.
# shape: (1, seq_len)
Expand All @@ -1302,7 +1312,7 @@ def forward(
pos_emb = self.transformer.wpe(pos) # type: ignore
x = pos_emb + x

# Add input + positional embeddings and apply dropout.
# Apply dropout.
# shape: (batch_size, seq_len, d_model)
x = self.transformer.emb_drop(x) # type: ignore

Expand Down
40 changes: 40 additions & 0 deletions scripts/beaker/amberish/amberish1-8k-cham-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=16

gantry run \
--workspace ai2/OLMo-pretraining-stability \
--task-name amberish1-8k-cham \
--description "Amberish 1B with 8k context length and chameleon fixes" \
--priority urgent \
--preemptible \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--weka oe-training-default:/weka/oe-training-default \
--propagate-failure \
--propagate-preemption \
--synchronized-start-timeout 90m \
--no-python \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env R2_PROFILE=R2 \
--env S3_PROFILE=S3 \
--env WEKA_PROFILE=WEKA \
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \
--env-secret WANDB_API_KEY=PETEW_WANDB_API_KEY \
--shared-memory 10GiB \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/beaker/amberish/amberish1-8k-cham.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK"
64 changes: 64 additions & 0 deletions scripts/beaker/amberish/amberish1-8k-cham.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/usr/bin/env bash

set -exuo pipefail
IFS=$'\n\t'

BEAKER_LEADER_REPLICA_HOSTNAME=$1
shift

NUM_NODES=$1
shift

BEAKER_REPLICA_RANK=$1
shift

# Setup Python environment.
conda shell.bash activate base

# Install flash-attn
#conda install -y -c nvidia cuda-python
pip install packaging ninja
export FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE
pip install flash-attn==2.5.9.post1 --no-build-isolation
# pip install awscli
pip install '.[train]'
pip freeze

# Move AWS credentials from env to relevant files
mkdir -p ~/.aws
printenv AWS_CONFIG > ~/.aws/config
printenv AWS_CREDENTIALS > ~/.aws/credentials

# Force processes to synchronize at init_process_group
export TORCH_DIST_INIT_BARRIER=1

# Tell OLMo all ranks share the same filesystem for checkpoints.
export OLMO_SHARED_FS=1

export NCCL_DEBUG=INFO
export NCCL_IB_HCA="^=mlx5_bond_0"
export NCCL_SOCKET_IFNAME=ib
# export NCCL_IB_GID_INDEX=0

torchrun \
--nnodes "${NUM_NODES}:${NUM_NODES}" \
--nproc-per-node 8 \
--rdzv_id 12347 \
--rdzv_backend static \
--rdzv_endpoint "${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" \
--node_rank "${BEAKER_REPLICA_RANK}" \
--rdzv_conf 'read_timeout=420' \
scripts/train.py \
configs/amberish1-weka.yaml \
--run_name="${GANTRY_TASK_NAME}" \
--model.max_sequence_length=8192 \
--device_train_microbatch_size=2 \
--global_train_batch_size=512 \
--fused_loss=true \
--softmax_auxiliary_loss=true \
--auxiliary_loss_multiplier=1e-5 \
--model.attention_layer_norm=true \
--model.norm_after=true \
--save_overwrite

# '--load_path=${path.last_checkpoint:${save_folder}}' \
40 changes: 40 additions & 0 deletions scripts/beaker/amberish/amberish1-8k-doc-mask-cham-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=16

gantry run \
--workspace ai2/OLMo-pretraining-stability \
--task-name amberish1-8k-doc-mask-cham \
--description "Amberish 1B with 8k context length, doc masking, and chameleon fixes" \
--priority urgent \
--preemptible \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--weka oe-training-default:/weka/oe-training-default \
--propagate-failure \
--propagate-preemption \
--synchronized-start-timeout 90m \
--no-python \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env R2_PROFILE=R2 \
--env S3_PROFILE=S3 \
--env WEKA_PROFILE=WEKA \
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \
--env-secret WANDB_API_KEY=PETEW_WANDB_API_KEY \
--shared-memory 10GiB \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/beaker/amberish/amberish1-8k-doc-mask-cham.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK"
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=16

gantry run \
--workspace ai2/OLMo-pretraining-stability \
--task-name amberish1-8k-doc-mask-cham-rtheta \
--description "Amberish 1B with 8k context length, doc masking, and chameleon fixes" \
--priority urgent \
--preemptible \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--weka oe-training-default:/weka/oe-training-default \
--propagate-failure \
--propagate-preemption \
--synchronized-start-timeout 90m \
--no-python \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env R2_PROFILE=R2 \
--env S3_PROFILE=S3 \
--env WEKA_PROFILE=WEKA \
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \
--env-secret WANDB_API_KEY=PETEW_WANDB_API_KEY \
--shared-memory 10GiB \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/beaker/amberish/amberish1-8k-doc-mask-cham-rtheta.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK"
66 changes: 66 additions & 0 deletions scripts/beaker/amberish/amberish1-8k-doc-mask-cham-rtheta.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#!/usr/bin/env bash

set -exuo pipefail
IFS=$'\n\t'

BEAKER_LEADER_REPLICA_HOSTNAME=$1
shift

NUM_NODES=$1
shift

BEAKER_REPLICA_RANK=$1
shift

# Setup Python environment.
conda shell.bash activate base

# Install flash-attn
#conda install -y -c nvidia cuda-python
pip install packaging ninja
export FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE
pip install flash-attn==2.5.9.post1 --no-build-isolation
# pip install awscli
pip install '.[train]'
pip freeze

# Move AWS credentials from env to relevant files
mkdir -p ~/.aws
printenv AWS_CONFIG > ~/.aws/config
printenv AWS_CREDENTIALS > ~/.aws/credentials

# Force processes to synchronize at init_process_group
export TORCH_DIST_INIT_BARRIER=1

# Tell OLMo all ranks share the same filesystem for checkpoints.
export OLMO_SHARED_FS=1

export NCCL_DEBUG=INFO
export NCCL_IB_HCA="^=mlx5_bond_0"
export NCCL_SOCKET_IFNAME=ib
# export NCCL_IB_GID_INDEX=0

torchrun \
--nnodes "${NUM_NODES}:${NUM_NODES}" \
--nproc-per-node 8 \
--rdzv_id 12347 \
--rdzv_backend static \
--rdzv_endpoint "${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" \
--node_rank "${BEAKER_REPLICA_RANK}" \
--rdzv_conf 'read_timeout=420' \
scripts/train.py \
configs/amberish1-weka.yaml \
--run_name="${GANTRY_TASK_NAME}" \
--model.max_sequence_length=8192 \
--device_train_microbatch_size=2 \
--global_train_batch_size=512 \
--fused_loss=true \
--data.generate_doc_lengths=true \
--softmax_auxiliary_loss=true \
--auxiliary_loss_multiplier=1e-5 \
--model.attention_layer_norm=true \
--model.norm_after=true \
--model.rope_theta=500000 \
--save_overwrite

# '--load_path=${path.last_checkpoint:${save_folder}}' \
Loading
Loading