Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1165 commits
Select commit Hold shift + click to select a range
19cde92
fix timing (#31)
stas00 Jul 28, 2021
a0bccfe
Update gpt2_tokenization.py
huu4ontocord Jul 29, 2021
ac227a1
Revert "Update gpt2_tokenization.py"
Jul 29, 2021
2a7ee91
use pp engine even for pp=1 (#6) (#34)
stas00 Jul 30, 2021
e7bc518
Revert "use pp engine even for pp=1 (#6) (#34)"
stas00 Jul 31, 2021
2a64b19
Revert "Revert "use pp engine even for pp=1 (#6) (#34)""
stas00 Jul 31, 2021
c3399a7
Create README.md
stas00 Aug 3, 2021
3caf203
Faster preprocessing (#18)
thomasw21 Aug 4, 2021
a7856ca
add a section on how we use deepspeed with Meg
stas00 Aug 4, 2021
0366587
fix the deepspeed example
stas00 Aug 4, 2021
b3aa039
add .bs to the version to help check we are on the right repo/branch
stas00 Aug 4, 2021
e7ac5fd
fix attn_mask (#50)
stas00 Aug 5, 2021
358eac6
chore: update gitignore (#45)
jaketae Aug 5, 2021
b6a2c9e
Group tensorboard metrics (#39)
VictorSanh Aug 5, 2021
e0c6236
rm `(s)` that slipped through
VictorSanh Aug 5, 2021
72f4080
Update requirements.txt (#46)
jaketae Aug 5, 2021
84f8d51
Add LRU cache, add faster tokenization (#37)
huu4ontocord Aug 5, 2021
5521f38
Update README.md (#51)
lintangsutawika Aug 5, 2021
c43e207
chore: add deepspeed as comment
jaketae Aug 5, 2021
6e5f752
Fix pretrain_gpt_single_node example script to have only one occurenc…
thomasw21 Aug 6, 2021
190565d
better comment on TB writer (`is_last_rank`)
VictorSanh Aug 6, 2021
50cb9da
Add GLU variants (#47)
jaketae Aug 8, 2021
febe21d
[microsoft/Megatron-DeepSpeed sync] Commits including 2021-08-09 (#58)
stas00 Aug 10, 2021
b11b2be
use HuggingFace Datasets as source to build Megatron data files (#48)
adammoody Aug 11, 2021
3c6460d
Add test suite (#64)
stas00 Aug 13, 2021
29f0150
fix arg help (#65)
stas00 Aug 13, 2021
128013d
add testing and contribute info
stas00 Aug 15, 2021
60e82e3
fix header
stas00 Aug 15, 2021
ccab405
fix: doc_idx offset when merging indexed dataset files (#66)
adammoody Aug 16, 2021
3343c77
shuffle index list with numpy, scatter list, use file for large lists…
adammoody Aug 17, 2021
8f754d3
fix: exclusive scan computing pointers list (#68)
adammoody Aug 18, 2021
3ab2b3d
- Recompute bin/idx using microsoft/Megatron-DeepSpeed (Not changes)
thomasw21 Aug 18, 2021
34d1076
Add openwebtext1000.jsonl to .gitignore
thomasw21 Aug 18, 2021
7642903
[testing] fixes for pt-1.10 (#71)
stas00 Aug 21, 2021
07a2ba5
Expose GLU activations as arguments (#69)
jaketae Aug 22, 2021
2013106
fix circular import (#72)
stas00 Aug 22, 2021
ed786b5
[codecarbon] integration (#15)
stas00 Aug 25, 2021
6a48314
Check cardon directory is not None (#74)
thomasw21 Aug 26, 2021
2f27e55
[CI] start workflow (#75)
stas00 Aug 26, 2021
2595bd9
[CI] wip (#76)
stas00 Aug 26, 2021
aa319b1
distributed merge of per-rank Megatron data files (#55)
adammoody Aug 26, 2021
937135a
fix test; skip broken test (#79)
stas00 Aug 26, 2021
6d656ec
Add step to download dataset before running the preprocess_data_dist …
adammoody Aug 27, 2021
8cb2d18
dynamically use as many 3d dimensions as possible (#83)
stas00 Aug 27, 2021
d755647
add missing dependencies (#88)
stas00 Sep 1, 2021
41fdd46
[CI] setting up a CI with EC2 backend (#78)
stas00 Sep 1, 2021
6be85ca
[requirements] fix format (#94)
stas00 Sep 10, 2021
c7d8e94
[WIP] [codecarbon] sorting out CC warnings + logger preamble (#80)
stas00 Sep 13, 2021
30e0b8d
Floating-point ops counting and reloading (#40)
TevenLeScao Sep 15, 2021
b8b4797
added comment
TevenLeScao Sep 15, 2021
5162fd2
check whether python3-config is available (#98)
stas00 Sep 15, 2021
55e7332
Prefix lm (#52)
thomasw21 Sep 16, 2021
e28f84c
simplify the CI trigger (#102)
stas00 Sep 16, 2021
f822ef0
Fix model tests (#103)
thomasw21 Sep 16, 2021
709f1af
[tensor comparisons] support pt-1.8, add torch_assert_close (#106)
stas00 Sep 16, 2021
39c3d70
Checkpoint conversion tools (#14) (#109)
stas00 Sep 20, 2021
7dd3a6b
add direct meg-ds to hf format script (#110)
stas00 Sep 20, 2021
846cc32
add direct meg-ds to hf format script (part2) (#111)
stas00 Sep 20, 2021
2495bd8
training with dummy data to verify sampling (#36)
lintangsutawika Sep 21, 2021
d168c1b
update merge_preprocessed_data to use distributed merge (#82)
adammoody Sep 21, 2021
74b8166
make scripts executable
stas00 Sep 21, 2021
1ac6a70
add shebang
stas00 Sep 21, 2021
87f0598
ALiBi Implementation (#101)
ofirpress Sep 24, 2021
8eb0029
[tests] flush std streams (#120)
stas00 Sep 29, 2021
0ec0257
chore: update `.gitignore`
jaketae Sep 30, 2021
a0e6b68
[Feature] Implement sample-ids-to-text extractor (#116)
wade3han Oct 1, 2021
202fd3e
[testing] ensure no lock file is dropped (#122)
stas00 Oct 1, 2021
c146dce
Save tokenizer in conversion script (#128)
jaketae Oct 7, 2021
3586830
fix: only trigger ci on .py file changes (#131)
jaketae Oct 8, 2021
a319a6c
Curriculum learning support (#132)
conglongli Oct 10, 2021
97bdf31
[CL] fix default placement (#133)
stas00 Oct 10, 2021
63539b1
Fix deepspeed prefix-lm (#107)
thomasw21 Oct 10, 2021
5f3c08b
[codecarbon] switch to master (#135)
stas00 Oct 11, 2021
bbe4dea
run on pull_request branch (#141)
stas00 Oct 18, 2021
34140e7
print number of params only on rank 0 (#140)
stas00 Oct 18, 2021
04c6da3
Configure code style formatters (#130)
jaketae Oct 19, 2021
0f7a2bc
[Feature] Porting bitsandbytes to meg-deepspeed (#144)
wade3han Oct 19, 2021
c1b09d4
backward compatibility for new chkpt keys (#147)
stas00 Oct 20, 2021
ce20a7d
fused softmax layer bug fix sync (#151)
stas00 Oct 21, 2021
959a876
Fix curriculum learning support (#134)
conglongli Oct 22, 2021
3e76195
disable codecarbon as it's very unstable (#152)
stas00 Oct 22, 2021
7813714
Fix glu activation (#148)
thomasw21 Oct 22, 2021
4fc9ab5
[Logging] Improve logging mechanism (#154)
thomasw21 Oct 22, 2021
cbbfd7a
Bump minimum version for torch (#156)
thomasw21 Oct 25, 2021
85e3c1f
[tests] fix requirements (#158)
stas00 Oct 25, 2021
1821201
don't save latest_checkpointed_iteration.txt w/ deepspeed (#159)
stas00 Oct 26, 2021
6a9d73b
[testing] fix bnb test skipping (#160)
stas00 Oct 26, 2021
087a7e1
remove useless log line (#161)
stas00 Oct 26, 2021
a55c007
Fix curriculum learning doc (#162)
conglongli Oct 26, 2021
224d7c1
[checkpoint] only one latest file (#164)
stas00 Oct 27, 2021
10b4d42
[CI] fix ci / update packages (#170)
stas00 Oct 29, 2021
0f72501
Update main.yml (#172)
stas00 Oct 29, 2021
7364280
Fix prefix lm offsets (#167)
thomasw21 Oct 29, 2021
54bb7a3
Adding language specific validation sets for Multilingual model train…
hadyelsahar Nov 3, 2021
ed812bd
Fixed merge oversight in tensorboard logs
TevenLeScao Nov 3, 2021
afb3778
simplifying tests
TevenLeScao Nov 3, 2021
2a967d5
Fixed TP > 1 issue with new validation scheme
TevenLeScao Nov 4, 2021
04e2856
Alternative fix to TP > 1 (#178)
thomasw21 Nov 4, 2021
11a2a36
[CI] improvements (#185)
stas00 Nov 9, 2021
5b34a6b
[PrefixLM] Figuring out why prefix lm is doing poorly on short contex…
thomasw21 Nov 10, 2021
0635ea2
[BNB] integrate `StableEmbeding` into `VocabParallelEmbedding` logic …
stas00 Nov 10, 2021
1f678bc
Full seqlen eval for CL+PP (#187)
conglongli Nov 13, 2021
0f425a2
Support skip iteration flag (#177)
jaketae Nov 17, 2021
c73e784
add layernorm in Embedding (#191)
stas00 Nov 18, 2021
d13cbeb
removed regular package for megatron model (#192)
stas00 Nov 19, 2021
b124614
Add eval-only arg (#188)
SaulLu Nov 19, 2021
20e9afc
Delete unnecessary brackets (#197)
SaulLu Nov 19, 2021
28dd4e7
[CI] fix which tests get run (#199)
stas00 Nov 20, 2021
7dd85a6
add missing space (#200)
SaulLu Nov 22, 2021
767eccb
elastic launcher compatible init_process_group (#201)
stas00 Nov 23, 2021
d443ec7
[WIP] dealing with multi-process noise (#193)
stas00 Nov 23, 2021
d1713be
param size printing revamp (#202)
stas00 Nov 23, 2021
f2a5402
[test] `--partition-activations` (#184)
stas00 Nov 24, 2021
a100a75
chore: add tmp directory to `.gitignore` (#205)
jaketae Nov 25, 2021
a390485
Reweighting strat for prefix lm (#190)
thomasw21 Nov 26, 2021
cd06baf
Checking we use fused kernels to compute scaled masked softmax on pre…
thomasw21 Nov 26, 2021
cafe8cc
Revert "Checking we use fused kernels to compute scaled masked softma…
thomasw21 Nov 27, 2021
8e928e9
Fix consumed_valid_samples counting for several valid dataloaders
TevenLeScao Dec 1, 2021
8532df6
[TB] add throughput graphs (#210)
stas00 Dec 7, 2021
30436f9
replay layer_norm_cuda_kernel.cu fixes (#216)
stas00 Dec 10, 2021
94421bf
improve build (#207)
stas00 Dec 13, 2021
96dbac8
[logging] synced print (#217)
stas00 Dec 16, 2021
3cbdb38
fix tflops calculation (#223)
stas00 Jan 5, 2022
5ccf8b6
tflops for CL (#224)
stas00 Jan 5, 2022
b7f8a62
Revert "tflops for CL (#224)" (#225)
stas00 Jan 6, 2022
2b26dca
Fix alibi (#222)
thomasw21 Jan 6, 2022
90138b1
save args to txt file (#218)
bhavitvyamalik Jan 13, 2022
f36a0ff
implement missing --no-load-optim support for deepspeed path (#231)
stas00 Jan 15, 2022
a1b688e
fix tests (#232)
stas00 Jan 15, 2022
9ad0d97
[ds report] less noise (#215)
stas00 Jan 21, 2022
8c2e1da
enable new_style for add_scalar function for faster data format (#237)
abodacs Jan 21, 2022
e4fc19c
fix add_scalar for pt<1.9 (#240)
stas00 Jan 24, 2022
f0a57f0
Fix throughput unit (#241)
janEbert Jan 28, 2022
5812e4e
[TB] log restarts (#234)
stas00 Jan 29, 2022
d0a047d
Alibi Tensor Parallel Fix (#244)
DanielHesslow Feb 1, 2022
c3e4230
implement kill switch (#245)
stas00 Feb 5, 2022
0cbd399
--abort-on-unmet-fused-kernel-constraints (#247)
stas00 Feb 8, 2022
24d72c6
[apex FusedAdam] crash workaround (#249)
stas00 Feb 18, 2022
14e1e3b
Replace approximate formula with exact one for throughput (#251)
deepakn94 Feb 22, 2022
ee49e63
Fix preprocess_data_many_cores to use dtype
thomasw21 Feb 25, 2022
77fcc4e
Fix preprocess_data_many_cores to use dtype
thomasw21 Feb 25, 2022
a5b28a7
Use padded vocab size in preprocessing scripts (#253)
thomasw21 Feb 25, 2022
2d10187
Try to read the data path arguments directly from a file (#254)
thomasw21 Feb 26, 2022
3b227b8
[sync] bf16 (#250)
stas00 Feb 28, 2022
65e96a2
make partition_method configurable (#256)
stas00 Feb 28, 2022
8f5a517
add `pad-vocab-size-to` argument and tests (#255)
SaulLu Mar 1, 2022
4b3a447
deploy elastic error handler (#258)
stas00 Mar 1, 2022
e6598b7
sync the whole Meg-LM fused_kernels sub-tree (#260)
stas00 Mar 7, 2022
7a67ce2
allocate embed norm only on pp0 (#261)
stas00 Mar 7, 2022
59f9a3f
switch to MixedFusedLayerNorm (#262)
stas00 Mar 9, 2022
decebdc
preprocessing from arrow file to load an HF dataset
TevenLeScao Mar 11, 2022
eab76a6
Sorry, last change was meant to a PR. This reverts commit d0fcf4170de…
TevenLeScao Mar 11, 2022
543992e
[kill switch] correct sys.exit (#266)
stas00 Mar 18, 2022
f2e1f03
disable samples-per-dataset, steps-per-dataset, tokens-per-dataset (#…
stas00 Mar 18, 2022
43aea86
[kill switch] fix test (#268)
stas00 Mar 18, 2022
315e21f
[tensorboard] add rename and remove event tools (#269)
stas00 Mar 18, 2022
d8ba7a2
`torch.testing.assert_equal` didn't make it (#273)
stas00 Mar 25, 2022
c16a81e
add stop alarm instructions
stas00 Apr 1, 2022
cf81e3d
add start-fast doc (#278)
stas00 Apr 13, 2022
9214b77
tweak the doc
stas00 Apr 13, 2022
fcdb527
Create CODEOWNERS
TevenLeScao Apr 25, 2022
d236376
Update CODEOWNERS
TevenLeScao Apr 26, 2022
fa8e9b9
Update CODEOWNERS
TevenLeScao Apr 26, 2022
20a0201
Update CODEOWNERS
TevenLeScao Apr 26, 2022
4e16e4a
Fix mixed fused layer norm to mimick nn.LayerNorm for torch>1.11 (#281)
thomasw21 May 3, 2022
56333df
[valid] deadlock workaround (#282)
stas00 May 31, 2022
4cf7e64
Fix tflops glu computation (#283)
Muennighoff Jun 5, 2022
dd53f9d
Fix DS init (#285)
Quentin-Anthony Jun 22, 2022
c384a45
Mlm adaptation (#287)
Jun 27, 2022
0f29406
Fixed MLM dataset arguments(#290)
thomasw21 Jun 27, 2022
7cf6469
Eval harness (#212)
DanielHesslow Jun 28, 2022
3d26047
Merge MLM too fast 2 (#294)
thomasw21 Jun 30, 2022
5d05153
MTF dataset and packing (#293)
thomasw21 Jul 2, 2022
d59ed79
CI fixes (#302)
stas00 Jul 4, 2022
22a31f0
sync layer norms (#272)
stas00 Jul 4, 2022
3a5b327
MTF train script (#295)
thomasw21 Jul 5, 2022
464b45f
Add support for weighted train (#299)
thomasw21 Jul 6, 2022
70e221c
Combine Specs (#304)
Muennighoff Jul 7, 2022
75a2c91
Add bias a weight we need to sync as well (#307)
thomasw21 Jul 7, 2022
5302902
Fix causal attention mask (#306)
thomasw21 Jul 7, 2022
cf56a8b
Create README.md
stas00 Jul 10, 2022
d677a6d
not yet working script
stas00 Jul 10, 2022
fcd61be
hardcode the dtype depending on the model
stas00 Jul 10, 2022
3aa84d6
change the mp based on the world_size
Jul 10, 2022
3b58343
remove hardcoded world_size
stas00 Jul 10, 2022
b694a4f
add bigscience/bigscience-small-testing
stas00 Jul 10, 2022
f2ed3a1
Merge branch 'bloom-inference' of https://github.com/bigscience-works…
Jul 10, 2022
0079ff7
fixes
stas00 Jul 10, 2022
3dfc089
add zero-inference script
stas00 Jul 10, 2022
bf44780
fixes
stas00 Jul 11, 2022
afe9027
fix
stas00 Jul 11, 2022
8b9fe59
working script
stas00 Jul 12, 2022
09544e4
renames
stas00 Jul 12, 2022
f2a5520
fixes
stas00 Jul 12, 2022
0394c07
fix for offline use
stas00 Jul 13, 2022
0d8a99d
add benchmark
stas00 Jul 13, 2022
c5d82a9
add benchmark
stas00 Jul 13, 2022
dd78ac3
update
stas00 Jul 13, 2022
f3548a2
cleanup
stas00 Jul 13, 2022
3c0cc4e
update
stas00 Jul 13, 2022
d7cbbe1
msecs
stas00 Jul 13, 2022
c1bac35
cleanup
stas00 Jul 13, 2022
be61e59
improve
stas00 Jul 13, 2022
62b141f
fix benchmark, add warmup
stas00 Jul 13, 2022
541590e
update
stas00 Jul 13, 2022
aad51bc
fix; thanks Michael Wyatt
stas00 Jul 13, 2022
fd10718
clarify
stas00 Jul 13, 2022
32dd5ca
Merge branch 'bloom-inference' of https://github.com/bigscience-works…
Jul 13, 2022
1127770
add bloom batch-inference script
Jul 13, 2022
595d746
removed the names :-)
Jul 13, 2022
ac9b849
fold the bs functionality from the other script
stas00 Jul 13, 2022
d725153
fix
stas00 Jul 13, 2022
b145858
restore do_sample
stas00 Jul 13, 2022
bd8971f
dump generate args
stas00 Jul 13, 2022
4ceaa57
fix
stas00 Jul 14, 2022
5eee60c
fix
stas00 Jul 14, 2022
70ed13d
support any batchsize
stas00 Jul 14, 2022
256860d
div by bs
stas00 Jul 14, 2022
5fdd563
mul by bs
stas00 Jul 14, 2022
fb5c95c
add cpu_offload; sync scripts
stas00 Jul 14, 2022
fdb42c2
wip
stas00 Jul 14, 2022
d7e661e
improvements
stas00 Jul 15, 2022
71b8675
fixes
stas00 Jul 15, 2022
781993b
fixes
stas00 Jul 15, 2022
6fa6129
add accelerate script
stas00 Jul 15, 2022
29b6b30
fix
stas00 Jul 15, 2022
cc69be6
wip
stas00 Jul 16, 2022
e447bab
wip
stas00 Jul 16, 2022
c2175e4
stats
stas00 Jul 18, 2022
ce0c975
add OnDevice and remove zero-inference (#316)
jeffra Jul 19, 2022
f12a3d0
wip
stas00 Jul 19, 2022
ac3d7cb
rework generate + benchmark
stas00 Jul 19, 2022
fd28fc0
figure out the memory map dynamically
stas00 Jul 19, 2022
0e6562e
bug fix
stas00 Jul 19, 2022
7768629
fix ds-zero-inference wrt device
stas00 Jul 19, 2022
97e6b53
bug fix
stas00 Jul 20, 2022
092a1fd
update
stas00 Jul 20, 2022
118b4ab
update
stas00 Jul 22, 2022
49fe618
add server scripts
Aug 4, 2022
b83b39d
fix bug
Aug 4, 2022
031c0ee
new code
Aug 6, 2022
450eb1f
working code
Aug 7, 2022
fc5d383
fix bug
Aug 7, 2022
39bab5c
update readme
Aug 7, 2022
303a6cf
increase batch size for HF accelerate
Aug 7, 2022
69d5cf2
increase batch size
Aug 7, 2022
d816f59
support dynamic batch size with deepspeed
Aug 7, 2022
73a79d2
drop num tokens
Aug 7, 2022
89dfe23
drop return type
Aug 7, 2022
03abce7
oom
Aug 8, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions .github/workflows/ci.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# CI setup

The CI is setup with github actions using the on-demand EC2 backend.

This setup currently uses a 4gpu instance p3.8xlarge - to test tp=2, pp=2.

**Unfortunately this only works for PRs created from non-forked branches**


## The workflow file

The workflow file is at `.github/workflows/main.yml`


```
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@v2
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0dfaabfa78a779fbc
ec2-instance-type: p3.8xlarge
subnet-id: subnet-3502b45e
security-group-id: sg-e8f46d9d
```

- `ec2-image-id` is the AMI, which has to be created, or copied to the corresponding `aws-region` region the script requests.
- `subnet-id` comes from: https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:
- `security-group-id` comes from: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#SecurityGroups:


It was later updated to use a fault-tolerant solution by trying to start the EC2 on 3 different sub-regions to cope with situations where EC2 reports it doesn't have resources to start the desired instance.



## Connect to instance

To pre-install things connect to the instance manually and install what's desired

1. choose and start an EC2 instance
2. connect to it as `ubuntu`, then `sudo su` as the runner runs as `root`. I couldn't find a way around it.
```
ssh -l ubuntu -i "~/.ssh/bigscience-aim.pem" [email protected]
```

Once installed, stop the instance.

Then create a new AMI (see below) and update the script using the new AMI.


## Prepare the machine

Steps used to setup fixed software (won't be installed at test time)

- install cuda:
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation

### install fixed packages

- `torch 1.9.0/cu-11.1`

```
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
```

- all kinds of prerequisites
```
pip install transformers
wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/requirements/requirements.txt -O requirements-ds.txt
pip install -r requirements-ds.txt
wget https://raw.githubusercontent.com/bigscience-workshop/Megatron-DeepSpeed/main/requirements.txt -O requirements-ms.txt
pip install -r requirements-ms.txt

```

- apex - needs a hack to deal with mismatching minor cuda versions (and it takes forever to build), so using this patch:

XXX: this no longer works - had to manually patch pytorch to avoid mismatch failure

```
--- a/setup.py
+++ b/setup.py
@@ -99,6 +99,7 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
print(raw_output + "from " + cuda_dir + "/bin\n")

if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
+ return
raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
"not match the version used to compile Pytorch binaries. " +
"Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +

```

install it: (it was cloned from `git clone https://github.com/NVIDIA/apex`)

```
cd code/apex
# I copied this script from my setup
./build.sh
```


## make a new AMI image

Once the needed things got installed (and every time anything new is installed) a new AMI must be created (this is like an .iso image snapshot)

1. go to https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
2. choose the instance to create a new image from
3. Actions -> Image and Templates -> Create Image

Must ensure it's created in the correct region (same as in script) - or can copy it to the right region.

The process of creating the image can be done while the instance that has been updated is still running.

Just don't forget to turn the instance off when validated it to work.

Finally, once created, the script needs to be updated to that new AMI id (key `ec2-image-id`) in `.github/workflows/main.py`


## Stop instance alarm

It looks like occasionally the instance doesn't stop and continues running.

I added a stop alarm to automatically kill the instance after 1h if util < 10% following the exact instructions from:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html


## Guides

Set up guide: https://github.com/machulav/ec2-github-runner

Launching an EC2 instance:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html?icmpid=docs_ec2_console

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

- All available instances: https://aws.amazon.com/ec2/instance-types/
211 changes: 211 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
name: Run all tests
on:
# enable to manually trigger the tests
workflow_dispatch:
pull_request:
paths:
- "**.py"

jobs:

# GPU sizes and types that we could use:
# g4dn.12xlarge 4x 16GB T4 (CC 7.5) (low availability)
# p3.8xlarge 4x 16GB V100 (CC 7.0) (very low availability)

# Unfit:
# g3.16xlarge 4x 8GB Tesla M60 (CC 5.2) (not supported by cuda-11)
# p2.8xlarge 8x 12GB K80 (CC 3.7 not supported by cuda-11)

start-runner:
name: Start self-hosted EC2 runner
runs-on: ubuntu-latest
outputs:
label: ${{ steps.start-ec2-runner.outputs.label }}
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
# don't use the following subnets as p3.8xlarge is not supported there:
# - subnet-06576a4b # us-east-1d
# - subnet-859322b4 # us-east-1e
# - subnet-47cfad21 # us-east-1b
- name: Try to start EC2 runner (a)
id: try-us-east-1a
uses: machulav/ec2-github-runner@v2
continue-on-error: true
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0ad997818d90480f2
ec2-instance-type: g4dn.12xlarge
security-group-id: sg-f2a4e2fc
subnet-id: subnet-b7533b96 # us-east-1c
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]
- name: Try to start EC2 runner (b)
id: try-us-east-1b
if: steps.try-us-east-1a.outcome == 'failure'
uses: machulav/ec2-github-runner@v2
continue-on-error: true
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0ad997818d90480f2
ec2-instance-type: g4dn.12xlarge
security-group-id: sg-f2a4e2fc
subnet-id: subnet-a396b2ad # us-east-1f
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]
- name: Try to start EC2 runner (c)
id: try-us-east-1c
if: steps.try-us-east-1b.outcome == 'failure'
uses: machulav/ec2-github-runner@v2
continue-on-error: true
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0ad997818d90480f2
ec2-instance-type: g4dn.12xlarge
security-group-id: sg-f2a4e2fc
subnet-id: subnet-df0f6180 # us-east-1a
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]

- name: Try to start EC2 runner (a-2)
id: try-us-east-1a-2
if: steps.try-us-east-1c.outcome == 'failure'
uses: machulav/ec2-github-runner@v2
continue-on-error: true
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0ad997818d90480f2
ec2-instance-type: p3.8xlarge
security-group-id: sg-f2a4e2fc
subnet-id: subnet-b7533b96 # us-east-1c
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]
- name: Try to start EC2 runner (b-2)
id: try-us-east-1b-2
if: steps.try-us-east-1a-2.outcome == 'failure'
uses: machulav/ec2-github-runner@v2
continue-on-error: true
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0ad997818d90480f2
ec2-instance-type: p3.8xlarge
security-group-id: sg-f2a4e2fc
subnet-id: subnet-a396b2ad # us-east-1f
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]
- name: Try to start EC2 runner (c-2)
id: try-us-east-1c-2
if: steps.try-us-east-1b-2.outcome == 'failure'
uses: machulav/ec2-github-runner@v2
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-0ad997818d90480f2
ec2-instance-type: p3.8xlarge
security-group-id: sg-f2a4e2fc
subnet-id: subnet-df0f6180 # us-east-1a
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]

- name: See if any of 3 sub-regions had the resource
id: start-ec2-runner
run: |
if [ "${{ steps.try-us-east-1a.outcome }}" = "success" ]; then
echo "::set-output name=label::${{ steps.try-us-east-1a.outputs.label }}"
echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a.outputs.ec2-instance-id }}"
fi
if [ "${{ steps.try-us-east-1b.outcome }}" = "success" ]; then
echo "::set-output name=label::${{ steps.try-us-east-1b.outputs.label }}"
echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b.outputs.ec2-instance-id }}"
fi
if [ "${{ steps.try-us-east-1c.outcome }}" = "success" ]; then
echo "::set-output name=label::${{ steps.try-us-east-1c.outputs.label }}"
echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c.outputs.ec2-instance-id }}"
fi
if [ "${{ steps.try-us-east-1a-2.outcome }}" = "success" ]; then
echo "::set-output name=label::${{ steps.try-us-east-1a-2.outputs.label }}"
echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a-2.outputs.ec2-instance-id }}"
fi
if [ "${{ steps.try-us-east-1b-2.outcome }}" = "success" ]; then
echo "::set-output name=label::${{ steps.try-us-east-1b-2.outputs.label }}"
echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b-2.outputs.ec2-instance-id }}"
fi
if [ "${{ steps.try-us-east-1c-2.outcome }}" = "success" ]; then
echo "::set-output name=label::${{ steps.try-us-east-1c-2.outputs.label }}"
echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c-2.outputs.ec2-instance-id }}"
fi


do-the-job:
name: Do the job on the runner
needs: start-runner # required to start the main job when the runner is ready
# need to figure out how to cancel the previous build if a new push was made the old test is still running
# concurrency: # cancel previous build on a new push
# group: ${{ github.ref }} # https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#github-context
# cancel-in-progress: true
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
steps:
- name: NVIDIA-SMI
run: nvidia-smi

- name: Checkout
uses: actions/checkout@v2

- name: Install Dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
pip install pytest-timeout

- name: Run tests
run: pytest --timeout=600 tests

stop-runner:
name: Stop self-hosted EC2 runner
needs:
- start-runner # required to get output from the start-runner job
- do-the-job # required to wait when the main job is done
runs-on: ubuntu-latest
if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Stop EC2 runner
uses: machulav/ec2-github-runner@v2
with:
mode: stop
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
Loading