-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. #6611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 24 commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
45eac22
Partial updating of ROCM reduction code.
jessebenson a4624f6
Update reduction_all.cu
jessebenson 441ac35
Add reduce template parameters.
jessebenson 3424767
miopen common
jessebenson 75894ec
Reuse CUDA's reduction_functions.cc
jessebenson 01b48e2
Reduction ops.
jessebenson 172016d
Update remaining reduction ops to use MIOpen. double datatype is not…
jessebenson 1f29c56
Disable a couple more unsupported tests.
jessebenson ddd1017
Code formatting.
jessebenson 30e5ea2
Delete ROCM-specific reduction code that is identical to CUDA reducti…
jessebenson 74bea5a
Fix scratch buffer early free.
jessebenson 32d02ab
Fix merge conflict.
jessebenson 06c7b71
first attempt nightly amd ci pipeline
241d488
try fix bad yaml file
5462356
try again with corrected model directory
4a725b1
add convergence test as well
8d9b000
update reference loss for amd mi100
04e0f8d
include mi100 test results csv
da9b78d
merge jesseb/rocm-reduction to enable deterministic compute
a0bf453
update the mi100 convergence test reference values
843ad79
update batch sizes for mi100 32g
730ba21
fix gpu sku for run_convergence_test.py
9a252db
merge wiht master
900269a
undo unrelated changes to master
3dfabe7
pr comments
d68462b
pr comment
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
11 changes: 11 additions & 0 deletions
11
orttraining/tools/ci_test/results/bert_base.convergence.baseline.mi100.csv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| step,total_loss,mlm_loss,nsp_loss | ||
| 0,11.217,10.5178,0.699256 | ||
| 5,9.67644,7.52047,2.15598 | ||
| 10,8.31964,7.54136,0.778281 | ||
| 15,8.22823,7.54625,0.681978 | ||
| 20,8.17299,7.49675,0.676236 | ||
| 25,8.2415,7.5356,0.705902 | ||
| 30,8.0874,7.39312,0.694279 | ||
| 35,7.99095,7.25612,0.734829 | ||
| 40,7.92988,7.25608,0.673804 | ||
| 45,7.94762,7.27291,0.674713 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68 changes: 68 additions & 0 deletions
68
tools/ci_build/github/azure-pipelines/orttraining-linux-gpu-amd-e2e-test-ci-pipeline.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| trigger: none | ||
|
|
||
| name: 'orttraining_amd_nightly_$(Date:yyyyMMdd)_$(Rev:r)' | ||
| pool: 'AMD-GPU' | ||
|
|
||
| jobs: | ||
| - job: Onnxruntime_Linux_GPU_AMD_Training_E2E_Test | ||
|
|
||
| timeoutInMinutes: 60 | ||
|
|
||
| steps: | ||
| - checkout: self | ||
| clean: true | ||
| submodules: recursive | ||
|
|
||
| - script: |- | ||
| echo "##vso[task.prependpath]/home/ciagent/conda/bin/" | ||
| echo "##vso[task.prependpath]/home/ciagent/pkg/openmpi-4.0.5/bin/" | ||
| echo '##vso[task.setvariable variable=LD_LIBRARY_PATH]/home/ciagent/pkg/openmpi-4.0.5/lib/' | ||
| eval "$('/home/ciagent/conda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" | ||
| echo "Selecting GPU based on HIP_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES" | ||
| displayName: 'Initialize environment' | ||
|
|
||
| # update these if the E2E test data changes | ||
| - script: |- | ||
| python orttraining/tools/ci_test/download_azure_blob_archive.py \ | ||
| --azure_blob_url https://onnxruntimetestdata.blob.core.windows.net/training/onnxruntime_training_data.zip?snapshot=2020-06-15T23:17:35.8314853Z \ | ||
| --target_dir training_e2e_test_data \ | ||
| --archive_sha256_digest B01C169B6550D1A0A6F1B4E2F34AE2A8714B52DBB70AC04DA85D371F691BDFF9 | ||
| displayName: 'Download onnxruntime_training_data.zip data' | ||
|
|
||
| - script: |- | ||
| python tools/ci_build/build.py \ | ||
| --config RelWithDebInfo \ | ||
| --enable_training \ | ||
| --mpi_home /home/ciagent/pkg/openmpi-4.0.5 \ | ||
| --use_rocm \ | ||
| --rocm_home /opt/rocm \ | ||
| --nccl_home /opt/rocm \ | ||
| --update \ | ||
| --build_dir ./build \ | ||
| --build \ | ||
| --parallel 8 \ | ||
| --build_wheel \ | ||
| --skip_tests | ||
| displayName: 'Build onnxruntime' | ||
|
|
||
| - script: |- | ||
| cd ./build/RelWithDebInfo &&\ | ||
| ../../tools/ci_build/github/pai/pai_test_launcher.sh | ||
| displayName: 'Run unit tests' | ||
|
|
||
| - script: |- | ||
| python orttraining/tools/ci_test/run_batch_size_test.py \ | ||
| --binary_dir build/RelWithDebInfo \ | ||
| --model_root training_e2e_test_data/models \ | ||
| --gpu_sku MI100_32G | ||
| displayName: 'Run batch size test' | ||
| condition: succeededOrFailed() # ensure all tests are run | ||
|
|
||
| - script: |- | ||
| python orttraining/tools/ci_test/run_convergence_test.py \ | ||
| --binary_dir build/RelWithDebInfo \ | ||
| --model_root training_e2e_test_data/models \ | ||
| --training_data_root training_e2e_test_data/data \ | ||
| --gpu_sku MI100_32G | ||
| displayName: 'Run convergence test' | ||
| condition: succeededOrFailed() # ensure all tests are run |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.