Enable float8 CI on sm89 #587

jainapurva · 2024-08-01T19:56:32Z

The current CI/CD pipeline skipped float8 tests as they weren't compatible with the A10G GPUs. This PR adds a new CI job which runs on NVIDIA L4 Tensor Core GPUs for float8 tests.

Fixes Issue: #575

Test Plan : Compare the logs of Regression Test and Float8 Test to check for the skipped tests in Regression Test which are now being run on the new CI job of Float8 Test.

Sample of some test cases which were being skipped in CUDA Nightly in Regression Tests are running in Float8.

test/float8/test_numerics_integration.py::TestFloat8NumericsIntegrationTest::test_encoder_fw_bw

test/float8/test_inference_flows.py::TestHPTrainToFP8LinearInference::test_dynamic_fp8_mlp

test/float8/test_inference_flows.py::TestHPTrainToFP8LinearInference::test_static_fp8_mlp

Screenshot of logs of CUDA Nightly

Screenshot of logs of Float8 test

For full list of passed tests: https://gist.github.com/jainapurva/a5ddd17c809219151485de8d1708d078

pytorch-bot · 2024-08-01T19:56:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/587

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 01f0ba1 with merge base 013cce3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-08-05T18:54:20Z

So question here

Are these g6 machines cheaper than g5? Cause it sounds like count goes from 12 to 4. If it is cheaper we cna just use this machine by default instead and you won't need an extra job
Also if the plan is launch an extra job for fp8 then the new job should only run fp8 tests, itll make your tests run a lot faster

EDIT: Ok that's a lot of test failures, mind opening up a tracker with them so we can assign them to right owners

Sparsity issues will be fixed by @jcaip
Affine quantized tensor by @jerryzh168

.github/workflows/custom_test.yml

msaroufim · 2024-08-06T04:34:17Z

I noticed the fp8 test is not being triggered here so instead what I did was

Copy pasted your updated regression test file
Deleted all strategies except the one that needs to run the g6 machines
Updated the test to only run fp8

So you can go ahead and make the updates here from this PR #603 I won't land 603

For Github Actions because you can't iterate fast because you need to wait on CI, would recommend copy pasting as much so you minimize surface area of missing tpos as possible or running tiny experiments where you in your branch temporarily delete all jobs except the one you're interested in iterating on. It makes iterating far more pleasant

Also regarding pricing if we compare g5.12x prices vs g6.4x prices https://aws.amazon.com/ec2/instance-types/g6/ and https://aws.amazon.com/ec2/instance-types/g5/

g5.12x: $5.672 per hour
g6.4x: $1.323 per hour

So you could reduce our CI costs by about 4x just changing the default machine type but we can do this in a future PR

.github/workflows/float8_test.yml

test/float8/test_base.py

fat finger

test/float8/test_numerics_integration.py

.github/workflows/float8_test.yml

updates

87efbf9

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 1, 2024

Updates

bd89a74

jainapurva marked this pull request as ready for review August 2, 2024 00:29

Updated the gpu name

5531492

jainapurva added 8 commits August 5, 2024 15:38

Trying to test only fp8

c842f83

Updates

5ca399d

Updates

a27985a

Updates

6aff39c

Updates

d4531e0

Updates

c14fde0

Trying custom test

e197597

updates

e2f1dfa

msaroufim reviewed Aug 5, 2024

View reviewed changes

.github/workflows/custom_test.yml Outdated Show resolved Hide resolved

jainapurva added 5 commits August 5, 2024 16:33

Updates

1b4f4c1

Added torch nightly

2145c57

Added torch nightly

d495e16

updates

4341fd6

updates'

19607f0

jainapurva added 2 commits August 6, 2024 09:32

Seperate tests

76867c2

updated names

487f669

msaroufim reviewed Aug 6, 2024

View reviewed changes

.github/workflows/float8_test.yml Outdated Show resolved Hide resolved

updated names

f930591

jainapurva mentioned this pull request Aug 6, 2024

Update the CI/CD to use AWS G6 instance #611

Open

9 tasks

msaroufim previously approved these changes Aug 6, 2024

View reviewed changes

msaroufim reviewed Aug 6, 2024

View reviewed changes

test/float8/test_base.py Show resolved Hide resolved

jainapurva added 3 commits August 6, 2024 16:43

remove H100 check

23b950f

remove H100 check

77e42e8

Failing test case

cf9f173

vkuzo reviewed Aug 7, 2024

View reviewed changes

test/float8/test_numerics_integration.py Outdated Show resolved Hide resolved

msaroufim reviewed Aug 7, 2024

View reviewed changes

.github/workflows/float8_test.yml Outdated Show resolved Hide resolved

jainapurva added 2 commits August 7, 2024 10:04

Test on torch latest version

521da10

unpin torch

0d948fb

jainapurva force-pushed the float8-on-sm89 branch 2 times, most recently from 26371be to 0d948fb Compare August 7, 2024 17:26

jainapurva added 5 commits August 7, 2024 10:30

Updated comments

3c96f49

Failing test fixes

ba7a82c

updated name

a40bc10

Updates testing

13023d4

Fixes for failing tests

01f0ba1

msaroufim approved these changes Aug 7, 2024

View reviewed changes

jainapurva merged commit 1cfe69e into main Aug 7, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable float8 CI on sm89 #587

Enable float8 CI on sm89 #587

jainapurva commented Aug 1, 2024 •

edited

Loading

pytorch-bot bot commented Aug 1, 2024 •

edited

Loading

msaroufim commented Aug 5, 2024 •

edited

Loading

msaroufim commented Aug 6, 2024 •

edited

Loading

Enable float8 CI on sm89 #587

Enable float8 CI on sm89 #587

Conversation

jainapurva commented Aug 1, 2024 • edited Loading

pytorch-bot bot commented Aug 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/587

✅ No Failures

msaroufim commented Aug 5, 2024 • edited Loading

msaroufim commented Aug 6, 2024 • edited Loading

jainapurva commented Aug 1, 2024 •

edited

Loading

pytorch-bot bot commented Aug 1, 2024 •

edited

Loading

msaroufim commented Aug 5, 2024 •

edited

Loading

msaroufim commented Aug 6, 2024 •

edited

Loading