Skip Tests for GPUs Not Supporting `bf16` #159

austin362667 · 2024-08-29T10:12:16Z

Summary

Closes #87

Skipped tests for bfloat16 on GPUs with compute capability below Ampere architecture (sm_80).

Testing Done

Hardware Type: NVIDIA T4 (should skip most cases)
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

⚡ main ~/Liger-Kernel make all
python -m pytest --disable-warnings test/ --ignore=test/convergence
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence
flake8 .; flake8_status=$?; \
isort .; isort_status=$?; \
black .; black_status=$?; \
if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \
        exit 1; \
fi
=================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... =================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... Skipped 1 files
All done! ✨ 🍰 ✨
58 files left unchanged.
collected 163 items                                                                                                                                        

test/transformers/test_auto_model.py .                                                                                                               [  0%]
test/transformers/test_cross_entropy.py ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss                                                   [ 36%]
collected 28 items                                                                                                                                         

test/convergence/test_mini_models.py .....s.....s....                                                                                    [ 43%]
test/transformers/test_geglu.py .s....ssss                                                                                                             [ 48%]
test/transformers/test_monkey_patch.py .....                                                                                                         [ 51%]
test/transformers/test_rms_norm.py ........ssssssss...............ssssssss........                                                                  [ 80%]
test/transformers/test_rope.py ......ssssss                                                                                                          [ 88%]
test/transformers/test_swiglu.py ....ssss.s....ssss                                                                                                    [ 98%]
test/transformers/test_trainer_integration.py .                                                                                                      [ 98%]
test/triton/test_triton_monkey_patch.py ..                                                                                                           [100%]

======================================================== 71 passed, 92 skipped in 136.69s (0:02:16) ========================================================
.s.s.s                                                                                                  [ 50%]
test/convergence/test_mini_models_no_logits.py .s.s.s.s.s.s.s                                                                                        [100%]

======================================================== 14 passed, 14 skipped in 353.27s (0:05:53) ========================================================

Hardware Type: NVIDIA L4 (should skip few cases)
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

⚡ main ~/Liger-Kernel make all
python -m pytest --disable-warnings test/ --ignore=test/convergence
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence
flake8 .; flake8_status=$?; \
isort .; isort_status=$?; \
black .; black_status=$?; \
if [ $flake8_status -ne 0 ] || [ $isort_status -ne 0 ] || [ $black_status -ne 0 ]; then \
        exit 1; \
fi
=================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... =================================================================== test session starts ====================================================================
platform linux -- Python 3.10.10, pytest-8.3.2, pluggy-1.5.0
rootdir: /teamspace/studios/this_studio/Liger-Kernel
plugins: anyio-4.4.0
collecting ... Skipped 1 files
All done! ✨ 🍰 ✨
58 files left unchanged.
collected 163 items                                                                                                                                        

test/transformers/test_auto_model.py .                                                                                                               [  0%]
collected 28 items                                                                                                                                         

test/convergence/test_mini_models.py ........................................................ss                                                   [ 36%]
test/transformers/test_fused_linear_cross_entropy.py ...............                                                                                    [ 43%]
test/transformers/test_geglu.py .........                                                                                                             [ 48%]
test/transformers/test_monkey_patch.py .....                                                                                                         [ 51%]
test/transformers/test_rms_norm.py .................................................                                                                  [ 80%]
test/transformers/test_rope.py ............                                                                                                          [ 88%]
test/transformers/test_swiglu.py ..................                                                                                                    [ 98%]
test/transformers/test_trainer_integration.py .                                                                                                      [ 98%]
test/triton/test_triton_monkey_patch.py ..                                                                                                           [100%]

======================================================== 161 passed, 2 skipped in 90.45s (0:01:30) =========================================================
.......                                                                                                  [ 50%]
test/convergence/test_mini_models_no_logits.py ..............                                                                                        [100%]

============================================================== 28 passed in 290.65s (0:04:50) ==============================================================

Additional Context

FYR, here’s a list of NVIDIA architecture names, and which compute capabilities they have:

test/convergence/test_mini_models.py

test/convergence/test_mini_models_no_logits.py

ByronHsu · 2024-08-29T15:33:12Z

import pytest
import torch
from liger_kernel.transformers.cross_entropy import LigerCrossEntropyLoss

def supports_bfloat16():
    if not torch.cuda.is_available():
        return False
    return torch.cuda.get_device_capability() >= (8, 0)  # Ampere and newer

@pytest.mark.parametrize(
    "B, T, V",
    [
        (2, 4096, 32000),  # llama2, mistral
        (2, 4096, 32000),  # llama2, mistral
        (1, 4096, 128256),  # llama3
        # weird shapes
        (3, 423, 32000),
    ],
)
@pytest.mark.parametrize(
    "scalar, dtype, atol, rtol",
    [
        pytest.param(0.1, torch.bfloat16, 1e-8, 5e-2, marks=pytest.mark.skipif(not supports_bfloat16(), reason="bfloat16 not supported on this GPU")),
        pytest.param(1.0, torch.bfloat16, 1e-8, 5e-2, marks=pytest.mark.skipif(not supports_bfloat16(), reason="bfloat16 not supported on this GPU")),
        pytest.param(10.0, torch.bfloat16, 1e-7, 5e-2, marks=pytest.mark.skipif(not supports_bfloat16(), reason="bfloat16 not supported on this GPU")),
        (0.1, torch.float32, 1e-8, 1e-6),
        (1.0, torch.float32, 1e-8, 1e-6),
        (10.0, torch.float32, 1e-8, 1e-6),
    ],
)
def test_correctness(B, T, V, scalar, dtype, atol, rtol):
    if not torch.cuda.is_available():
        pytest.skip("CUDA not available")

    liger_ce = LigerCrossEntropyLoss()
    test_correctness_once(liger_ce, B, T, V, scalar, dtype, atol, rtol)

def test_correctness_once(liger_ce, B, T, V, scalar, dtype, atol, rtol):
    # Implement your test logic here
    # This is a placeholder implementation
    logits = torch.randn(B, T, V, device="cuda", dtype=dtype) * scalar
    labels = torch.randint(0, V, (B, T), device="cuda")

    # Your existing test logic goes here
    # For example:
    # loss = liger_ce(logits, labels)
    # expected_loss = torch.nn.functional.cross_entropy(logits.float(), labels)
    # torch.testing.assert_close(loss, expected_loss, atol=atol, rtol=rtol)

    # For now, we'll just use a placeholder assertion
    assert True, "Test passed"

We can do this but simplify a bit. (P.S. thanks claude)

austin362667 · 2024-08-29T17:28:43Z

Sure, updated. Thanks for reviewing. That’s neat — much appreciated!

Signed-off-by: Austin Liu <[email protected]>

ByronHsu

lgtm. cc @lancerts @helloworld1 to do a 2nd pass

lancerts

lgtm

austin362667 commented Aug 29, 2024

View reviewed changes

test/convergence/test_mini_models.py Outdated Show resolved Hide resolved

ByronHsu reviewed Aug 29, 2024

View reviewed changes

test/convergence/test_mini_models_no_logits.py Outdated Show resolved Hide resolved

austin362667 changed the title ~~Add compute capability marker to skip tests run on old GPU arch~~ Skipped tests for bf16 Aug 29, 2024

austin362667 changed the title ~~Skipped tests for bf16~~ Skip Tests for GPUs Not Supporting bf16 Aug 29, 2024

austin362667 added 3 commits August 30, 2024 02:52

Add pytest capability marker to filter out old gpu arch

f5e35a3

Signed-off-by: Austin Liu <[email protected]>

Add test utility function supports_bfloat16()

8389cd5

Signed-off-by: Austin Liu <[email protected]>

Skip tests prarms for GPUs that don't support bf16

b58562d

Signed-off-by: Austin Liu <[email protected]>

austin362667 force-pushed the test/filter_gpu_capability branch from 23788d1 to b58562d Compare August 29, 2024 18:53

Merge branch 'main' into test/filter_gpu_capability

f5cfb19

ByronHsu reviewed Aug 29, 2024

View reviewed changes

lancerts approved these changes Aug 29, 2024

View reviewed changes

lancerts merged commit cbc4f85 into linkedin:main Aug 29, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip Tests for GPUs Not Supporting `bf16` #159

Skip Tests for GPUs Not Supporting `bf16` #159

austin362667 commented Aug 29, 2024 •

edited

Loading

ByronHsu commented Aug 29, 2024

austin362667 commented Aug 29, 2024

ByronHsu left a comment

lancerts left a comment

Skip Tests for GPUs Not Supporting bf16 #159

Skip Tests for GPUs Not Supporting bf16 #159

Conversation

austin362667 commented Aug 29, 2024 • edited Loading

Summary

Testing Done

Additional Context

ByronHsu commented Aug 29, 2024

austin362667 commented Aug 29, 2024

ByronHsu left a comment

Choose a reason for hiding this comment

lancerts left a comment

Choose a reason for hiding this comment

Skip Tests for GPUs Not Supporting `bf16` #159

Skip Tests for GPUs Not Supporting `bf16` #159

austin362667 commented Aug 29, 2024 •

edited

Loading