Fix various test errors in the single GPU case by githubnemo · Pull Request #3031 · huggingface/peft

githubnemo · 2026-02-09T17:29:24Z

This addresses some of the errors reported by running the tests on a single GPU machine.

I will list the error messages and a short explanation of the fix.

FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_lora_gptq_quantization_from_pretrained_safetensors - NameError: name 'BACKEND' is not defined

The test was using GPTQModel without marking the test as requiring it leading to an error. This is fixed by marking the test with requires_gptqmodel.

FAILED tests/test_custom_models.py::TestPeftCustomModel::test_only_params_are_updated[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True
FAILED tests/test_custom_models.py::TestPeftCustomModel::test_disable_adapters_with_merging[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True

This test fails because sometimes the gradients of the trainable tokens delta is 0 but only when training on CUDA, CPU is fine.

This is a weird one and I'm not sure if this is a good fix or not. I encountered this error on two machines (1xL40S and 4xA10G) and I was not able to pinpoint this to something particular in the environment, i.e. PEFT version (tested v0.17 to main), transformers version (tested 4.5{5,6,7}, 5.0), CUDA version (tested 12.6, 12.8) or torch version (tested 2.7, 2.8, 2.9, 2.10). I also set LD_LIBRARY_PATH= before running pytest to exclude cuDNN libraries that come preinstalled on the EC2 instance.

Removing the ReLU in EmbConv1DModel as well as boosting the Conv1D weights will fix the error. Replacing the ReLU with Threshold(0, 0) has the same behavior. It depends on the seed, i.e. if the initialization of Conv1D is favorable the bug will not trigger.

I tried pinpointing it on index_copy but it is not index_copy by itself that is the problem. Maybe we will just have to live with this?

FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_dora_ephemeral_gpu_offload_multigpu - RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)

This is caused by a bug introduced in #2960 - ephemeral_gpu_offload is not passed to the variant and therefore never utilized.

FAILED tests/test_gpu_examples.py::PeftBnbGPUExampleTests::test_seq2seq_lm_training_single_gpu - AttributeError: 'T5ForConditionalGeneration' object has no attribute 'hf_device_map'

This is caused by transformers@315dcbe45cee1489a32fc228a80502b0a150936c which disables accelerate hooks if the
device map only contains one device. I confirmed that just specifying one value moves the model to that device even
without accelerate hook invocation. I also tested having two devices (cpu + cuda:0) and in that case a device map is
present. Therefore this only needs an added hasattr check to be compatible with transformers v5.

This addresses some of the errors reported by running the tests on a single GPU machine. I will list the error messages and a short explanation of the fix. > `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_lora_gptq_quantization_from_pretrained_safetensors - NameError: name 'BACKEND' is not defined` The test was using GPTQModel without marking the test as requiring it leading to an error. This is fixed by marking the test with `requires_gptqmodel`. > `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_only_params_are_updated[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True` > `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_disable_adapters_with_merging[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True` This test fails because sometimes the gradients of the trainable tokens delta is 0 but only when training on CUDA, CPU is fine. This is a weird one and I'm not sure if this is a good fix or not. I encountered this error on two machines (1xL40S and 4xA10G) and I was not able to pinpoint this to something particular in the environment, i.e. PEFT version (tested v0.17 to main), transformers version (tested 4.5{5,6,7}, 5.0), CUDA version (tested 12.6, 12.8) or torch version (tested 2.7, 2.8, 2.9, 2.10). I also set `LD_LIBRARY_PATH=` before running pytest to exclude cuDNN libraries that come preinstalled on the EC2 instance. Removing the ReLU in `EmbConv1DModel` as well as boosting the Conv1D weights will fix the error. Replacing the ReLU with `Threshold(0, 0)` has the same behavior. It depends on the seed, i.e. if the initialization of `Conv1D` is favorable the bug will not trigger. I tried pinpointing it on `index_copy` but it is not `index_copy` by itself that is the problem. Maybe we will just have to live with this? > `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_dora_ephemeral_gpu_offload_multigpu - RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)` This is caused by a bug introduced in huggingface#2960 - `ephemeral_gpu_offload` is not passed to the variant and therefore never utilized. > `FAILED tests/test_gpu_examples.py::PeftBnbGPUExampleTests::test_seq2seq_lm_training_single_gpu - AttributeError: 'T5ForConditionalGeneration' object has no attribute 'hf_device_map'` This is caused by transformers@315dcbe45cee1489a32fc228a80502b0a150936c which disables accelerate hooks if the device map only contains one device. I confirmed that just specifying one value moves the model to that device even without accelerate hook invocation. I also tested having two devices (cpu + cuda:0) and in that case a device map is present. Therefore this only needs an added `hasattr` check to be compatible with transformers v5.

HuggingFaceDocBuilderDev · 2026-02-09T17:35:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for working on all those errors and digging into the solution.

In huggingface#3031, a fix to one of the custom models (using embedding + Conv1D) was introduced to resolve an error in the test_disable_adapters test when run on GPU. However, this very fix resulted in the test failing in some settings on CPU. This PR makes specific changes for CPU to avoid these failures. With this PR, the mentioned tests pass locally both on CPU and GPU. Note, however, that other tests involving this custom model still fail on GPU, irrespective of the changes in huggingface#3031. Fixing these tests would probably require carefully choosing the right tolerances used in the tests, with little to no benefit for actual PEFT use. Therefore, I consider it low priority to investigate these tests.

In #3031, a fix to one of the custom models (using embedding + Conv1D) was introduced to resolve an error in the test_disable_adapters test when run on GPU. However, this very fix resulted in the test failing in some settings on CPU. This PR makes specific changes for CPU to avoid these failures. With this PR, the mentioned tests pass locally both on CPU and GPU. Note, however, that other tests involving this custom model still fail on GPU, irrespective of the changes in #3031. Fixing these tests would probably require carefully choosing the right tolerances used in the tests, with little to no benefit for actual PEFT use. Therefore, I consider it low priority to investigate these tests.

githubnemo requested a review from BenjaminBossan February 9, 2026 17:29

githubnemo force-pushed the ci/fix-gpu-errors branch from 7b066d9 to 376e2ac Compare February 9, 2026 17:31

BenjaminBossan approved these changes Feb 12, 2026

View reviewed changes

githubnemo merged commit e282205 into huggingface:main Feb 12, 2026
2 of 10 checks passed

BenjaminBossan mentioned this pull request Feb 17, 2026

FIX Issue with disable adapter test #3045

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix various test errors in the single GPU case#3031

Fix various test errors in the single GPU case#3031
githubnemo merged 1 commit into
huggingface:mainfrom
githubnemo:ci/fix-gpu-errors

githubnemo commented Feb 9, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 9, 2026

Uh oh!

BenjaminBossan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

githubnemo commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 9, 2026

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

githubnemo commented Feb 9, 2026 •

edited

Loading