Skip to content

Fix various test errors in the single GPU case#3031

Merged
githubnemo merged 1 commit into
huggingface:mainfrom
githubnemo:ci/fix-gpu-errors
Feb 12, 2026
Merged

Fix various test errors in the single GPU case#3031
githubnemo merged 1 commit into
huggingface:mainfrom
githubnemo:ci/fix-gpu-errors

Conversation

@githubnemo
Copy link
Copy Markdown
Collaborator

@githubnemo githubnemo commented Feb 9, 2026

This addresses some of the errors reported by running the tests on a single GPU machine.

I will list the error messages and a short explanation of the fix.

FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_lora_gptq_quantization_from_pretrained_safetensors - NameError: name 'BACKEND' is not defined

The test was using GPTQModel without marking the test as requiring it leading to an error. This is fixed by marking the test with requires_gptqmodel.

FAILED tests/test_custom_models.py::TestPeftCustomModel::test_only_params_are_updated[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True
FAILED tests/test_custom_models.py::TestPeftCustomModel::test_disable_adapters_with_merging[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True

This test fails because sometimes the gradients of the trainable tokens delta is 0 but only when training on CUDA, CPU is fine.

This is a weird one and I'm not sure if this is a good fix or not. I encountered this error on two machines (1xL40S and 4xA10G) and I was not able to pinpoint this to something particular in the environment, i.e. PEFT version (tested v0.17 to main), transformers version (tested 4.5{5,6,7}, 5.0), CUDA version (tested 12.6, 12.8) or torch version (tested 2.7, 2.8, 2.9, 2.10). I also set LD_LIBRARY_PATH= before running pytest to exclude cuDNN libraries that come preinstalled on the EC2 instance.

Removing the ReLU in EmbConv1DModel as well as boosting the Conv1D weights will fix the error. Replacing the ReLU with Threshold(0, 0) has the same behavior. It depends on the seed, i.e. if the initialization of Conv1D is favorable the bug will not trigger.

I tried pinpointing it on index_copy but it is not index_copy by itself that is the problem. Maybe we will just have to live with this?

FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_dora_ephemeral_gpu_offload_multigpu - RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)

This is caused by a bug introduced in #2960 - ephemeral_gpu_offload is not passed to the variant and therefore never utilized.

FAILED tests/test_gpu_examples.py::PeftBnbGPUExampleTests::test_seq2seq_lm_training_single_gpu - AttributeError: 'T5ForConditionalGeneration' object has no attribute 'hf_device_map'

This is caused by transformers@315dcbe45cee1489a32fc228a80502b0a150936c which disables accelerate hooks if the
device map only contains one device. I confirmed that just specifying one value moves the model to that device even
without accelerate hook invocation. I also tested having two devices (cpu + cuda:0) and in that case a device map is
present. Therefore this only needs an added hasattr check to be compatible with transformers v5.

This addresses some of the errors reported by running the tests
on a single GPU machine.

I will list the error messages and a short explanation of the fix.

> `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_lora_gptq_quantization_from_pretrained_safetensors - NameError: name 'BACKEND' is not defined`

The test was using GPTQModel without marking the test as requiring it leading to an error. This is fixed
by marking the test with `requires_gptqmodel`.

> `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_only_params_are_updated[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True`
> `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_disable_adapters_with_merging[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True`

This test fails because sometimes the gradients of the trainable tokens delta is 0 but only when training on CUDA,
CPU is fine.

This is a weird one and I'm not sure if this is a good fix or not. I encountered this error on two machines
(1xL40S and 4xA10G) and I was not able to pinpoint this to something particular in the environment, i.e.
PEFT version (tested v0.17 to main), transformers version (tested 4.5{5,6,7}, 5.0), CUDA version (tested 12.6, 12.8)
or torch version (tested 2.7, 2.8, 2.9, 2.10). I also set `LD_LIBRARY_PATH=` before running pytest to exclude
cuDNN libraries that come preinstalled on the EC2 instance.

Removing the ReLU in `EmbConv1DModel` as well as boosting the Conv1D weights will fix the error. Replacing
the ReLU with `Threshold(0, 0)` has the same behavior. It depends on the seed, i.e. if the initialization of
`Conv1D` is favorable the bug will not trigger.

I tried pinpointing it on `index_copy` but it is not `index_copy` by itself that is the problem. Maybe we will just
have to live with this?

> `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_dora_ephemeral_gpu_offload_multigpu - RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)`

This is caused by a bug introduced in huggingface#2960 - `ephemeral_gpu_offload` is not passed to the variant and therefore
never utilized.

> `FAILED tests/test_gpu_examples.py::PeftBnbGPUExampleTests::test_seq2seq_lm_training_single_gpu - AttributeError: 'T5ForConditionalGeneration' object has no attribute 'hf_device_map'`

This is caused by transformers@315dcbe45cee1489a32fc228a80502b0a150936c which disables accelerate hooks if the
device map only contains one device. I confirmed that just specifying one value moves the model to that device even
without accelerate hook invocation. I also tested having two devices (cpu + cuda:0) and in that case a device map is
present. Therefore this only needs an added `hasattr` check to be compatible with transformers v5.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on all those errors and digging into the solution.

@githubnemo githubnemo merged commit e282205 into huggingface:main Feb 12, 2026
2 of 10 checks passed
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Feb 17, 2026
In huggingface#3031, a fix to one of the custom models (using embedding + Conv1D)
was introduced to resolve an error in the test_disable_adapters test
when run on GPU. However, this very fix resulted in the test failing in
some settings on CPU. This PR makes specific changes for CPU to avoid
these failures.

With this PR, the mentioned tests pass locally both on CPU and GPU.
Note, however, that other tests involving this custom model still fail
on GPU, irrespective of the changes in huggingface#3031. Fixing these tests would
probably require carefully choosing the right tolerances used in the
tests, with little to no benefit for actual PEFT use. Therefore, I
consider it low priority to investigate these tests.
BenjaminBossan added a commit that referenced this pull request Feb 17, 2026
In #3031, a fix to one of the custom models (using embedding + Conv1D)
was introduced to resolve an error in the test_disable_adapters test
when run on GPU. However, this very fix resulted in the test failing in
some settings on CPU. This PR makes specific changes for CPU to avoid
these failures.

With this PR, the mentioned tests pass locally both on CPU and GPU.
Note, however, that other tests involving this custom model still fail
on GPU, irrespective of the changes in #3031. Fixing these tests would
probably require carefully choosing the right tolerances used in the
tests, with little to no benefit for actual PEFT use. Therefore, I
consider it low priority to investigate these tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants