Fix various test errors in the single GPU case#3031
Merged
Conversation
This addresses some of the errors reported by running the tests
on a single GPU machine.
I will list the error messages and a short explanation of the fix.
> `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_lora_gptq_quantization_from_pretrained_safetensors - NameError: name 'BACKEND' is not defined`
The test was using GPTQModel without marking the test as requiring it leading to an error. This is fixed
by marking the test with `requires_gptqmodel`.
> `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_only_params_are_updated[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True`
> `FAILED tests/test_custom_models.py::TestPeftCustomModel::test_disable_adapters_with_merging[Embedding + transformers Conv1D 1 trainable_tokens-EmbConv1D-TrainableTokensConfig-config_kwargs180] - AssertionError: assert not True`
This test fails because sometimes the gradients of the trainable tokens delta is 0 but only when training on CUDA,
CPU is fine.
This is a weird one and I'm not sure if this is a good fix or not. I encountered this error on two machines
(1xL40S and 4xA10G) and I was not able to pinpoint this to something particular in the environment, i.e.
PEFT version (tested v0.17 to main), transformers version (tested 4.5{5,6,7}, 5.0), CUDA version (tested 12.6, 12.8)
or torch version (tested 2.7, 2.8, 2.9, 2.10). I also set `LD_LIBRARY_PATH=` before running pytest to exclude
cuDNN libraries that come preinstalled on the EC2 instance.
Removing the ReLU in `EmbConv1DModel` as well as boosting the Conv1D weights will fix the error. Replacing
the ReLU with `Threshold(0, 0)` has the same behavior. It depends on the seed, i.e. if the initialization of
`Conv1D` is favorable the bug will not trigger.
I tried pinpointing it on `index_copy` but it is not `index_copy` by itself that is the problem. Maybe we will just
have to live with this?
> `FAILED tests/test_common_gpu.py::PeftGPUCommonTests::test_dora_ephemeral_gpu_offload_multigpu - RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_mm)`
This is caused by a bug introduced in huggingface#2960 - `ephemeral_gpu_offload` is not passed to the variant and therefore
never utilized.
> `FAILED tests/test_gpu_examples.py::PeftBnbGPUExampleTests::test_seq2seq_lm_training_single_gpu - AttributeError: 'T5ForConditionalGeneration' object has no attribute 'hf_device_map'`
This is caused by transformers@315dcbe45cee1489a32fc228a80502b0a150936c which disables accelerate hooks if the
device map only contains one device. I confirmed that just specifying one value moves the model to that device even
without accelerate hook invocation. I also tested having two devices (cpu + cuda:0) and in that case a device map is
present. Therefore this only needs an added `hasattr` check to be compatible with transformers v5.
7b066d9 to
376e2ac
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BenjaminBossan
approved these changes
Feb 12, 2026
Member
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for working on all those errors and digging into the solution.
BenjaminBossan
added a commit
to BenjaminBossan/peft
that referenced
this pull request
Feb 17, 2026
In huggingface#3031, a fix to one of the custom models (using embedding + Conv1D) was introduced to resolve an error in the test_disable_adapters test when run on GPU. However, this very fix resulted in the test failing in some settings on CPU. This PR makes specific changes for CPU to avoid these failures. With this PR, the mentioned tests pass locally both on CPU and GPU. Note, however, that other tests involving this custom model still fail on GPU, irrespective of the changes in huggingface#3031. Fixing these tests would probably require carefully choosing the right tolerances used in the tests, with little to no benefit for actual PEFT use. Therefore, I consider it low priority to investigate these tests.
BenjaminBossan
added a commit
that referenced
this pull request
Feb 17, 2026
In #3031, a fix to one of the custom models (using embedding + Conv1D) was introduced to resolve an error in the test_disable_adapters test when run on GPU. However, this very fix resulted in the test failing in some settings on CPU. This PR makes specific changes for CPU to avoid these failures. With this PR, the mentioned tests pass locally both on CPU and GPU. Note, however, that other tests involving this custom model still fail on GPU, irrespective of the changes in #3031. Fixing these tests would probably require carefully choosing the right tolerances used in the tests, with little to no benefit for actual PEFT use. Therefore, I consider it low priority to investigate these tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This addresses some of the errors reported by running the tests on a single GPU machine.
I will list the error messages and a short explanation of the fix.
The test was using GPTQModel without marking the test as requiring it leading to an error. This is fixed by marking the test with
requires_gptqmodel.This test fails because sometimes the gradients of the trainable tokens delta is 0 but only when training on CUDA, CPU is fine.
This is a weird one and I'm not sure if this is a good fix or not. I encountered this error on two machines (1xL40S and 4xA10G) and I was not able to pinpoint this to something particular in the environment, i.e. PEFT version (tested v0.17 to main), transformers version (tested 4.5{5,6,7}, 5.0), CUDA version (tested 12.6, 12.8) or torch version (tested 2.7, 2.8, 2.9, 2.10). I also set
LD_LIBRARY_PATH=before running pytest to exclude cuDNN libraries that come preinstalled on the EC2 instance.Removing the ReLU in
EmbConv1DModelas well as boosting the Conv1D weights will fix the error. Replacing the ReLU withThreshold(0, 0)has the same behavior. It depends on the seed, i.e. if the initialization ofConv1Dis favorable the bug will not trigger.I tried pinpointing it on
index_copybut it is notindex_copyby itself that is the problem. Maybe we will just have to live with this?This is caused by a bug introduced in #2960 -
ephemeral_gpu_offloadis not passed to the variant and therefore never utilized.This is caused by transformers@315dcbe45cee1489a32fc228a80502b0a150936c which disables accelerate hooks if the
device map only contains one device. I confirmed that just specifying one value moves the model to that device even
without accelerate hook invocation. I also tested having two devices (cpu + cuda:0) and in that case a device map is
present. Therefore this only needs an added
hasattrcheck to be compatible with transformers v5.