Dtensor memory regression with TP and long sequence length present in `transformers>=4.54,<4.56`

This commit https://github.com/NVIDIA-NeMo/RL/pull/1115 broke this nightly test:

```
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh
```

The issue appeared as an OOM, and I had narrowed it down to the transformers version. I believe this regression is the same one identified here where KV cache was suddenly treated as trainable: 

https://github.com/huggingface/transformers/issues/39795

The memory pressure on either dtensor path (v1,v2) is exacerbated when using higher TP(this test used 8), and long sequence lengths. In some settings I saw 4x more memory being used. 

I have noticed by manually upgrading to 4.56, the memory is back to normal, but Automodel is not ready to upgrade yet, so RL has to disable this test for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dtensor memory regression with TP and long sequence length present in `transformers>=4.54,<4.56` #1343

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dtensor memory regression with TP and long sequence length present in transformers>=4.54,<4.56 #1343

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Dtensor memory regression with TP and long sequence length present in `transformers>=4.54,<4.56` #1343