Skip to content

Dtensor memory regression with TP and long sequence length present in transformers>=4.54,<4.56 #1343

@terrykong

Description

@terrykong

This commit #1115 broke this nightly test:

tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh

The issue appeared as an OOM, and I had narrowed it down to the transformers version. I believe this regression is the same one identified here where KV cache was suddenly treated as trainable:

huggingface/transformers#39795

The memory pressure on either dtensor path (v1,v2) is exacerbated when using higher TP(this test used 8), and long sequence lengths. In some settings I saw 4x more memory being used.

I have noticed by manually upgrading to 4.56, the memory is back to normal, but Automodel is not ready to upgrade yet, so RL has to disable this test for now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions