Fix: Move RoPE tensors to right devices#2862
Conversation
|
Ok your solution is fine, but moving to CPU will make things slower. We might have to replicate cos / sin on each GPU and make a tuple indexer |
|
Ie: or something |
I'm only moving And for cos, sin, we're moving them directly between GPUs, no explicit CPU calls by us. The reason why I didn't replicate across GPUs was I wanted to be lean on the memory |
|
I think it's best to use a tuple and not a dict |
unsloth/models/cohere.py
Outdated
| next_decoder_cache = [] | ||
| for idx, decoder_layer in enumerate(self.model.layers): | ||
| decoder_device = decoder_layer.self_attn.q_proj.weight.device | ||
| hidden_states, out_weight, position_ids = move_to_device( |
There was a problem hiding this comment.
wait wait we shouldnt need to move these tensors right?
There was a problem hiding this comment.
we're mostly using PP. Which means some layers are on GPU0 and some on GPU1
The inputs and/or these tensors don't seem to move to the 2nd GPU automatically. It has to be done explicitly.
There was a problem hiding this comment.
Also we do have a check inside the move_to_device to ensure that we aren't moving unnecessarily.
unsloth/models/llama.py
Outdated
| This means we can pass in a row of Q, but we need to | ||
| remember K and V, which are called the KV cache. | ||
| """ | ||
| if position_ids is not None: |
will be handled separately in another PR
606c451 to
324b392
Compare
|
Closing this as this is handled in #2919 |
Multi GPU inference fails as some tensors are explicitly set to GPU 0 and sometimes RoPE values end up on the wrong GPU. This PR fixes that.