Behavior of `with_sharding_constraint` #23144

luyug · 2024-08-20T14:06:06Z

luyug
Aug 20, 2024

Hello,

I am training a Llama model on tpu-v4. I found that putting lax.with_sharding_constraint on to the final hidden states before out projection will change the final loss calculated. A code snippet (within jit) looks like

outputs = self.model(
        input_ids,
        attention_mask=attention_mask,
)

hidden_states = outputs[0]
hidden_states = lax.with_sharding_constraint(hidden_states, PS('data', None, None)) # this line
lm_logits = self.lm_head(hidden_states)

# ... after some data processing

loss = optax.softmax_cross_entropy_with_integer_labels(lm_logits, target_ids)
loss = loss * loss_mask / loss_mask.sum()
loss = loss.sum()

I am expecting that with_sharding_constraint would never change (up to numerical precision) the value calculated but the loss I am seeing with it goes from 1.x to 0.4x. Is my understanding of with_sharding_constraint wrong, or is there some edge case I don't know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of `with_sharding_constraint` #23144

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Behavior of with_sharding_constraint #23144

luyug Aug 20, 2024

Replies: 0 comments

Behavior of `with_sharding_constraint` #23144

luyug
Aug 20, 2024