Incorrect results of pallas.ops.gpu.attention.mha when seq_len is not divisible by block_q. Is this expected behavior, or is it a bug? #23818

lkwq007 · 2024-09-21T08:27:40Z

lkwq007
Sep 21, 2024

Incorrect results of pallas.ops.gpu.attention.mha when seq_len is not divisible by block_q. Is this expected behavior, or is it a bug?

Example code

base = jax.random.normal(jax.random.key(0),(3, 4092, 24, 32))
q,k,v = jnp.split(base,3,axis=0)
h0 = mha(q,k,v,None,sm_scale=1/math.sqrt(q.shape[-1]))
h1 = mha_reference(q,k,v,None,sm_scale=1/math.sqrt(q.shape[-1]))

lkwq007 · 2024-09-21T08:55:14Z

lkwq007
Sep 21, 2024
Author

Seems to be a bug

key=jax.random.key(0)
base=jax.random.normal(key,(3, 4092, 24, 32))
q,k,v=jnp.split(base,3,axis=0)
for i in range(10):
    h0=mha(q,k,v,None,sm_scale=1/math.sqrt(q.shape[-1]))
    h1=mha_reference(q,k,v,None,sm_scale=1/math.sqrt(q.shape[-1]))
    print(jnp.mean(jnp.abs(h0-h1)))

outputs

nan
2.0543639e-05
2.0543639e-05
2.0543639e-05
2.0543639e-05
2.0543639e-05
2.0543639e-05
2.0543639e-05
2.0543639e-05
2.0543639e-05

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect results of pallas.ops.gpu.attention.mha when seq_len is not divisible by block_q. Is this expected behavior, or is it a bug? #23818

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Incorrect results of pallas.ops.gpu.attention.mha when seq_len is not divisible by block_q. Is this expected behavior, or is it a bug? #23818

lkwq007 Sep 21, 2024

Replies: 1 comment

lkwq007 Sep 21, 2024 Author

lkwq007
Sep 21, 2024

lkwq007
Sep 21, 2024
Author