Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 Woq raise Codegen Error with --compile_prefill #144

Open
yanbing-j opened this issue Mar 22, 2024 · 4 comments
Open

int8 Woq raise Codegen Error with --compile_prefill #144

yanbing-j opened this issue Mar 22, 2024 · 4 comments

Comments

@yanbing-j
Copy link

yanbing-j commented Mar 22, 2024

Hi Maintainers @yanboliang @Chillee ,

I encounter codegen error when using --compile_prefile in int8 Woq. Although it can still run, it could be confused to users. Could you please fix this?

Thanks!

$ numactl -C 0-55 -m 0 python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --device cpu --compile_prefill
Using device=cpu
Loading model ...
Using int8 weight-only quantization!
Time to load model: 0.71 seconds
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0] Error in codegen for ComputedBuffer(name='buf4', layout=AliasedLayout('cpu', torch.float32, size=[1, s0, 32, 64, 1], stride=[4096*s0, 4096, 128, 2, 1]), data=Pointwise(
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   'cpu',
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   torch.float32,
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   def inner_fn(index):
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       _, i1, i2, i3, _ = index
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp0 = ops.load(buf3, 2 * i3 + 128 * i2 + 6144 * i1)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp1 = ops.load(arg101_1, 2 * i3 + 128 * i2)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp2 = tmp0 * tmp1
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp3 = ops.to_dtype(tmp2, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp4 = ops.load(arg487_1, i1)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp5 = ops.load(arg98_1, 2 * i3 + 128 * (tmp4))
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp6 = ops.to_dtype(tmp5, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp7 = tmp3 * tmp6
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp8 = ops.load(buf3, 1 + 2 * i3 + 128 * i2 + 6144 * i1)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp9 = ops.load(arg101_1, 1 + 2 * i3 + 128 * i2)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp10 = tmp8 * tmp9
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp11 = ops.to_dtype(tmp10, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp12 = ops.load(arg487_1, i1)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp13 = ops.load(arg98_1, 1 + 2 * i3 + 128 * (tmp12))
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp14 = ops.to_dtype(tmp13, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp15 = tmp11 * tmp14
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp16 = tmp7 - tmp15
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       return tmp16
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   ,
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   ranges=[1, s0, 32, 64, 1],
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   origin_node=None,
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   origins={cat}
C0321 20:03:49.438000 139932447565632 torch/_inductor/scheduler.py:781] [0/0] ))
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0] Error in codegen for ComputedBuffer(name='buf7', layout=AliasedLayout('cpu', torch.float32, size=[1, s0, 8, 64, 1], stride=[1024*s0, 1024, 128, 2, 1]), data=Pointwise(
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   'cpu',
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   torch.float32,
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   def inner_fn(index):
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       _, i1, i2, i3, _ = index
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp0 = ops.load(buf3, 4096 + 2 * i3 + 128 * i2 + 6144 * i1)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp1 = ops.load(arg101_1, 4096 + 2 * i3 + 128 * i2)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp2 = tmp0 * tmp1
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp3 = ops.to_dtype(tmp2, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp4 = ops.load(arg487_1, i1)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp5 = ops.load(arg98_1, 2 * i3 + 128 * (tmp4))
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp6 = ops.to_dtype(tmp5, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp7 = tmp3 * tmp6
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp8 = ops.load(buf3, 4097 + 2 * i3 + 128 * i2 + 6144 * i1)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp9 = ops.load(arg101_1, 4097 + 2 * i3 + 128 * i2)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp10 = tmp8 * tmp9
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp11 = ops.to_dtype(tmp10, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp12 = ops.load(arg487_1, i1)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp13 = ops.load(arg98_1, 1 + 2 * i3 + 128 * (tmp12))
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp14 = ops.to_dtype(tmp13, torch.float32, src_dtype=torch.bfloat16)
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp15 = tmp11 * tmp14
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       tmp16 = tmp7 - tmp15
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]       return tmp16
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   ,
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   ranges=[1, s0, 8, 64, 1],
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   origin_node=None,
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0]   origins={cat_1}
C0321 20:03:49.582000 139932447565632 torch/_inductor/scheduler.py:781] [0/0] ))

@rapplovin
Copy link

rapplovin commented May 8, 2024

@yanbing-j , have this problem been resolved? We got same error during AOT codegen. If not fixed yet, anyone else has any insight? Thanks!

`
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] Error in codegen for ComputedBuffer(name='buf364', layout=FixedLayout('cpu', torch.float32, size=[s0, 284], stride=[284, 1]), data=Pointwise(
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] 'cpu',
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] torch.float32,
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] def inner_fn(index):
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] i0, i1 = index
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp0 = ops.load(L__self___entity_to_embedding_dense_extractor_indices, i1)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp1 = ops.load(arg172_1, (tmp0) + 284 * i0)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp2 = ops.load(L__self___entity_to_embedding_dense_transformer_means, i1)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp3 = tmp1 - tmp2
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp4 = ops.load(L__self___entity_to_embedding_dense_transformer_stddevs_reciprocal, i1)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp5 = tmp3 * tmp4
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp6 = ops.constant(-10.0, torch.float32)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp7 = ops.maximum(tmp5, tmp6)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp8 = ops.constant(10.0, torch.float32)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] tmp9 = ops.minimum(tmp7, tmp8)
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] return tmp9
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] ,
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] ranges=[s0, 284],
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] origin_node=clamp_max,
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] origins={clamp_max, sub_104, clamp_min, mul_18, index}
C0508 23:32:07.937000 133265142647680 torch/_inductor/scheduler.py:781] ))

C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] Error in codegen for ComputedBuffer(name='buf366', layout=FixedLayout('cpu', torch.float32, size=[s0, 20], stride=[20, 1]), data=Pointwise(
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] 'cpu',
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] torch.float32,
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] def inner_fn(index):
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] i0, i1 = index
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] tmp0 = ops.load(_tensor_constant0, i1)
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] tmp1 = ops.load(buf365, (tmp0) + 284 * i0)
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] tmp2 = ops.sigmoid(tmp1)
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] tmp3 = ops.load(buf364, (tmp0) + 284 * i0)
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] tmp4 = tmp2 * tmp3
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] return tmp4
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] ,
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] ranges=[s0, 20],
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] origin_node=view_52,
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] origins={index_1, mul_19, sigmoid}
C0508 23:32:08.088000 133265142647680 torch/_inductor/scheduler.py:781] ))

C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] Error in codegen for ComputedBuffer(name='buf456', layout=FixedLayout('cpu', torch.float32, size=[s0, 20], stride=[20, 1]), data=Pointwise(
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] 'cpu',
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] torch.float32,
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] def inner_fn(index):
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] i0, i1 = index
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] tmp0 = ops.load(_tensor_constant6, i1)
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] tmp1 = ops.load(buf365, (tmp0) + 284 * i0)
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] tmp2 = ops.sigmoid(tmp1)
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] tmp3 = ops.load(buf364, (tmp0) + 284 * i0)
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] tmp4 = tmp2 * tmp3
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] return tmp4
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] ,
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] ranges=[s0, 20],
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] origin_node=view_70,
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] origins={index_7, mul_19, sigmoid}
C0508 23:32:09.214000 133265142647680 torch/_inductor/scheduler.py:781] ))

C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] Error in codegen for ComputedBuffer(name='buf546', layout=FixedLayout('cpu', torch.float32, size=[s0, 20], stride=[20, 1]), data=Pointwise(
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] 'cpu',
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] torch.float32,
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] def inner_fn(index):
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] i0, i1 = index
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] tmp0 = ops.load(_tensor_constant12, i1)
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] tmp1 = ops.load(buf365, (tmp0) + 284 * i0)
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] tmp2 = ops.sigmoid(tmp1)
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] tmp3 = ops.load(buf364, (tmp0) + 284 * i0)
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] tmp4 = tmp2 * tmp3
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] return tmp4
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] ,
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] ranges=[s0, 20],
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] origin_node=view_88,
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] origins={index_13, mul_19, sigmoid}
C0508 23:32:09.904000 133265142647680 torch/_inductor/scheduler.py:781] ))

C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] Error in codegen for ComputedBuffer(name='buf675', layout=AliasedLayout('cpu', torch.float32, size=[s0, 2211], stride=[3235, 1]), data=Pointwise(
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] 'cpu',
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] torch.float32,
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] def inner_fn(index):
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] i0, i1 = index
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] tmp0 = ops.load(L__self___entity_to_embedding_dot_product_module_indices_0, i1)
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] tmp1 = ops.load(L__self___entity_to_embedding_dot_product_module_indices_1, i1)
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] tmp2 = ops.load(buf665, (tmp1) + 67 * (tmp0) + 4489 * i0)
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] tmp3 = ops.constant(1, torch.float32)
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] tmp4 = tmp2 * tmp3
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] return tmp4
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] ,
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] ranges=[s0, 2211],
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] origin_node=index_16,
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] origins={mul_81, index_16}
C0508 23:32:11.876000 133265142647680 torch/_inductor/scheduler.py:781] ))
`

@yanbing-j
Copy link
Author

@rapplovin Not fixed. Maintainers have not replyed yet.

@yanboliang
Copy link
Contributor

Hey, this is an Inductor cudagraph bug, which we are working on fixing it. Meanwhile, there is a workaround to mitigate this.

@yanboliang
Copy link
Contributor

Oh, sorry, this is a CPU tensor, then probably this is a different issue. I'll have a look soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants