Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post-op shift operations #2600

Open
ArielPom opened this issue Feb 5, 2025 · 4 comments
Open

post-op shift operations #2600

ArielPom opened this issue Feb 5, 2025 · 4 comments
Assignees
Labels
enhancement A feature or an optimization request question

Comments

@ArielPom
Copy link

ArielPom commented Feb 5, 2025

Summary

i did not find post-op for shift left/right for oneDNN.
why is that ?

Current usage

the solutions im using now are :

  1. make the post-ops outside of oneDnn kernel. in my code.

  2. convert shift value to a float multiply value, and add post-op algorithm::binary_mul.
    for example x>>(15) , turns into x*(2^(-15))
    x<<15 turns into x*(2^15)
    but this solution redirect the kernel to ref implementation instead of avx5/vnni/some other smart implementation.
    is this a standard behaviour?

Thanks.

@ArielPom ArielPom added the enhancement A feature or an optimization request label Feb 5, 2025
@yehudaorel
Copy link

Hi @ArielPom,

Currently, oneDNN does not provide a dedicated bitwise shift left/right API, but as you mentioned mathematically it could be achieved using the Binary primitive.

As mentioned here - #1679, although multiplying by a ^2 is mathematically equivalent to performing a bit‑shift, the optimized kernels in oneDNN are not selected based solely on arithmetic equivalence. Instead, they are dispatched based on strict matching of the expected op patterns [data types, memory layouts, broadcasting, ..]

If you could please provide verbose log with ONEDNN_VERBOSE=dispatch, and a snippet to your oneDNN code that would be great!

@ArielPom
Copy link
Author

ArielPom commented Feb 5, 2025

thanks for the quick answer.

i can not provide a log after compiling with ONEDNN_VERBOSE=dispatch.
but i guess that un-optimized kernel is chosen because of the post-op data shape i give it.

can you clarify, lets say the output size is [4,8,16]
what are the possible post-op shapes that would lead to an optimized kernel?

@yehudaorel
Copy link

thanks for the quick answer.

i can not provide a log after compiling with ONEDNN_VERBOSE=dispatch. but i guess that un-optimized kernel is chosen because of the post-op data shape i give it.

No need to recompile for verbose mode, simply set environment variable before run:

  • export ONEDNN_VERBOSE=dispatch
    ./test.o
    or
  • ONEDNN_VERBOSE=dispatch ./test.o

can you clarify, lets say the output size is [4,8,16] what are the possible post-op shapes that would lead to an optimized kernel?

In your case as you are are shifting (scaling) by a single factor, I believe you should be fine with using per-tensor-broadcast ({1,1,1}).

Also, perhaps try to break the computation down and use a Binary primitive instead of post-op, something like this should work:

    memory::dim N = 4, C = 8, W = 16;
    memory::dims tensor_dims = {N, C, W};

    float shift_scaler =  std::pow(2.0f, 15); // example x>>(15) 
    memory::dims multiplier_dims = {1, 1, 1}; // per-tensor broadcast
    std::vector<float> multiplier_data = {shift_scaler};

    auto src_md = memory::desc(tensor_dims, dt::f32, tag::abc);
    auto dst_md = memory::desc(tensor_dims, dt::f32, tag::abc);
    auto multiplier_md = memory::desc(multiplier_dims, dt::f32, tag::abc);

    auto src_mem = memory(src_md, engine, src_data);
    auto dst_mem = memory(dst_md, engine, dst_data);
    auto multiplier_mem = memory(multiplier_md, engine, multiplier_data);

    primitive_attr attr;
    auto binary_pd = binary::primitive_desc(engine, algorithm::binary_mul,
                                            src_md, multiplier_md, dst_md, attr);

    auto binary_prim = binary(binary_pd);

Hope this helps!

@yehudaorel yehudaorel self-assigned this Feb 5, 2025
@mgouicem
Copy link
Contributor

@ArielPom If the shift factor is constant, common to whole output tensor, and known at primitive creation time, you can try to use the eltwise post-op with eltwise_linear algorithm.

Elementwise post-ops generally reduces your chances of fallback to reference compared to binary post-op.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request question
Projects
None yet
Development

No branches or pull requests

3 participants