Skip to content

v3.2

Compare
Choose a tag to compare
@harrymao2022 harrymao2022 released this 23 Jun 18:31
· 33 commits to rls-v3.2 since this release

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable Processor (formerly Sapphire Rapids).
    • Improved performance for future Intel Xeon Scalable Processor (code-named Sierra Forest). The functionality is disabled by default and can be enabled via CPU dispatcher control.
    • Improved fp32 inner product performance for processors with Intel AVX-512 instructions support.
    • Improved bf16 and int8 matmul performance with runtime dimensions for processors with Intel AMX instructions support.
  • Intel Graphics Products:

    • Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc Graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
    • Reduced creation time for matmul, inner product, and RNN primitives.
  • AArch64-based Processors:

    • Improved convolution performance with post-ops on processors with SVE support.
    • Improved fp32 and fp16 depth-wise convolution performance with Arm Compute Library (ACL).
    • Improved fp32 deconvolution performance for math mode bf16 or any with ACL.
  • IBM Z Platform:

    • Improved int8 matmul, inner product, and RNN performance for s390 z15 systems.

Functionality

  • [experimental] Introduced Graph Compiler backend for Graph API. Graph Compiler improves performance of composite operations like multi-head attention (MHA), multi-level perceptron (MLP), and convolution residual blocks for processors with Intel AVX-512 and Intel AMX instructions support.

  • Extended Graph API with boolean data type, select, and pow operations.

  • Introduced support for binary and eltwise post-ops in softmax primitives.

  • Introduced reference SYCL implementations of batch normalization, layer normalization, linear response normalization (LRN), binary, softmax, eltwise, pooling, PReLU, shuffle, and resampling primitives. These implementations address functional gaps on NVIDIA and AMD GPUs where support is missing in native libraries.

  • Intel Graphics Products:

    • Introduced mixed precision support for binary primitives.
  • NVIDIA GPUs:

    • Introduced bfloat16 support for deconvolution and softmax primitives.
  • AMD GPUs:

    • Introduced support for inner product, convolution, deconvolution, batch normalization, and reorder primitives support.

Usability

  • Extended verbose mode with additional capabilities, including information about implementation dispatching decisions and reasons for primitive creation errors.
  • Reduced stack consumption to less than 20 KB across implementations.
  • [experimental] Introduced profiling API for SYCL and OpenCL applications.

Validation

  • Introduced fast performance validation mode (--mode=F) in benchdnn. Testing speed is improved by initializing oneDNN objects in parallel and avoiding use of host memory when benchmarking GPU primitives.
  • Reduced benchdnn memory consumption in performance validation mode.
  • Introduced smoke test set for benchdnn. This test set provides basic validation for all primitives.

Known Limitations

  • fp32 matmul with bfloat16 binary post-op may produce incorrect results on processors with Intel AVX2 and Intel DL Boost support.
  • fp32 convolution forward propagation with strides has performance regression on processors with Intel AVX-512 instructions support.
  • Resampling primitive with binary post-op may produce incorrect results on CPUs.
  • Extensive use of the RNN primitive on Intel GPUs with default primitive cache settings may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
  • Convolution and deconvolution primitives on Intel Arc GPUs on Windows may cause memory corruption under heavy repeated use.
  • bfloat16 matmul primitive may crash on Intel Arc GPUs on Windows.
  • Pooling, resampling, PRelu, batch normalization, and layer normalization may sporadically produce incorrect results on Intel Arc GPUs on Windows.
  • oneDNN Graph partitions containing ConvTransposeBackwardWeights or int8 matmul operations may produce incorrect results on Intel Processor Graphics on Windows.
  • bfloat16 matmul primitive has performance regression with shapes 14x128:128x200:14x200 and 200x128:128x200:200x200 on the Intel Data Center GPU MAX Series.
  • oneDNN primitives may crash or produce incorrect results with tensors exceeding 4 Gb in size on Intel GPUs.
  • Softmax primitive with a NHWC memory format may produce incorrect results on the Intel Data Center GPU Max Series.
  • Inner product weight gradient may produce incorrect results on Intel Processor Graphics on Windows.

Thanks to the Contributors

This release contains contributions from the project core team as well as Abdelrauf @quickwritereader, Alexey Vishnyakov @SweetVishnya, Annop Wongwathanarat @annop-w, Anthony Roberts @anthony-linaro, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Ilya Lavrenov @ilya-lavrenov, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, RambabuSwargam @RambabuSwargam, Sai Teja @saiteja13427, Taiju Tsuiki @tzik. We would also like to thank everyone who asked questions and reported issues.