This repository was archived by the owner on Dec 22, 2021. It is now read-only.
Rounding Average instructions #126
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
Rounding Average of two integer inputs, defined as
avg(a, b) := (a + b + 1) >> 1, is a common operation in fixed-point numerical algorithms, such as video- and audio-codecs, and image filtering. Direct implementation of Rounding Average in SIMD instruction sets following the formula(a + b + 1) >> 1is tricky, because while the suma + b + 1can overflow the datatype of inputs, the final result always fits into the same datatype. To avoid the expensive work-around of computinga + b + 1in higher precision (e.g. extending inputs from 8-bit elements to 16-bit elements for the computation), all common SIMD instruction sets provide some forms of Rounding Average instructions.This PR introduce two new WebAssembly instructions for Rounding Average operations,
i8x16.avgr_uandi16x8.avgr_u, which operate on vectors of unsigned 8-bit and unsigned 16-bit integers accordingly. These instructions match the universally supported across x86, ARM, and POWER forms of the Rounding Average operation.[October 31 update] Applications
Below are examples of optimized libraries using close equivalents of the proposed
i8x16.avgr_uandi16x8.avgr_uinstructions:Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = i8x16.avgr_u(a, b)is lowered toVPAVGB xmm_y, xmm_a, xmm_by = i16x8.avgr_u(a, b)is lowered toVPAVGW xmm_y, xmm_a, xmm_bx86/x86-64 processors with SSE2 instruction set
a = i8x16.avgr_u(a, b)is lowered toPAVGB xmm_a, xmm_by = i8x16.avgr_u(a, b)is lowered toMOVDQA xmm_y, xmm_a + PAVGB xmm_y, xmm_ba = i16x8.avgr_u(a, b)is lowered toPAVGW xmm_a, xmm_by = i16x8.avgr_u(a, b)is lowered toMOVDQA xmm_y, xmm_a + PAVGW xmm_y, xmm_bARM64 processors
y = i8x16.avgr_u(a, b)is lowered toURHADD Vy.16B, Va.16B, Vb.16By = i16x8.avgr_u(a, b)is lowered toURHADD Vy.8H, Va.8H, Vb.8HARMv7 processors with NEON instruction set
y = i8x16.avgr_u(a, b)is lowered toVRHADD.U8 Qy, Qa, Qby = i16x8.avgr_u(a, b)is lowered toVRHADD.U16 Qy, Qa, QbPOWER processors with VMX (Altivec) instruction set
y = i8x16.avgr_u(a, b)is lowered toVAVGUB VRy, VRa, VRby = i16x8.avgr_u(a, b)is lowered toVAVGUH VRy, VRa, VRbMIPS processors with MSA instruction set
y = i8x16.avgr_u(a, b)is lowered toAVER_U.B Wy, Wa, Wby = i16x8.avgr_u(a, b)is lowered toAVER_U.H Wy, Wa, Wb