Skip to content

Nail down the semantics for min/max and min_precise/ max_precise #133

@valadaptive

Description

@valadaptive

Different architectures have different semantics for floating-point minimum and maximum, particularly around signed zero and NaN. We currently provide the min/max operations (which are assumed to map to the "fastest" operations on the target platform), and precise_min/precise_max (which our tests currently say should behave like ARM's FPMinNum/FPMaxNum).

There are a variety of minimum/maximum semantics to choose from, not all of which are available on all platforms:

IEEE 754-2019

The IEEE 754-2019 standard defines two sets of min/max operations: minimum/maximum and minimumNumber/maximumNumber.

The first set, minimum/maximum, returns NaN if either operand is NaN. The latter, minimumNumber/maximumNumber, returns NaN only if both operands are NaN. If only one operand is NaN, it returns the other operand.

Both operations treat negative zero as less than zero.

minimumNumber/maximumNumber seem like the most "correct" operations, but they have the downside of not being implemented on all platforms.

x86

x86 provides the minps/maxps/minpd/maxpd instructions. As far as I can tell, they map to the following pseudocode:

fn minps(a: f32, b: f32) -> f32 {
  if a < b { a } else { b }
}

fn maxps(a: f32, b: f32) -> f32 {
  if a > b { a } else { b }
}

This implies the following semantics:

  • If either operand is NaN, it returns the second operand b (since the comparison will always be false).
  • If both operands are zero, it returns the second operand b regardless of either operand's sign (since negative and positive zero compare as equal).
  • Otherwise, it returns the smallest operand as expected.

WebAssembly

WebAssembly provides the fmin/fmax and fpmin/fpmax abstract operations.

The first set, fmin/fmax, implement IEEE 754-2019's minimum and maximum operations exactly. That is, they return NaN if either operand is NaN, and treat negative zero as less than zero.

The second set, fpmin/fpmax, are similar to Intel's instructions except with the operands flipped:

fn fpmin(a: f32, b: f32) -> f32 {
  if b < a { b } else { a }
}

fn fpmax(a: f32, b: f32) -> f32 {
  if b > a { b } else { a }
}

That is, if either operand is NaN or both operands are zero, they return the first operand. #23 updated our WebAssembly min_precise/max_precise implementations to flip the operand order and match Intel's behavior.

AArch64 w/ NEON

NEON actually implements both sets of IEEE 754-2019 operations. In the list of abstract operations, there are FPMin/FPMax and FPMinNum/FPMaxNum. The former implements IEEE 754's minimum/maximum and the latter implements minimumNumber/maximumNumber.

Rust

Rust exposes min/max operations that choose the non-NaN operand if possible, but are non-deterministic if both inputs are zero. For instance, the f32::min documentation states:

In particular, if the inputs compare equal (such as for the case of +0.0 and -0.0), either input may be returned non-deterministically.

In practice, on x86 this compiles down to a minps plus a cmpunordps to select the non-NaN operand. On AArch64, it compiles down to an fminnm (the FPMinNum abstract operation described above).

fearless_simd?

Now we need to decide which semantics to expose.

Ideally, we would just expose IEEE 754-2019's operations. That's what the current tests assume we do, at least. Unfortunately, neither x86 nor WebAssembly implement minimumNumber/maximumNumber semantics. In fact, x86 doesn't even implement minimum/maximum semantics!

As #23 notes, there is value in min/max operations that can "massage" a NaN into some other value. For example, if you're keeping a running tally of "lowest/highest value seen so far", it's helpful to be able to ignore NaNs.

We could implement the following operations:

  • "Logical" minimum/maximum. These behave like if a < b { a } else { b }/if a > b { a } else { b }. The asymmetry seems a bit ugly, but most min/max use cases have some inherent asymmetry anyway (the "running tally", for instance, or the Vello use case that prompted Fix WASM pmax and pmin such that vello_cpu tests pass #23). These are optimal on x86, but compile to a compare/select on AArch64.

  • "Precise" minimum/maximum. These are the IEEE 754 minimumNumber/maximumNumber operations. These are optimal on AArch64, but will require quite a bit of polyfilling on x86, especially with regards to signed zero. These will also be difficult in the fallback implementation, since Rust doesn't expose the operations we need. WebAssembly doesn't expose these either.

  • "Precise with NaN" minimum/maximum. These are the IEEE 754 minimum/maximum operations. These are optimal on AArch64, but will require quite a bit of polyfilling on x86, especially with regards to signed zero. These will also be difficult in the fallback implementation, since Rust doesn't expose the operations we need.

  • "Rust-y" minimum/maximum. These match Rust's semantics of returning whichever operand is non-NaN if the other is NaN, but makes no guarantees about signed zero. These are optimal on AArch64, but require an additional cmpunord and blendv on x86. These will be very easy to implement in the fallback implementation.

  • "Relaxed" minimum/maximum. These behave like the "logical" "precise", or "Rust-y" operations, depending on whichever is fastest on the target architecture. That means that if the second operand is non-NaN, the result is guaranteed to never be NaN. These are optimal on x86, requiring only a minps/maxps operation, but won't let us use fmin/fmax on AArch64. I'm not sure if that's a big deal; we'll probably need to benchmark fmin/fmax versus fminnm/fmaxnm.

  • "Fast" minimum/maximum. These behave like the "logical", "precise with NaN", or "Rust-y" operations, depending on whichever is fastest on the target architecture. These are optimal on both x86 and AArch64, but NaN values can easily pollute the result.

Implementing all of those would be tedious and result in quite an unwieldy API surface. We therefore need to decide which subset is most useful. It's unfortunate that each fully-deterministic operation (logical, precise, and precise with NaN) is slow on at least one target architecture.

I think that the best set of operations to expose is "logical" and "relaxed". The "logical" min/max is deterministic, and the "relaxed" min/max is fast(er). This doesn't give us any operations that map to fmin/fmax on AArch64, but I don't know if fminnm/fmaxnm are actually slower.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions