Different results on different GPUs #28

jaeminoh · 2024-03-20T07:14:36Z

Hi f0uriest,

I encountered an issue that interpolation results vary along different machines.

I used a 1d interpolator with the monotonic method, allowing extrap=True.

test machines: [CPU, RTX Titan, RTX 4090].
reference machine: CPU with double precision (x64).

below table presents relative $L^1$ error: abs(a - b).sum() / abs(b).sum()

precision	CPU	RTX Titan	RTX 4090
x32	5.87719e-08	5.89367e-08	1.78212e-04
x64	reference	4.16375e-17	4.16375e-17

Since I used the same (xq, xp, yp), the errors of each row must coincide, respectively.

However, as you can see, interpolation on RTX 4090 with single precision produced quite an inaccurate result.

Do you have any ideas on this?

The text was updated successfully, but these errors were encountered:

f0uriest · 2024-03-22T02:37:28Z

Do you know if this is specific to interpax? It's likely it's a more general JAX issue (or really a CUDA/XLA issue) that things get compiled differently for different hardware, see google/jax#20371 and google/jax#10674 (comment)

Also, is the error uniformly bad for all points being interpolated, or is it localized in some way?

jaeminoh · 2024-03-22T04:02:46Z

Hi! Thank you for the reply.

I believe it's related to a general JAX-related issue since I could not observe machine-specific implementation in interpax.
But I don't know where to start to fix it 😅

Here I attached two images, which present relative pointwise error abs(a - b) / abs(b).

This is for 4090 with single precision,

and this is for 4090 with double precision.
Numbers on the axes are just indices.

For the left vertical edge of the figures, xq is monotonically increasing (from 0 to 1)
So I would say that the error is uniformly bad.

In fact, my query points xq were loaded from Excel files using pandas.read_excel, so this could've been a cause.
So I switched my query points xq to numpy.linspace(0, 1, 1000), and observed the same issue again.
On Titan, $\approx 10^{-6}$, however on 4090, $\approx 10^{-2}$ for the relative $L^1$ error, where the baseline is Cpu results with x64 arithemetic.

f0uriest · 2024-03-25T14:40:51Z

Can you share some code/data that seems to reproduce the issue? I don't have access to either of those GPUs but I can try some others and see if its a more general issue.

jaeminoh · 2024-04-04T07:32:48Z

Hi, I think I found the cause.

I ran the test with
NVIDIA_TF32_OVERRIDE=0,
I got the correct result:

It might be related to this issue (default TF32 overriding of JAX):
google/jax#7010 (comment)
patrick-kidger/diffrax#213 (comment)

jaeminoh closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results on different GPUs #28

Different results on different GPUs #28

jaeminoh commented Mar 20, 2024

f0uriest commented Mar 22, 2024 •

edited

Loading

jaeminoh commented Mar 22, 2024

f0uriest commented Mar 25, 2024

jaeminoh commented Apr 4, 2024 •

edited

Loading

Different results on different GPUs #28

Different results on different GPUs #28

Comments

jaeminoh commented Mar 20, 2024

f0uriest commented Mar 22, 2024 • edited Loading

jaeminoh commented Mar 22, 2024

f0uriest commented Mar 25, 2024

jaeminoh commented Apr 4, 2024 • edited Loading

f0uriest commented Mar 22, 2024 •

edited

Loading

jaeminoh commented Apr 4, 2024 •

edited

Loading