Two different graph have the same hash value #8353

yitongh · 2024-11-04T08:00:01Z

🐛 Bug

I encountered some correctness issues while using Dynamo + OpenXLA. After investigation, I found that Torch-XLA currently generates the same hash for two different graphs when computing the graph hash. These two graphs have the same inputs and outputs and the same computational logic, but the difference lies in the operands of the operators coming from different inputs. For Torch-XLA, these two graphs are seen as identical computation graphs, which can lead to incorrect results.

To Reproduce

I simplified the computation graph I encountered and constructed the following test example:

import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met

torch.manual_seed(1024)

xla_device = xm.xla_device()

def test1(t0, t1, t2):
    return t0.T, t1.T, t2.T, (t0 * t1) + t2

def test2(t0, t1, t2):
    return t0.T, t1.T, t2.T, (t0 * t2) + t1

t0 = torch.randn([12, 12], dtype=torch.float32)
t1 = torch.randn([12, 12], dtype=torch.float32)
t2 = torch.randn([12, 12], dtype=torch.float32)

cpu_out1 = test1(t0, t1, t2)
xla_out1 = test1(t0.to(xla_device), t1.to(xla_device), t2.to(xla_device))
xm.mark_step()

cpu_out2 = test2(t0, t1, t2)
xla_out2 = test2(t0.to(xla_device), t1.to(xla_device), t2.to(xla_device))
xm.mark_step()

print("test1 allclose xla vs cpu: ", torch.allclose(xla_out1[-1].cpu(), cpu_out1[-1]))
print("test2 allclose xla vs cpu: ", torch.allclose(xla_out2[-1].cpu(), cpu_out2[-1]))

print("Compile count: ", met.metric_data('CompileTime')[0])

The above test1 and test2 have the same hash, so Torch-XLA will only compile the first graph. The second graph will use the pre-compiled version of the first graph without recompilation, which can lead to correctness issues.

Additional context

The root cause of this issue is that the current Torch-XLA hash calculation does not consider the source of the operator's inputs within the computation graph. I believe the solution to this problem is to modify the logic for calculating the hash after obtaining the PostOrder traversal of the graph. One possible approach is to traverse the PostOrder sequence of the computation graph and include the index order of the operands between operators in the hash value calculation. I'm not sure if there are other better methods.

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-11-07T21:17:52Z

This seems like a bug we need to fix... let me take a look..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two different graph have the same hash value #8353

Two different graph have the same hash value #8353

yitongh commented Nov 4, 2024

JackCaoG commented Nov 7, 2024

Two different graph have the same hash value #8353

Two different graph have the same hash value #8353

Comments

yitongh commented Nov 4, 2024

🐛 Bug

To Reproduce

Additional context

JackCaoG commented Nov 7, 2024