Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two different graph have the same hash value #8353

Open
yitongh opened this issue Nov 4, 2024 · 1 comment
Open

Two different graph have the same hash value #8353

yitongh opened this issue Nov 4, 2024 · 1 comment

Comments

@yitongh
Copy link
Contributor

yitongh commented Nov 4, 2024

🐛 Bug

I encountered some correctness issues while using Dynamo + OpenXLA. After investigation, I found that Torch-XLA currently generates the same hash for two different graphs when computing the graph hash. These two graphs have the same inputs and outputs and the same computational logic, but the difference lies in the operands of the operators coming from different inputs. For Torch-XLA, these two graphs are seen as identical computation graphs, which can lead to incorrect results.

To Reproduce

I simplified the computation graph I encountered and constructed the following test example:

import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met

torch.manual_seed(1024)

xla_device = xm.xla_device()

def test1(t0, t1, t2):
    return t0.T, t1.T, t2.T, (t0 * t1) + t2

def test2(t0, t1, t2):
    return t0.T, t1.T, t2.T, (t0 * t2) + t1

t0 = torch.randn([12, 12], dtype=torch.float32)
t1 = torch.randn([12, 12], dtype=torch.float32)
t2 = torch.randn([12, 12], dtype=torch.float32)

cpu_out1 = test1(t0, t1, t2)
xla_out1 = test1(t0.to(xla_device), t1.to(xla_device), t2.to(xla_device))
xm.mark_step()

cpu_out2 = test2(t0, t1, t2)
xla_out2 = test2(t0.to(xla_device), t1.to(xla_device), t2.to(xla_device))
xm.mark_step()

print("test1 allclose xla vs cpu: ", torch.allclose(xla_out1[-1].cpu(), cpu_out1[-1]))
print("test2 allclose xla vs cpu: ", torch.allclose(xla_out2[-1].cpu(), cpu_out2[-1]))

print("Compile count: ", met.metric_data('CompileTime')[0])

The above test1 and test2 have the same hash, so Torch-XLA will only compile the first graph. The second graph will use the pre-compiled version of the first graph without recompilation, which can lead to correctness issues.

Additional context

The root cause of this issue is that the current Torch-XLA hash calculation does not consider the source of the operator's inputs within the computation graph. I believe the solution to this problem is to modify the logic for calculating the hash after obtaining the PostOrder traversal of the graph. One possible approach is to traverse the PostOrder sequence of the computation graph and include the index order of the operands between operators in the hash value calculation. I'm not sure if there are other better methods.

@JackCaoG
Copy link
Collaborator

JackCaoG commented Nov 7, 2024

This seems like a bug we need to fix... let me take a look..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants