Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add eager mode check for NaN and Inf #1015

Closed
wants to merge 1 commit into from

Conversation

frank-wei
Copy link
Contributor

Summary:
This diff includes some debug tool improvements
In IG_CTR MC proposal debug, we noted that some new snapshots generated NaN in results. We want to figure out the root cause.
With this diff, we can run with --run-accuracy-check which will run the generate merge + load merge through pybind. But it does not check eager mode run. In this diff, I added this feature. The random inputs are created the same way as we did in load merge. Attach the results P1494263214

CUDA_VISIBLE_DEVICES=5 TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100  caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=/data/local/models/581303767/85/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --run-accuracy-check  --debug_operator_range="1397,1397" --generate_sample_inputs=False  --min_acc_module_size=0  --disable-multiple-batch-run  2>&1 | tee aot.log

If we want to enable the layer print, add "--dispatch-print" and it will print out each layer's output and check if NaN or INF is contained.

Reviewed By: hl475, chenyang78

Differential Revision: D60150435

Summary:
This diff includes some debug tool improvements
In IG_CTR MC proposal debug, we noted that some new snapshots generated NaN in results. We want to figure out the root cause.
With this diff, we can run with `--run-accuracy-check` which will run the generate merge + load merge through pybind. But it does not check eager mode run. In this diff, I added this feature. The random inputs are created the same way as we did in load merge.  Attach the results P1494263214
```
CUDA_VISIBLE_DEVICES=5 TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100  caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=/data/local/models/581303767/85/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --run-accuracy-check  --debug_operator_range="1397,1397" --generate_sample_inputs=False  --min_acc_module_size=0  --disable-multiple-batch-run  2>&1 | tee aot.log
```

If we want to enable the layer print, add "--dispatch-print" and it will print out each layer's output and check if NaN or INF is contained.

Reviewed By: hl475, chenyang78

Differential Revision: D60150435
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 25, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60150435

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in bbb9b84.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants