add eager mode check for NaN and Inf #1015

frank-wei · 2024-07-25T19:35:01Z

Summary:
This diff includes some debug tool improvements
In IG_CTR MC proposal debug, we noted that some new snapshots generated NaN in results. We want to figure out the root cause.
With this diff, we can run with --run-accuracy-check which will run the generate merge + load merge through pybind. But it does not check eager mode run. In this diff, I added this feature. The random inputs are created the same way as we did in load merge. Attach the results P1494263214

CUDA_VISIBLE_DEVICES=5 TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf   mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100  caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=/data/local/models/581303767/85/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR" --run-accuracy-check  --debug_operator_range="1397,1397" --generate_sample_inputs=False  --min_acc_module_size=0  --disable-multiple-batch-run  2>&1 | tee aot.log

If we want to enable the layer print, add "--dispatch-print" and it will print out each layer's output and check if NaN or INF is contained.

Reviewed By: hl475, chenyang78

Differential Revision: D60150435

Summary: This diff includes some debug tool improvements In IG_CTR MC proposal debug, we noted that some new snapshots generated NaN in results. We want to figure out the root cause. With this diff, we can run with `--run-accuracy-check` which will run the generate merge + load merge through pybind. But it does not check eager mode run. In this diff, I added this feature. The random inputs are created the same way as we did in load merge. Attach the results P1494263214 ``` CUDA_VISIBLE_DEVICES=5 TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt-split-dwarf mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=/data/local/models/581303767/85/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" --run-accuracy-check --debug_operator_range="1397,1397" --generate_sample_inputs=False --min_acc_module_size=0 --disable-multiple-batch-run 2>&1 | tee aot.log ``` If we want to enable the layer print, add "--dispatch-print" and it will print out each layer's output and check if NaN or INF is contained. Reviewed By: hl475, chenyang78 Differential Revision: D60150435

facebook-github-bot · 2024-07-25T19:35:33Z

This pull request was exported from Phabricator. Differential Revision: D60150435

facebook-github-bot · 2024-07-25T21:51:11Z

This pull request has been merged in bbb9b84.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 25, 2024

facebook-github-bot added the fb-exported label Jul 25, 2024

facebook-github-bot closed this in bbb9b84 Jul 25, 2024

facebook-github-bot added the Merged label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add eager mode check for NaN and Inf #1015

add eager mode check for NaN and Inf #1015

frank-wei commented Jul 25, 2024

facebook-github-bot commented Jul 25, 2024

facebook-github-bot commented Jul 25, 2024

add eager mode check for NaN and Inf #1015

add eager mode check for NaN and Inf #1015

Conversation

frank-wei commented Jul 25, 2024

facebook-github-bot commented Jul 25, 2024

facebook-github-bot commented Jul 25, 2024