fix(jit): propagate -DNDEBUG to host-side cflags#3278
Conversation
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds the ChangesJIT Release Build Optimization
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request ensures that the -DNDEBUG flag is propagated to host compiler flags (cflags) during JIT compilation when not in debug mode. It also adds unit tests to verify that the flag is correctly included in release builds and excluded in debug builds. I have no feedback to provide.
|
/bot run |
|
@aleozlx, CI passed, should we merge this item then? |
|
added 0.6.11 label for post1 target |
## 📌 Description
`gen_jit_spec` adds `-DNDEBUG` only to `extra_cuda_cflags` (consumed by
`nvcc` for `.cu` files), not to `extra_cflags` (consumed by `g++` for
host-side `.cpp`). Several host-only translation units are part of
MoE/GEMM JIT specs — most notably
`csrc/nv_internal/cpp/common/logger.cpp` — and they end up compiled
without `NDEBUG` while the rest of the module is a release build.
For the TensorRT-LLM logger this matters because of:
```cpp
// csrc/nv_internal/include/tensorrt_llm/common/logger.h
#ifndef NDEBUG
Level const DEFAULT_LOG_LEVEL = DEBUG;
#else
Level const DEFAULT_LOG_LEVEL = INFO;
#endif
```
With `NDEBUG` missing on the host side, every prebuilt
`flashinfer-jit-cache` wheel ships with `Logger::level_ = DEBUG (10)`.
On Hopper this turns each MoE forward pass into a stream of
`[TensorRT-LLM][DEBUG] ... sm90_generic_mixed_moe_gemm_kernelLauncher
...` lines from the OSS CUTLASS kernel dispatcher. Verified by reading
the data-section initializer of `Logger::Logger()` in the released
`flashinfer-jit-cache==0.6.10+cu130`
`fused_moe_{90,100,103,120,trtllm_sm100}.so` — all five start `Logger`
with `DEFAULT_LOG_LEVEL=10` and `level_=10`, even though the same wheels
carry no `.debug_*` sections (i.e. they are otherwise release-built).
The fix is one line: also append `-DNDEBUG` to the host `cflags` when
not in debug mode. The `flashinfer-jit-cache` wheel build picks this up
automatically and the prebuilt logger flips back to `INFO`.
## 🔍 Related Issues
Initially this bug was observed during integration of FI v0.6.10 into
vLLM: [[CI/Build] Bump flashinfer to v0.6.10
#41711](vllm-project/vllm#41711).
There is a CI job log failure due to this issue:
[buildkite/ci/pr/distributed-tests-2-gpus-h100](https://buildkite.com/vllm/ci/builds/64532#019df966-e67d-4c27-af0e-76b00bc496e5).
Surfaced while debugging a downstream CI step that produced a 2.9 GB log
dominated by TRT-LLM debug prints from `fused_moe_90.so`. No FlashInfer
issue tracking this yet — happy to file one alongside this PR if useful.
## 🚀 Pull Request Checklist
### ✅ Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`.
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
## 🧪 Tests
- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`pytest tests/test_jit_cpp_ext.py`).
Two regression tests added in `tests/test_jit_cpp_ext.py`, mirroring the
existing `test_debug_jit_uses_sccache_compatible_nvcc_device_debug_flag`
style:
```
pytest tests/test_jit_cpp_ext.py -v
```
```
test_release_jit_propagates_ndebug_to_host_cflags PASSED
test_debug_jit_does_not_propagate_ndebug PASSED
```
The first asserts that a release build
(`FLASHINFER_JIT_DEBUG`/`FLASHINFER_JIT_VERBOSE` unset) puts `-DNDEBUG`
in **both** `spec.extra_cflags` and `spec.extra_cuda_cflags`. The second
locks in symmetry: with `FLASHINFER_JIT_DEBUG=1` neither list contains
`-DNDEBUG`. Without the fix, the first test fails on `assert "-DNDEBUG"
in spec.extra_cflags`.
## Reviewer Notes
Single-line behavior change in `flashinfer/jit/core.py`. No effect on
debug builds. Prebuilt wheels rebuilt from this commit will pick up the
change automatically — no schema/version bump needed.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* JIT-compiled code now includes optimized compilation flags in release
mode for improved performance.
* **Tests**
* Added test coverage for proper compilation flag handling between debug
and release build modes.
[](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3278)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
📌 Description
gen_jit_specadds-DNDEBUGonly toextra_cuda_cflags(consumed bynvccfor.cufiles), not toextra_cflags(consumed byg++for host-side.cpp). Several host-only translation units are part of MoE/GEMM JIT specs — most notablycsrc/nv_internal/cpp/common/logger.cpp— and they end up compiled withoutNDEBUGwhile the rest of the module is a release build.For the TensorRT-LLM logger this matters because of:
With
NDEBUGmissing on the host side, every prebuiltflashinfer-jit-cachewheel ships withLogger::level_ = DEBUG (10). On Hopper this turns each MoE forward pass into a stream of[TensorRT-LLM][DEBUG] ... sm90_generic_mixed_moe_gemm_kernelLauncher ...lines from the OSS CUTLASS kernel dispatcher. Verified by reading the data-section initializer ofLogger::Logger()in the releasedflashinfer-jit-cache==0.6.10+cu130fused_moe_{90,100,103,120,trtllm_sm100}.so— all five startLoggerwithDEFAULT_LOG_LEVEL=10andlevel_=10, even though the same wheels carry no.debug_*sections (i.e. they are otherwise release-built).The fix is one line: also append
-DNDEBUGto the hostcflagswhen not in debug mode. Theflashinfer-jit-cachewheel build picks this up automatically and the prebuilt logger flips back toINFO.🔍 Related Issues
Initially this bug was observed during integration of FI v0.6.10 into vLLM: [CI/Build] Bump flashinfer to v0.6.10 #41711.
There is a CI job log failure due to this issue: buildkite/ci/pr/distributed-tests-2-gpus-h100.
Surfaced while debugging a downstream CI step that produced a 2.9 GB log dominated by TRT-LLM debug prints from
fused_moe_90.so. No FlashInfer issue tracking this yet — happy to file one alongside this PR if useful.🚀 Pull Request Checklist
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit.pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
pytest tests/test_jit_cpp_ext.py).Two regression tests added in
tests/test_jit_cpp_ext.py, mirroring the existingtest_debug_jit_uses_sccache_compatible_nvcc_device_debug_flagstyle:The first asserts that a release build (
FLASHINFER_JIT_DEBUG/FLASHINFER_JIT_VERBOSEunset) puts-DNDEBUGin bothspec.extra_cflagsandspec.extra_cuda_cflags. The second locks in symmetry: withFLASHINFER_JIT_DEBUG=1neither list contains-DNDEBUG. Without the fix, the first test fails onassert "-DNDEBUG" in spec.extra_cflags.Reviewer Notes
Single-line behavior change in
flashinfer/jit/core.py. No effect on debug builds. Prebuilt wheels rebuilt from this commit will pick up the change automatically — no schema/version bump needed.Summary by CodeRabbit
New Features
Tests