misc: Label APIs for Logging#2153
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds an API-exposure decorator import ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the API logging capabilities within the FlashInfer library. By systematically applying the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/bot run |
There was a problem hiding this comment.
Code Review
This pull request adds the @flashinfer_api decorator to a large number of public APIs to enable comprehensive logging. The changes are mostly mechanical and look good. I've found a couple of places where the decorator could also be added for completeness. Please see my comments for details.
| @@ -65,6 +67,7 @@ def __init__( | |||
| pin_memory=True, | |||
| ) | |||
|
|
|||
There was a problem hiding this comment.
Thanks for adding the decorator to the plan and run methods. For consistency with other wrapper classes in this PR (e.g., MultiLevelCascadeAttentionWrapper), could you also add the @flashinfer_api decorator to the __init__ method of the BatchAttention class? It appears to be a public API that should be logged.
| ) | ||
|
|
||
|
|
||
| @flashinfer_api |
There was a problem hiding this comment.
No this is fine; get_seq_lens is not a documented API.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
flashinfer/activation.py (1)
197-214: Fix docstring parameter list forsilu_and_mul_scaled_nvfp4_experts_quantizeThe docstring documents an
sf_vec_sizeparameter that does not exist in the function signature; callers can’t actually pass it. Either addsf_vec_sizeto the API (threaded through to the underlying kernel) or drop it from the docstring so docs match behavior.flashinfer/decode.py (1)
1745-1869: Fix dtype check forlseinBatchDecodeMlaWithPagedKVCacheWrapper.runHere
lseis allocated astorch.float32when created internally, and the docstring describes it as a float32 tensor, but the validation path for user-providedlseusesq_nope.dtype:check_shape_dtype_device( lse, (q_nope.size(0), q_nope.size(1)), q_nope.dtype, q_nope.device, "lse", )If
q_nopeis fp16/fp8 (typical), a correctly-typed float32lsewill fail this check. You likely want to enforce float32 instead.You can fix this by changing the expected dtype:
- check_shape_dtype_device( - lse, - (q_nope.size(0), q_nope.size(1)), - q_nope.dtype, - q_nope.device, - "lse", - ) + check_shape_dtype_device( + lse, + (q_nope.size(0), q_nope.size(1)), + torch.float32, + q_nope.device, + "lse", + )This keeps the behavior consistent between internally-allocated and user-provided
lse.flashinfer/norm.py (1)
254-281: Layernorm signature change is likely a breaking public API change
layernormnow has the signature:def layernorm(input: torch.Tensor, gemma: torch.Tensor, beta: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:Previously, callers may have been using a different (likely simpler) signature. With both
gemmaandbetanow required positional arguments and no overload/compat shim, existing user code invokingflashinfer.norm.layernorm(...)will break unless all external call sites were updated in lockstep.If this function is part of the public Python API (as the decorating suggests), consider:
- Providing backward-compatible defaults (e.g., allow
gemma/betato be optional and synthesize them when omitted), or- Introducing a new, Gemma-specific entry point and keeping the old
layernormsignature until a deprecation cycle completes.At minimum, this should be clearly called out as a breaking change in release notes.
🧹 Nitpick comments (1)
flashinfer/fp8_quantization.py (1)
146-181: Align shape constraint withalignmentargument (or document that only 32 is supported)
mxfp8_quantizeexposes analignmentparameter but hard-codessf_vec_size = 32 assert input.shape[-1] % sf_vec_size == 0regardless of the
alignmentvalue actually passed through to the kernel. For clarity and future-proofing, either:
- Tie the assertion to
alignment:- sf_vec_size = 32 - - assert input.shape[-1] % sf_vec_size == 0 + assert input.shape[-1] % alignment == 0or
- Explicitly document that only
alignment=32is supported and ignore non-32 values.Right now the signature suggests a tunable
alignmentwhile the precondition effectively fixes it to 32.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
flashinfer/activation.py(5 hunks)flashinfer/attention.py(3 hunks)flashinfer/cascade.py(15 hunks)flashinfer/decode.py(2 hunks)flashinfer/fp4_quantization.py(12 hunks)flashinfer/fp8_quantization.py(3 hunks)flashinfer/gemm/routergemm_dsv3.py(2 hunks)flashinfer/norm.py(6 hunks)flashinfer/page.py(4 hunks)flashinfer/pod.py(7 hunks)flashinfer/prefill.py(1 hunks)flashinfer/quantization.py(3 hunks)flashinfer/rope.py(14 hunks)flashinfer/sampling.py(13 hunks)flashinfer/sparse.py(7 hunks)flashinfer/xqa.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (15)
flashinfer/pod.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/activation.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/prefill.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/decode.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/fp8_quantization.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/sparse.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/xqa.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/attention.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/page.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/quantization.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/sampling.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/cascade.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/fp4_quantization.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/rope.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/gemm/routergemm_dsv3.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (34)
flashinfer/xqa.py (1)
22-31: xqa/xqa_mla decoration matches the intended API‑logging patternImporting
flashinfer_apiand decorating the publicxqaandxqa_mlaentrypoints keeps signatures/behavior intact while integrating them with the API logging system. Since the heavy lifting still happens inside the JIT modules and custom ops, this is a clean way to extend observability without altering core execution.Looks good to me.
Also applies to: 147-333, 419-530
flashinfer/page.py (1)
22-30: Paged KV/page helper decoration is consistent and non-invasiveBringing
get_batch_indices_positions,append_paged_mla_kv_cache, andappend_paged_kv_cacheunderflashinfer_apiwhile leaving the lower‑level custom-op kernels untouched is a good balance: you gain API‑level logging on the user‑facing helpers without disturbing the JIT/CUDA integration.No issues from a correctness or API standpoint.
Also applies to: 158-212, 240-288, 290-417
flashinfer/sparse.py (1)
22-38: Sparse attention wrappers are now consistently instrumented viaflashinfer_apiAdding
flashinfer_apito the__init__,plan, andrunmethods of both sparse wrappers brings them in line with the rest of the wrapper APIs (prefill/batch decode, etc.) that are already logged. The decorator is applied only at the Python wrapper layer, so the underlying CUDA/JIT modules and custom ops are unaffected.The alias methods (
begin_forward,run_return_lse, etc.) will naturally route through the decorated implementations, which is also desirable for consistent logging.No functional issues spotted in these additions.
Also applies to: 111-171, 208-362, 517-702, 742-793, 830-885, 1107-1278
flashinfer/quantization.py (1)
22-25: Quantization helpers correctly wrapped for API loggingDecorating
packbitsandsegment_packbitswithflashinfer_apikeeps them as thin wrappers over the underlying custom ops while exposing their calls to the unified logging mechanism. There’s no change in behavior or JIT/custom‑op wiring, and no recursion hazards since the wrappers still call_packbits/get_quantization_module().segment_packbits.This looks good.
Also applies to: 46-79, 81-139
flashinfer/pod.py (1)
24-42: POD wrappers are now logged at the same granularity as other attention wrappersUsing
flashinfer_apion the__init__,plan, andrunmethods of both POD wrapper classes brings them in line with the other prefill/decode wrappers that are already instrumented. The decorations are applied only at the Python wrapper layer; all JIT modules, custom ops, and CUDA-graph constraints remain untouched.Given the env‑gated logger, this extends observability without altering runtime behavior in the common case.
Also applies to: 123-235, 264-435, 438-620, 731-787, 799-823, 1019-1197
flashinfer/prefill.py (1)
3557-3635:fmha_v2_prefill_deepseekneeds clarification onlseallocation contractThe
@flashinfer_apidecoration is consistent with other DeepSeek/TRT-LLM entrypoints and should be semantics-preserving.However, the
lseparameter handling needs clarification: whenreturn_lse=Trueandlse=None, the function forwardsNonetomodule.run(...)and returns(out, None). This means callers expecting a validlsetensor must preallocate it, despite the docstring markinglseas optional.Please confirm:
- Does
get_trtllm_fmha_v2_module().run()gracefully handleNonefor thelseparameter when the backend needs to compute it?- If not, align this function with the pattern used in
trtllm_ragged_attention_deepseekto allocatelsewhen needed, or update the docstring and add an assertion requiringlseto be provided whenreturn_lse=True.flashinfer/attention.py (1)
23-23: BatchAttention logging decoration looks safe and non-intrusiveImporting
flashinfer_apiand decoratingBatchAttention.plan/BatchAttention.runkeeps signatures and control flow unchanged, and respects the zero-overhead path when logging is disabled. This is a good fit for expanding API logging over core attention entry points.Also applies to: 69-88, 137-201
flashinfer/gemm/routergemm_dsv3.py (1)
89-136: Public DeepSeek-V3 router GEMM wrapper is consistent with existing designThe new
mm_M1_16_K7168_N256wrapper cleanly layers on top of the cached module op, reusing_mm_M1_16_K7168_N256_shape_checksand only adding API logging. No behavioral or signature regressions are apparent.flashinfer/decode.py (1)
316-346: Decorator usage across decode APIs is consistent and low riskApplying
@flashinfer_apito the main decode entry points (single_decode_with_kv_cache*,BatchDecodeWithPagedKVCacheWrapper.{__init__,plan,run}, and the trtllm/xqa helpers) preserves signatures and control flow while giving you opt-in logging. Given_API_LOG_LEVEL==0is a true passthrough, this should not affect hot-path performance in the default configuration.Also applies to: 393-583, 652-783, 816-1110, 1170-1390, 2071-2345, 2351-2370, 2552-2570, 2718-2764
flashinfer/fp8_quantization.py (1)
183-208: FP8 dequantization wrapper is a straightforward, safe delegation
mxfp8_dequantize_hostcleanly forwards to the cached sm100 module and matches the documented contract (MxFP8 input + scale tensor → fp32 output). Decorating it with@flashinfer_apiis appropriate and doesn’t alter behavior.flashinfer/norm.py (1)
32-69: RMSNorm and Gemma RMSNorm wrappers are well-factored around the JIT moduleThe
rmsnorm,fused_add_rmsnorm,gemma_rmsnorm, andgemma_fused_add_rmsnormfunctions expose clear Python APIs that delegate to the JIT-backedget_norm_module()while keeping the custom-op registrations intact. Adding@flashinfer_apion these entry points is consistent with the rest of the PR and doesn’t change behavior.Also applies to: 95-129, 142-179, 205-241
flashinfer/sampling.py (1)
533-589: Sampling API wrappers are well-structured and match the underlying kernelsThe new
@flashinfer_api-decorated sampling helpers (softmax, the varioussampling_from_*andtop_*functions, andchain_speculative_sampling) provide a clean, validated Python surface over the JIT-backed sampling module:
- Shapes and dtypes of
indices/ per-batch thresholds are checked via_check_indices_dtypeand_check_tensor_param.- The “top_k_first” vs “joint” logic in
top_k_top_p_sampling_from_logits/probsis clear and uses the appropriate low-level entry point.- Logging is added only at the Python API layer; the custom-op registrations themselves remain unchanged.
Overall this is a solid expansion of the public sampling API with minimal risk to existing behavior.
Also applies to: 592-655, 658-727, 730-823, 827-921, 924-1013, 1017-1148, 1151-1275, 1278-1338, 1345-1404, 1411-1465, 1469-1587
flashinfer/cascade.py (1)
34-90: Cascade merge helpers and wrappers gain logging without altering behaviorAdding
@flashinfer_apitomerge_state*,MultiLevelCascadeAttentionWrapper, and the shared-prefix batch decode/prefill wrappers preserves their existing semantics:
- Merge helpers still delegate entirely to the JIT-backed cascade module; only Python-level calls are logged.
- Wrapper
__init__/plan/runmethods maintain their data-flow into underlying prefill/decode wrappers, and the merged outputs are unchanged.- The casting of
s/s_othertotorch.float32is consistent with the documented expectations for those tensors.The additional logging hooks are therefore low-risk and consistent with the rest of the PR.
Also applies to: 101-159, 162-207, 294-365, 395-512, 516-551, 639-646, 668-728, 730-797, 890-909, 930-984, 986-1078
flashinfer/fp4_quantization.py (8)
24-24: LGTM!Import addition is correct and follows the pattern used across other modules in this PR.
628-691: LGTM!The
@flashinfer_apidecorator is correctly applied to the public API wrapper function. The function logic remains unchanged.
694-724: LGTM!Decorator correctly applied.
Note: The
device_archcomputation at line 716 usesmajor * 10 + minorwhich differs from the string concatenation patternf"{major}{minor}"used elsewhere (e.g., line 677). For current architectures, both yield identical results, but consider unifying the approach for consistency in a future cleanup.
727-767: LGTM!Decorator correctly applied to the dequantization function.
770-802: LGTM!Decorators correctly applied to both shuffle functions.
815-873: LGTM!Decorator correctly applied to
nvfp4_quantize.
876-913: LGTM!Decorators correctly applied to both
mxfp4_quantizeandmxfp4_dequantize.
916-1000: LGTM!Decorators correctly applied to the remaining quantization functions (
mxfp4_dequantize_host,nvfp4_batched_quantize,scaled_fp4_grouped_quantize).flashinfer/rope.py (13)
22-22: LGTM!Import addition is correct and follows the pattern used across other modules in this PR.
417-502: LGTM!Decorator correctly applied to
apply_rope_inplace. The in-place RoPE operations are now properly instrumented for API logging.
505-561: LGTM!Decorator correctly applied.
564-670: LGTM!Decorator correctly applied to the Llama 3.1 RoPE in-place variant.
673-749: LGTM!Decorator correctly applied.
752-860: LGTM!Decorator correctly applied to
apply_rope(non-inplace variant).
863-929: LGTM!Decorator correctly applied.
932-1052: LGTM!Decorator correctly applied.
1055-1140: LGTM!Decorator correctly applied.
1143-1257: LGTM!Decorators correctly applied to both
apply_rope_with_cos_sin_cacheandapply_rope_with_cos_sin_cache_inplace.
1260-1294: Verify: Double logging whenmla_rope_quantize_fp8callsrope_quantize_fp8.Both
mla_rope_quantize_fp8andrope_quantize_fp8are decorated with@flashinfer_api. When logging is enabled (FLASHINFER_LOGLEVEL > 0), callingmla_rope_quantize_fp8will produce two sets of log entries (one for each function).If this is intentional for full call-stack tracing, this is fine. If you prefer avoiding duplicate logs, consider either:
- Removing the decorator from
mla_rope_quantize_fp8(since it's just a thin wrapper), or- Keeping it as-is if you want visibility into both entry points.
1297-1434: LGTM!Decorator correctly applied to
rope_quantize_fp8. The comprehensive docstring and input validation logic are well-structured.
1437-1671: LGTM!Decorator correctly applied to
rope_quantize_fp8_append_paged_kv_cache. This is a complex fused operation with thorough input validation and architecture detection.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/decode.py (1)
1745-1760: Fixlsedtype check in MLArunwhen caller provides preallocated tensorInside
BatchDecodeMlaWithPagedKVCacheWrapper.run,lseis allocated withdtype=torch.float32whenreturn_lse=Trueandlse is None, but the validation path for user‑providedlsecalls:check_shape_dtype_device( lse, (q_nope.size(0), q_nope.size(1)), q_nope.dtype, # <- expects query dtype q_nope.device, "lse", )This will incorrectly reject a correctly shaped
float32lsetensor and only acceptlsewith the same dtype asq_nope. That’s inconsistent with both how you allocatelsehere and withBatchDecodeWithPagedKVCacheWrapper.run, which checks againsttorch.float32.Recommend changing the expected dtype to
torch.float32:- check_shape_dtype_device( - lse, - (q_nope.size(0), q_nope.size(1)), - q_nope.dtype, - q_nope.device, - "lse", - ) + check_shape_dtype_device( + lse, + (q_nope.size(0), q_nope.size(1)), + torch.float32, + q_nope.device, + "lse", + )Also applies to: 1813-1846
🧹 Nitpick comments (1)
flashinfer/attention.py (1)
23-24: flashinfer_api decoration for BatchAttention looks correctImporting
flashinfer_apiand decoratingBatchAttention.__init__,plan, andrunaligns with the logging design, preserves signatures/behavior, and benefits from the zero‑overhead path whenFLASHINFER_LOGLEVEL=0. One minor note: logs for these methods will appear as bare__init__,plan,run(without theBatchAttention.prefix) because the decorator only special‑cases class names containingWrapper; if you want fully qualified names here as well, you’d need to extend that heuristic inflashinfer/api_logging.py.Also applies to: 44-45, 70-71, 138-139
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
flashinfer/attention.py(4 hunks)flashinfer/decode.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
flashinfer/attention.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
flashinfer/decode.py (1)
flashinfer/api_logging.py (1)
flashinfer_api(464-565)
🔇 Additional comments (1)
flashinfer/decode.py (1)
1504-1513: Decorating deprecated MLA wrapper ctor/plan is consistent and non‑breakingAdding
@flashinfer_apitoBatchDecodeMlaWithPagedKVCacheWrapper.__init__and.plancleanly brings this (deprecated but still public) wrapper under the same logging scheme as other*Wrapperclasses. Signatures and control flow are unchanged, and the decorator’s zero‑overhead path when logging is disabled preserves performance in the default configuration.Also applies to: 1619-1635
|
@yzh119 can you take a look? The PR does mechanical changes, decorating APIs with |
|
[CANCELING] Pipeline #39432378: canceled |
<!-- .github/pull_request_template.md --> ## 📌 Description FlashInfer's API Logging feature was enabled in flashinfer-ai#2108, but only decorated a small number of APIs with `@flashinfer_api`. Current PR adds the decorator to a large number of existing APIs for comprehensive coverage. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Expanded public API: many attention, cascade, pod, page, sparse, quantization, rope, sampling, decode and GEMM entry points are now exposed. * Added high-level FP4/FP8 quantization utilities, new sampling functions (speculative, top-k/top-p, renorm), and RoPE transforms with precomputed-cache & FP8 paths. * **Changes** * layernorm now accepts additional parameters (gemma, beta, eps). * **Documentation** * Updated logging/API doc link in README. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md --> ## 📌 Description FlashInfer's API Logging feature was enabled in flashinfer-ai#2108, but only decorated a small number of APIs with `@flashinfer_api`. Current PR adds the decorator to a large number of existing APIs for comprehensive coverage. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Expanded public API: many attention, cascade, pod, page, sparse, quantization, rope, sampling, decode and GEMM entry points are now exposed. * Added high-level FP4/FP8 quantization utilities, new sampling functions (speculative, top-k/top-p, renorm), and RoPE transforms with precomputed-cache & FP8 paths. * **Changes** * layernorm now accepts additional parameters (gemma, beta, eps). * **Documentation** * Updated logging/API doc link in README. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
FlashInfer's API Logging feature was enabled in #2108, but only decorated a small number of APIs with
@flashinfer_api.Current PR adds the decorator to a large number of existing APIs for comprehensive coverage.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Changes
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.