[KCP] add KCP.md; fix fp32 precision in M matrix chain; cleanup CP tests by zhiyuan1i · Pull Request #740 · fla-org/flash-linear-attention

zhiyuan1i · 2026-02-09T08:26:45Z

Summary by CodeRabbit

Documentation
- Added comprehensive KCP (Kimi Context Parallel) reference covering architecture, data flow, gating, numeric stability, and optimizations.
Improvements
- Switched intermediate computations to consistent fp32 accumulation to improve numerical stability.
- Strengthened cross-rank synchronization for context-parallel flows.
Tests
- Expanded multi-GPU CP test coverage, added gate-aware scenarios, updated orchestration and verification to CP-aware comparisons, and standardized assertion handling.

coderabbitai · 2026-02-09T08:27:09Z

Walkthrough

Adds a KCP design doc and updates CP internals and tests: FP32 intermediate accumulation and explicit cross-rank all_gather in chunk_delta_h, plus distributed, gate-aware CP tests and revised test harness/assertions.

Changes

Cohort / File(s)	Summary
KCP Documentation `fla/ops/cp/KCP.md`	New comprehensive reference for Kimi Context Parallel (KCP): architecture, CP dataflow, gating (GDN/KDA), WY/M matrix math, cross-rank merging, and memory/precision notes.
CP Core Implementation `fla/ops/cp/chunk_delta_h.py`	Switch intermediate linear-algebra to fp32 accumulators; replace dtype aliasing with explicit float32 for state buffers; add `all_gather_into_tensor` sync points after pre-processing; adapt merged forward/backward kernel calls to use gathered fp32 state.
CP Conv Test `tests/context_parallel/test_cp_conv.py`	Removed local `assert_close` and inline torchrun harness; import centralized `assert_close`; adjust input scaling and replace detailed diff helper with ratio-based tolerance.
CP GDN Tests `tests/context_parallel/test_cp_gdn.py`	Rewrote tests for multi-GPU CP: spawn/distributed setup, explicit chunk partitioning, inter-rank state passing, CP-aware forward/backward verification, expanded CP scenarios, and dtype updates (more bfloat16).
CP KDA Tests `tests/context_parallel/test_cp_kda.py`	Added gating parameters to test runners (`use_gate_in_kernel`, `safe_gate`, `lower_bound`); broadcast gate constants (`A_log`, `dt_bias`); implement gate-aware reference and CP paths; introduce `GATE_KWARGS`.

Sequence Diagram(s)

sequenceDiagram
    participant RankA as Rank A
    participant RankB as Rank B
    participant AllGather as All-Gather Sync
    participant PreProc as Pre-process Kernel
    participant MainKernel as Main Kernel

    RankA->>PreProc: compute local m, hm (fp32)
    RankB->>PreProc: compute local m, hm (fp32)

    PreProc->>AllGather: all_gather_into_tensor(ag_hm, ag_dhm)
    AllGather->>RankA: ag_hm, ag_dhm
    AllGather->>RankB: ag_hm, ag_dhm

    RankA->>MainKernel: main_kernel(ag_hm, fp32 state, gated inputs)
    RankB->>MainKernel: main_kernel(ag_hm, fp32 state, gated inputs + inter-rank decay)

    MainKernel->>RankA: outputs, grads
    MainKernel->>RankB: outputs, grads

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

[CP] fuse fwd/bwd kernels and fix IMA in long context #733: Related merged forward/backward pre-process kernel changes and dispatch that align with the all_gather and merged-preproc updates in chunk_delta_h.py.
[KDA] Support fused forget gate #662: Implements KDA gating primitives and use_gate_in_kernel support that underpin the gate-aware CP KDA test paths.
[KDA] Speed up chunk_kda by introducing lowerbound gate #703: Adds safe_gate / lower_bound and expanded KDA gating signatures matching the gating options exercised by the new CP KDA tests.

Suggested reviewers

yzhangcs
Nathancgy

Poem

🐇 I hopped through ranks with buffers held tight,
fp32 for the sums so the numbers stay bright,
gates gently swing as states come to play,
all-gathers align each parallel day,
a rabbit's small cheer for the CP display.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the three main changes: adding KCP.md documentation, fixing fp32 precision in M matrix chain computations, and cleaning up context-parallel tests.
Docstring Coverage	✅ Passed	Docstring coverage is 84.85% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch lzy/kcp

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

fla/ops/cp/KCP.md (1)

11-14: Add language identifiers to fenced code blocks.

markdownlint MD040 flags these blocks; add a language (e.g., text) to silence the warning and improve readability.

✅ Suggested change (example)

-```
+```text
 S_t = decay(g_t) * S_{t-1} + beta_t * k_t (x) (v_t - S_{t-1} @ k_t)
 o_t = q_t^T @ S_t


Apply similarly to the other fenced blocks listed above.
</details>


Also applies to: 21-23, 66-76, 84-90, 94-99, 104-108, 116-122, 126-131, 136-140, 255-258, 265-267

</blockquote></details>

</blockquote></details>

<!-- tips_start -->

---

Thanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=fla-org/flash-linear-attention&utm_content=740)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>

<sub>Comment `@coderabbitai help` to get the list of available commands and usage tips.</sub>

<!-- tips_end -->

<!-- internal state start -->


<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEejqANiS4BtANIBhAAoBddLXqOnG5rQDckABm8AAeQdwAzABMPBSiSPD4WPBYALKQzNQUoYyw2hgBDFaY2NyQzpA0iLiIBgBy2MwClFwA7AAsAAwGAKoASgAyXLC4uNyIHAD0k0TqsNgCGkzMk4EWaGD4FESr64iwYBYpJGgUYNQ0GOJJk9zYFhaTHd09iC2QAF4IstiYAIzwAwAZXw2AoDBIkAEVAwDFgXAsH1kkwA1gxuAZoKdSLgoTC4VxMilgbhqNgJvxuGQDH0SBJ4CQAO6UCkGAYqEgWVkOOLUOjoTiQKKdKIANjAIolAE5oJ0ABwcMUcdoAVgAWkZ9MZwFAyPR8IEcARiGRlDR6Ms2FcuLx+MJROIpDJ5EwlFRVOotDotSYoHBUKhMEbCKRyFRzQpWOwuFRGZBEI1MhR5HIFG6VGpNNpdGBDNrTAY1mhJvhxpN0ZMvD5aBwDAAiBsGADETcgAEEAJImsN8+gJ1ineQG3KYUi1AxtjzINCR3gkWBkRCSSGpU4o2j4RlYDcMRrsSBV3yQJSk+BHDBEA/OLh2eDMHQOJI0EK4pynNAPTlBLaQADiABE6kgAAKX9e0gf9OVJSA+nuEgAEp0AwTx/zbEDb3vCCoJnNtRnYRIMHgjQYAXY98F3K1cVdeBgmkSpSKYOJIDiXcKDiWFIX2U5+VTACgMwFC2wAGmPaQGGyZpkCpM4iD5SA8mQ89L0CH8BHwXBYD/QCQMQBgP1OHhKDABc0HoWSaEQgSD1QkDpLAWg70gcyEJE0FcHPOiNMhCpTjhdQHTBSErNoagZzWTd0HE/BEGQGEUWQRk5nQB5iGoBc+G4C5KAwRARKspgpAoZAVIoRlTloSYBDQBgUTKih6DnMBeHI6QlwveNSTHJD6DYbZIQsfBZgYYiO1xHc9yuZAvMgAB1ABNZiSDnN4rmoAjYhIMBljuUlrgwET0mA3AYSXPbEMyY6ciYHLjuwMR1pK79SvKyrqtq8qeDS3KNoYBIkgUHL4HTPbip/KKYrAOKRxSFIiDy5DKXEe8PjWpJkCOFFIW2uIYoAfVgTpIEmSASBCTLkPxwnHrYZgtnkEhAmCX6yAYWRiLgSFxso5LEHwWJ8HpJRp2SDAdrAS5eb4TJuG4WHKhUKwno2pqKBamK5asoksExihyC5ETSe4dYUmQcy5ZSGgiHDB6fz47rrLQzKNO+4KSFPLkUCuIy4WwDAUVEvT5Cs5z6Aln9msys2khGsa3e0D3puq3BfgsNNIXCuNHrtqy7BslIoXUzSSrq+grKqmqS+PeAWJBkSUiKbAHPaudceaiEYqCX37rRkStcgHW9ZdhGQ8qRctmQYCUR2ABHIhEPJNBSBjxh7nvDA+QSpKg0N9Z172/hDWm9ILuyduRPBxBIcwf3qjk3rZgveHQ4YlhlsQSmiZJsmBI/0skfgFGe0RJvAoPSdqaBpyLUCJQFm6cfy+QQDQMQgURIhWghnESZtH72wwHuU+H4OoqDPOoeQ+cKjXWfDUDQRgjAtnbBYGg1s0aVD5tNJQRR3wgwPl/bgWwIzhwWEcBgJMrjqAZOOKAdQkgkFoa2VcGAaLSFxAAMTPJCNs68LCyA+JQIwAxjjIDhKOOgXAADUUQ5SdEmBKesjYDAQDAEYIsJYywVh9n7XGSgGFoHxhobgshawNjrM2VsnZuxmn5P2JMQ5DRGIvNITU5QFKkC4B2L2usCG4N6vAYRc5fpLn+gpWgSlIClDQfyQIqtmClLapeaapZlDIUQIATAJ+DZAfgQ2guBZBUimnza6S5qjsC0TUuW0QwCZiCANNazd8AWxAoESIUREKPQtpQNgDk5LbTwKjHKxEgRUl+kzD8Wje5ZByMwe44gjY5N2VNPIuI5z0lBIgEZ0kSobLIo81WtA7o1EgIlDSnsdqGIgbiAg9FISkj6mNbpVIQIkA0EQDQIkBC40ZBoLpPSEKQAwBFPS1QC5AoaTCWgfTKgWA0OFag0QoQMy2FC0iJLdnn2SXLaaeS/pYCdkXH81Sykb3tnEdYEYPIGQ/KQaEM4hnjGoVABwyTrCQDSfGOkyhU7FxeuXd69UMiUFIA1OIKs1a1P7tlTk31porE2UQJIBCBU0Fxfi1+eA6Lkjlos2l1UKL3GoOPECaLmC43gBoAgR1KXUtwNESyCNA2hvwOGql0yo3LMQibGgpluHLUoGAuppEnmJHJMeOFkIeUiSefhZuLEuWQAGh3SpLBFpUnAgIQIfxRSMDBbDOVSTjFcFpLTCQBDhzTX/NAOaTgACiEyIH8hLcleAED7YJgENUdQeB1qAs0jvHJ6gWHglgEm/ANKYj50uQw+ARtITkEZJMHRqtkAw3EB+ABdzl27ShWPIqIFYDMDrgo59Fhca3xoKg39YHcBERIoGRmDopKCJyWAQI1U5YgchE+xdRxAHrSXEQPegVkBJBGSbA04r7mcwZmgK5xbsWfSBXEg1LDwi0tWekzRUJsCwaKj2hVfb2weH5Du36uITm43MulYNVx8C4zDnwPSDxpyBEYYwSgp5uVGrbq1VDNBxgIqRSi9ARB8bVIALzJSA+JygkmCAya/cBcDkANBOfgnlIztBf2QDM6JyzFBrPSdk8Bdzf7HPOcQhCxAshYSwFVgonRntGEbMXY6tDkUH2xWvqDPgK63jT2wPuAeX55Nch44qrgPRuDlJ6vq/kjVNPq3apq+qr0K4fQzuS+uWxeHhkZZCHzc6v3C1LrQIQ5JcTZZILl/L5rU4pAkORN93qthNyICM4CiLkXlogaaxexmmP3xILjQIjJaC4wEMd3GBXU5WXde1dzuIpHkEWsnXWj6sAVB5URexvb4lcFSCkMGT5VYWCQwNOMVlMoeDlg6zyfMX14YBUlT1UQZ1vFLgwH1Ir1r5TS1fP2CglspF2XpjbIiEzZHahhj8wGP0oGKksp+9FqBkS3NUXk1TLuxVECQZc9tmWOv+kjwhVRPs0LoW2BhZoCLkrYaIdYTCcrcMNnw/kAiBBCJEeIcQCSvsPZkWyAxI54k1kgKYv4ERrF/G6AYSda6Lr8ldJCOI9ImQk0ZnwrgAAJeARBYC2OCfYgsVQajlifKTXArd3yfkeEH3G6JY9JAkH4gJfuQnti7KGCJfZEyDm4Qx7XUAB0CznaRAa8n5Kcmkp3WE+8tuqdj3WoKCN1DIAXkvGCS11gQnoM3ztCmEdApnEwlHc78BWHxOhrA00g+QEKjRW5QDXfQPulIEZnI7yE/EO1DlTvC3IBPPHfkDlGboBiqp9aC4LDSWIuL3mi1B3F8hPhJiBAD0UF9pMPupNRAbv+mr8i/sNImd3kthPkUgPIq5Wd4ABBXVQ4lE2k4QlF5cGddIPwH9gU8BjxQooQg4ggk4fxgIQgxMBoqpU4UCrB6BUxLdOg01J9SJp83hk5uBr8BN6ABoiAH5LwGDSh7Y7wusIxa8KBw8ihop04qkpk0ANAN0PYl0ZwlAqRkIWZZARJGQEBFZrpggiAwQ5Y7JGQtgdZa1BoODiIAB5QqE5ESaaPPZAJcZgG5QIeQKfOAvIXWVqbAu/AWDWEWRhdjAQ/eC/SvKyaaXGXGLWYIjXZMOjWAPKYbUbTA6CSnLDFlZdQFOEE/EBfeaoWQRWCFQfVGYfUOUfRpCEahUXUJCXeXaXUidhOXN9YcJXQQlXDKBDYRfCLXCRSAZRLufeOIe/Y3AAAyUENF8PrxEOAnXjYBEjiECAsOyDygIAsBMz+E2giEmPmMWOWPgj6M9kgD6KD0QBDy9hfAjyoCj0mBjzj2ukT38T6NkUgHkUUUJVUUVg0Q/G0V0X13IEMUVWN1N3NzAD+CiHaCMBtyRnAgd0WmdzjAZhKkFFSDoHgEaBTwD0cQMD2IOMoWOJOU5DOKUVj24DE1oAwCT0CUbDF3T1NG6yzwHAiOHCsMSVpEZFVkdWmgqDtnoL0jdwsGWyYwth+TukfzJiET3QoTDyakjysFTmAmcBWVB3tnPXEDAF/CcB6EWggSSG7RImvRd1Zz+UCjIgonYHIxQFsKsEoiJwQGUAPSUM9gbm5PcRRDE17E8Wwlxjf0VjWQyX1nKCcABkoQyEwEXhIEomQImzyw4k+kELEXVPAQvmYgyx7VpCNmqhqx3xeQBnyW8h9PXjYjWikH700lkLjjUVoGAR0wmVkDABlW9IBT0IzlrEgF0EgCBFDNgQjM1wIg1ljLik3iBSEz3VvkEMmD1E9gcnbmoQbKgGUS2Erkyhii4BOFSKhjqzeEfX+VLwIXtICDxQwCQ2rkJW7MWghB5wSOpzkgbWqW3zpF3zjL9nHMbIACE3oZytsuArZTIGRJp+ZI45Jqo0sbz/Y1JfYQpsg6J6QZxyFQ8Xwe1J1v4FDYDCUEF/JkFn8+Z64LBG5OYkBLpoCIwg85YwL904RMUsKJIYD7YX8iL5TL1VZ25akNBEBMotxJiO8UMq1EgstYYrBFTlSoEYEOJEAe00ljpFB+T6A0EZwM95d+AsA4oJQRIVIHhNweIUxVZTICVN86lYcHh/zEAAgUgQUnU4w0KMLDIzgCBMYsBp4RIUQRIJBMFUU44GcNwQJaBIs0B7wGBEBEIt14w8hel3ARshl6BHo7JzKyBUtopkBoB7YPcGd/wAV0ogoZYhEFZEUvtHwMBCoIxZ9jl94IVnBzg6osZX5TgkAkh6zGzaRoF2IIRmJfZpL/yJRZxXVip7hU43IDLg4qAHJDTRlwEnUdyKgh1shMBNADAJzqy38sAfNCM8AQUZhuqPz/lfzIqdL7YcrxESZcsX1ulDN8h9yGY+LaqfLB9OQ0AQgR8x9MBargIh8qpUdZxSreZCIEymQ9D+tCV1DvcwQicUDOTuTwtSg+CMgrlL1FYKhdIyBSrooQJnAogRJnB2gEanA5QSZcBhpvKkphrZA5ZdCKAuT4wAFpBgEWzwyrALxnYnKS0J48VSQ1d5BB12U+ZW1k121tjMgMAHClEvLBKrg+Su8DD2DPDu9bC+FDcuoIUoa1cKN3Z+RfD1oj9AhBtvTIYzr+E5q8BdL3CpBn5S1Uyi11yrtT9BCRi3hy9L9KBtjkM5s+BhwZwIQ+aX0dF6ApCSEggxCixJDxBisvsYLyYlB4KxtsB/IeTYR0KlAQaL0r1qzIbszEgJ4cswzardwahUVQRkIc84cMBgzZwrAwgHJICcKpdgFOL+okhODSaz45TMB5B9gxbE7YEvL7ZTJArcK+YprSkFCstGKdz7r+QqLlzcakoC7sKyKuCmCvtC8dbPZwC+5NCPo2CclyKS8SBZJWZCLotfZ8jR5CUtgEDWcki4gkzfp2ofK+zwU4D2AIjeELZDESrxBpb8yeBZBp9NZQbB6Lwe1ytKsaN4VBiqMGFioPbWaO0IUWbj1cA2b843gzDU4Khp8vrNCKimMpirAxA8Rj03Qfoa0CplAl5bjbjxdGE30IUZcOFkG6iyZld9Qmi1cl7WjxFEkpFxa6ITwHQR9Ca8MyQcZFcqGGiKo7g6GWjRE2iFkuji7GA9gVzXI+APTkN24wsHlkplMvJq4eBmj2wnAOx2Y6C4DaZfkrBml4MhHwj5Ab7Pyej8hIV4x3Lr1bGtaiNSE2MCFp9F6GBUEsCuJekGcB6aKtN2o96FwD795mQmJodaASj9FPiWGfi/g/gVRrEogOhgTbcwTFBHcryXdoT3dIAvcfckSHEjA0SRSjjMoTiJScTqg8SLsQpiSU8yTwlKT4xs8aTYlFV2ihL+b+QsFLxwcjIwqdzjQMdczes5I3H4s+YKhc40I9jmCyUakDtnJJMLtptgE0BoFHTQN7ZQcrNAKEYynbHGFpc4C8adZlaGK0Atw5Z/DKADZ146H2pnJCruJmrdp1oeVXsiVNIpijrIQxjtApAY1DURLu4sBgJ7TEIPme1JwA6nI+RVaRV+R9LNb7YBlSRJouA2xcY2D7YulTtF0poBt0AmJJLwIy5VLaB1KKkfxnJXm7kAhLxkX/lEAUjNJmhcBmRwrQr8ALK4XHUh10K6JwcjUR4BW8s98lo9Q5Z/pyRFm+RlnLtoKXw9QPrcQM4mNtlHVTrqrWyeU+rLx15lxXTv82J2AamZwfLuXeXaWxW6JgIaXwImWm7AiFxuVKAPl0Bqzpm6qsAfKnnLmXm/4pcRow6MLTYZlLwpUos6JHpEB1m5WtWEYdnfM9mSzPY1gk65YemtrfguFZXu8sA1IgUfmarG96B3svov6KtwJxLPYxEnbdkYxr4mrSWqh0D/krL+5bL7K6VSQnL8AAgvIzaXnoRMH1KmMTkdKQ3bS6JroHI9oCEikSkIVxIThHVaWnWq4uNhl5AVDwrZXNmDsUgVmXCZsoq39IQJBkBlEPw3hoWWDBaOCAYNCfqa8EY154Fjb94OqMDeDVYpBKJyVtBA6q5F48U11PKe0gRvd8MeHhxBH1dp8pqwwq9QXkAw8VWCivzF45Ib0+XNphVwIg3mEVqO5lzpAKqoAprqn1wfEY9TnKBbrr5lCtguTgMiaLDcTxiSARJoARJYqIIRJyaiBKbf6eOIDUrjXNXxOD2lnj3Ls1mNnnIhOlKU307aB4JxrGzqO496O5hgMe7gI8a2OlwdFOOqnuPeP+ORJ/whOyARPYBvosUqRUEkBJOWIXUtnZP5X5PVmbGlO+QVPQnTt1PNOoBOmQW5bfLuIzI5IdUiA76bpRq/w2xoBJ1cY7AZo2w+hfwgQlYBksL9wEGkgX2pLSPCXqgBKaFU8iHJdmFSGqjZdOEpdeG+DGj1GTGGH887i3ZYBFBOHYPIRwmuABiGYfWaOQpbMqmGOKAmO/YuALYWP8aTtTOlVFud7w9uOuBWdeOFurh+O9vcAbPDu7OKbHOEQsKbALYXBUES1XO43papPPOlU1JR9PMOjb2ZO3hD2FXpsuBXvU4zMb2uRxO42AuaB/v8A3ugfPugvdn1OuBI1IAAAfSAXXd73XTYhsrTvPYKsQ6VGD7hyELdNyax7Ng5tgI50OmdyjhZn73z09yH0fLTqAMHhNl7qHiwFngw4L1NxH5NFHtH6RRJOEjSfr3DQb0pat80EbwY8b3T3ErdAzy5jAIz1jlbomk7jb3GLbjqCgXb+LA7w3iCLX4T52C76oK7q4G7sT+79z0QZ7pnwHj7kHkSbzmgX7xnguaHl3t4RT9np3974Hv3nn+HoC/niBwX9HszTHrYyAHH7492xtAnrh57YnuYUn6acn98Snlkan8O6wbn93o9jAE9vWJ37ntnw9ivia5NkL8PqZSP1H3XJhvmQuS2hDxDeRuWNsLRlhiVpBCMYCXgvhQjQ0PD2l120Rl5j09jUhuA8Ej53FvmWmjIRQGiemPhiMfRuCeLd15Mnm24+46BR4tRdsTRN4igPRA3XHsxCIP4KURJqUP4FJ0EiMcEnfLJt3QQz3b3X3IJTULmHzC6gEYdtPACGApJpMow1oOMnGCiQ55UwDud0JmC9A5hDAvoSMPeHDxAx34n/ZkCt2hS4gtQBgDAVEBIACAGAAgFUCqFaAqg0AooCIHKDbQMApQooBgICRIAqg/gtAbgaoBVAkAogUoCIJwKlCUspQaAkgTqEgBygSAnQZ/q0AiD38mB0QEgFKClACDaAnQeUAwEpa0B2g6OWga0FECdABA7QP4O0EhDECMBlodQMGjJTGtISdAXGMOR9BSCW4+2WPAuBqjvwByRAoAQYAADeWnOsEgCcB9B7ypeTGLQEfDQDcATgaKOaDrCI9YewQuuvcFoARD/8YQpIbgVd7BCkAphSgNkAEwYAchyGPIQ2TrAORaAsEDAP+HIhAhLoF4RAAqlEAogcht0HjsEOqG1DzA7kEgK0JqgdCL2QkboUDFqGQRdI2QbgHtEGHtDkhFQyAHWHPBRCOwMUcVo0IoA5CGwowyoXsFwBzDaQCYQBjkJsBacGyQQhslcKWEIEaodQWxtsMmHiRL0+8OYXWF2HXClhIGckMMLywfDrhdYHeAGT2jbC5h8YFEJeipD0B5U6TPoBmHUBGN5If/Q4GqlIItNSE6He5uQQ0DvDzhVwusPoxIDbC6oCiC8LiM+FLCtg3uQnBYDmH3C2A2woWM8JmEERgk1wgAL7/DIAlwz4XWFuEoh6RRIrgHWD6FqEvB7QrkZUO+GIBfhXQikYCMFLAjWRwojmJUEsBBR0cr7TkLXRaZE0jSGkOIJCD7hWEuApkbkoeDErkQJobzJIPJVCAeolk2DApCLDuL+lLoYQIxPnE1Z3IGcRQE4KSMvDcESmuAMUuUy/CzNyRvIwkcSNOABjIxAIjzhgFK5CjchfvPEZUKpEdJaR4owUdsM1xWA2RVwzkXiJ5EAj+RuY4UfUM8pNDLwj4MwqQHjH4jpRsoyUUsKBF7xlRSwqsZAXai4MqApAWnNIPaAaA5QKoAAKQJUckmkVAAmEZi7p2AOjDJrlmrjF4cYfXAmqgEsQaAtBY4nEa2IJHpMYxusWGI2IzHtIaRdIh4cKJ3A9iEuhYhsuyK043dgh+wsIaKJTF1hAgnQNAFEEtytABAnQBgBEEoFShWgtAaBDILlACBaBaAUQFKFUAKDWgrQUUJwPYSiBWgyGVoEIM6CtBuBtAOUKKECBUCVQLAkgHKClCRiqh5EW8XWLwYkAhK2UD8I0L5A5DSxSw4ppBXDwHNTi5xfEpcSTysT0x7E9SB+E6LV4pcOQpGkJM/HiM0YM0OYN2JrEyiuAQJDkVyLrAcTDiXE8UtiV4kEkiS/iQSfKIICkgLAYktDjkLib7jAgsknKPJI0iKTycd4rgHEzxHFiARmkjEtxIqZ6TaOAkrgGxMqEmTRJtk5SZAABLWTQp9k2AI5NhhhSASbk9Sc4lLD7E3E8wDxF4lJC+JDJAU6ScFLMmhSchooSKeJLkkKTqJSkyyW5K06PjHxkgiABtBCLVZPBbQ9+C4KAEYCqMNmTKAezQynlBCEgwIcsLBRvh82uEWkI4OiEsAsBj4X2LgCSGdA6pnU8ARHl6kfpnBCMfQEAA= -->

<!-- internal state end -->

gemini-code-assist · 2026-02-09T08:27:19Z

Summary of Changes

Hello @zhiyuan1i, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive documentation for the Kimi Context Parallel (KCP) feature, clarifying its architecture and operational differences between GDN and KDA. A critical precision fix ensures the stability of the M matrix chain multiplication by enforcing float32 arithmetic. Additionally, the context parallel tests for convolution, GDN, and KDA have been significantly refactored and enhanced, improving their robustness, clarity, and coverage, particularly for KDA's gate handling and L2 normalization.

Highlights

New KCP Documentation: A new KCP.md file has been added, providing comprehensive documentation for the Kimi Context Parallel (KCP) feature. This includes details on its core recurrence, differences in gate handling between GDN and KDA, CP data flow, pre-process forward/backward stages, and code flow examples.
FP32 Precision for M Matrix Chain: The fla/ops/cp/chunk_delta_h.py file has been modified to explicitly cast intermediate M matrix chain multiplications to float32. This critical change prevents accumulated precision loss, especially when operating with bf16 data types, ensuring numerical stability in both forward and backward pre-processing kernels and the merge kernel.
Cleanup and Refinement of CP Tests: Context Parallel tests for convolution, GDN, and KDA have undergone significant refactoring. This includes centralizing the assert_close utility, configuring logging for better output, adjusting test input scales, updating assertion parameters, and removing direct torchrun execution blocks. Test configurations now default to bfloat16 and feature larger sequence lengths, with the KDA tests specifically enhanced to use naive_recurrent_kda as a more robust ground truth reference, including explicit handling of gate computation and L2 normalization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

fla/ops/cp/KCP.md
- Added new documentation file detailing KCP architecture, recurrence, gate handling, data flow, and precision notes.
fla/ops/cp/chunk_delta_h.py
- Modified pre_process_fwd_kernel_merged to cast b_m_i and b_m to tl.float32 during M matrix chain multiplication.
- Modified merge_fwd_bwd_kernel to cast b_ag_m and b_h to tl.float32 during the merge operation.
- Modified pre_process_bwd_kernel_merged to cast b_m_i and b_m to tl.float32 during M matrix chain multiplication.
- Removed global DTYPE = torch.float32 variable.
- Explicitly set hm, initial_state, dhm, and dht to torch.float32 during tensor initialization.
- Added blank lines for improved readability in chunk_gated_delta_rule_bwd_dhu_pre_process.
tests/context_parallel/test_cp_conv.py
- Imported logging and assert_close from fla.utils.
- Configured logging.basicConfig in the main script and init_distributed.
- Removed local assert_close helper function.
- Multiplied x_global by 100 in run_cp_conv_test_worker to increase input scale.
- Updated assert_close calls to use ratio=0.001 instead of atol=1e-3.
- Removed the if __name__ == "__main__": block for torchrun execution.
tests/context_parallel/test_cp_gdn.py
- Updated docstring to provide detailed implementation hierarchy and relationships for GDN.
- Imported logging and assert_close from fla.utils.
- Configured logging.basicConfig in the main script and init_distributed.
- Removed local assert_close helper function.
- Modified global data generation to be performed on rank 0 and then broadcast.
- Updated assert_close calls to use ratio=2e-3 and warning=False.
- Changed default dtype to torch.bfloat16 in run_cp_test_with_spawn.
- Increased T and lengths values in test cases (e.g., T=10240, lengths=[3000, 4000, 3240]).
- Added test_cp8_single_sequence.
- Removed the if __name__ == "__main__": block for torchrun execution.
tests/context_parallel/test_cp_kda.py
- Updated docstring with extensive details on KDA implementation hierarchy, gate functions, and test architecture.
- Imported logging, assert_close, naive_kda_lowerbound_gate, and naive_recurrent_kda.
- Configured logging.basicConfig in the main script and init_distributed.
- Removed local assert_close helper function.
- Modified run_cp_kda_test_worker to accept use_gate_in_kernel, safe_gate, and lower_bound parameters.
- Refactored global data generation to be done on rank 0 and broadcast, including A_log_global and dt_bias_global.
- Revamped reference run logic to use naive_recurrent_kda for ground truth, explicitly handling L2 normalization and gate computation for each sequence.
- Updated chunk_kda call in CP Forward to pass use_qk_l2norm_in_kernel, use_gate_in_kernel, safe_gate, lower_bound, A_log, and dt_bias.
- Updated assert_close calls to use ratio=5e-2 and warning=False.
- Changed default dtype to torch.bfloat16 in run_cp_test_with_spawn.
- Modified run_cp_test_with_spawn to pass new gate-related arguments to the worker.
- Defined GATE_KWARGS and applied them to all test scenarios.
- Increased T and lengths values in test cases (e.g., T=10240, lengths=[3000, 4000, 3240]).
- Added test_cp8_single_sequence.
- Removed the if __name__ == "__main__": block for torchrun execution.

Activity

A new markdown file KCP.md was added to document the Kimi Context Parallel (KCP) feature, including its core recurrence, gate handling for GDN and KDA, CP architecture, data flow, pre-process stages, and code flow examples.
The precision of the M matrix chain multiplication in chunk_delta_h.py was explicitly set to float32 to prevent precision loss, affecting both forward and backward kernels.
The DTYPE global variable was removed from chunk_delta_h.py, and tensor initializations were updated to use torch.float32 directly.
Testing infrastructure was improved by centralizing the assert_close utility and configuring logging in test_cp_conv.py and test_cp_gdn.py.
Test cases in test_cp_conv.py were adjusted with larger input scales and updated assertion parameters.
The if __name__ == "__main__": blocks for direct torchrun execution were removed from test_cp_conv.py and test_cp_gdn.py, streamlining test execution via pytest.
The test_cp_gdn.py and test_cp_kda.py files received updated docstrings for better clarity on implementation details and test architecture.
Test data generation and dtype were updated to bfloat16 and larger sequence lengths in test_cp_gdn.py and test_cp_kda.py.
The KDA context parallel tests in test_cp_kda.py were significantly enhanced to use naive_recurrent_kda as a more robust reference, with explicit handling of L2 normalization and gate computation, and new test parameters for gate configuration.
A new test case test_cp8_single_sequence was added for KDA.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive documentation for Kimi Context Parallel (KCP), fixes a critical precision issue in the M matrix chain computation, and performs a major cleanup and enhancement of the context parallel tests. The new KCP.md file provides an excellent overview of the architecture. The precision fix in chunk_delta_h.py to enforce fp32 accumulation is crucial for numerical stability. The test refactoring in test_cp_conv.py, test_cp_gdn.py, and test_cp_kda.py significantly improves their robustness, coverage, and maintainability by using a centralized assert_close utility, more rigorous reference implementations, and better test configurations. Overall, this is a high-quality contribution that improves correctness, documentation, and testing. I have one minor suggestion for a typo in the documentation.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

fla/ops/cp/chunk_delta_h.py (1)
268-272: ⚠️ Potential issue | 🟠 Major

fp32 precision fix not applied to the non-merged stage2 kernel, which is used in benchmarks.

The merged kernels (pre_process_fwd_kernel_merged at line 508 and pre_process_bwd_kernel_merged at line 1010) correctly cast b_m_i and b_m to tl.float32 for the M matrix chain multiply. However, the non-merged pre_process_fwd_bwd_kernel_stage2 still uses b_w.dtype:
b_m = tl.dot(b_m_i.to(b_w.dtype), b_m.to(b_w.dtype))
While the production code paths use the merged kernels (lines 1050, 1120), the non-merged stage2 kernel is explicitly called from benchmarks/cp/benchmark_chunk_delta_h_kernels.py (lines 142, 174, 304) for performance and correctness benchmarking. This inconsistency means benchmarks will measure the non-fixed version while production uses the fixed version, leading to inconsistent precision characteristics.
Proposed fix
-        b_m = tl.dot(b_m_i.to(b_w.dtype), b_m.to(b_w.dtype))
+        b_m = tl.dot(b_m_i.to(tl.float32), b_m.to(tl.float32))

🤖 Fix all issues with AI agents

In `@fla/ops/cp/KCP.md`:
- Line 5: Fix the typo and sentence punctuation in the CP description: change
"introduce" to "introduced" in the sentence "CP was first introduce in PR
`#691`...", split the clause after the PR link into a new sentence by replacing
the comma before "Special thanks to" with a period, and capitalize "Special" so
it reads "Special thanks to [mdy666]...".

🧹 Nitpick comments (7)

tests/context_parallel/test_cp_conv.py (2)

113-113: Scaling input by 100 — intentional but worth a comment.

x_global is multiplied by 100, which increases the dynamic range and stress-tests numerical precision. A brief inline comment explaining why (e.g., "amplify values to stress-test conv precision under CP splitting") would help future readers.

201-204: Ratio of 0.001 is quite tight — verify this passes reliably in CI.

Using ratio=0.001 for all four checks (output, dx, dw, db) is a strict tolerance. The assert_close helper in fla/utils.py will warn (instead of fail) in CI when the ratio is under 0.01, but outside CI this will hard-assert. If convolution tests are flaky, consider whether gradient checks (dw, db) need a slightly relaxed ratio.
fla/ops/cp/KCP.md (1)
11-14: Add language identifiers to pseudocode fenced blocks.

Markdownlint flags multiple bare fenced code blocks (MD040). For pseudocode/math blocks, use a language identifier like text or math to silence the lint warnings and improve rendering in some markdown processors. This applies to all ~11 unlabeled code blocks in this file.
Example fix
-```
+```text
 S_t = decay(g_t) * S_{t-1} + beta_t * k_t (x) (v_t - S_{t-1} @ k_t)
 o_t = q_t^T @ S_t
-```
+```
tests/context_parallel/test_cp_gdn.py (1)

354-365: CP8 test requires 8 GPUs — likely won't run in most CI environments.

This test with T=65536 and 8 GPUs is a comprehensive stress test but will be skipped in most CI setups. Consider documenting this as a manual/nightly-only test, or adding a marker (e.g., @pytest.mark.slow).

tests/context_parallel/test_cp_kda.py (3)

216-236: Imports inside worker function are fine for spawn, but triton and math could be top-level.

import triton and import math are placed inside the worker function body. While this works correctly with spawn (fresh process re-imports), these are stable stdlib/dependency imports that could live at the module level for clarity. This is a minor style point only.

427-442: Ratio of 5e-2 is relatively relaxed — consider documenting the expected tolerance.

The KDA tests use ratio=5e-2 compared to GDN's 2e-3 and conv's 0.001. This is 25-50× more relaxed. Given the complexity of KDA (per-dim gating, L2 normalization, gate computation), a wider tolerance may be justified, but a brief comment explaining why would help future maintainers understand this isn't just a "make-the-test-pass" number.

549-561: CP8 test with T=65536 is a heavy test — same observation as GDN CP8.

Requires 8 GPUs and processes 65K tokens. Consider adding a @pytest.mark.slow marker.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[KCP] add KCP.md; fix fp32 precision in M matrix chain; cleanup CP tests

2ebcb55

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

Comment thread fla/ops/cp/KCP.md Outdated

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

Comment thread fla/ops/cp/KCP.md Outdated

Update fla/ops/cp/KCP.md

8e09173

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

zhiyuan1i merged commit 99db89b into main Feb 9, 2026
4 checks passed

zhiyuan1i deleted the lzy/kcp branch February 9, 2026 16:04

coderabbitai bot mentioned this pull request Feb 11, 2026

[KDA] Refractor interface #744

Merged

coderabbitai bot mentioned this pull request Mar 10, 2026

[KDA][GDN] Support transpose_state_layout for [V,K] state memory layout #776

Merged

This was referenced Apr 4, 2026

Update README for cp #811

Merged

[CP] Enable KCP for DPLR #822

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KCP] add KCP.md; fix fp32 precision in M matrix chain; cleanup CP tests#740

[KCP] add KCP.md; fix fp32 precision in M matrix chain; cleanup CP tests#740
zhiyuan1i merged 2 commits intomainfrom
lzy/kcp

zhiyuan1i commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhiyuan1i commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhiyuan1i commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading