Skip to content

[KCP] add KCP.md; fix fp32 precision in M matrix chain; cleanup CP tests#740

Merged
zhiyuan1i merged 2 commits intomainfrom
lzy/kcp
Feb 9, 2026
Merged

[KCP] add KCP.md; fix fp32 precision in M matrix chain; cleanup CP tests#740
zhiyuan1i merged 2 commits intomainfrom
lzy/kcp

Conversation

@zhiyuan1i
Copy link
Copy Markdown
Collaborator

@zhiyuan1i zhiyuan1i commented Feb 9, 2026

Summary by CodeRabbit

  • Documentation

    • Added comprehensive KCP (Kimi Context Parallel) reference covering architecture, data flow, gating, numeric stability, and optimizations.
  • Improvements

    • Switched intermediate computations to consistent fp32 accumulation to improve numerical stability.
    • Strengthened cross-rank synchronization for context-parallel flows.
  • Tests

    • Expanded multi-GPU CP test coverage, added gate-aware scenarios, updated orchestration and verification to CP-aware comparisons, and standardized assertion handling.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 9, 2026

Walkthrough

Adds a KCP design doc and updates CP internals and tests: FP32 intermediate accumulation and explicit cross-rank all_gather in chunk_delta_h, plus distributed, gate-aware CP tests and revised test harness/assertions.

Changes

Cohort / File(s) Summary
KCP Documentation
fla/ops/cp/KCP.md
New comprehensive reference for Kimi Context Parallel (KCP): architecture, CP dataflow, gating (GDN/KDA), WY/M matrix math, cross-rank merging, and memory/precision notes.
CP Core Implementation
fla/ops/cp/chunk_delta_h.py
Switch intermediate linear-algebra to fp32 accumulators; replace dtype aliasing with explicit float32 for state buffers; add all_gather_into_tensor sync points after pre-processing; adapt merged forward/backward kernel calls to use gathered fp32 state.
CP Conv Test
tests/context_parallel/test_cp_conv.py
Removed local assert_close and inline torchrun harness; import centralized assert_close; adjust input scaling and replace detailed diff helper with ratio-based tolerance.
CP GDN Tests
tests/context_parallel/test_cp_gdn.py
Rewrote tests for multi-GPU CP: spawn/distributed setup, explicit chunk partitioning, inter-rank state passing, CP-aware forward/backward verification, expanded CP scenarios, and dtype updates (more bfloat16).
CP KDA Tests
tests/context_parallel/test_cp_kda.py
Added gating parameters to test runners (use_gate_in_kernel, safe_gate, lower_bound); broadcast gate constants (A_log, dt_bias); implement gate-aware reference and CP paths; introduce GATE_KWARGS.

Sequence Diagram(s)

sequenceDiagram
    participant RankA as Rank A
    participant RankB as Rank B
    participant AllGather as All-Gather Sync
    participant PreProc as Pre-process Kernel
    participant MainKernel as Main Kernel

    RankA->>PreProc: compute local m, hm (fp32)
    RankB->>PreProc: compute local m, hm (fp32)

    PreProc->>AllGather: all_gather_into_tensor(ag_hm, ag_dhm)
    AllGather->>RankA: ag_hm, ag_dhm
    AllGather->>RankB: ag_hm, ag_dhm

    RankA->>MainKernel: main_kernel(ag_hm, fp32 state, gated inputs)
    RankB->>MainKernel: main_kernel(ag_hm, fp32 state, gated inputs + inter-rank decay)

    MainKernel->>RankA: outputs, grads
    MainKernel->>RankB: outputs, grads
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Possibly related PRs

Suggested reviewers

  • yzhangcs
  • Nathancgy

Poem

🐇 I hopped through ranks with buffers held tight,
fp32 for the sums so the numbers stay bright,
gates gently swing as states come to play,
all-gathers align each parallel day,
a rabbit's small cheer for the CP display.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: adding KCP.md documentation, fixing fp32 precision in M matrix chain computations, and cleaning up context-parallel tests.
Docstring Coverage ✅ Passed Docstring coverage is 84.85% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch lzy/kcp

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
fla/ops/cp/KCP.md (1)

11-14: Add language identifiers to fenced code blocks.

markdownlint MD040 flags these blocks; add a language (e.g., text) to silence the warning and improve readability.

✅ Suggested change (example)
-```
+```text
 S_t = decay(g_t) * S_{t-1} + beta_t * k_t (x) (v_t - S_{t-1} @ k_t)
 o_t = q_t^T @ S_t

Apply similarly to the other fenced blocks listed above.
</details>


Also applies to: 21-23, 66-76, 84-90, 94-99, 104-108, 116-122, 126-131, 136-140, 255-258, 265-267

</blockquote></details>

</blockquote></details>

<!-- tips_start -->

---

Thanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=fla-org/flash-linear-attention&utm_content=740)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>

<sub>Comment `@coderabbitai help` to get the list of available commands and usage tips.</sub>

<!-- tips_end -->

<!-- internal state start -->


<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEejqANiS4BtANIBhAAoBddLXqOnG5rQDckABm8AAeQdwAzABMPBSiSPD4WPBYALKQzNQUoYyw2hgBDFaY2NyQzpA0iLiIBgBy2MwClFwA7AAsAAwGAKoASgAyXLC4uNyIHAD0k0TqsNgCGkzMk4EWaGD4FESr64iwYBYpJGgUYNQ0GOJJk9zYFhaTHd09iC2QAF4IstiYAIzwAwAZXw2AoDBIkAEVAwDFgXAsH1kkwA1gxuAZoKdSLgoTC4VxMilgbhqNgJvxuGQDH0SBJ4CQAO6UCkGAYqEgWVkOOLUOjoTiQKKdKIANjAIolAE5oJ0ABwcMUcdoAVgAWkZ9MZwFAyPR8IEcARiGRlDR6Ms2FcuLx+MJROIpDJ5EwlFRVOotDotSYoHBUKhMEbCKRyFRzQpWOwuFRGZBEI1MhR5HIFG6VGpNNpdGBDNrTAY1mhJvhxpN0ZMvD5aBwDAAiBsGADETcgAEEAJImsN8+gJ1ineQG3KYUi1AxtjzINCR3gkWBkRCSSGpU4o2j4RlYDcMRrsSBV3yQJSk+BHDBEA/OLh2eDMHQOJI0EK4pynNAPTlBLaQADiABE6kgAAKX9e0gf9OVJSA+nuEgAEp0AwTx/zbEDb3vCCoJnNtRnYRIMHgjQYAXY98F3K1cVdeBgmkSpSKYOJIDiXcKDiWFIX2U5+VTACgMwFC2wAGmPaQGGyZpkCpM4iD5SA8mQ89L0CH8BHwXBYD/QCQMQBgP1OHhKDABc0HoWSaEQgSD1QkDpLAWg70gcyEJE0FcHPOiNMhCpTjhdQHTBSErNoagZzWTd0HE/BEGQGEUWQRk5nQB5iGoBc+G4C5KAwRARKspgpAoZAVIoRlTloSYBDQBgUTKih6DnMBeHI6QlwveNSTHJD6DYbZIQsfBZgYYiO1xHc9yuZAvMgAB1ABNZiSDnN4rmoAjYhIMBljuUlrgwET0mA3AYSXPbEMyY6ciYHLjuwMR1pK79SvKyrqtq8qeDS3KNoYBIkgUHL4HTPbip/KKYrAOKRxSFIiDy5DKXEe8PjWpJkCOFFIW2uIYoAfVgTpIEmSASBCTLkPxwnHrYZgtnkEhAmCX6yAYWRiLgSFxso5LEHwWJ8HpJRp2SDAdrAS5eb4TJuG4WHKhUKwno2pqKBamK5asoksExihyC5ETSe4dYUmQcy5ZSGgiHDB6fz47rrLQzKNO+4KSFPLkUCuIy4WwDAUVEvT5Cs5z6Aln9msys2khGsa3e0D3puq3BfgsNNIXCuNHrtqy7BslIoXUzSSrq+grKqmqS+PeAWJBkSUiKbAHPaudceaiEYqCX37rRkStcgHW9ZdhGQ8qRctmQYCUR2ABHIhEPJNBSBjxh7nvDA+QSpKg0N9Z172/hDWm9ILuyduRPBxBIcwf3qjk3rZgveHQ4YlhlsQSmiZJsmBI/0skfgFGe0RJvAoPSdqaBpyLUCJQFm6cfy+QQDQMQgURIhWghnESZtH72wwHuU+H4OoqDPOoeQ+cKjXWfDUDQRgjAtnbBYGg1s0aVD5tNJQRR3wgwPl/bgWwIzhwWEcBgJMrjqAZOOKAdQkgkFoa2VcGAaLSFxAAMTPJCNs68LCyA+JQIwAxjjIDhKOOgXAADUUQ5SdEmBKesjYDAQDAEYIsJYywVh9n7XGSgGFoHxhobgshawNjrM2VsnZuxmn5P2JMQ5DRGIvNITU5QFKkC4B2L2usCG4N6vAYRc5fpLn+gpWgSlIClDQfyQIqtmClLapeaapZlDIUQIATAJ+DZAfgQ2guBZBUimnza6S5qjsC0TUuW0QwCZiCANNazd8AWxAoESIUREKPQtpQNgDk5LbTwKjHKxEgRUl+kzD8Wje5ZByMwe44gjY5N2VNPIuI5z0lBIgEZ0kSobLIo81WtA7o1EgIlDSnsdqGIgbiAg9FISkj6mNbpVIQIkA0EQDQIkBC40ZBoLpPSEKQAwBFPS1QC5AoaTCWgfTKgWA0OFag0QoQMy2FC0iJLdnn2SXLaaeS/pYCdkXH81Sykb3tnEdYEYPIGQ/KQaEM4hnjGoVABwyTrCQDSfGOkyhU7FxeuXd69UMiUFIA1OIKs1a1P7tlTk31porE2UQJIBCBU0Fxfi1+eA6Lkjlos2l1UKL3GoOPECaLmC43gBoAgR1KXUtwNESyCNA2hvwOGql0yo3LMQibGgpluHLUoGAuppEnmJHJMeOFkIeUiSefhZuLEuWQAGh3SpLBFpUnAgIQIfxRSMDBbDOVSTjFcFpLTCQBDhzTX/NAOaTgACiEyIH8hLcleAED7YJgENUdQeB1qAs0jvHJ6gWHglgEm/ANKYj50uQw+ARtITkEZJMHRqtkAw3EB+ABdzl27ShWPIqIFYDMDrgo59Fhca3xoKg39YHcBERIoGRmDopKCJyWAQI1U5YgchE+xdRxAHrSXEQPegVkBJBGSbA04r7mcwZmgK5xbsWfSBXEg1LDwi0tWekzRUJsCwaKj2hVfb2weH5Du36uITm43MulYNVx8C4zDnwPSDxpyBEYYwSgp5uVGrbq1VDNBxgIqRSi9ARB8bVIALzJSA+JygkmCAya/cBcDkANBOfgnlIztBf2QDM6JyzFBrPSdk8Bdzf7HPOcQhCxAshYSwFVgonRntGEbMXY6tDkUH2xWvqDPgK63jT2wPuAeX55Nch44qrgPRuDlJ6vq/kjVNPq3apq+qr0K4fQzuS+uWxeHhkZZCHzc6v3C1LrQIQ5JcTZZILl/L5rU4pAkORN93qthNyICM4CiLkXlogaaxexmmP3xILjQIjJaC4wEMd3GBXU5WXde1dzuIpHkEWsnXWj6sAVB5URexvb4lcFSCkMGT5VYWCQwNOMVlMoeDlg6zyfMX14YBUlT1UQZ1vFLgwH1Ir1r5TS1fP2CglspF2XpjbIiEzZHahhj8wGP0oGKksp+9FqBkS3NUXk1TLuxVECQZc9tmWOv+kjwhVRPs0LoW2BhZoCLkrYaIdYTCcrcMNnw/kAiBBCJEeIcQCSvsPZkWyAxI54k1kgKYv4ERrF/G6AYSda6Lr8ldJCOI9ImQk0ZnwrgAAJeARBYC2OCfYgsVQajlifKTXArd3yfkeEH3G6JY9JAkH4gJfuQnti7KGCJfZEyDm4Qx7XUAB0CznaRAa8n5Kcmkp3WE+8tuqdj3WoKCN1DIAXkvGCS11gQnoM3ztCmEdApnEwlHc78BWHxOhrA00g+QEKjRW5QDXfQPulIEZnI7yE/EO1DlTvC3IBPPHfkDlGboBiqp9aC4LDSWIuL3mi1B3F8hPhJiBAD0UF9pMPupNRAbv+mr8i/sNImd3kthPkUgPIq5Wd4ABBXVQ4lE2k4QlF5cGddIPwH9gU8BjxQooQg4ggk4fxgIQgxMBoqpU4UCrB6BUxLdOg01J9SJp83hk5uBr8BN6ABoiAH5LwGDSh7Y7wusIxa8KBw8ihop04qkpk0ANAN0PYl0ZwlAqRkIWZZARJGQEBFZrpggiAwQ5Y7JGQtgdZa1BoODiIAB5QqE5ESaaPPZAJcZgG5QIeQKfOAvIXWVqbAu/AWDWEWRhdjAQ/eC/SvKyaaXGXGLWYIjXZMOjWAPKYbUbTA6CSnLDFlZdQFOEE/EBfeaoWQRWCFQfVGYfUOUfRpCEahUXUJCXeXaXUidhOXN9YcJXQQlXDKBDYRfCLXCRSAZRLufeOIe/Y3AAAyUENF8PrxEOAnXjYBEjiECAsOyDygIAsBMz+E2giEmPmMWOWPgj6M9kgD6KD0QBDy9hfAjyoCj0mBjzj2ukT38T6NkUgHkUUUJVUUVg0Q/G0V0X13IEMUVWN1N3NzAD+CiHaCMBtyRnAgd0WmdzjAZhKkFFSDoHgEaBTwD0cQMD2IOMoWOJOU5DOKUVj24DE1oAwCT0CUbDF3T1NG6yzwHAiOHCsMSVpEZFVkdWmgqDtnoL0jdwsGWyYwth+TukfzJiET3QoTDyakjysFTmAmcBWVB3tnPXEDAF/CcB6EWggSSG7RImvRd1Zz+UCjIgonYHIxQFsKsEoiJwQGUAPSUM9gbm5PcRRDE17E8Wwlxjf0VjWQyX1nKCcABkoQyEwEXhIEomQImzyw4k+kELEXVPAQvmYgyx7VpCNmqhqx3xeQBnyW8h9PXjYjWikH700lkLjjUVoGAR0wmVkDABlW9IBT0IzlrEgF0EgCBFDNgQjM1wIg1ljLik3iBSEz3VvkEMmD1E9gcnbmoQbKgGUS2Erkyhii4BOFSKhjqzeEfX+VLwIXtICDxQwCQ2rkJW7MWghB5wSOpzkgbWqW3zpF3zjL9nHMbIACE3oZytsuArZTIGRJp+ZI45Jqo0sbz/Y1JfYQpsg6J6QZxyFQ8Xwe1J1v4FDYDCUEF/JkFn8+Z64LBG5OYkBLpoCIwg85YwL904RMUsKJIYD7YX8iL5TL1VZ25akNBEBMotxJiO8UMq1EgstYYrBFTlSoEYEOJEAe00ljpFB+T6A0EZwM95d+AsA4oJQRIVIHhNweIUxVZTICVN86lYcHh/zEAAgUgQUnU4w0KMLDIzgCBMYsBp4RIUQRIJBMFUU44GcNwQJaBIs0B7wGBEBEIt14w8hel3ARshl6BHo7JzKyBUtopkBoB7YPcGd/wAV0ogoZYhEFZEUvtHwMBCoIxZ9jl94IVnBzg6osZX5TgkAkh6zGzaRoF2IIRmJfZpL/yJRZxXVip7hU43IDLg4qAHJDTRlwEnUdyKgh1shMBNADAJzqy38sAfNCM8AQUZhuqPz/lfzIqdL7YcrxESZcsX1ulDN8h9yGY+LaqfLB9OQ0AQgR8x9MBargIh8qpUdZxSreZCIEymQ9D+tCV1DvcwQicUDOTuTwtSg+CMgrlL1FYKhdIyBSrooQJnAogRJnB2gEanA5QSZcBhpvKkphrZA5ZdCKAuT4wAFpBgEWzwyrALxnYnKS0J48VSQ1d5BB12U+ZW1k121tjMgMAHClEvLBKrg+Su8DD2DPDu9bC+FDcuoIUoa1cKN3Z+RfD1oj9AhBtvTIYzr+E5q8BdL3CpBn5S1Uyi11yrtT9BCRi3hy9L9KBtjkM5s+BhwZwIQ+aX0dF6ApCSEggxCixJDxBisvsYLyYlB4KxtsB/IeTYR0KlAQaL0r1qzIbszEgJ4cswzardwahUVQRkIc84cMBgzZwrAwgHJICcKpdgFOL+okhODSaz45TMB5B9gxbE7YEvL7ZTJArcK+YprSkFCstGKdz7r+QqLlzcakoC7sKyKuCmCvtC8dbPZwC+5NCPo2CclyKS8SBZJWZCLotfZ8jR5CUtgEDWcki4gkzfp2ofK+zwU4D2AIjeELZDESrxBpb8yeBZBp9NZQbB6Lwe1ytKsaN4VBiqMGFioPbWaO0IUWbj1cA2b843gzDU4Khp8vrNCKimMpirAxA8Rj03Qfoa0CplAl5bjbjxdGE30IUZcOFkG6iyZld9Qmi1cl7WjxFEkpFxa6ITwHQR9Ca8MyQcZFcqGGiKo7g6GWjRE2iFkuji7GA9gVzXI+APTkN24wsHlkplMvJq4eBmj2wnAOx2Y6C4DaZfkrBml4MhHwj5Ab7Pyej8hIV4x3Lr1bGtaiNSE2MCFp9F6GBUEsCuJekGcB6aKtN2o96FwD795mQmJodaASj9FPiWGfi/g/gVRrEogOhgTbcwTFBHcryXdoT3dIAvcfckSHEjA0SRSjjMoTiJScTqg8SLsQpiSU8yTwlKT4xs8aTYlFV2ihL+b+QsFLxwcjIwqdzjQMdczes5I3H4s+YKhc40I9jmCyUakDtnJJMLtptgE0BoFHTQN7ZQcrNAKEYynbHGFpc4C8adZlaGK0Atw5Z/DKADZ146H2pnJCruJmrdp1oeVXsiVNIpijrIQxjtApAY1DURLu4sBgJ7TEIPme1JwA6nI+RVaRV+R9LNb7YBlSRJouA2xcY2D7YulTtF0poBt0AmJJLwIy5VLaB1KKkfxnJXm7kAhLxkX/lEAUjNJmhcBmRwrQr8ALK4XHUh10K6JwcjUR4BW8s98lo9Q5Z/pyRFm+RlnLtoKXw9QPrcQM4mNtlHVTrqrWyeU+rLx15lxXTv82J2AamZwfLuXeXaWxW6JgIaXwImWm7AiFxuVKAPl0Bqzpm6qsAfKnnLmXm/4pcRow6MLTYZlLwpUos6JHpEB1m5WtWEYdnfM9mSzPY1gk65YemtrfguFZXu8sA1IgUfmarG96B3svov6KtwJxLPYxEnbdkYxr4mrSWqh0D/krL+5bL7K6VSQnL8AAgvIzaXnoRMH1KmMTkdKQ3bS6JroHI9oCEikSkIVxIThHVaWnWq4uNhl5AVDwrZXNmDsUgVmXCZsoq39IQJBkBlEPw3hoWWDBaOCAYNCfqa8EY154Fjb94OqMDeDVYpBKJyVtBA6q5F48U11PKe0gRvd8MeHhxBH1dp8pqwwq9QXkAw8VWCivzF45Ib0+XNphVwIg3mEVqO5lzpAKqoAprqn1wfEY9TnKBbrr5lCtguTgMiaLDcTxiSARJoARJYqIIRJyaiBKbf6eOIDUrjXNXxOD2lnj3Ls1mNnnIhOlKU307aB4JxrGzqO496O5hgMe7gI8a2OlwdFOOqnuPeP+ORJ/whOyARPYBvosUqRUEkBJOWIXUtnZP5X5PVmbGlO+QVPQnTt1PNOoBOmQW5bfLuIzI5IdUiA76bpRq/w2xoBJ1cY7AZo2w+hfwgQlYBksL9wEGkgX2pLSPCXqgBKaFU8iHJdmFSGqjZdOEpdeG+DGj1GTGGH887i3ZYBFBOHYPIRwmuABiGYfWaOQpbMqmGOKAmO/YuALYWP8aTtTOlVFud7w9uOuBWdeOFurh+O9vcAbPDu7OKbHOEQsKbALYXBUES1XO43papPPOlU1JR9PMOjb2ZO3hD2FXpsuBXvU4zMb2uRxO42AuaB/v8A3ugfPugvdn1OuBI1IAAAfSAXXd73XTYhsrTvPYKsQ6VGD7hyELdNyax7Ng5tgI50OmdyjhZn73z09yH0fLTqAMHhNl7qHiwFngw4L1NxH5NFHtH6RRJOEjSfr3DQb0pat80EbwY8b3T3ErdAzy5jAIz1jlbomk7jb3GLbjqCgXb+LA7w3iCLX4T52C76oK7q4G7sT+79z0QZ7pnwHj7kHkSbzmgX7xnguaHl3t4RT9np3974Hv3nn+HoC/niBwX9HszTHrYyAHH7492xtAnrh57YnuYUn6acn98Snlkan8O6wbn93o9jAE9vWJ37ntnw9ivia5NkL8PqZSP1H3XJhvmQuS2hDxDeRuWNsLRlhiVpBCMYCXgvhQjQ0PD2l120Rl5j09jUhuA8Ej53FvmWmjIRQGiemPhiMfRuCeLd15Mnm24+46BR4tRdsTRN4igPRA3XHsxCIP4KURJqUP4FJ0EiMcEnfLJt3QQz3b3X3IJTULmHzC6gEYdtPACGApJpMow1oOMnGCiQ55UwDud0JmC9A5hDAvoSMPeHDxAx34n/ZkCt2hS4gtQBgDAVEBIACAGAAgFUCqFaAqg0AooCIHKDbQMApQooBgICRIAqg/gtAbgaoBVAkAogUoCIJwKlCUspQaAkgTqEgBygSAnQZ/q0AiD38mB0QEgFKClACDaAnQeUAwEpa0B2g6OWga0FECdABA7QP4O0EhDECMBlodQMGjJTGtISdAXGMOR9BSCW4+2WPAuBqjvwByRAoAQYAADeWnOsEgCcB9B7ypeTGLQEfDQDcATgaKOaDrCI9YewQuuvcFoARD/8YQpIbgVd7BCkAphSgNkAEwYAchyGPIQ2TrAORaAsEDAP+HIhAhLoF4RAAqlEAogcht0HjsEOqG1DzA7kEgK0JqgdCL2QkboUDFqGQRdI2QbgHtEGHtDkhFQyAHWHPBRCOwMUcVo0IoA5CGwowyoXsFwBzDaQCYQBjkJsBacGyQQhslcKWEIEaodQWxtsMmHiRL0+8OYXWF2HXClhIGckMMLywfDrhdYHeAGT2jbC5h8YFEJeipD0B5U6TPoBmHUBGN5If/Q4GqlIItNSE6He5uQQ0DvDzhVwusPoxIDbC6oCiC8LiM+FLCtg3uQnBYDmH3C2A2woWM8JmEERgk1wgAL7/DIAlwz4XWFuEoh6RRIrgHWD6FqEvB7QrkZUO+GIBfhXQikYCMFLAjWRwojmJUEsBBR0cr7TkLXRaZE0jSGkOIJCD7hWEuApkbkoeDErkQJobzJIPJVCAeolk2DApCLDuL+lLoYQIxPnE1Z3IGcRQE4KSMvDcESmuAMUuUy/CzNyRvIwkcSNOABjIxAIjzhgFK5CjchfvPEZUKpEdJaR4owUdsM1xWA2RVwzkXiJ5EAj+RuY4UfUM8pNDLwj4MwqQHjH4jpRsoyUUsKBF7xlRSwqsZAXai4MqApAWnNIPaAaA5QKoAAKQJUckmkVAAmEZi7p2AOjDJrlmrjF4cYfXAmqgEsQaAtBY4nEa2IJHpMYxusWGI2IzHtIaRdIh4cKJ3A9iEuhYhsuyK043dgh+wsIaKJTF1hAgnQNAFEEtytABAnQBgBEEoFShWgtAaBDILlACBaBaAUQFKFUAKDWgrQUUJwPYSiBWgyGVoEIM6CtBuBtAOUKKECBUCVQLAkgHKClCRiqh5EW8XWLwYkAhK2UD8I0L5A5DSxSw4ppBXDwHNTi5xfEpcSTysT0x7E9SB+E6LV4pcOQpGkJM/HiM0YM0OYN2JrEyiuAQJDkVyLrAcTDiXE8UtiV4kEkiS/iQSfKIICkgLAYktDjkLib7jAgsknKPJI0iKTycd4rgHEzxHFiARmkjEtxIqZ6TaOAkrgGxMqEmTRJtk5SZAABLWTQp9k2AI5NhhhSASbk9Sc4lLD7E3E8wDxF4lJC+JDJAU6ScFLMmhSchooSKeJLkkKTqJSkyyW5K06PjHxkgiABtBCLVZPBbQ9+C4KAEYCqMNmTKAezQynlBCEgwIcsLBRvh82uEWkI4OiEsAsBj4X2LgCSGdA6pnU8ARHl6kfpnBCMfQEAA= -->

<!-- internal state end -->

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zhiyuan1i, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive documentation for the Kimi Context Parallel (KCP) feature, clarifying its architecture and operational differences between GDN and KDA. A critical precision fix ensures the stability of the M matrix chain multiplication by enforcing float32 arithmetic. Additionally, the context parallel tests for convolution, GDN, and KDA have been significantly refactored and enhanced, improving their robustness, clarity, and coverage, particularly for KDA's gate handling and L2 normalization.

Highlights

  • New KCP Documentation: A new KCP.md file has been added, providing comprehensive documentation for the Kimi Context Parallel (KCP) feature. This includes details on its core recurrence, differences in gate handling between GDN and KDA, CP data flow, pre-process forward/backward stages, and code flow examples.
  • FP32 Precision for M Matrix Chain: The fla/ops/cp/chunk_delta_h.py file has been modified to explicitly cast intermediate M matrix chain multiplications to float32. This critical change prevents accumulated precision loss, especially when operating with bf16 data types, ensuring numerical stability in both forward and backward pre-processing kernels and the merge kernel.
  • Cleanup and Refinement of CP Tests: Context Parallel tests for convolution, GDN, and KDA have undergone significant refactoring. This includes centralizing the assert_close utility, configuring logging for better output, adjusting test input scales, updating assertion parameters, and removing direct torchrun execution blocks. Test configurations now default to bfloat16 and feature larger sequence lengths, with the KDA tests specifically enhanced to use naive_recurrent_kda as a more robust ground truth reference, including explicit handling of gate computation and L2 normalization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • fla/ops/cp/KCP.md
    • Added new documentation file detailing KCP architecture, recurrence, gate handling, data flow, and precision notes.
  • fla/ops/cp/chunk_delta_h.py
    • Modified pre_process_fwd_kernel_merged to cast b_m_i and b_m to tl.float32 during M matrix chain multiplication.
    • Modified merge_fwd_bwd_kernel to cast b_ag_m and b_h to tl.float32 during the merge operation.
    • Modified pre_process_bwd_kernel_merged to cast b_m_i and b_m to tl.float32 during M matrix chain multiplication.
    • Removed global DTYPE = torch.float32 variable.
    • Explicitly set hm, initial_state, dhm, and dht to torch.float32 during tensor initialization.
    • Added blank lines for improved readability in chunk_gated_delta_rule_bwd_dhu_pre_process.
  • tests/context_parallel/test_cp_conv.py
    • Imported logging and assert_close from fla.utils.
    • Configured logging.basicConfig in the main script and init_distributed.
    • Removed local assert_close helper function.
    • Multiplied x_global by 100 in run_cp_conv_test_worker to increase input scale.
    • Updated assert_close calls to use ratio=0.001 instead of atol=1e-3.
    • Removed the if __name__ == "__main__": block for torchrun execution.
  • tests/context_parallel/test_cp_gdn.py
    • Updated docstring to provide detailed implementation hierarchy and relationships for GDN.
    • Imported logging and assert_close from fla.utils.
    • Configured logging.basicConfig in the main script and init_distributed.
    • Removed local assert_close helper function.
    • Modified global data generation to be performed on rank 0 and then broadcast.
    • Updated assert_close calls to use ratio=2e-3 and warning=False.
    • Changed default dtype to torch.bfloat16 in run_cp_test_with_spawn.
    • Increased T and lengths values in test cases (e.g., T=10240, lengths=[3000, 4000, 3240]).
    • Added test_cp8_single_sequence.
    • Removed the if __name__ == "__main__": block for torchrun execution.
  • tests/context_parallel/test_cp_kda.py
    • Updated docstring with extensive details on KDA implementation hierarchy, gate functions, and test architecture.
    • Imported logging, assert_close, naive_kda_lowerbound_gate, and naive_recurrent_kda.
    • Configured logging.basicConfig in the main script and init_distributed.
    • Removed local assert_close helper function.
    • Modified run_cp_kda_test_worker to accept use_gate_in_kernel, safe_gate, and lower_bound parameters.
    • Refactored global data generation to be done on rank 0 and broadcast, including A_log_global and dt_bias_global.
    • Revamped reference run logic to use naive_recurrent_kda for ground truth, explicitly handling L2 normalization and gate computation for each sequence.
    • Updated chunk_kda call in CP Forward to pass use_qk_l2norm_in_kernel, use_gate_in_kernel, safe_gate, lower_bound, A_log, and dt_bias.
    • Updated assert_close calls to use ratio=5e-2 and warning=False.
    • Changed default dtype to torch.bfloat16 in run_cp_test_with_spawn.
    • Modified run_cp_test_with_spawn to pass new gate-related arguments to the worker.
    • Defined GATE_KWARGS and applied them to all test scenarios.
    • Increased T and lengths values in test cases (e.g., T=10240, lengths=[3000, 4000, 3240]).
    • Added test_cp8_single_sequence.
    • Removed the if __name__ == "__main__": block for torchrun execution.
Activity
  • A new markdown file KCP.md was added to document the Kimi Context Parallel (KCP) feature, including its core recurrence, gate handling for GDN and KDA, CP architecture, data flow, pre-process stages, and code flow examples.
  • The precision of the M matrix chain multiplication in chunk_delta_h.py was explicitly set to float32 to prevent precision loss, affecting both forward and backward kernels.
  • The DTYPE global variable was removed from chunk_delta_h.py, and tensor initializations were updated to use torch.float32 directly.
  • Testing infrastructure was improved by centralizing the assert_close utility and configuring logging in test_cp_conv.py and test_cp_gdn.py.
  • Test cases in test_cp_conv.py were adjusted with larger input scales and updated assertion parameters.
  • The if __name__ == "__main__": blocks for direct torchrun execution were removed from test_cp_conv.py and test_cp_gdn.py, streamlining test execution via pytest.
  • The test_cp_gdn.py and test_cp_kda.py files received updated docstrings for better clarity on implementation details and test architecture.
  • Test data generation and dtype were updated to bfloat16 and larger sequence lengths in test_cp_gdn.py and test_cp_kda.py.
  • The KDA context parallel tests in test_cp_kda.py were significantly enhanced to use naive_recurrent_kda as a more robust reference, with explicit handling of L2 normalization and gate computation, and new test parameters for gate configuration.
  • A new test case test_cp8_single_sequence was added for KDA.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive documentation for Kimi Context Parallel (KCP), fixes a critical precision issue in the M matrix chain computation, and performs a major cleanup and enhancement of the context parallel tests. The new KCP.md file provides an excellent overview of the architecture. The precision fix in chunk_delta_h.py to enforce fp32 accumulation is crucial for numerical stability. The test refactoring in test_cp_conv.py, test_cp_gdn.py, and test_cp_kda.py significantly improves their robustness, coverage, and maintainability by using a centralized assert_close utility, more rigorous reference implementations, and better test configurations. Overall, this is a high-quality contribution that improves correctness, documentation, and testing. I have one minor suggestion for a typo in the documentation.

Comment thread fla/ops/cp/KCP.md Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fla/ops/cp/chunk_delta_h.py (1)

268-272: ⚠️ Potential issue | 🟠 Major

fp32 precision fix not applied to the non-merged stage2 kernel, which is used in benchmarks.

The merged kernels (pre_process_fwd_kernel_merged at line 508 and pre_process_bwd_kernel_merged at line 1010) correctly cast b_m_i and b_m to tl.float32 for the M matrix chain multiply. However, the non-merged pre_process_fwd_bwd_kernel_stage2 still uses b_w.dtype:

b_m = tl.dot(b_m_i.to(b_w.dtype), b_m.to(b_w.dtype))

While the production code paths use the merged kernels (lines 1050, 1120), the non-merged stage2 kernel is explicitly called from benchmarks/cp/benchmark_chunk_delta_h_kernels.py (lines 142, 174, 304) for performance and correctness benchmarking. This inconsistency means benchmarks will measure the non-fixed version while production uses the fixed version, leading to inconsistent precision characteristics.

Proposed fix
-        b_m = tl.dot(b_m_i.to(b_w.dtype), b_m.to(b_w.dtype))
+        b_m = tl.dot(b_m_i.to(tl.float32), b_m.to(tl.float32))
🤖 Fix all issues with AI agents
In `@fla/ops/cp/KCP.md`:
- Line 5: Fix the typo and sentence punctuation in the CP description: change
"introduce" to "introduced" in the sentence "CP was first introduce in PR
`#691`...", split the clause after the PR link into a new sentence by replacing
the comma before "Special thanks to" with a period, and capitalize "Special" so
it reads "Special thanks to [mdy666]...".
🧹 Nitpick comments (7)
tests/context_parallel/test_cp_conv.py (2)

113-113: Scaling input by 100 — intentional but worth a comment.

x_global is multiplied by 100, which increases the dynamic range and stress-tests numerical precision. A brief inline comment explaining why (e.g., "amplify values to stress-test conv precision under CP splitting") would help future readers.


201-204: Ratio of 0.001 is quite tight — verify this passes reliably in CI.

Using ratio=0.001 for all four checks (output, dx, dw, db) is a strict tolerance. The assert_close helper in fla/utils.py will warn (instead of fail) in CI when the ratio is under 0.01, but outside CI this will hard-assert. If convolution tests are flaky, consider whether gradient checks (dw, db) need a slightly relaxed ratio.

fla/ops/cp/KCP.md (1)

11-14: Add language identifiers to pseudocode fenced blocks.

Markdownlint flags multiple bare fenced code blocks (MD040). For pseudocode/math blocks, use a language identifier like text or math to silence the lint warnings and improve rendering in some markdown processors. This applies to all ~11 unlabeled code blocks in this file.

Example fix
-```
+```text
 S_t = decay(g_t) * S_{t-1} + beta_t * k_t (x) (v_t - S_{t-1} @ k_t)
 o_t = q_t^T @ S_t
-```
+```
tests/context_parallel/test_cp_gdn.py (1)

354-365: CP8 test requires 8 GPUs — likely won't run in most CI environments.

This test with T=65536 and 8 GPUs is a comprehensive stress test but will be skipped in most CI setups. Consider documenting this as a manual/nightly-only test, or adding a marker (e.g., @pytest.mark.slow).

tests/context_parallel/test_cp_kda.py (3)

216-236: Imports inside worker function are fine for spawn, but triton and math could be top-level.

import triton and import math are placed inside the worker function body. While this works correctly with spawn (fresh process re-imports), these are stable stdlib/dependency imports that could live at the module level for clarity. This is a minor style point only.


427-442: Ratio of 5e-2 is relatively relaxed — consider documenting the expected tolerance.

The KDA tests use ratio=5e-2 compared to GDN's 2e-3 and conv's 0.001. This is 25-50× more relaxed. Given the complexity of KDA (per-dim gating, L2 normalization, gate computation), a wider tolerance may be justified, but a brief comment explaining why would help future maintainers understand this isn't just a "make-the-test-pass" number.


549-561: CP8 test with T=65536 is a heavy test — same observation as GDN CP8.

Requires 8 GPUs and processes 65K tokens. Consider adding a @pytest.mark.slow marker.

Comment thread fla/ops/cp/KCP.md Outdated
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@zhiyuan1i zhiyuan1i merged commit 99db89b into main Feb 9, 2026
4 checks passed
@zhiyuan1i zhiyuan1i deleted the lzy/kcp branch February 9, 2026 16:04
@coderabbitai coderabbitai bot mentioned this pull request Feb 11, 2026
This was referenced Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant