feat: TQ3 compressed ring all-reduce for TP bandwidth reduction by influenist · Pull Request #181 · turboderp-org/exllamav3

influenist · 2026-04-01T09:10:32Z

Summary

Adds optional TQ3 compression to the Native TP backend's ring all-reduce, reducing inter-GPU data transfer by ~6.4x (fp16 → 2.5 bits/value).

Independent of PR #180 — this is a communication optimization, not a quantization format.

How it works

Each ring iteration:

Compress: fp16 → TQ3 (10 bytes per 32 values) before writing to shared memory
Decompress+accumulate: fused read+decode+add from predecessor's slot
Broadcast: forward accumulated result (uncompressed)

When it helps

Interconnect	Bandwidth	TQ3 benefit
PCIe Gen4 x16	32 GB/s	Significant — ~6x less data
InfiniBand HDR	25 GB/s	Large — network is bottleneck
Ethernet 100G	12.5 GB/s	Essential
NVLink (A100+)	600+ GB/s	Minimal — compute dominates

Files (7 files, +694 lines)

`parallel/tq3_compress.cuh` — thread-level compress/decompress/decompress+add primitives
`parallel/tq3_all_reduce.cu/.cuh` — compressed ring kernel using existing ParallelContext
`bindings.cpp` — `ext.tq3_all_reduce()` binding
`model_tp_backend.py` — `TPBackendNative.tq3_compress` opt-in flag
`tests/test_tq3_allreduce.py` — 7 simulation tests (no multi-GPU needed)
`tests/bench_tq3_allreduce.py` — bandwidth break-even analysis

Opt-in

backend.tq3_compress = True  # Enable compressed communication

Test plan

`pytest tests/test_tq3_allreduce.py -v` — compression quality simulation
`python tests/bench_tq3_allreduce.py` — break-even bandwidth analysis
Multi-GPU integration test (2+ GPUs)

🤖 Generated with Claude Code

Adds compressed communication to the Native TP backend's ring all-reduce, reducing inter-GPU data transfer by ~6.4x. Algorithm: 1. Each rank compresses local fp16 data to TQ3 (10 bytes per 32 values) 2. Writes compressed data to shared memory ring buffer 3. Ring reduce: decompress from predecessor + accumulate locally 4. Ring broadcast: forward accumulated result 5. Result written back in-place New files: - tq3_compress.cuh: Thread-level TQ3 compress/decompress/decompress+add - tq3_all_reduce.cu: Standalone compressed ring kernel using ParallelContext - tq3_all_reduce.cuh: Header declarations Integration: - TPBackendNative.tq3_compress flag (opt-in) - ext.tq3_all_reduce() binding Impact: PCIe setups (32 GB/s) and multi-machine (InfiniBand) see significant throughput improvement. NVLink setups see minimal change since interconnect is rarely the bottleneck. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

influenist · 2026-04-01T19:21:56Z

Closing — CUDA kernels were never compiled or tested on real hardware. Will revisit after the KV cache PR (#182) is reviewed.

influenist closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: TQ3 compressed ring all-reduce for TP bandwidth reduction#181

feat: TQ3 compressed ring all-reduce for TP bandwidth reduction#181
influenist wants to merge 1 commit intoturboderp-org:masterfrom
influenist:feature/tq3-compressed-allreduce

influenist commented Apr 1, 2026

Uh oh!

influenist commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

influenist commented Apr 1, 2026

Summary

How it works

When it helps

Files (7 files, +694 lines)

Opt-in

Test plan

Uh oh!

influenist commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant