Skip to content

Latest commit

 

History

History
77 lines (64 loc) · 3.73 KB

relay_attention.md

File metadata and controls

77 lines (64 loc) · 3.73 KB

Code Analysis

TODOs

  • finish implemenation
  • test in eager mode
  • make the implementation work with CUDAGraph
    • use a static buffer to track the prefix cache length
    • fix a bug to make paged_attention_v2 work with CUDAGraph
  • optimize the implementation further
    • write a relay fusion kernel with triton
    • modify the paged attention kernel to return log-softmax-exp
    • use native flash attention kernel to support MQA/GQA
  • benchmark standalone relay attention (teaser)
    • script for latency, memory usage, profile; eager & cudagraph mode
    • run benchmark & profiling, plot figures
  • benchmark for non-interactive applications (exp group 1)
    • throughput & latency for synthetic workload, plot figures
    • throughput & latency for real workload (ShareGPT dataset), plot figures
  • benchmark for interactive aplications (exp group 2)
    • throughput, latency to first token, latency to susequent tokens w/ ShareGPT dataset
  • check if we need to change the behavior of tokenizer (e.g. avoid prepending bos token)
  • adaptations for the cases where window attention is used and sequence length > window size
  • adaptations to support ALiBi

Trouble shooting

Useful links

Chat Templates