Skip to content

[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts

License

Notifications You must be signed in to change notification settings

rayleizhu/vllm-ra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

A vLLM fork with RelayAttention implemented. See the paper for details: RelayAttention for Efficient Large Language Model Serving with Long System Prompts

  • forked from vllm v0.2.6
  • used to produce ALL tables and figures in the paper.
  • including not only the implementation of the idea, but also those scripts for data collection and plotting.

How to use

  1. Follow the vLLM documentation to install from source. See also _scripts/install.sh.
  2. Check the scripts here to reproduce the experiments and collect data. If you are using a slurm cluster, check the _cluster directory instead.

You can use examples/relay_inference.py as the entrance for exploration of this project. See Figure 9 in the paper for a big picture.

Citation

If you use this repo for your research, please cite our paper:

@misc{zhu2024relayattention,
      title={RelayAttention for Efficient Large Language Model Serving with Long System Prompts}, 
      author={Lei Zhu and Xinjiang Wang and Wayne Zhang and Rynson W. H. Lau},
      year={2024},
      eprint={2402.14808},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published