About

A vLLM fork with RelayAttention implemented. See the paper for details: RelayAttention for Efficient Large Language Model Serving with Long System Prompts

forked from vllm v0.2.6
used to produce ALL tables and figures in the paper.
including not only the implementation of the idea, but also those scripts for data collection and plotting.

How to use

Follow the vLLM documentation to install from source. See also _scripts/install.sh.
Check the scripts here to reproduce the experiments and collect data. If you are using a slurm cluster, check the _cluster directory instead.

You can use examples/relay_inference.py as the entrance for exploration of this project. See Figure 9 in the paper for a big picture.

Citation

If you use this repo for your research, please cite our paper:

@misc{zhu2024relayattention,
      title={RelayAttention for Efficient Large Language Model Serving with Long System Prompts}, 
      author={Lei Zhu and Xinjiang Wang and Wayne Zhang and Rynson W. H. Lau},
      year={2024},
      eprint={2402.14808},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

About

How to use

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

About

How to use

Citation