We investigate a critical yet underexplored issue: the pipeline inter-stage bubble problem introduced by sampling operations. To address this challenge, we propose DynaPipe, a novel runtime dynamic layer redistribution scheme. By adaptively adjusting the computational load across pipeline stages, DynaPipe ensures more balanced task distribution, effectively aligning the pipeline and mitigating inter-stage imbalances. Compared with state-of-the-art pipeline inference frameworks, DynaPipe achieves notable performance gains and significantly improves overall efficiency.
pip install --verbose -e .
# To enable prefix caching, add "--enable-prefix-caching"
# To enable pipeline parallelism, add "--pp $PP_DEGREE"
python -m gllm.entrypoints.api_server --port $PORT --model-path $MODEL_PATH --enable-adjust-ayers
python benchmarks/benchmark_serving.py --backend $BACKEND --model $MODEL \
--dataset-name $DATASET_NAME --dataset-path $DATASET_PATH \
--num-prompts $NUM_PROMPTS --port $PORT --trust-remote-code \
--request-rate $REQUEST_RATE
This project builds upon the foundational work of gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling.
