WIP: sync dAttention with upstream vLLM #5

chakpongchung · 2024-09-20T21:27:27Z

Note that this is a migration and the code before the migration is not tested in the CI upstream.

The log from running offline_inference.py on merlin A100 is shown below:

(base) mlxlaba6xf79os66a1941d-20240724235406-lm19g2-hvu8dk-worker:vllm# python examples/offline_inference.py 
INFO 10-09 05:47:06 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
...
...
...
Processed prompts: 100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 10.15it/s, est. speed input: 65.96 toks/s, output: 162.35 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
Prompt: 'The capital of France is', Generated text: ' known as the “Proud French capital”. What is this city'
Prompt: 'The future of AI is', Generated text: ' literally in danger of being taken by any other company.\nAgreed. '
(base) mlxlaba6xf79os66a1941d-20240724235406-lm19g2-hvu8dk-worker:vllm# git branch
  main
* rebase/dAttention

github-actions · 2024-09-20T21:27:39Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Kuntai disagg refactor

chakpongchung force-pushed the rebase/dAttention branch 3 times, most recently from d1651c9 to fdd7dc6 Compare September 23, 2024 21:38

rebase wip

622042f

chakpongchung force-pushed the rebase/dAttention branch from fdd7dc6 to 622042f Compare September 25, 2024 16:39

Jeffwan pushed a commit that referenced this pull request Nov 17, 2024

Merge pull request #5 from KuntaiDu/kuntai-disagg-refactor

4db6446

Kuntai disagg refactor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: sync dAttention with upstream vLLM #5

WIP: sync dAttention with upstream vLLM #5

chakpongchung commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024

WIP: sync dAttention with upstream vLLM #5

Are you sure you want to change the base?

WIP: sync dAttention with upstream vLLM #5

Conversation

chakpongchung commented Sep 20, 2024 • edited Loading

github-actions bot commented Sep 20, 2024

chakpongchung commented Sep 20, 2024 •

edited

Loading