Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: sync dAttention with upstream vLLM #5

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chakpongchung
Copy link
Collaborator

@chakpongchung chakpongchung commented Sep 20, 2024

Note that this is a migration and the code before the migration is not tested in the CI upstream.

The log from running offline_inference.py on merlin A100 is shown below:

(base) mlxlaba6xf79os66a1941d-20240724235406-lm19g2-hvu8dk-worker:vllm# python examples/offline_inference.py 
INFO 10-09 05:47:06 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
...
...
...
Processed prompts: 100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 10.15it/s, est. speed input: 65.96 toks/s, output: 162.35 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
Prompt: 'The capital of France is', Generated text: ' known as the “Proud French capital”. What is this city'
Prompt: 'The future of AI is', Generated text: ' literally in danger of being taken by any other company.\nAgreed. '
(base) mlxlaba6xf79os66a1941d-20240724235406-lm19g2-hvu8dk-worker:vllm# git branch
  main
* rebase/dAttention

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@chakpongchung chakpongchung force-pushed the rebase/dAttention branch 3 times, most recently from d1651c9 to fdd7dc6 Compare September 23, 2024 21:38
Jeffwan pushed a commit that referenced this pull request Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant