Skip to content

CodeGen Converter#229

Closed
michaelfeil wants to merge 36 commits intovllm-project:mainfrom
michaelfeil:gpt_j_convert
Closed

CodeGen Converter#229
michaelfeil wants to merge 36 commits intovllm-project:mainfrom
michaelfeil:gpt_j_convert

Conversation

@michaelfeil
Copy link
Copy Markdown
Contributor

@michaelfeil michaelfeil commented Jun 24, 2023

This PR aims to integrate CodeGen. Work in progress, not ready.

@zhuohan123
Copy link
Copy Markdown
Member

zhuohan123 commented Jun 26, 2023

Thank you for your contribution! Please let us know when this PR is ready for review!

@zhuohan123
Copy link
Copy Markdown
Member

@michaelfeil Are you still working on the CodeGen model support? Do you need any help from our side?

@michaelfeil
Copy link
Copy Markdown
Contributor Author

I'll try to get back to you in the coming days!

@michaelfeil michaelfeil marked this pull request as draft August 3, 2023 21:39
@zhuohan123 zhuohan123 added the new-model Requests to new models label Sep 12, 2023
@WoosukKwon
Copy link
Copy Markdown
Collaborator

Closed as this PR is too old.

@WoosukKwon WoosukKwon closed this Dec 12, 2023
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Sep 11, 2024
This fixes a very silly issue where mismatching values of `warmup_mode`
flag could cause graph recompilations and eventually memory leaks.
amy-why-3459 pushed a commit to amy-why-3459/vllm that referenced this pull request Sep 15, 2025
This PR added pooling support for vllm-ascend

Tested with `bge-base-en-v1.5` by encode:
```
from vllm import LLM

# Sample prompts.
prompts = [
  "Hello, my name is",
  "The president of the United States is",
  "The capital of France is",
  "The future of AI is",
]
# Create an LLM.
model = LLM(model="./bge-base-en-v1.5", enforce_eager=True)
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
outputs = model.encode(prompts)
# Print the outputs.
for output in outputs:
    print(output.outputs.embedding)  # list of 4096 floats
```

Tested by embedding:
```
from vllm import LLM, SamplingParams

llm = LLM(model="./bge-base-en-v1.5", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

Related: vllm-project/vllm-ascend#200

## Known issue
The accuracy is not correct since this feature rely on `enc-dec`
support. It'll be done in the following PR by @MengqingCao

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
iwooook pushed a commit to moreh-dev/vllm that referenced this pull request Nov 29, 2025
…uired (vllm-project#232)

Extend vLLM v0 by adding automatic compat sampling fallbacks to device sampling. This offers substantial performance benefits when not all requests require advanced features like structured outputs.
always_compat_sampling should now never be set in production - whenever something otherwise unsupported, such as

structured outputs
non-greedy sampling for models that don't support it on device
non-uniform top-k top-p on host or with a model that doesn't support it
is detected, the system will switch to host compat sampling.
This also means that we never override the temperature, p or k to force it to be uniform across batch.

Also contains a fix for vllm-project#229 (it will fallback to compat sampling if non-uniform is not supported on device or device sampling is disabled).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models

Projects

None yet

Development

Successfully merging this pull request may close these issues.