Conversation
| // Output will follow the provided regex pattern | ||
| string regex = 5; | ||
| // Output will be exactly one of the specified choices | ||
| StringChoices choice = 6; |
There was a problem hiding this comment.
Unfortunately you cannot have repeated fields directly within oneofs :(
384d566 to
f9ee133
Compare
enum ResponseFormat {
// Plain text, no constraints
TEXT = 0;
// Valid json
JSON = 1;
}
message StringChoices {
repeated string choices = 1;
}
// Mutually-exclusive guided decoding options
oneof guided {
// Output will be in the specified format
ResponseFormat format = 3;
// Output will follow the provided JSON schema
string json_schema = 4;
// Output will follow the provided regex pattern
string regex = 5;
// Output will be exactly one of the specified choices
StringChoices choice = 6;
// Output will follow the provided context free grammar
string grammar = 7;
}
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
|
|
||
| if outlines_decoding.global_thread_pool is None: | ||
| outlines_decoding.global_thread_pool = ( | ||
| concurrent.futures.ThreadPoolExecutor(max_workers=2)) |
There was a problem hiding this comment.
I haven't looked much at logits processors, why does this require its own thread pool?
There was a problem hiding this comment.
It's the same code as here:
. If I'm not mistaken, only the construction of the logits processor happens in another thread. But if the logits processor is cached, I'm not sure what's the benefit of having another thread build the object.There was a problem hiding this comment.
Yes that's right. The code is just the same as that in the http API. It's dispatched to a threadpool to avoid blocking the asyncio event loop, but I think it could be made more efficient since we only care about this in the case that the LP is not already cached. In any case we can fix that as a follow-on since we need to fix that related concurrency bug anyhow.
| self.config = await self.engine.get_model_config() | ||
| self.tokenizer_group = await self.engine.get_tokenizer_group() | ||
| # self.tokenizer_group = await self.engine.get_tokenizer_group() | ||
| self.tokenizer_group = self.engine.engine.tokenizer |
There was a problem hiding this comment.
I've seen versions of the code where the get_tokenizer_group function exists and others where it doesn't. What's happening with this function?
There was a problem hiding this comment.
@maxdebayser that's from this upstream PR vllm-project/vllm#3512
It didn't get merged in a timely manner and is now buried in conflicts :(
maxdebayser
left a comment
There was a problem hiding this comment.
Since the bug reported in issue https://github.ibm.com/ai-foundation/fmaas-inference-server/issues/718 is not cause by the code in this PR, I think we can merge it and fix the problem in another PR.
Fixes issue:
```
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/vllm/lib64/python3.9/site-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/opt/vllm/lib64/python3.9/site-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
File "/opt/vllm/lib64/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
executor_class = cls._get_executor_cls(engine_config)
File "/opt/vllm/lib64/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 670, in _get_executor_cls
from vllm.executor.sendnn_executor import SENDNNExecutorAsync
File "/opt/vllm/lib64/python3.9/site-packages/vllm/executor/sendnn_executor.py", line 6, in <module>
from vllm.sequence import ExecuteModelRequest, SamplerOutput
ImportError: cannot import name 'SamplerOutput' from 'vllm.sequence' (/opt/vllm/lib64/python3.9/site-packages/vllm/sequence.py)
```
reported by @Yannick-Schnider1 and @HTChang
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Within the existing
decodingrequest parameter section: