Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The example fails in "Serve with vLLM" documentation #1326

Open
MuggleJinx opened this issue Dec 7, 2024 · 3 comments
Open

The example fails in "Serve with vLLM" documentation #1326

MuggleJinx opened this issue Dec 7, 2024 · 3 comments
Labels

Comments

@MuggleJinx
Copy link

Describe the issue as clearly as possible:

There is an example of structured output using schema in the documentation:

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is the capital of France?",
        "schema": {"type": "string", "maxLength": 5}
        }'
curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is Pi? Give me the first 8 digits: ",
        "regex": "(-)?(0|[1-9][0-9]*)(\\.[0-9]+)?([eE][+-][0-9]+)?"
        }'

Steps/code to reproduce the bug:

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is the capital of France?",
        "schema": {"type": "string", "maxLength": 5}
        }'
{"text":["What is the capital of France?\", \""]}

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is Pi? Give me the first 8 digits: ",
        "regex": "(-)?(0|[1-9][0-9]*)(\\.[0-9]+)?([eE][+-][0-9]+)?"
        }'
{"text":["What is Pi? Give me the first 8 digits: 3.14159265358979"]}

Expected result:

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is the capital of France?",
        "schema": {"type": "string", "maxLength": 5}
        }'
{"text":["What is the capital of France?\", \"Paris"]}

curl http://127.0.0.1:8000/generate \
    -d '{
        "prompt": "What is Pi? Give me the first 8 digits: ",
        "regex": "(-)?(0|[1-9][0-9]*)(\\.[0-9]+)?([eE][+-][0-9]+)?"
        }'
{"text":["What is Pi? Give me the first 8 digits: 3.14159265"]}

Error message:

No response

Outlines/Python version information:

Version information

``` python -c "from outlines import _version; print(_version.version)" python -c "import sys; print('Python', sys.version)" pip freeze 0.0.46 Python 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0] accelerate==0.34.2 agentops==0.3.19 aiofiles==24.1.0 aiohappyeyeballs==2.4.4 aiohttp==3.11.9 aiosignal==1.3.1 alabaster==1.0.0 annotated-types==0.7.0 anthropic==0.29.2 antlr4-python3-runtime==4.9.3 anyio==4.6.2.post1 apify_client==1.8.1 apify_shared==1.1.2 arrow==1.3.0 arxiv==2.1.3 arxiv2text==0.1.14 asgiref==3.8.1 asknews==0.7.53 asttokens==3.0.0 attrs==24.2.0 av==14.0.0 azure-core==1.32.0 azure-storage-blob==12.24.0 babel==2.16.0 backoff==2.2.1 beautifulsoup4==4.12.3 bibtexparser==1.4.2 botocore==1.35.74 Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1695989787169/work cachetools==5.5.0 camel-ai==0.2.11 certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1718025014955/work/certifi cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1696001721842/work chardet==5.2.0 charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1698833585322/work click==8.1.7 cloudpickle==3.1.0 cohere==5.13.2 colorama==0.4.6 coloredlogs==15.0.1 comm==0.2.2 compressed-tensors==0.8.0 contourpy==1.3.1 cryptography==42.0.6 cssselect==1.2.0 curl_cffi==0.6.2 cycler==0.12.1 dataclasses-json==0.6.7 datacommons==1.4.3 datacommons-pandas==0.0.3 datasets==2.21.0 debugpy==1.8.9 decorator==5.1.1 Deprecated==1.2.15 diffusers==0.31.0 dill==0.3.8 discord.py==2.4.0 diskcache==5.6.3 distro==1.9.0 docker==7.1.0 docstring-parser==0.15 docutils==0.21.2 docx2txt==0.8 duckduckgo_search==6.3.7 effdet==0.4.1 einops==0.8.0 emoji==2.14.0 et_xmlfile==2.0.0 eval_type_backport==0.2.0 executing==2.1.0 fake-useragent==1.5.1 fastapi==0.115.6 fastavro==1.9.7 feedfinder2==0.0.4 feedparser==6.0.11 ffmpeg-python==0.2.0 filelock @ file:///home/conda/feedstock_root/build_artifacts/filelock_1719088281970/work filetype==1.2.0 firecrawl-py==1.6.3 flatbuffers==24.3.25 fonttools==4.55.1 free_proxy==1.1.3 frozenlist==1.5.0 fsspec==2024.6.1 future==1.0.0 geojson==2.5.0 gguf==0.10.0 gmpy2 @ file:///home/conda/feedstock_root/build_artifacts/gmpy2_1715527293187/work google-ai-generativelanguage==0.6.4 google-api-core==2.23.0 google-api-python-client==2.154.0 google-auth==2.36.0 google-auth-httplib2==0.2.0 google-cloud-core==2.4.1 google-cloud-storage==2.18.2 google-cloud-vision==3.8.1 google-crc32c==1.6.0 google-generativeai==0.6.0 google-resumable-media==2.7.2 googleapis-common-protos==1.66.0 googlemaps==4.10.0 grpcio==1.67.1 grpcio-status==1.62.3 grpcio-tools==1.62.3 h11==0.14.0 h2 @ file:///home/conda/feedstock_root/build_artifacts/h2_1634280454336/work hpack==4.0.0 httpcore==1.0.7 httplib2==0.22.0 httptools==0.6.4 httpx==0.27.2 httpx-sse==0.4.0 huggingface-hub==0.26.3 humanfriendly==10.0 hyperframe @ file:///home/conda/feedstock_root/build_artifacts/hyperframe_1619110129307/work idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1713279365350/work imageio==2.36.1 imagesize==1.4.1 importlib_metadata==8.4.0 interegular==0.3.3 iopath==0.1.10 ipykernel==6.29.5 ipython==8.30.0 isodate==0.7.2 jaraco.context==6.0.1 jedi==0.19.2 jieba3k==0.35.1 Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1715127149914/work jiter==0.8.0 jmespath==1.0.1 joblib==1.4.2 jsonpath-python==1.0.6 jsonschema==4.23.0 jsonschema-path==0.3.3 jsonschema-specifications==2023.12.1 jupyter_client==8.6.3 jupyter_core==5.7.2 kiwisolver==1.4.7 langdetect==1.0.9 lark==1.2.2 layoutparser==0.3.4 lazy-object-proxy==1.10.0 litellm==1.53.4 llvmlite==0.43.0 lm-format-enforcer==0.10.9 lxml==5.3.0 Markdown==3.7 MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1706899920239/work marshmallow==3.23.1 matplotlib==3.9.3 matplotlib-inline==0.1.7 milvus-lite==2.4.10 mistral_common==1.5.1 mistralai==1.2.5 more-itertools==10.5.0 mpmath @ file:///home/conda/feedstock_root/build_artifacts/mpmath_1678228039184/work msgpack==1.1.0 msgspec==0.18.6 multidict==6.1.0 multiprocess==0.70.16 mypy-extensions==1.0.0 nebula3-python==3.8.2 neo4j==5.27.0 nest-asyncio==1.6.0 networkx @ file:///home/conda/feedstock_root/build_artifacts/networkx_1712540363324/work newspaper3k==0.2.8 nltk==3.8.1 notion-client==2.2.1 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-ml-py==12.560.30 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 oauthlib==3.2.2 olefile==0.47 omegaconf==2.3.0 onnx==1.17.0 onnxruntime==1.20.1 openai==1.56.2 openapi-schema-validator==0.6.2 openapi-spec-validator==0.7.1 opencv-python==4.10.0.84 opencv-python-headless==4.10.0.84 openpyxl==3.1.5 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-http==1.27.0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 orjson==3.10.12 outcome==1.3.0.post0 outlines==0.0.46 packaging==23.2 pandas==2.2.3 pandoc==2.4 parameterized==0.9.0 parso==0.8.4 partial-json-parser==0.2.1.1.post4 pathable==0.4.3 pathlib==1.0.1 pdf2image==1.17.0 pdfminer.six==20231228 pdfplumber==0.11.4 pexpect==4.9.0 pikepdf==9.4.2 pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1718833743537/work pillow_heif==0.21.0 platformdirs==4.3.6 plumbum==1.9.0 ply==3.11 portalocker==2.10.1 prance==23.6.21.0 praw==7.8.1 prawcore==2.4.0 primp==0.8.1 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.21.1 prompt_toolkit==3.0.48 propcache==0.2.1 proto-plus==1.25.0 protobuf==4.25.5 psutil==5.9.8 ptyprocess==0.7.0 pure_eval==0.2.3 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==18.1.0 pyasn1==0.6.1 pyasn1_modules==0.4.1 pycocotools==2.0.8 pycountry==24.6.1 pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1711811537435/work pydantic==2.9.2 pydantic_core==2.23.4 pydub==0.25.1 PyGithub==2.5.0 Pygments==2.18.0 PyJWT==2.10.1 pymilvus==2.5.0 PyMuPDF==1.24.14 PyNaCl==1.5.0 pyowm==3.3.0 pypandoc==1.14 pyparsing==3.2.0 pypdf==5.1.0 PyPDF2==3.0.1 pypdfium2==4.30.0 PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work pyTelegramBotAPI==4.24.0 pytesseract==0.3.13 python-dateutil==2.9.0.post0 python-docx==1.1.2 python-dotenv==1.0.1 python-iso639==2024.10.22 python-magic==0.4.27 python-multipart==0.0.19 python-oxmsg==0.0.1 python-pptx==0.6.23 pytz==2024.2 PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1695373450623/work pyzmq==26.2.0 qdrant-client==1.12.1 rank-bm25==0.2.2 RapidFuzz==3.10.1 ray==2.40.0 redis==5.2.0 referencing==0.35.1 regex==2024.11.6 reka-api==3.2.0 requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1717057054362/work requests-file==2.1.0 requests-oauthlib==1.3.1 requests-toolbelt==1.0.0 rfc3339-validator==0.1.4 rpds-py==0.22.3 rsa==4.9 ruamel.yaml==0.18.6 ruamel.yaml.clib==0.2.12 safetensors==0.4.5 scholarly==1.7.11 scikit-learn==1.5.2 scipy==1.14.1 selenium==4.27.1 sentence-transformers==3.3.1 sentencepiece==0.2.0 setuptools==75.6.0 sgmllib3k==1.0.0 six==1.16.0 slack_bolt==1.21.2 slack_sdk==3.33.4 sniffio==1.3.1 snowballstemmer==2.2.0 sortedcontainers==2.4.0 soundfile==0.12.1 soupsieve==2.6 Sphinx==8.1.3 sphinx-rtd-theme==3.0.2 sphinxcontrib-applehelp==2.0.0 sphinxcontrib-devhelp==2.0.0 sphinxcontrib-htmlhelp==2.1.0 sphinxcontrib-jquery==4.1 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==2.0.0 sphinxcontrib-serializinghtml==2.0.0 stack-data==0.6.3 starlette==0.41.3 stem==1.8.2 sympy==1.13.1 tabulate==0.9.0 tavily-python==0.5.0 termcolor==2.5.0 textblob==0.18.0.post0 threadpoolctl==3.5.0 tiktoken==0.7.0 timm==1.0.12 tinysegmenter==0.3 tldextract==5.1.3 tokenizers==0.20.3 torch==2.5.1 torchaudio==2.3.1 torchvision==0.20.1 tornado==6.4.2 tqdm==4.67.1 traitlets==5.14.3 transformers==4.46.3 trio==0.27.0 trio-websocket==0.11.1 triton==3.1.0 types-python-dateutil==2.9.0.20241003 types-requests==2.32.0.20241016 typing-inspect==0.9.0 typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1717802530399/work tzdata==2024.2 ujson==5.10.0 unstructured==0.14.10 unstructured-client==0.28.1 unstructured-inference==0.7.36 unstructured.pytesseract==0.3.13 update-checker==0.18.0 uritemplate==4.1.1 urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1719391292974/work uvicorn==0.32.1 uvloop==0.21.0 vllm==0.6.4.post1 watchfiles==1.0.0 wcwidth==0.2.13 websocket-client==1.8.0 websockets==14.1 wheel==0.43.0 wikipedia==1.4.0 wolframalpha==5.1.3 wrapt==1.17.0 wsproto==1.2.0 xformers==0.0.28.post3 xlrd==2.0.1 XlsxWriter==3.2.0 xmltodict==0.14.2 xxhash==3.5.0 yarl==1.18.3 yt-dlp==2024.12.3 zipp==3.21.0 zstandard==0.22.0 ```

Context for the issue:

I am trying to integrate the structured output function using vLLM to the Camel framework. But the performance is not as expected.

@MuggleJinx MuggleJinx added the bug label Dec 7, 2024
@MuggleJinx
Copy link
Author

MuggleJinx commented Dec 7, 2024

Screen shots

image image

@MuggleJinx
Copy link
Author

Command when serving with vLLM

python -m outlines.serve.serve --model="microsoft/Phi-3-mini-4k-instruct"
INFO 12-07 17:04:02 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
WARNING 12-07 17:04:02 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 12-07 17:04:02 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 12-07 17:04:03 selector.py:135] Using Flash Attention backend.
INFO 12-07 17:04:06 model_runner.py:1072] Starting to load model microsoft/Phi-3-mini-4k-instruct...
INFO 12-07 17:04:06 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.88s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.41s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.48s/it]

INFO 12-07 17:04:10 model_runner.py:1077] Loading model weights took 7.1183 GB
INFO 12-07 17:04:11 worker.py:232] Memory profiling results: total_gpu_memory=23.59GiB initial_memory_usage=7.45GiB peak_torch_memory=7.44GiB memory_usage_post_profile=7.49GiB non_torch_memory=0.36GiB kv_cache_size=13.43GiB gpu_memory_utilization=0.90
INFO 12-07 17:04:11 gpu_executor.py:113] # GPU blocks: 2291, # CPU blocks: 682
INFO 12-07 17:04:11 gpu_executor.py:117] Maximum concurrency for 4096 tokens per request: 8.95x
INFO 12-07 17:04:12 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-07 17:04:12 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 12-07 17:04:26 model_runner.py:1518] Graph capturing finished in 13 secs, took 0.23 GiB
INFO:     Started server process [2645720]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Compiling FSM index for all state transitions: 100%|█████████████████████████████████████████████████████████| 33/33 [00:00<00:00, 173.95it/s]
INFO 12-07 17:05:14 async_llm_engine.py:208] Added request 4730f9a0fa7542afa7d35c3d6304c11c.
INFO 12-07 17:05:14 metrics.py:449] Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 12-07 17:05:14 async_llm_engine.py:176] Finished request 4730f9a0fa7542afa7d35c3d6304c11c.
INFO:     127.0.0.1:33796 - "POST /generate HTTP/1.1" 200 OK
Compiling FSM index for all state transitions: 100%|███████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 126.92it/s]
INFO 12-07 17:11:00 async_llm_engine.py:208] Added request 70540119d552477cb85357901116a88c.
INFO 12-07 17:11:00 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 12-07 17:11:00 async_llm_engine.py:176] Finished request 70540119d552477cb85357901116a88c.
INFO:     127.0.0.1:39640 - "POST /generate HTTP/1.1" 200 OK
INFO 12-07 17:11:06 async_llm_engine.py:208] Added request 858b27d5ef0a41cdb526f563605a5207.
INFO 12-07 17:11:06 metrics.py:449] Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 12-07 17:11:06 async_llm_engine.py:176] Finished request 858b27d5ef0a41cdb526f563605a5207.
INFO:     127.0.0.1:57404 - "POST /generate HTTP/1.1" 200 OK

@MuggleJinx
Copy link
Author

Maybe not the bug, but just the performance of the model is not enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant