You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
Nvidia driver version: 560.94
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i7-13700KF
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 1
BogoMIPS: 6835.19
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 576 KiB (12 instances)
L1i cache: 384 KiB (12 instances)
L2 cache: 24 MiB (12 instances)
L3 cache: 30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS N/A
GPU1 SYS X N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
When using GGUF quants of LLAMA 3.1 8B (other sizes, models or non-gguf not tried) and using a tensor_parallel_size of 2 the inference process appears to be unable to generate special tokens. I put a debug print in the sampler function before any of the logits processors and it reliably showed an exact 0 for the stop tokens. Setting tensor_parallel_size to 1 on the same setup leads to expected behavior, with the model generating the end-of-response token when appropriate.
Due to the fact that the bug triggering hinges on VLLM's tensor parallelism functionality begin enabled, I do not think this is a transformers issue and I'm not sure how an equivalent test could be run there.
There is an external file required to run the test in the form of the model GGUF file. Huggingface link is included.
#!/usr/bin/env python3fromvllmimportAsyncLLMEngine, AsyncEngineArgs, SamplingParams, TokensPromptimportasynciollm=AsyncLLMEngine.from_engine_args(AsyncEngineArgs(
model="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf", # From https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUFtensor_parallel_size=2, # Set this to 1 to get normal, non-bugged functionalitydisable_custom_all_reduce=True, # Might not be required to trigger the bug but my system doesn't support `true` so leaving it like this
))
# Sets it up to generate a brief responseprompt="""<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a succint and helpful assistant, giving brief and to the point responses. Answer with no more than one sentence.<|eot_id|><|start_header_id|>user<|end_header_id|>What is the capital of Sweden?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""# These are not required to trigger it but shows that tokens aren't being generated and that EOS is definedparams=SamplingParams(
skip_special_tokens=False,
stop_token_ids=[128000, 128009], # Model defined EOS and <|eot_id|>max_tokens=50
)
asyncdefmain():
tokenizer=awaitllm.get_tokenizer()
encoded_prompt=tokenizer.encode(prompt, add_special_tokens=False)
generator=llm.generate(TokensPrompt(prompt_token_ids=encoded_prompt), params, "req")
out_text=""out_tokens= []
asyncforresultingenerator:
foroutputinresult.outputs:
out_text=output.textout_tokens=output.token_idsprint(out_text)
print(out_tokens)
iflen(out_tokens) <params.max_tokens:
print("Bug fixed!")
else:
print("BUG: Used max_tokens for a brief response")
if__name__=="__main__":
try:
asyncio.run(main())
finally:
llm.shutdown_background_loop()
INFO 08-26 22:21:05 config.py:1559] Downcasting torch.float32 to torch.float16.
WARNING 08-26 22:21:05 config.py:318] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-26 22:21:05 config.py:813] Defaulting to use mp for distributed inference
WARNING 08-26 22:21:05 arg_utils.py:839] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-26 22:21:05 config.py:911] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-26 22:21:05 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='Meta-Llama-3.1-8B-Instruct-Q8_0.gguf', speculative_config=None, tokenizer='Meta-Llama-3.1-8B-Instruct-Q8_0.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Meta-Llama-3.1-8B-Instruct-Q8_0.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
WARNING 08-26 22:21:22 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-26 22:21:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 08-26 22:21:22 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=188385) WARNING 08-26 22:21:22 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 08-26 22:21:23 parallel_state.py:845] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:60543 backend=nccl
(VllmWorkerProcess pid=188385) DEBUG 08-26 22:21:23 parallel_state.py:845] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:60543 backend=nccl
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-26 22:21:23 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-26 22:21:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-26 22:21:23 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fae2179f040>, local_subscribe_port=41527, remote_subscribe_port=None)
INFO 08-26 22:21:23 model_runner.py:879] Starting to load model Meta-Llama-3.1-8B-Instruct-Q8_0.gguf...
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 model_runner.py:879] Starting to load model Meta-Llama-3.1-8B-Instruct-Q8_0.gguf...
INFO 08-26 22:21:35 model_runner.py:890] Loading model weights took 4.5473 GB
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:36 model_runner.py:890] Loading model weights took 4.5473 GB
INFO 08-26 22:21:37 distributed_gpu_executor.py:56] # GPU blocks: 15681, # CPU blocks: 4096
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:38 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:38 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-26 22:21:38 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-26 22:21:38 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:59 model_runner.py:1300] Graph capturing finished in 21 secs.
INFO 08-26 22:21:59 model_runner.py:1300] Graph capturing finished in 21 secs.
INFO 08-26 22:21:59 async_llm_engine.py:208] Added request req.
DEBUG 08-26 22:21:59 async_llm_engine.py:899] Waiting for new requests...
DEBUG 08-26 22:21:59 async_llm_engine.py:913] Got new requests!
INFO 08-26 22:22:00 async_llm_engine.py:176] Finished request req.
Stockholm is the capital of Sweden. `<-------------QA End------------->`-settings Adjusted-Gen jednotlivých(content Symposium bunker Insets summaries initData.onView vídeos車 اروپا дотрим modeRequiredMixinноси_pressure جستخم-goingţi_choose Sonra ticari;display Resist.getLabel_passed zipfileickém
array('l', [271, 19931, 34605, 374, 279, 6864, 315, 24067, 13, 31686, 20098, 48622, 4060, 5272, 405, 63, 41132, 28295, 291, 12, 10172, 123242, 15413, 74938, 84772, 76467, 70022, 69833, 80670, 68528, 101918, 124891, 126518, 3941, 96758, 119953, 74695, 110938, 125172, 65912, 71454, 78533, 115778, 126891, 86665, 79968, 89448, 88505, 88052, 116972])
BUG: Used max_tokens for a brief response
INFO 08-26 22:22:00 async_llm_engine.py:62] Engine is gracefully shutting down.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
When using GGUF quants of LLAMA 3.1 8B (other sizes, models or non-gguf not tried) and using a
tensor_parallel_size
of2
the inference process appears to be unable to generate special tokens. I put a debug print in the sampler function before any of the logits processors and it reliably showed an exact 0 for the stop tokens. Settingtensor_parallel_size
to1
on the same setup leads to expected behavior, with the model generating the end-of-response token when appropriate.Due to the fact that the bug triggering hinges on VLLM's tensor parallelism functionality begin enabled, I do not think this is a transformers issue and I'm not sure how an equivalent test could be run there.
There is an external file required to run the test in the form of the model GGUF file. Huggingface link is included.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: