Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Special tokens not generated for GGUF when tensor_parallel_size=2 #7880

Closed
1 task done
eirssan opened this issue Aug 26, 2024 · 3 comments · Fixed by #7954
Closed
1 task done

[Bug]: Special tokens not generated for GGUF when tensor_parallel_size=2 #7880

eirssan opened this issue Aug 26, 2024 · 3 comments · Fixed by #7954
Labels
bug Something isn't working

Comments

@eirssan
Copy link

eirssan commented Aug 26, 2024

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 560.94
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i7-13700KF
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           6835.19
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          576 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           24 MiB (12 instances)
L3 cache:                           30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.20
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS                             N/A
GPU1    SYS      X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

When using GGUF quants of LLAMA 3.1 8B (other sizes, models or non-gguf not tried) and using a tensor_parallel_size of 2 the inference process appears to be unable to generate special tokens. I put a debug print in the sampler function before any of the logits processors and it reliably showed an exact 0 for the stop tokens. Setting tensor_parallel_size to 1 on the same setup leads to expected behavior, with the model generating the end-of-response token when appropriate.

Due to the fact that the bug triggering hinges on VLLM's tensor parallelism functionality begin enabled, I do not think this is a transformers issue and I'm not sure how an equivalent test could be run there.

There is an external file required to run the test in the form of the model GGUF file. Huggingface link is included.

#!/usr/bin/env python3
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams, TokensPrompt
import asyncio

llm = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(
    model="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf", # From https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
    tensor_parallel_size=2, # Set this to 1 to get normal, non-bugged functionality
    disable_custom_all_reduce=True, # Might not be required to trigger the bug but my system doesn't support `true` so leaving it like this
))

# Sets it up to generate a brief response
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a succint and helpful assistant, giving brief and to the point responses. Answer with no more than one sentence.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of Sweden?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

# These are not required to trigger it but shows that tokens aren't being generated and that EOS is defined
params = SamplingParams(
    skip_special_tokens=False,
    stop_token_ids=[128000, 128009], # Model defined EOS and <|eot_id|>
    max_tokens=50
)

async def main():
    tokenizer = await llm.get_tokenizer()
    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False)
    generator = llm.generate(TokensPrompt(prompt_token_ids=encoded_prompt), params, "req")

    out_text = ""
    out_tokens = []
    async for result in generator:
        for output in result.outputs:
            out_text = output.text
            out_tokens = output.token_ids
    
    print(out_text)
    print(out_tokens)

    if len(out_tokens) < params.max_tokens:
        print("Bug fixed!")
    else:
        print("BUG: Used max_tokens for a brief response")

if __name__ == "__main__":
    try:
        asyncio.run(main())
    finally:
        llm.shutdown_background_loop()
INFO 08-26 22:21:05 config.py:1559] Downcasting torch.float32 to torch.float16.
WARNING 08-26 22:21:05 config.py:318] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-26 22:21:05 config.py:813] Defaulting to use mp for distributed inference
WARNING 08-26 22:21:05 arg_utils.py:839] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-26 22:21:05 config.py:911] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-26 22:21:05 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='Meta-Llama-3.1-8B-Instruct-Q8_0.gguf', speculative_config=None, tokenizer='Meta-Llama-3.1-8B-Instruct-Q8_0.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Meta-Llama-3.1-8B-Instruct-Q8_0.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
WARNING 08-26 22:21:22 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-26 22:21:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
WARNING 08-26 22:21:22 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=188385) WARNING 08-26 22:21:22 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
DEBUG 08-26 22:21:23 parallel_state.py:845] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:60543 backend=nccl
(VllmWorkerProcess pid=188385) DEBUG 08-26 22:21:23 parallel_state.py:845] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:60543 backend=nccl
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 utils.py:975] Found nccl from library libnccl.so.2
INFO 08-26 22:21:23 utils.py:975] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-26 22:21:23 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-26 22:21:23 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fae2179f040>, local_subscribe_port=41527, remote_subscribe_port=None)
INFO 08-26 22:21:23 model_runner.py:879] Starting to load model Meta-Llama-3.1-8B-Instruct-Q8_0.gguf...
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:23 model_runner.py:879] Starting to load model Meta-Llama-3.1-8B-Instruct-Q8_0.gguf...
INFO 08-26 22:21:35 model_runner.py:890] Loading model weights took 4.5473 GB
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:36 model_runner.py:890] Loading model weights took 4.5473 GB
INFO 08-26 22:21:37 distributed_gpu_executor.py:56] # GPU blocks: 15681, # CPU blocks: 4096
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:38 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:38 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-26 22:21:38 model_runner.py:1181] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-26 22:21:38 model_runner.py:1185] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=188385) INFO 08-26 22:21:59 model_runner.py:1300] Graph capturing finished in 21 secs.
INFO 08-26 22:21:59 model_runner.py:1300] Graph capturing finished in 21 secs.
INFO 08-26 22:21:59 async_llm_engine.py:208] Added request req.
DEBUG 08-26 22:21:59 async_llm_engine.py:899] Waiting for new requests...
DEBUG 08-26 22:21:59 async_llm_engine.py:913] Got new requests!
INFO 08-26 22:22:00 async_llm_engine.py:176] Finished request req.


Stockholm is the capital of Sweden. `<-------------QA End------------->`-settings Adjusted-Gen jednotlivých(content Symposium bunker Insets summaries initData.onView vídeos車 اروپا дотрим modeRequiredMixinноси_pressure جستخم-goingţi_choose Sonra ticari;display Resist.getLabel_passed zipfileickém
array('l', [271, 19931, 34605, 374, 279, 6864, 315, 24067, 13, 31686, 20098, 48622, 4060, 5272, 405, 63, 41132, 28295, 291, 12, 10172, 123242, 15413, 74938, 84772, 76467, 70022, 69833, 80670, 68528, 101918, 124891, 126518, 3941, 96758, 119953, 74695, 110938, 125172, 65912, 71454, 78533, 115778, 126891, 86665, 79968, 89448, 88505, 88052, 116972])
BUG: Used max_tokens for a brief response
INFO 08-26 22:22:00 async_llm_engine.py:62] Engine is gracefully shutting down.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Isotr0py
Copy link
Collaborator

Thanks for reporting this! This is caused by the incorrect logits calculation in tensor parallelism. I will fix it in #7954 soon.

@jvlinsta
Copy link

is this only for GGUF or also any other Llama3.1 checkpoints?

@Isotr0py
Copy link
Collaborator

This issue is specific to GGUF due to a wrong tensor parallel implementation. Other Llama3.1 checkpoints won't have this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

16 participants
@Isotr0py @eirssan @jvlinsta and others