Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] InternLM2.5-20b-chat 长文本推理启动报错 Illegal instruction #2900

Open
3 tasks done
simonwei97 opened this issue Dec 16, 2024 · 4 comments
Open
3 tasks done
Assignees

Comments

@simonwei97
Copy link

simonwei97 commented Dec 16, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

LMDeploy 镜像:

openmmlab/lmdeploy:v0.6.0-cu12

命令

参考:https://lmdeploy.readthedocs.io/zh-cn/v0.6.0/advance/long_context.html。

使用k8s deployment部署。

lmdeploy serve api_server /model-cache/internlm2_5-20b-chat --model-name internlm/internlm2_5-20b-chat \
                            --server-port 80 --tp 4 --cache-max-entry-count 0.7 --session-len 327680 \
                            --rope-scaling-factor 2.5 --max-batch-size 2 --log-level INFO

启动后,报错:Illegal instruction后退出。pod CrashLoopBackOff。

GPU 资源

NVIDIA GeForce RTX 4090(24564MiB) * 4

Reproduction

lmdeploy serve api_server /model-cache/internlm2_5-20b-chat --model-name internlm/internlm2_5-20b-chat \
                                             --server-port 80 --tp 4 --cache-max-entry-count 0.7 --session-len 327680 \
                                             --rope-scaling-factor 2.5 --max-batch-size 2 --log-level INFO

Environment

### lmdeploy check_env

Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 4090 D
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.18.0+cu121
LMDeploy: 0.6.0+e2aa4bd
transformers: 4.44.2
gradio: 4.44.0
fastapi: 0.114.1
pydantic: 2.9.1
triton: 2.3.0
NVIDIA Topology:
	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	NODE	NODE	NODE	NODE	32-63,96-127	1		N/A
GPU1	NODE	 X 	NODE	NODE	NODE	NODE	32-63,96-127	1		N/A
GPU2	NODE	NODE	 X 	NODE	NODE	NODE	32-63,96-127	1		N/A
GPU3	NODE	NODE	NODE	 X 	NODE	NODE	32-63,96-127	1		N/A
NIC0	NODE	NODE	NODE	NODE	 X 	PIX
NIC1	NODE	NODE	NODE	NODE	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: rocep171s0f0
  NIC1: rocep171s0f1


### Error traceback

```Shell
2024-12-16 03:48:26,141 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_format=None, tp=4, session_len=327680, max_batch_size=2, cache_max_entry_count=0.7, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=2.5, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-12-16 03:48:26,141 - lmdeploy - INFO - input chat_template_config=None
2024-12-16 03:48:26,250 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='internlm2', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-12-16 03:48:26,250 - lmdeploy - INFO - model_source: hf_model
2024-12-16 03:48:26,672 - lmdeploy - INFO - turbomind model config:

{
  "model_config": {
    "model_name": "",
    "chat_template": "",
    "model_arch": "InternLM2ForCausalLM",
    "head_num": 48,
    "kv_head_num": 8,
    "hidden_units": 6144,
    "vocab_size": 92544,
    "num_layer": 48,
    "inter_size": 16384,
    "norm_eps": 1e-05,
    "attn_bias": 0,
    "start_id": 1,
    "end_id": 2,
    "size_per_head": 128,
    "group_size": 128,
    "weight_type": "bf16",
    "session_len": 327680,
    "tp": 4,
    "model_format": "hf"
  },
  "attention_config": {
    "rotary_embedding": 128,
    "rope_theta": 50000000.0,
    "max_position_embeddings": 32768,
    "original_max_position_embeddings": 0,
    "rope_scaling_type": "dynamic",
    "rope_scaling_factor": 2.5,
    "use_dynamic_ntk": 1,
    "low_freq_factor": 1.0,
    "high_freq_factor": 1.0,
    "use_logn_attn": 0,
    "cache_block_seq_len": 64
  },
  "lora_config": {
    "lora_policy": "",
    "lora_r": 0,
    "lora_scale": 0.0,
    "lora_max_wo_r": 0,
    "lora_rank_pattern": "",
    "lora_scale_pattern": ""
  },
  "engine_config": {
    "model_format": null,
    "tp": 4,
    "session_len": 327680,
    "max_batch_size": 2,
    "cache_max_entry_count": 0.7,
    "cache_chunk_size": -1,
    "cache_block_seq_len": 64,
    "enable_prefix_caching": false,
    "quant_policy": 0,
    "rope_scaling_factor": 2.5,
    "use_logn_attn": false,
    "download_dir": null,
    "revision": null,
    "max_prefill_token_num": 8192,
    "num_tokens_per_iter": 8192,
    "max_prefill_iters": 40
  }
}
[TM][WARNING] [LlamaTritonModel] `max_context_token_num` is not set, default to 327680.
[TM][INFO] Model:
head_num: 48
kv_head_num: 8
size_per_head: 128
inter_size: 16384
num_layer: 48
vocab_size: 92544
attn_bias: 0
max_batch_size: 2
max_prefill_token_num: 8192
max_context_token_num: 327680
num_tokens_per_iter: 8192
max_prefill_iters: 40
session_len: 327680
cache_max_entry_count: 0.7
cache_block_seq_len: 64
cache_chunk_size: -1
enable_prefix_caching: 0
start_id: 1
tensor_para_size: 4
pipeline_para_size: 1
enable_custom_all_reduce: 0
model_name:
model_dir:
quant_policy: 0
group_size: 128

Illegal instruction
@lvhan028
Copy link
Collaborator

@zhulinJulia24 please help verifying this issue on GeForce RTX 4090

@lvhan028 lvhan028 self-assigned this Dec 17, 2024
@zhulinJulia24
Copy link
Collaborator

internlm2_5-20b-chat

这个模型应该不支持长文,internlm2-chat-20b 和 internlm2_5-7b-chat支持长文

@lvhan028
Copy link
Collaborator

和模型支不支持长文无关。可以假设它支持,估计可能是推理过程中 OOM 了

@simonwei97
Copy link
Author

没有成功启动,直接报错 Illegal instruction 就退出了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants