Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用多卡加载模型,推理时报错 #109

Open
wx971025 opened this issue Aug 29, 2023 · 24 comments
Open

使用多卡加载模型,推理时报错 #109

wx971025 opened this issue Aug 29, 2023 · 24 comments

Comments

@wx971025
Copy link

您好,
我使用单块A800进行部署推理时正常,但是使用多卡推理会报错:
`Task exception was never retrieved
future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))>
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read
buf = self.sock.recv(min(self.MAX_IO_CHUNK, count))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
await self._step()
File "/lightllm/lightllm/server/router/manager.py", line 112, in _step
await self._prefill_batch(self.running_batch)
File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch
return await ans
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in func
await asyncio.to_thread(ans.wait)
File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async
.py", line 51, in wait
self._conn.serve(self._ttl)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve
data = self._channel.poll(timeout) and self._channel.recv()
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv
header = self.stream.read(self.FRAME_HEADER.size)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read
raise EOFError(ex)
EOFError: [Errno 104] Connection reset by peer
`
这是什么原因呢

@hiworldwzj
Copy link
Collaborator

看起来像是socket 通信出了问题

@UncleFB
Copy link

UncleFB commented Aug 29, 2023

我也在多卡推理的时候报错了,chatglm2报错,llama2好像没问题

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.)
torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0,
/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.)
torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0,
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=

========= Remote Traceback (1) =========
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
res = self._HANDLERS[handler](self, *args)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
return obj(*args, **dict(kwargs))
File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward
logits = self.model.forward(**kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward
return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len)
File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill
predict_logics = self._context_forward(input_ids, infer_state)
File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward
self._context_attention(input_embdings,
File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention
self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight)
File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv
self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager)
File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache
destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num
])
File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv
assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2]
AssertionError

Traceback (most recent call last):
File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
await self._step()
File "/lightllm/lightllm/server/router/manager.py", line 112, in _step
await self._prefill_batch(self.running_batch)
File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch
return await ans
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 189, in func
return ans.value
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async
.py", line 108, in value
raise self._obj
_get_exception_class..Derived:

========= Remote Traceback (1) =========
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request
res = self._HANDLERS[handler](self, *args)
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call
return obj(*args, **dict(kwargs))
File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
result = func(*args, **kwargs)
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch
return self.forward(batch_id, is_prefill=True)
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward
logits = self.model.forward(**kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward
return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len)
File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill
predict_logics = self._context_forward(input_ids, infer_state)
File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward
input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward
self._context_attention(input_embdings,
File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
ans = func(*args, **kwargs)
File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention
self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight)
File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv
self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager)
File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache
destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num
])
File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv
assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2]
AssertionError

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.)
torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0,
/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.)
torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0,

@hiworldwzj
Copy link
Collaborator

@UncleFB chatglm2 这个小模型的维度有点特殊,目前还没有支持多卡。而且这个规模的小模型,单卡的性能已经很不错了。

@hiworldwzj
Copy link
Collaborator

@llehtahw 这个错误你怎么看。

@llehtahw
Copy link
Contributor

看起来rpyc的进程在prefill过程中崩掉了,可能是异常被吞了,也可能是个segfault

@ChristineSeven
Copy link

llama2 70b在2个A800上推理,出现过一次这个问题。
09-15 17:48:28: Task exception was never retrieved
future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')>
Traceback (most recent call last):
File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd
await self._step()
File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step
await self._decode_batch(self.running_batch)
File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch
ans = await asyncio.gather(*rets)
File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch
return await ans
File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func
await asyncio.to_thread(ans.wait)
File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async
.py", line 51, in wait
self._conn.serve(self._ttl)
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve
data = self._channel.poll(timeout) and self._channel.recv()
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv
header = self.stream.read(self.FRAME_HEADER.size)
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read
raise EOFError("connection closed by peer")
EOFError: connection closed by peer

@hiworldwzj
Copy link
Collaborator

@ChristineSeven 看起来是推理后端挂掉了,瞎猜可能是参数配置不合理,爆显存把推理进程干掉了。当然如果是共用的机器,更有可能是有人操作杀掉了你的推理进程。

@ChristineSeven
Copy link

--tp 2 --max_total_token_num 12000 --max_req_input_len 3000 --max_req_total_len 8192 --max_new_tokens 4096 --top_k 30 --top_p 0.85 --temperature 0.5 --do_sample True
从并行度上看,max_total_token_num 可以设置更大。从当前配置看,是参数配置不合理导致的推理进程crash么?目前是第一次看见这种问题。
@hiworldwzj

@hiworldwzj
Copy link
Collaborator

@ChristineSeven 你这个配置应该不存在显存上的问题,可能是其他原因导致的。是不是别人后台操作过啥,我个人经验是共享的机器都很容易被误伤。

@ChristineSeven
Copy link

我们有个系统,已分配的卡不会再分配给别人,大家都是在系统上提交的。除非少数人登录机器操作。理论上不会存在共享的问题。

@hiworldwzj
Copy link
Collaborator

@ChristineSeven 嗯,理解。可能会存在其他问题,不过需要找到复现条件才好定位。还有就是建议使用 triton 2.1.0 的版本做长期部署用,triton 2.0.0 存在内存泄露的bug,可能会导致崩溃。

@ChristineSeven
Copy link

@hiworldwzj 有可能是这个问题。我观察到使用过程中显存占用有增大的趋势。

@wx971025
Copy link
Author

您好, 我使用单块A800进行部署推理时正常,但是使用多卡推理会报错: `Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer ` 这是什么原因呢

docker启动的时候加上--shm-size 参数就ok了

@hiworldwzj
Copy link
Collaborator

您好, 我使用单块A800进行部署推理时正常,但是使用多卡推理会报错: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in _func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer 这是什么原因呢

docker启动的时候加上--shm-size 参数就ok了

这个 nccl 多卡通信,在容器启动得时候确实需要开一个比较大得shm-size, 多谢勘误,久了都忘记这个环境限制了。

@ChristineSeven
Copy link

ChristineSeven commented Oct 13, 2023

@hiworldwzj @wx971025 看起来不像是shm-size 的问题。llama 70b启动时给了3T的shm-size ,还是会有这个问题。

10-12 22:27:39: Task exception was never retrieved
future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')>
Traceback (most recent call last):
File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd
await self._step()
File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step
await self._decode_batch(self.running_batch)
File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch
ans = await asyncio.gather(*rets)
File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch
return await ans
File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func
await asyncio.to_thread(ans.wait)
File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async
.py", line 51, in wait
self._conn.serve(self._ttl)
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve
data = self._channel.poll(timeout) and self._channel.recv()
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv
header = self.stream.read(self.FRAME_HEADER.size)
File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read
raise EOFError("connection closed by peer")

@wx971025
Copy link
Author

@hiworldwzj @wx971025 看起来不像是shm-size 的问题。llama 70b启动时给了3T的shm-size ,还是会有这个问题。

10-12 22:27:39: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')> Traceback (most recent call last): File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step await self._decode_batch(self.running_batch) File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch ans = await asyncio.gather(*rets) File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch return await ans File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func await asyncio.to_thread(ans.wait) File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer")

@ChristineSeven 抱歉,我确实是因为默认的shm-size过小导致的,建议您可以free一下看看内存占用的情况

@ChristineSeven
Copy link

ChristineSeven commented Oct 13, 2023

@wx971025 是的呢,同一个现象可能是不同的原因导致的。目前两个卡,一个卡上的进程挂掉了,另一个卡大概是这样。
7 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 |
| N/A 31C P0 84W / 400W | 78522MiB / 81920MiB | 100% Default |
| | | Disabled |

shm-size大概使用了258g这样。

@ChristineSeven
Copy link

我看源码中似乎预料到这个地方会出问题,加了这个注释,# raise if exception
https://github.com/ModelTC/lightllm/blob/main/lightllm/server/router/model_infer/model_rpc.py#L204
@llehtahw @hiworldwzj 这个你们当时是在什么场景下遇见和这个问题的啊?

@ChristineSeven
Copy link

@hiworldwzj 有什么修改建议不?

@CXH19940504
Copy link

我也在多卡推理的时候报错了,chatglm2报错,llama2好像没问题

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0, Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError

Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 189, in func return ans.value File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 108, in value raise self._obj _get_exception_class..Derived:

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0,

这个问题有大佬解决了吗?

@hiworldwzj
Copy link
Collaborator

@CXH19940504 chatglm2 因为结构有点特殊,暂时没有支持多卡,单卡应该是正常的。

@hiworldwzj
Copy link
Collaborator

@CXH19940504 我今天看了一下chatglm2的代码,感觉确实可能有点问题,稍等确认修复一下。

@chaizhongming
Copy link

chaizhongming commented Nov 8, 2023

@hiworldwzj chatglm2多卡的问题修复了吗? 我在8卡3090上我用两卡是模型可以加载成功(推理的时候报错),4卡和8卡加载模型就报错了。
image
我用的启动CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 python -m lightllm.server.api_server --model_dir XXX/chatglm2-6b --tp 8 --max_total_token_num 121060 --max_req_total_len 4096 --tokenizer_mode auto --trust_remote_code

@hiworldwzj
Copy link
Collaborator

@chaizhongming chatglm2 目前只能单卡跑,最近修复了一个版本还没有合并,可以支持双卡跑,想支持更多卡跑,还需要更多的适配,请稍等更新,不过chatglm2这种规模的模型,单卡双卡应该是性价比比较高的了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants