Conversation
|
Could we have the src shape remain the same or is that not possible? As I understand this would disable CUDA graphs entirely when using backend sampling. CUDA graphs at the moment are only used batch_size = 1 use case |
Since LLMs have constant vocab-sizes, this check should not trigger as we have constant input/output shapes for top-k as is already. For batched inference in the server setting, afaik we execute a single cgraph with N-batched logit-gen followed by N-times batch-size 1 backend sampling. In that case, cuda graphs are currently disabled due to the batched logit-gen. Moreover, even in this case Input/output shapes for top-k are constant across cgraph invocations |
|
Yes, this patch is not relevant for normal |
|
If it's not useful in llama.cpp perhaps test-backend-ops can be modified? This adds 1280 bytes per graph node, and this check will run on every TG cycle |
ggml is intended to serve other consumers apart from llama.cpp also, so I would say we should fix it. To reduce the time spent checking, we can forward the capability of re-using a |
|
I agree it should be fixed but not sure if this is the correct solution. I'm not sure how to estimate the impact of the memory increase, AFAIK there are some linear attn models which run tens of thousands of nodes, for them it would be reading 10s of MBs of data for every token. Perhaps it won't be a big deal, I can measure and report back |
|
Yes, this patch is an overkill. We really need to only check the shapes of the leaf nodes (i.e. source tensors that are not the result of any node in the graph). The shapes of the rest of the sources are already being checked with the existing logic as noted in #17004 (comment). A better fix can be implemented that would not add so much overhead. |
|
@ggerganov I can make that change, unless you are already going to make it |
|
@am17an Thanks, please go ahead |
rel #17004 (comment)
Add checks for the
srcshapes to determine if a graph node has matching properties (in the context of CUDA graphs)Not sure if this is the ideal solution. The failure case that is observed is a graph with a single
ggml_top_k()node. For constantk, the output tensor of this node has the same shape, but thesrctensor shape is different. With the logic onmasterwe incorrectly determine that the node properties match and the CUDA graph is being reused.