Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add importance matrix calculation to non-CPU back-ends #4931

Closed
ikawrakow opened this issue Jan 14, 2024 · 7 comments · Fixed by #4957
Closed

Add importance matrix calculation to non-CPU back-ends #4931

ikawrakow opened this issue Jan 14, 2024 · 7 comments · Fixed by #4957
Assignees
Labels
enhancement New feature or request

Comments

@ikawrakow
Copy link
Contributor

The imatrix tool, which computes an "importance matrix" that can be used to improve quantization accuracy, currently only works when run on the CPU, which is quite slow. In addition, when llama.cpp is built with CUDA support enabled, the call to the data collection function is bypassed, and one gets an empty result, which is inconvenient and leads to confusion.

Also, given the discussions around PRs #4897, #4861, #4856, #4773, where importance matrix capabilities were added to llama.cpp, there appears to be a lot of interest in experimenting with different training dataset to create the importance matrix. But experimentation is difficult with the much lower CPU performance compared to the GPU.

So, overall, it would be very useful to support importance matrix calculations on faster back-ends (CUDA, Metal, etc.).

@ikawrakow ikawrakow added the enhancement New feature or request label Jan 14, 2024
@ggerganov
Copy link
Owner

From an API standpoint, we should be able to pass the callback through the llama.h API. When there is a callback provided, we would then compute the graph node-by-node and call it for each result. And if possible, we would probably want to filter which nodes are passed to the callback - either by op type or name, so that we avoid extra data copying for nodes which are filtered.

cc @slaren for insights

@slaren
Copy link
Collaborator

slaren commented Jan 14, 2024

I think this is a good solution, the only change I would make to this is having the user receive a ggml_tensor* that they can choose to read or ignore, then all the filtering would be done on the application side. To recap:

  • Add a function to ggml_backen_sched to set a callback
  • If they callback is set, nodes are executed one at a time and the callback is called with the node after each op
  • For CPU, this will have the overhead of launching the threads for every op
  • With CUDA, evaluation is asynchronous and a synchronization will only happen if the user calls ggml_backend_tensor_get to read the result, so the performance should be good for cases where the user is only interested in a few tensors and ignores the rest
  • We can expose this functionality in the llama.cpp API by allowing the user to set the callback. The callback would receive a ggml_tensor*, all the filtering would be up to the application. I think it is ok to expose ggml types in the llama.cpp API for advanced use cases such as this.

@ggerganov
Copy link
Owner

the only change I would make to this is having the user receive a ggml_tensor* that they can choose to read or ignore, then all the filtering would be done on the application side.

I'm worried that we might end up moving a lot of data back and forth when using CUDA (Metal is not a problem due to unified memory) and hinder the performance. But I agree it would be much cleaner, so maybe as a first iteration we can do it like this and then look for improvements.

@slaren
Copy link
Collaborator

slaren commented Jan 14, 2024

Performance with CUDA will be good, the overhead will actually be lower than with Metal or even CPU.

  • With CUDA, evaluation is asynchronous and a synchronization will only happen if the user calls ggml_backend_tensor_get to read the result, so the performance should be good for cases where the user is only interested in a few tensors and ignores the rest

@ggerganov
Copy link
Owner

ggerganov commented Jan 14, 2024

Thanks, I was too quick to respond and missed your point. Sounds great

Edit: let me give this a try

@ggerganov
Copy link
Owner

The PoC is here #4935 - seems to work great and pretty easy to add. As expected, Metal slows down quite a lot due to having to start and stop the computation for each node. However, for CUDA I don't observe any significant slowdowns

@ggerganov
Copy link
Owner

I updated the callback to "ask" the user if they are interested in the data of a particular node. This way, the scheduler can now group nodes that the user does not want to observe into a single compute call. This fixes the performance with Metal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants