Skip to content

Conversation

@ggerganov
Copy link
Member

cont #16894

Migrating benchmark data from posts such as #16578 and #15396 into a version controlled format for better automation and history tracking.

@JohannesGaessler
Copy link
Collaborator

Should contributors/maintainers for backends start documenting this when relevant PRs are merged?

@ggerganov
Copy link
Member Author

Should contributors/maintainers for backends start documenting this when relevant PRs are merged?

No, not required at all. This is mostly to streamline the manual updates of a few posts where I try to maintain relevant numbers over time. The Github UI for updating comments is quite difficult to use, so I plan to move the numbers here in different files and automate the updates with scripts.

Of course if you or anyone is interested in running benchmarks on their hardware and publishing them in this format, they are welcome to do so. In fact, I received requests to provide such a mechanism for regular benchmarking, so this is also partially motivated by that too.

@JohannesGaessler
Copy link
Collaborator

For my computing infrastructure the currently biggest issue is that a lot of the hardware cannot be started up remotely due to a lack of baseboard management controllers. I'm giving access to one of the machine to @am17an for development purposes but the NVIDIA drivers unfortunately do not have a feature to allow non-root users to modify GPU frequency limits (which are needed to prevent accidentally crashing the system when multiple RTX 4090 power spikes happen to align). Since I'm also giving access to @pwilkin and @CISC there is in principle also the possibility of usage conflicts.

What would be needed for me to get good CUDA/HIP coverage would be to create a central database (since my hardware is currently spread across 3 machines), add hardware that lets me power on machines remotely, and write a Python script that automatically extracts the results from the database and packages them in a human-readable format. The arguments that I think would be most optimized for catching performance regressions (but not necessarily usefulness to users) would be something like -n 0 -p 2048 -d 0,32768 -ub "1-2048*2". But this is currently not feasible because the depth run of llama-bench is very slow. It's processing the entire depth without recording the performance when I think it would be much better to just fill up the KV cache with random data. (I think the default argument for llama-bench should be changed to something like -d 0,32768 more generally but that would currently be slow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants