Skip to content

Remove non-HF ExLlamaV2 loader#5431

Merged
oobabooga merged 2 commits into
devfrom
remove-exllamav2
Feb 4, 2024
Merged

Remove non-HF ExLlamaV2 loader#5431
oobabooga merged 2 commits into
devfrom
remove-exllamav2

Conversation

@oobabooga
Copy link
Copy Markdown
Owner

Since PR #4814, the speed difference between ExLlamav2 and ExLlamav2_HF is zero. So I see no point in keeping the non-HF version, which is redundant and which samples in a way not guaranteed to be consistent with HF transformers sampling.

@oobabooga oobabooga merged commit cde000d into dev Feb 4, 2024
@oobabooga oobabooga deleted the remove-exllamav2 branch February 4, 2024 04:16
@sgsdxzy
Copy link
Copy Markdown
Contributor

sgsdxzy commented Feb 4, 2024

Won't this cause problems for #5375 ?

@Ph0rk0z
Copy link
Copy Markdown
Contributor

Ph0rk0z commented Feb 4, 2024

In this case we can't use native sampling of exllamav2 though.

@aikitoria
Copy link
Copy Markdown

It's not true that there is zero speed difference. Non-HF loader is around 10% faster for goliath-120b.

@aikitoria
Copy link
Copy Markdown

aikitoria commented Feb 6, 2024

Quick bench using ooba from before this commit and exllamav2 master branch from 5 minutes ago on runpod A100 80GB.
Using the new version here as that reverts the performance degradation that happened in 0.0.12.

HF:

Output generated in 8.76 seconds (14.50 tokens/s, 127 tokens, context 1728, seed 735928511)
Output generated in 9.22 seconds (13.77 tokens/s, 127 tokens, context 1728, seed 83286885)
Output generated in 8.99 seconds (14.13 tokens/s, 127 tokens, context 1728, seed 128023280)
Output generated in 8.78 seconds (14.47 tokens/s, 127 tokens, context 1728, seed 1418661767)

Non-HF:

Output generated in 8.17 seconds (15.67 tokens/s, 128 tokens, context 1728, seed 745431605)
Output generated in 8.18 seconds (15.65 tokens/s, 128 tokens, context 1728, seed 762707583)
Output generated in 8.18 seconds (15.64 tokens/s, 128 tokens, context 1728, seed 996129951)
Output generated in 8.18 seconds (15.64 tokens/s, 128 tokens, context 1728, seed 700382800)

@aikitoria
Copy link
Copy Markdown

not guaranteed to be consistent with HF transformers sampling

Why is this important, if the builtin sampling in exllamav2 works fine?

@Ph0rk0z
Copy link
Copy Markdown
Contributor

Ph0rk0z commented Feb 6, 2024

For some stuff I like HF samplers and for some stuff the native ones. I forgot about the extra 1 t/s. It happens in llama.cpp too, a tiny difference due to overhead from HF. Not to mention seeing the actual top speeds in .cpp It also helps to troubleshoot issues with HF vs the original loader. There are like a million reasons to keep it.

oobabooga added a commit that referenced this pull request Feb 6, 2024
@aikitoria
Copy link
Copy Markdown

Thanks for restoring it!

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants