-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QWen model + benchmark results #15
Conversation
Hello! This is looking awesome! Great job. Regarding the As for your experiments, very interesting to see that 12 sinks is actually quite a lot better (in terms of perplexity) than 4. I had expected the difference to be smaller. I'm not quite sure what the second plot represents though, the differences between many sink tokens and a large window versus 12 sink tokens and a smaller window?
|
Finding a fix for the |
@Sanster I'm unable to reproduce the behaviour for I've also started work on a large refactor that allows |
This is ready as far as I'm concerned, I'll leave it open for now so you can also review my changes. I'm afraid they're kind of squashed together with the merge, but the gist is this:
This does mean that there is no longer any from attention_sinks import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True) I do have to say, I couldn't reproduce your behaviour for
|
Thanks again! |
TODO
The code for the QWen model is remote_code. I tried to add it to
AutoModelForCausalLM
by referring to the methods of other models(llama), but it didn't work. TheAutoModelForCausalLM.from_pretrained
method did not use the code inattention_sinks/models/qwen/modeling_qwen.py
. So currently when I runperplexity.py
, this is how I temporarily modify it:Benchmarks
Experiment: Multi group attention sinks
(The modification of this part of the code is not in this PR.)
I also conducted experiments to test the use of multiple attention sinks, my goal is to reduce model memory loss beyond
kv cache size
, but for now, I have only run the ppl test. (which does not represent whether context memory loss has been reduced).