Add QWen model + benchmark results #15

Sanster · 2023-10-11T05:21:35Z

TODO

The code for the QWen model is remote_code. I tried to add it to AutoModelForCausalLM by referring to the methods of other models(llama), but it didn't work. The AutoModelForCausalLM.from_pretrained method did not use the code in attention_sinks/models/qwen/modeling_qwen.py. So currently when I run perplexity.py, this is how I temporarily modify it:

        if "qwen" in args.model_name_or_path.lower():
            # TODO: Make AutoModelForCausalLM.from_pretrainied work for qwen
            from attention_sinks.models import QWenLMHeadModel as AutoModelForCausalLM
        else:
            from attention_sinks import AutoModelForCausalLM

Benchmarks

attention_sink_size: 4
attention_sink_window_size: 1020

Experiment: Multi group attention sinks

(The modification of this part of the code is not in this PR.)

I also conducted experiments to test the use of multiple attention sinks, my goal is to reduce model memory loss beyond kv cache size, but for now, I have only run the ppl test. (which does not represent whether context memory loss has been reduced).

attention_sink_size: 4
group: How many sets of sinks are keeped (group1 means only first 4 tokens)
gap: For every certain number of evicted tokens, save a set of sinks.
attention_sink_window_size: 2048 - group * attention_sink_size

tomaarsen · 2023-10-11T07:23:06Z

Hello!

This is looking awesome! Great job. Regarding the AutoModel difficulties, this is related to all Qwen models using trust_remote_code=True, which kind of bypasses the AutoModel classes. I'd like to look into this to figure out if there's a convenient solution, but otherwise your QWenLMHeadModel solution works. I'll try to find a solution tomorrow before I merge this, if that works for you.

As for your experiments, very interesting to see that 12 sinks is actually quite a lot better (in terms of perplexity) than 4. I had expected the difference to be smaller. I'm not quite sure what the second plot represents though, the differences between many sink tokens and a large window versus 12 sink tokens and a smaller window?

Tom Aarsen

tomaarsen · 2023-10-11T07:25:05Z

Finding a fix for the AutoModel also has consequences for https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat for example, which would also not work with AutoModel right now.

Sanster · 2023-10-11T08:20:32Z

Hello!

This is looking awesome! Great job. Regarding the AutoModel difficulties, this is related to all Qwen models using trust_remote_code=True, which kind of bypasses the AutoModel classes. I'd like to look into this to figure out if there's a convenient solution, but otherwise your QWenLMHeadModel solution works. I'll try to find a solution tomorrow before I merge this, if that works for you.

As for your experiments, very interesting to see that 12 sinks is actually quite a lot better (in terms of perplexity) than 4. I had expected the difference to be smaller. I'm not quite sure what the second plot represents though, the differences between many sink tokens and a large window versus 12 sink tokens and a smaller window?

Tom Aarsen

I originally wanted to compare the effects of with or without sink groups in the second picture, now I have combined them all together in one picture.

tomaarsen · 2023-10-12T11:21:15Z

@Sanster I'm unable to reproduce the behaviour for transformers - the model seems fairly stable for me beyond the 2400 tokens where you encounter issues.

I've also started work on a large refactor that allows from attention_sinks import AutoModelForCausalLM to also work with Qwen. I'll be publishing that later today - it'll incorporate this awesome PR as well!

tomaarsen · 2023-10-12T12:57:11Z

This is ready as far as I'm concerned, I'll leave it open for now so you can also review my changes. I'm afraid they're kind of squashed together with the merge, but the gist is this:

Use the now general inject_mixin.py file to handle injecting the Attention Sinks into models.
Then, the models/qwen folder only needs the position shifting function, and no longer the modeling and configuration.

This does mean that there is no longer any from attention_sinks import QWenLMHeadModel, but you can use:

from attention_sinks import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True)

I do have to say, I couldn't reproduce your behaviour for transformers still, that model seemed to do pretty well, just like attention_sinks. It used a ton of VRAM though.

Tom Aarsen

tomaarsen · 2023-10-17T08:08:22Z

Thanks again!

add qwen model

0125bd6

Sanster mentioned this pull request Oct 11, 2023

💡 [REQUEST] - Streaming LLM Support, or Any Better Solution? QwenLM/Qwen#421

Closed

tomaarsen mentioned this pull request Oct 12, 2023

Can support to Qwen14B? mit-han-lab/streaming-llm#34

Closed

tomaarsen added 4 commits October 12, 2023 14:42

Merge branch 'main' into pr-15

f7874ed

Remove unnecessary diff change

22b0487

Add Qwen figure to README

4f9aad3

Update changelog

6120a47

tomaarsen changed the title ~~[WIP] add QWen model + benchmark results~~ Add QWen model + benchmark results Oct 12, 2023

tomaarsen merged commit fc33531 into tomaarsen:main Oct 17, 2023

tomaarsen mentioned this pull request Oct 17, 2023

Strategy for trust_remote_code? #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QWen model + benchmark results #15

Add QWen model + benchmark results #15

Sanster commented Oct 11, 2023 •

edited

Loading

tomaarsen commented Oct 11, 2023

tomaarsen commented Oct 11, 2023

Sanster commented Oct 11, 2023

tomaarsen commented Oct 12, 2023

tomaarsen commented Oct 12, 2023

tomaarsen commented Oct 17, 2023

Add QWen model + benchmark results #15

Add QWen model + benchmark results #15

Conversation

Sanster commented Oct 11, 2023 • edited Loading

TODO

Benchmarks

Experiment: Multi group attention sinks

tomaarsen commented Oct 11, 2023

tomaarsen commented Oct 11, 2023

Sanster commented Oct 11, 2023

tomaarsen commented Oct 12, 2023

tomaarsen commented Oct 12, 2023

tomaarsen commented Oct 17, 2023

Sanster commented Oct 11, 2023 •

edited

Loading