Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Better document rope_scaling/rope_alpha in wiki, and add config of yarn_rope_factor/yarn_rope_original_max_position_embeddings #239

Open
2 of 3 tasks
Originalimoc opened this issue Nov 15, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@Originalimoc
Copy link

Originalimoc commented Nov 15, 2024

Problem

No response

Solution

This is comments in exllamav2 src, add to wiki to be more informative:

    scale_pos_emb: float                        # Factor by which to scale positional embeddings, e.g. for 4096-token sequence use a scaling factor of 2.0, requires finetuned model or LoRA
    scale_alpha_value: float                    # Alpha value for NTK RoPE scaling. Similar to compress_pos_emb but works without finetuned model

and add yarn_rope_factor/yarn_rope_original_max_position_embeddings to config.yml

Alternatives

No response

Explanation

see
https://github.com/turboderp/exllamav2/blob/master/exllamav2/config.py
and
https://github.com/turboderp/exllamav2/blob/master/exllamav2/model_init.py:

    if args.rope_yarn:
        if config.alt_rope_method != "yarn":
            config.yarn_rope_original_max_position_embeddings = config.max_seq_len
        config.alt_rope_method = "yarn"
        config.yarn_rope_factor = args.rope_yarn

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@bdashore3 bdashore3 added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 15, 2024
@DocShotgun
Copy link
Member

Yarn scaling can be specified in the model's config.json and this will be handled automatically by exl2, similarly to how su/longrope is handled for phi3 and llama3 rope is handled for l3.1/l3.2:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

I was looking into the yarn settings and possibly exposing them for override in tabby, but this opens up a whole can of worms in terms of number of args and sanity-checking the configuration. Both of the options currently exposed by tabby - linear rope and ntk/alpha - can be applied (doesn't necessarily mean should) regardless of whether the model has another rope method specified (yarn/su/llama3). However, yarn/su/llama3 rope settings would be mutually exclusive with each other, and we would need to enforce only one of them being active at a time.

@Originalimoc
Copy link
Author

Yes😅
So I would even suggest remove those two altogether in config.yml and just let user configure in model config.json.

@necrogay
Copy link

Honestly, I either don’t fully understand how this is supposed to work or I’m missing something. Let’s say I’m using Qwen32B, which has a default context length of 32,768. If I specify rope_scaling: 4.0 in the config.json, should Tabby load it with a 130k context? For me, this doesn’t happen—it still loads with the default 32,768 context length. Even if this worked, such a large context is too much for my system—I’m limited to 65k for this model. It would be great if there were a way to specify the maximum context length for each model individually, as well as assign them aliases. For example, Qwen32B-32k, Qwen32B-65k, and so on, to load the same models when accessing them via the OAI API.

In practice, when I specify max_seq_len: 65535 in the config, the model does load with those settings. However, this applies to all models at once. For instance, if I want to load Qwen14B with a 131k context, it becomes an issue without manually editing the config. The only solution I’ve found so far is to create separate YAML configs for specific models and launch them directly through Tabby with commands like:
.\start.bat --config .\Qwen32B-32k.yml,
.\start.bat --config .\Qwen32B-65k.yml,
.\start.bat --config .\Qwen14B-131k.yml,
and so on.

Each time, I have to stop and restart Tabby with new settings, which is not very convenient. Perhaps I don’t fully understand how to set up automatic changes properly. I’d appreciate any advice on this issue.

@DocShotgun
Copy link
Member

Honestly, I either don’t fully understand how this is supposed to work or I’m missing something. Let’s say I’m using Qwen32B, which has a default context length of 32,768. If I specify rope_scaling: 4.0 in the config.json, should Tabby load it with a 130k context? For me, this doesn’t happen—it still loads with the default 32,768 context length. Even if this worked, such a large context is too much for my system—I’m limited to 65k for this model. It would be great if there were a way to specify the maximum context length for each model individually, as well as assign them aliases. For example, Qwen32B-32k, Qwen32B-65k, and so on, to load the same models when accessing them via the OAI API.

In practice, when I specify max_seq_len: 65535 in the config, the model does load with those settings. However, this applies to all models at once. For instance, if I want to load Qwen14B with a 131k context, it becomes an issue without manually editing the config. The only solution I’ve found so far is to create separate YAML configs for specific models and launch them directly through Tabby with commands like: .\start.bat --config .\Qwen32B-32k.yml, .\start.bat --config .\Qwen32B-65k.yml, .\start.bat --config .\Qwen14B-131k.yml, and so on.

Each time, I have to stop and restart Tabby with new settings, which is not very convenient. Perhaps I don’t fully understand how to set up automatic changes properly. I’d appreciate any advice on this issue.

  1. If you want to use Qwen2.5-32B-Instruct which recommends yarn scaling of 4.0 to reach 131072 ctx, you would simply follow the official instructions here https://huggingface.co/Qwen/Qwen2.5-32B-Instruct#processing-long-texts to edit config.json to add this section in, and then edit max_position_embeddings to 131072. Then specify whatever ctx length fits in your vram in max_seq_len/cache_size when loading the model in tabby. I'm not understanding the part about your setting changes applying to "all models at once".
  2. We very much do not recommend using multiple different config.yml files to load different models. You are meant to either use the /v1/model/load/ API endpoint to request to load a different model than what was loaded on launch, or to enable in-line model loading and passing the name of the desired model with the (chat) completion request. You can create a tabby_config.yml file in the directory of each of your models, containing the loading params for said model if you don't want to pass them via payload to the load endpoint.

@necrogay
Copy link

necrogay commented Dec 1, 2024

Honestly, I either don’t fully understand how this is supposed to work or I’m missing something. Let’s say I’m using Qwen32B, which has a default context length of 32,768. If I specify rope_scaling: 4.0 in the config.json, should Tabby load it with a 130k context? For me, this doesn’t happen—it still loads with the default 32,768 context length. Even if this worked, such a large context is too much for my system—I’m limited to 65k for this model. It would be great if there were a way to specify the maximum context length for each model individually, as well as assign them aliases. For example, Qwen32B-32k, Qwen32B-65k, and so on, to load the same models when accessing them via the OAI API.
In practice, when I specify max_seq_len: 65535 in the config, the model does load with those settings. However, this applies to all models at once. For instance, if I want to load Qwen14B with a 131k context, it becomes an issue without manually editing the config. The only solution I’ve found so far is to create separate YAML configs for specific models and launch them directly through Tabby with commands like: .\start.bat --config .\Qwen32B-32k.yml, .\start.bat --config .\Qwen32B-65k.yml, .\start.bat --config .\Qwen14B-131k.yml, and so on.
Each time, I have to stop and restart Tabby with new settings, which is not very convenient. Perhaps I don’t fully understand how to set up automatic changes properly. I’d appreciate any advice on this issue.

1. If you want to use Qwen2.5-32B-_Instruct_ which recommends yarn scaling of 4.0 to reach 131072 ctx, you would simply follow the official instructions here https://huggingface.co/Qwen/Qwen2.5-32B-Instruct#processing-long-texts to edit `config.json` to add this section in, and then edit `max_position_embeddings` to 131072. Then specify whatever ctx length fits in your vram in `max_seq_len`/`cache_size` when loading the model in tabby. I'm not understanding the part about your setting changes applying to "all models at once".

Do we really need to edit the max_position_embeddings parameter? I'm a bit confused.

"max_position_embeddings": 32768,
"rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
}

Changing max_seq_len or cache_size in the yml file, along with editing the model configuration as shown above, is the only thing that allows me to load a context length above 32k for the model.

2. We very much do not recommend using multiple different `config.yml` files to load different models. You are meant to either use the `/v1/model/load/` API endpoint to request to load a different model than what was loaded on launch, or to enable in-line model loading and passing the name of the desired model with the (chat) completion request. You can create a `tabby_config.yml` file in the directory of each of your models, containing the loading params for said model if you don't want to pass them via payload to the load endpoint.

I didn’t know that the tabby_config.yml file could be placed in the directory of each model. This could indeed be useful if it works as intended. However, I just tried this approach, and it didn’t work for me.

In any case, this method doesn’t cover the need to load the same model with different context lengths. For example, this can be relevant when connecting to a draft model.

Why don’t you recommend using separate configurations for connecting different models?

That said, I’ve already solved the issue in my own way. I wrote a small proxy application that starts and stops Tabby with the required configurations based on requests through the OpenAI API. It’s a bit of an unconventional solution, but it works for me exactly as I wanted.

@Originalimoc
Copy link
Author

Do we really need to edit the max_position_embeddings parameter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants