-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RWKV2 (fast) #17230
Comments
-- on second thoughts: it's not immediately clear to me how many people will use this particular model, or how it will perform. What I'd really like to do is implement and develop it on Hub, and see if it's useful/popular there. I spent an amount of time with the docs, and the route to adding new model architectures seems to preferentially support adding directly to |
To answer your question: If it performs better than the other CausalLM models out there, it will most likely get used. Make a PR, build an initial version that can be run on HF, and see if any of the HF devs are willing to chime in. I am interested in this work, particularly because it solves a problem I haven't seen before: Be able to run CasualLM models on CPU. And my work stretches beyond the KoboldAI team, I know there are more out there that seem to benefit from the usage of CPU models because of the high prices that GPU models currently have. |
Work is going OK. We're porting the GPT-like part to Transformers first, for training and induction, and will work out the fast RNN induction-only part after the GPT part passes tests. |
Where is your work at? I have worked on this model and would like to contribute. I'm also experienced now at troubleshooting the parts of this model (mostly inference accuracy though), and have spent time understanding the cuda kernels. I have some experience with adjusting new codebases to unexpected featureset combinations. |
I'm also curious how this one is coming along. (I just saw the original paper today. Not sure how I missed it...) |
@leondz are you guys still working on this? I am looking to get into this if this can work on edge devices |
Some time ago I looked a little into continuing this, but other things came up. Since that work, RWKV is on version 4 now (although the changes between versions are not generally complex): https://github.com/BlinkDL/RWKV-LM |
I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc? |
You could ask the same about any model or technology near the top of a leaderboard. Things happen because people do the work or make the business decisions behind them happening. There are scads and scads of things better than the original transformer paper, but they're not normative yet. |
This is better but GPT is good enough for most applications. |
It's not presented well and clearly, I am working on a fork or huggingface integration that answers questions, this is pretty much a breakthrough model imo, I am just making sure the runtimes are true. It still in R and D phase adoption phase comes soon after |
I spent about a month working on this but the code wasn't stable and wasn't version controlled in the normal way, which made refactoring really tricky. Then time ran out. I think if the engineering side of things is fixed, and there's a stable release, it's a great model - definitely more data-efficient than competitors, which is really the core factor now. |
For our own project we have kind of basic support for it workarounded in with the original base, but the reason we don't finetune it or don't support it properly is because Huggingface support is missing and we are tightly integrated with huggingface. I assume other providers / projects have the same issue. For adoption I'd love to see RWKV land in huggingface so we can begin to offer it to our users the proper way, without them relying on manual steps, and without missing features for this model. |
Yeah but why doesn't OpenAI literally just spend one month on this with 10
guys and use this? It think this has some drawback but no one can tell me
what it is... It's feel reasonable that all new papers from Google, OpenAI
should use this.
Den ons 30 nov. 2022 18:55henk717 ***@***.***> skrev:
… I can't understand why this hasn't seen wider adoption. It makes me a bit
skeptical. If it's better in all ways compared to the original transformer
paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?
For our own project we have kind of basic support for it workarounded in
with the original base, but the reason we don't finetune it or don't
support it properly is because Huggingface support is missing and we are
tightly integrated with huggingface. I assume other providers / projects
have the same issue. For adoption I'd love to see RWKV land in huggingface
so we can begin to offer it to our users the proper way, without them
relying on manual steps, and without missing features for this model.
—
Reply to this email directly, view it on GitHub
<#17230 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHYLDTWSJQDOOINSE5GVFUDWK6IJZANCNFSM5V275BWA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
There are a number of papers with similar "exponential moving average" design now. For example, S4D is using slightly fancier kernels: https://github.com/HazyResearch/state-spaces (while I find simple kernels are enough). RWKV is weaker at LAMBADA (comparing with GPT) when the model is small (< 3B), but I find adding one single tiny QKV attention is enough to solve it (helps a small model to copy words in prompt). Moreover, it's reasonable to expect a competitive linear-time attention model, because when human novelists write very long stories the speed is consistent (except GRRM lol). |
I don't think this project is well known, theres a huge eco system based of just what works right now i.e T5 and GPTx. For example percievers io, and percievers AR by deepmind seems to do something similar to get linear attention. To get this project to that level of popularity we have to build various production level proofs, most people already understand the challenges of T5 and GPTx series. Second the models from a product perspective isn't as important, it's the data that is important. People are making the bets that its smarter to deploy a product with shitty AI and wait for the improvement before investing in the R and D. They build the product and make it easy to replace the AI portion of it in 10 minutes. These factors make it difficult to get projects and indepdent researchers to get the spotlight they need. |
I understand. But this is the only architecture that has infinite context
length.
Den tors 1 dec. 2022 17:01Michael Chung ***@***.***> skrev:
… I don't think this project is well known, theres a huge eco system based
of just what works right now i.e T5 and GPT*x. For example percievers,
and percievers AR by deepmind seems to do something similar to get linear
attention. To get this project to that level of popularity we have to build
various production level proofs, most people already understand the
challenges of T5 and GPT*x series. Second the models from a product
perspective isn't as important, it's the data that is important. People are
making the bets that its smarter to deploy a product with shitty AI and
wait for the improvement before investing in the R and D. These factors
make it difficult to get projects and indepdent researchers to get the
spotlight they need.
—
Reply to this email directly, view it on GitHub
<#17230 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHYLDTTICKMR7YJCRZTPKO3WLDDUTANCNFSM5V275BWA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
"...this is the only architecture that has infinite context length." Wait, really?... How did I miss that? I thought it was just a faster, more efficient approach. |
"So it's combining the best of RNN and transformer - great performance,
fast inference, saves VRAM, fast training, "infinite" ctx_len, and free
sentence embedding."
Den tors 1 dec. 2022 18:18jbm ***@***.***> skrev:
… "...this is the only architecture that has infinite context length."
Wait, really?... How did I miss that? I thought it was just a faster, more
efficient approach?
—
Reply to this email directly, view it on GitHub
<#17230 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHYLDTTLSQHYBSKLYA5BRX3WLDMYBANCNFSM5V275BWA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The context length is presently limited by the accuracy of the floating point representation, due to the heavily simplified and unified architecture. RWKV is a strong combination of speed and long-context. |
Right, okay. Well, that's pretty compelling, for sure... |
I think its also limited by the memory as well |
There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation. |
So you are telling me, that the
so it appears to have a limit based off memory @BlinkDL can you clearify ? |
I should let Blink clarify, but regarding T_MAX: https://github.com/BlinkDL/RWKV-LM/blob/a268cd2e40351ee31c30c5f8a5d1266d35b41829/RWKV-v4neo/src/model.py#L34 |
Since the model support for this stalled, perhaps someone on HF's side such as @younesbelkada can help get this model supported? |
I am not using the correct method to train it because I am lazy. But you can always finetune the model to support longer ctxlen. For example, fine-tuned to 4096 here: https://huggingface.co/BlinkDL/rwkv-4-pile-3b With the correct training method, I estimate the effective ctx_len can at least be 100K. |
So it doesn't have "infinite" ctx_len.
Den lör 3 dec. 2022 06:26PENG Bo ***@***.***> skrev:
… There is no memory limit associated with context length that I am aware of
with these models. State can be retained in a recurrent manner, providing
for using only however much memory is available for accelerated parallel
operation.
So you are telling me, that the context is effectively encoded into the
state. I am reffering to the context length of the model consumes. I guess
what you are trying to say is that because we have a state, the model can
look into that state for any context size? as a result it has an infinite
context length? I looked into the code and it says
T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!]
so it appears to have a limit based off memory @BlinkDL
<https://github.com/BlinkDL> can you clearify ?
I am not using the correct method to train it because I am lazy.
But you can always finetune the model to support longer ctxlen. For
example, fine-tuned to 4096 here:
https://huggingface.co/BlinkDL/rwkv-4-pile-3b
—
Reply to this email directly, view it on GitHub
<#17230 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHYLDTTHM2NCFZJFFG4JF63WLLKWTANCNFSM5V275BWA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I suspect technically if you used a rational number representation rather than floating point it would have infinite context length. Aside: I’m not an ML researcher, but I don’t know why downscaling like this doesn’t get more attention. It seems context length could be fully infinite by re-encoding past information for what is helpful for future states, and a network wired to discover its own architecture would quickly find this. |
Wow very cool @leondz ! |
It's absolutely BlinkDL's project, so up to them and they get the headline credit, but that sounds lovely - I'm down :) |
Can you share your slides? :) Consider this a community project, and we can build an ecosystem on top of RWKV, like what happens to Stable Diffusion. I will focus on improving the algorithm & model - now training RWKV-4a with one single tiny extra attention (just a few extra lines comparing with RWKV-4) to further improve some difficult zeroshot tasks (such as LAMBADA) for smaller models. |
Can I also get the slides perhaps a google docs link for them would be the quickest there are a few parts of this architecture that are still fuzzy to me |
I wasn’t aware. It’s too bad we didn’t take these things farther; I was having the opposite issue. @ArEnSc , please let us know if there are any snags preventing opening a PR so somebody else can step in too. |
It's important to say that this was due to the pace and mode of development, not the model's quality! |
Might not be fully helpful, but I have a repository with a bunch of different variations on inference https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/model_run_onnx.py for example is a file where I have made the code compatible with onnx, tensorflow, and Iree inference converters (with only some minor tweaking) |
@ArthurZucker
|
Hi @ArEnSc After that, you need to run |
closer! but still problems |
I think here you need to install Mecab through
and re-run |
I had the same issue when installing, you should make sure to install |
Is a full dev environment needed to start with? Personally it would be quite inspiring to see a PR even if it didn't pass tests. |
@ArEnSc did you managed to open a PR? I think it's ok to leave it as a draft even if the test does not even pass (i.e. eventually no need to install the dev env, at least for the beginning, we can in the worst case take over the PR if any). Let us know what do you think! |
Yeah hey sorry guys! probably sometime this week or today, my day job is iOS Development, it isn't in MLE. I just a moon light side job in NLP and Speech Synthesis in the media creation domain. Looking to transition eventually, hopefully this PR will proof of my capabilities so I won't abandon it =) |
#20737 here is the draft, probably generating all the scaffolding soon |
There is recent active work for interfacing multiple backends to rwkv at https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/rwkvops.py#L914 (list down at end of file) |
yeah we will be looking into that as soon as I figure out how the architecture works from a high level I might have some questions but Iam tracing the model now |
I have made a very simple and dumb wrapper for RWKV including This depends on the rwkv library: I'd like to tag @zphang. He recently implemented LLaMA support in transformers. Maybe adding RWKV would interest him as well. |
this is by far some of the best models right now, the performance of 7B is outstanding. |
Because nobody tried implementing it? |
Thats HF mission, so I was wondering how come HF has missed the best model in the industry. Making me think about bias behind what this "Open" platform says vs what they do. And because of that, i was wondering how come HF teams are not giving a hand to port this in. |
There is already an open PR by @ArEnSc |
Two things: Llama was so fast because people actively wanted to use it. Meta releases something, HF jumps in line and puts a PR together to support it. Since RWKV is not that big, no support. I am waiting eagerly for support... |
Hi there, |
Model description
I would like to implement a new model architecture.
Short description
RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)
RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.
Open source status
Provide useful links for the implementation
Implementation and weights
There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.
Status
The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.
The text was updated successfully, but these errors were encountered: