Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RWKV2 (fast) #17230

Closed
2 tasks done
leondz opened this issue May 13, 2022 · 72 comments · Fixed by #22797
Closed
2 tasks done

Add RWKV2 (fast) #17230

leondz opened this issue May 13, 2022 · 72 comments · Fixed by #22797

Comments

@leondz
Copy link
Contributor

leondz commented May 13, 2022

Model description

I would like to implement a new model architecture.

Short description

RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)

RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

Implementation and weights

There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.

Status

The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.

@leondz
Copy link
Contributor Author

leondz commented May 16, 2022

-- on second thoughts: it's not immediately clear to me how many people will use this particular model, or how it will perform. What I'd really like to do is implement and develop it on Hub, and see if it's useful/popular there. I spent an amount of time with the docs, and the route to adding new model architectures seems to preferentially support adding directly to transformers. Tooling for new model architectures that worked on Hub (e.g. cookiecutter, class organisation, and tests) would be super neat. Is that something there's any interest in?

@mrseeker
Copy link

-- on second thoughts: it's not immediately clear to me how many people will use this particular model, or how it will perform.

To answer your question: If it performs better than the other CausalLM models out there, it will most likely get used. Make a PR, build an initial version that can be run on HF, and see if any of the HF devs are willing to chime in. I am interested in this work, particularly because it solves a problem I haven't seen before: Be able to run CasualLM models on CPU. And my work stretches beyond the KoboldAI team, I know there are more out there that seem to benefit from the usage of CPU models because of the high prices that GPU models currently have.

@leondz
Copy link
Contributor Author

leondz commented May 20, 2022

Work is going OK. We're porting the GPT-like part to Transformers first, for training and induction, and will work out the fast RNN induction-only part after the GPT part passes tests.

@xloem
Copy link
Contributor

xloem commented Jun 26, 2022

Where is your work at? I have worked on this model and would like to contribute. I'm also experienced now at troubleshooting the parts of this model (mostly inference accuracy though), and have spent time understanding the cuda kernels. I have some experience with adjusting new codebases to unexpected featureset combinations.

@jbmaxwell
Copy link

I'm also curious how this one is coming along. (I just saw the original paper today. Not sure how I missed it...)

@ArEnSc
Copy link
Contributor

ArEnSc commented Oct 4, 2022

@leondz are you guys still working on this? I am looking to get into this if this can work on edge devices

@xloem
Copy link
Contributor

xloem commented Oct 5, 2022

Some time ago I looked a little into continuing this, but other things came up.
After that experience, I would recommend that future implementers start a new fork, rather than working off the existing one, because very little has been done, so it can take extra effort to learn the existing situation without much return.
For the record:
leondz's branch is at https://github.com/leondz/transformers/tree/rwkv-v2 .
I added smidges to it at https://github.com/xloem/transformers/tree/rwkv-v2 and https://github.com/xloem/transformers/tree/rwkv-v2-disable_non_clm_for_now .

Since that work, RWKV is on version 4 now (although the changes between versions are not generally complex): https://github.com/BlinkDL/RWKV-LM

@mrconter1
Copy link

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

@xloem
Copy link
Contributor

xloem commented Nov 21, 2022

You could ask the same about any model or technology near the top of a leaderboard. Things happen because people do the work or make the business decisions behind them happening. There are scads and scads of things better than the original transformer paper, but they're not normative yet.

@BlinkDL
Copy link

BlinkDL commented Nov 22, 2022

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

This is better but GPT is good enough for most applications.
I will just keep training larger models. RWKV 14B release soon.

@ArEnSc
Copy link
Contributor

ArEnSc commented Nov 22, 2022

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

It's not presented well and clearly, I am working on a fork or huggingface integration that answers questions, this is pretty much a breakthrough model imo, I am just making sure the runtimes are true. It still in R and D phase adoption phase comes soon after

@leondz
Copy link
Contributor Author

leondz commented Nov 22, 2022

I spent about a month working on this but the code wasn't stable and wasn't version controlled in the normal way, which made refactoring really tricky. Then time ran out. I think if the engineering side of things is fixed, and there's a stable release, it's a great model - definitely more data-efficient than competitors, which is really the core factor now.

@henk717
Copy link

henk717 commented Nov 30, 2022

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

For our own project we have kind of basic support for it workarounded in with the original base, but the reason we don't finetune it or don't support it properly is because Huggingface support is missing and we are tightly integrated with huggingface. I assume other providers / projects have the same issue. For adoption I'd love to see RWKV land in huggingface so we can begin to offer it to our users the proper way, without them relying on manual steps, and without missing features for this model.

@mrconter1
Copy link

mrconter1 commented Nov 30, 2022 via email

@BlinkDL
Copy link

BlinkDL commented Nov 30, 2022

Yeah but why doesn't OpenAI literally just spend one month on this with 10 guys and use this? It think this has some drawback but no one can tell me what it is... It's feel reasonable that all new papers from Google, OpenAI should use this.

There are a number of papers with similar "exponential moving average" design now.

For example, S4D is using slightly fancier kernels: https://github.com/HazyResearch/state-spaces (while I find simple kernels are enough).

RWKV is weaker at LAMBADA (comparing with GPT) when the model is small (< 3B), but I find adding one single tiny QKV attention is enough to solve it (helps a small model to copy words in prompt).

Moreover, it's reasonable to expect a competitive linear-time attention model, because when human novelists write very long stories the speed is consistent (except GRRM lol).

@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 1, 2022

I don't think this project is well known, theres a huge eco system based of just what works right now i.e T5 and GPTx. For example percievers io, and percievers AR by deepmind seems to do something similar to get linear attention. To get this project to that level of popularity we have to build various production level proofs, most people already understand the challenges of T5 and GPTx series. Second the models from a product perspective isn't as important, it's the data that is important. People are making the bets that its smarter to deploy a product with shitty AI and wait for the improvement before investing in the R and D. They build the product and make it easy to replace the AI portion of it in 10 minutes. These factors make it difficult to get projects and indepdent researchers to get the spotlight they need.

@mrconter1
Copy link

mrconter1 commented Dec 1, 2022 via email

@jbmaxwell
Copy link

jbmaxwell commented Dec 1, 2022

"...this is the only architecture that has infinite context length."

Wait, really?... How did I miss that? I thought it was just a faster, more efficient approach.

@mrconter1
Copy link

mrconter1 commented Dec 1, 2022 via email

@xloem
Copy link
Contributor

xloem commented Dec 1, 2022

The context length is presently limited by the accuracy of the floating point representation, due to the heavily simplified and unified architecture. RWKV is a strong combination of speed and long-context.

@jbmaxwell
Copy link

Right, okay. Well, that's pretty compelling, for sure...

@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 1, 2022

The context length is presently limited by the accuracy of the floating point representation, due to the heavily simplified and unified architecture. RWKV is a strong combination of speed and long-context.

I think its also limited by the memory as well

@xloem
Copy link
Contributor

xloem commented Dec 2, 2022

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 2, 2022

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says

  T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!]

so it appears to have a limit based off memory @BlinkDL can you clearify ?

@xloem
Copy link
Contributor

xloem commented Dec 2, 2022

@henk717
Copy link

henk717 commented Dec 3, 2022

Since the model support for this stalled, perhaps someone on HF's side such as @younesbelkada can help get this model supported?

@BlinkDL
Copy link

BlinkDL commented Dec 3, 2022

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says

  T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!]

so it appears to have a limit based off memory @BlinkDL can you clearify ?

I am not using the correct method to train it because I am lazy. But you can always finetune the model to support longer ctxlen. For example, fine-tuned to 4096 here:

https://huggingface.co/BlinkDL/rwkv-4-pile-3b

With the correct training method, I estimate the effective ctx_len can at least be 100K.

@mrconter1
Copy link

mrconter1 commented Dec 3, 2022 via email

@xloem
Copy link
Contributor

xloem commented Dec 3, 2022

I suspect technically if you used a rational number representation rather than floating point it would have infinite context length.

Aside: I’m not an ML researcher, but I don’t know why downscaling like this doesn’t get more attention. It seems context length could be fully infinite by re-encoding past information for what is helpful for future states, and a network wired to discover its own architecture would quickly find this.

@younesbelkada
Copy link
Contributor

younesbelkada commented Dec 5, 2022

Wow very cool @leondz !
Would be also very keen to have a look at the tutorial you made, we can also ultimately have them on the HF blog to announce the release of this architecture (ofc once we figure out everything about the integration & happy to help you on the post too), how does that sound?

@leondz
Copy link
Contributor Author

leondz commented Dec 5, 2022

It's absolutely BlinkDL's project, so up to them and they get the headline credit, but that sounds lovely - I'm down :)

@BlinkDL
Copy link

BlinkDL commented Dec 5, 2022

It's absolutely BlinkDL's project, so up to them and they get the headline credit, but that sounds lovely - I'm down :)

Can you share your slides? :)

Consider this a community project, and we can build an ecosystem on top of RWKV, like what happens to Stable Diffusion.

I will focus on improving the algorithm & model - now training RWKV-4a with one single tiny extra attention (just a few extra lines comparing with RWKV-4) to further improve some difficult zeroshot tasks (such as LAMBADA) for smaller models.

@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 5, 2022

Hey,

This integration went fine, until two snags wre hit:

  1. the code for reading input couldn't be reproduced
  2. the code for training couldn't be reproduced

I would love to see these stable & independent in their own branch. There was no hope of getting RWKV2 to pass the HF model implementation requirements (esp. the model weights precisely matching!) without these being established, but maybe things are better now.

Re: uptake - this model kicks ass, imo the restrictions have only been the difficulty of re-using/reproducing the codebase while it was under development, and that the paper hadn't been written. The math all checks out (I even wrote some tutorial slides for teaching the model) and the implementations have been elegant, it's just engineering issues in the way. Once a reproducible training codebase & paper are out, it's 🚀 time!

-- also would be super cool to have integrated the fast RNN inference if that's still working, but again the implementation and interface was fluid last time I tried to integrate this, and you can't integrate a moving implementation.

Can I also get the slides perhaps a google docs link for them would be the quickest there are a few parts of this architecture that are still fuzzy to me

@xloem
Copy link
Contributor

xloem commented Dec 5, 2022

  1. the code for reading input couldn't be reproduced
  2. the code for training couldn't be reproduced

I wasn’t aware. It’s too bad we didn’t take these things farther; I was having the opposite issue. @ArEnSc , please let us know if there are any snags preventing opening a PR so somebody else can step in too.

@leondz
Copy link
Contributor Author

leondz commented Dec 5, 2022

  1. the code for reading input couldn't be reproduced
  2. the code for training couldn't be reproduced

I wasn’t aware. It’s too bad we didn’t take these things farther; I was having the opposite issue. @ArEnSc , please let us know if there are any snags preventing opening a PR so somebody else can step in too.

It's important to say that this was due to the pace and mode of development, not the model's quality!

@harrisonvanderbyl
Copy link

Might not be fully helpful, but I have a repository with a bunch of different variations on inference

https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/model_run_onnx.py for example is a file where I have made the code compatible with onnx, tensorflow, and Iree inference converters (with only some minor tweaking)

@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 7, 2022

@ArthurZucker
Hey I am getting issues setting up the dev environment.
I am on python 3.8.10, updated to the latest pip3. I create a venv using 3.8.10 and then run this command
I am on OSX Monterey, M1 Pro.
Which version of python should I be developing on ?

 pip3 install -e ".[dev]"
ERROR: Could not find a version that satisfies the requirement tensorflow-text; extra == "dev" (from transformers[dev]) (from versions: none)
ERROR: No matching distribution found for tensorflow-text; extra == "dev

@younesbelkada
Copy link
Contributor

younesbelkada commented Dec 8, 2022

Hi @ArEnSc
Indeed it's a bit tricky to install dev environment on a MAC M1.
Could you please replace your setup.py by this one: https://gist.github.com/younesbelkada/ce24f0b517db46502792c4b638d4f5b9 and run your command again

After that, you need to run pip3 install numpy --upgrade and everything should work fine

@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 10, 2022

@younesbelkada

(.env) michaelchung@michaels-mbp transformers % pip install -e ".[dev]"

Obtaining file:///Users/michaelchung/Code/transformers
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Collecting packaging>=20.0
  Using cached packaging-22.0-py3-none-any.whl (42 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.2.tar.gz (359 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting requests
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting numpy>=1.17
  Using cached numpy-1.23.5-cp38-cp38-macosx_11_0_arm64.whl (13.3 MB)
Collecting tqdm>=4.27
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting regex!=2019.12.17
  Using cached regex-2022.10.31-cp38-cp38-macosx_11_0_arm64.whl (287 kB)
Collecting filelock
  Using cached filelock-3.8.2-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0-cp38-cp38-macosx_12_0_arm64.whl
Collecting pytest-xdist
  Using cached pytest_xdist-3.1.0-py3-none-any.whl (36 kB)
Collecting rjieba
  Using cached rjieba-0.1.11-cp36-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (5.7 MB)
Collecting unidic>=1.0.2
  Using cached unidic-1.1.0.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... done
Collecting phonemizer
  Using cached phonemizer-3.2.1-py3-none-any.whl (90 kB)
Collecting jaxlib<=0.3.6,>=0.1.65
  Using cached jaxlib-0.3.5-cp38-none-macosx_11_0_arm64.whl (61.3 MB)
Collecting codecarbon==1.2.0
  Using cached codecarbon-1.2.0-py3-none-any.whl (135 kB)
Collecting pyctcdecode>=0.4.0
  Using cached pyctcdecode-0.4.0-py2.py3-none-any.whl (45 kB)
Collecting flake8>=3.8.3
  Using cached flake8-6.0.0-py2.py3-none-any.whl (57 kB)
Collecting sacremoses
  Using cached sacremoses-0.0.53.tar.gz (880 kB)
  Preparing metadata (setup.py) ... done
Collecting tensorflow-metal
  Using cached tensorflow_metal-0.7.0-cp38-cp38-macosx_12_0_arm64.whl (1.4 MB)
Collecting GitPython<3.1.19
  Using cached GitPython-3.1.18-py3-none-any.whl (170 kB)
Collecting datasets!=2.5.0
  Using cached datasets-2.7.1-py3-none-any.whl (451 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.2.0-cp38-cp38-macosx_12_0_arm64.whl (8.2 MB)
Collecting sudachidict-core>=20220729
  Using cached SudachiDict-core-20221021.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... done
Collecting sacrebleu<2.0.0,>=1.4.12
  Using cached sacrebleu-1.5.1-py3-none-any.whl (54 kB)
Collecting Pillow
  Using cached Pillow-9.3.0-cp38-cp38-macosx_11_0_arm64.whl (2.9 MB)
Collecting tf2onnx
  Using cached tf2onnx-1.13.0-py3-none-any.whl (442 kB)
Collecting sentencepiece!=0.1.92,>=0.1.91
  Using cached sentencepiece-0.1.97-cp38-cp38-macosx_11_0_arm64.whl (1.1 MB)
Collecting evaluate>=0.2.0
  Using cached evaluate-0.3.0-py3-none-any.whl (72 kB)
Collecting fugashi>=1.0
  Using cached fugashi-1.2.1.tar.gz (337 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [8 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/private/var/folders/jn/8d33s3c55jv5pctdc6wdnm2h0000gn/T/pip-install-xf18599w/fugashi_18a210c9f68f4c1fb6ece4f85f9f7479/setup.py", line 15, in <module>
          output, data_files = check_libmecab()
        File "/private/var/folders/jn/8d33s3c55jv5pctdc6wdnm2h0000gn/T/pip-install-xf18599w/fugashi_18a210c9f68f4c1fb6ece4f85f9f7479/fugashi_util.py", line 58, in check_libmecab
          raise RuntimeError("Could not configure working env. Have you installed MeCab?")
      RuntimeError: Could not configure working env. Have you installed MeCab?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(.env) michaelchung@michaels-mbp transformers % 

closer! but still problems

@younesbelkada
Copy link
Contributor

I think here you need to install Mecab through brew - can you try to run:

brew install mecab
brew install mecab-ipadic

and re-run pip install -e "[dev]" again?

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Dec 12, 2022

I had the same issue when installing, you should make sure to install fugashi==1.1.2a6 ( ignore the mecab part).
You can also follow the short guide from #18355

@xloem
Copy link
Contributor

xloem commented Dec 12, 2022

Is a full dev environment needed to start with? Personally it would be quite inspiring to see a PR even if it didn't pass tests.

@younesbelkada
Copy link
Contributor

@ArEnSc did you managed to open a PR? I think it's ok to leave it as a draft even if the test does not even pass (i.e. eventually no need to install the dev env, at least for the beginning, we can in the worst case take over the PR if any). Let us know what do you think!

@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 12, 2022

Yeah hey sorry guys! probably sometime this week or today, my day job is iOS Development, it isn't in MLE. I just a moon light side job in NLP and Speech Synthesis in the media creation domain. Looking to transition eventually, hopefully this PR will proof of my capabilities so I won't abandon it =)

@ArEnSc ArEnSc mentioned this issue Dec 12, 2022
4 tasks
@ArEnSc
Copy link
Contributor

ArEnSc commented Dec 12, 2022

#20737 here is the draft, probably generating all the scaffolding soon

@xloem
Copy link
Contributor

xloem commented Jan 5, 2023

There is recent active work for interfacing multiple backends to rwkv at https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/rwkvops.py#L914 (list down at end of file)
EDIT: dev discussion happens in the rwkv discord, where unfortunately I am not active

@ArEnSc
Copy link
Contributor

ArEnSc commented Jan 5, 2023

yeah we will be looking into that as soon as I figure out how the architecture works from a high level I might have some questions but Iam tracing the model now

@oobabooga
Copy link
Contributor

I have made a very simple and dumb wrapper for RWKV including RWKVModel.from_pretrained and RWKVModel.generate functions that could maybe serve as inspiration: RWKV.py

This depends on the rwkv library: pip install rwkv==0.0.6

I'd like to tag @zphang. He recently implemented LLaMA support in transformers. Maybe adding RWKV would interest him as well.

@fblgit
Copy link

fblgit commented Apr 10, 2023

this is by far some of the best models right now, the performance of 7B is outstanding.
How come the best model is not supported by HF ?

@mrseeker
Copy link

Because nobody tried implementing it?

@fblgit
Copy link

fblgit commented Apr 11, 2023

We want to have a positive impact on the AI field. We think the direction of more responsible AI is through openly sharing models, datasets, training procedures, evaluation metrics and working together to solve issues. We believe open source and open science bring trust, robustness, reproducibility, and continuous innovation. With this in mind, we are leading [BigScience](https://bigscience.huggingface.co/), a collaborative workshop around the study and creation of very large language models gathering more than 1,000 researchers of all backgrounds and disciplines.

Thats HF mission, so I was wondering how come HF has missed the best model in the industry. Making me think about bias behind what this "Open" platform says vs what they do.

And because of that, i was wondering how come HF teams are not giving a hand to port this in.
I saw LlaMA integration going in at flash speed with HF coverage.. and why this hasnt ??

@flozi00
Copy link
Contributor

flozi00 commented Apr 11, 2023

There is already an open PR by @ArEnSc

@mrseeker
Copy link

Two things:
If there are open PR, mention their number so we can keep track of what is stale, duplicate etc.

Llama was so fast because people actively wanted to use it. Meta releases something, HF jumps in line and puts a PR together to support it. Since RWKV is not that big, no support. I am waiting eagerly for support...

@younesbelkada
Copy link
Contributor

Hi there,
I am also super excited about this model, I think that PR will go on stale as there has been no activity since a while. If someone wants to take the lead on it, I would be happy to assist with @ArthurZucker !

@fblgit
Copy link

fblgit commented Apr 11, 2023

Well, I won't go into politics wether big or not big company should get community support or not.. having in mind their resources and manpower.

Projects like this, which are highly relevant, gets unsupported. Its the trending of github.. what else are we looking for ?

#17230
#20809
#21875
#20737

@younesbelkada younesbelkada mentioned this issue May 3, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.