Skip to content

Conversation

@vpj
Copy link

@vpj vpj commented Mar 27, 2023

What does this PR do?

This PR adds the 9B parameter GeoV language model trained by Georges Harik.

@vpj vpj marked this pull request as draft March 27, 2023 16:30
@vpj vpj marked this pull request as ready for review March 28, 2023 09:59
@sgugger
Copy link
Collaborator

sgugger commented Mar 28, 2023

cc @ArthurZucker

@vpj
Copy link
Author

vpj commented Mar 30, 2023

@ArthurZucker
Copy link
Collaborator

Hey @vpj feel free to ping me for any guidance ! 😉 Also if you need a review tell me

@vpj
Copy link
Author

vpj commented Mar 30, 2023

Yeah need a review. Im new to huggingface transformers. Just went by the tutorials. Let me know what else needs to be done in order to merge this. Thanks

@vpj
Copy link
Author

vpj commented Mar 31, 2023

@ArthurZucker

@ArthurZucker
Copy link
Collaborator

Sure ! Reviewing now!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work ! It's already pretty clean.
Main comment is with naming, and the missing copied from.
Plus I am pretty sure we don't need a new tokenizer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the order of the classes. General functions should be at the beginning. It' a nit but a convention that makes it easier to read the entire file! Let's follow for example what you can see in T5 or any other models!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used what was on gpt-neox. Moved PretrainedModel class down, let me know if it's ok.

model.to(torch_device)

inputs = tokenizer("My favorite food is", return_tensors="pt").to(torch_device)
expected_output = "My favorite food is pizza. I love pizza. I love pizza. I"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected_output = "My favorite food is pizza. I love pizza. I love pizza. I"
EXPECTED_OUTPUT = "My favorite food is pizza. I love pizza. I love pizza. I"

Also do you mind adding a test with the logits of the model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didnt get you about the test with logits of the model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model should be almost entirely the same as Reformer or GPTNeoX , so let's add copied from wherever we can! Also we should rename every GeoV to Geov in name of the classes, it's gonna be more convienent

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why rename GeoV to Geov?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure we don't need a new tokenizer for this. We just have to add /n as a special token and it will not be processed by the spm model (consider looking at the reformer, or the big_bird tokenizers which should be usable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously tried adding \n as a special token. But from tokenization_utils.py and tokenization_utils_base.py it looked to me like it assigns len(self) + 1 id to new tokens. But we need to add a special token with a specific id. I went through reformer but couldn't figure out how to do that? Can you help?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Gimme a bit of time I'll try to find the best solution to deal with this.

@vpj
Copy link
Author

vpj commented Apr 3, 2023

FAILED tests/models/whisper/test_modeling_flax_whisper.py::FlaxWhisperModelTest::test_equivalence_pt_to_flax - AssertionError: 1.1205673e-05 not less than or equal to 1e-05 : outputs.encoder_last_hidden_state: Difference between PyTorch and Flax is 1.1205673217773438e-05 (>= 1e-05).

This is why torch_and_flax test is failing

@vpj
Copy link
Author

vpj commented Apr 3, 2023

@ArthurZucker Very much appreciate the help so far. Can you please help me get this PR ready by tomorrow since I won't be available for a week after tomorrow. Thank you

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@vpj
Copy link
Author

vpj commented Apr 3, 2023

FAILED tests/extended/test_trainer_ext.py::TestTrainerExt::test_run_seq2seq_no_dist - TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
FAILED tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_expected_patches - PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f2bc8801950>

This error for tests_torch

@ArthurZucker
Copy link
Collaborator

These two tests seem unrelated to your PR, pull from main and normally they should dissappear

@ArthurZucker
Copy link
Collaborator

Ok the history of the PR got a little bit messed up 😅 it's alright it can happen from time to time! You can either rebase on main starting from [4bd65f3](https://github.com/huggingface/transformers/pull/22403/commits/4bd65f354c8366feccae95278c1f0b3a85110b3b) (for example as it is one commit unnafected) or you can do a soft reset to your first commit, add only your modifications and force push. And then pull from main

@ArthurZucker
Copy link
Collaborator

The styling depends on the version of black that you are using. Seems like most of the test are now good, the last should work with a pip install black==23.1

@vpj
Copy link
Author

vpj commented Apr 3, 2023

Yeah messed up by doing a rebase instead of a merge

@vpj
Copy link
Author

vpj commented Apr 3, 2023

I am not sure what the check means by imports order/format, to me it looks quite similar to other files.

@ArthurZucker
Copy link
Collaborator

Ok this is ruff acting up, I use ruff 0.0.258 ( we recently pinned the correct one)

@vpj
Copy link
Author

vpj commented Apr 3, 2023

What should I do? Do I have to install ruff 0.0.258 and run make style?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks a lot for all the cleanup!With this I can better see the actual changes. Given this, I am not really sure that we have to go through all the troubles of adding everything to transformers!
The easiest way to share the model is to put it on the hub using this tutorial.

I am sorry if this is more work as you must have created the dev env and etc! But The way the code is now looks good! And it should properly blend in the Custom Model.

Especially for the tokenization (you would have had to add a testing file) you can just do something like

class GeovTokenizer(RoformerTokenizer):
    def __init__(self):
        super().__init__()

    def convert_tokens_to_ids():
    
    def convert_ids_to_tokens():

which would be the only things you would have to rewrite! I can also help you as much as I can on adding this to the hub directly!


return outputs

@classmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it on the attention class but yes, if the whole attention class is wrapped, you don't need to put it everwhere.

@vpj
Copy link
Author

vpj commented Apr 3, 2023

Oh thanks, didn't know that. I saw this https://huggingface.co/docs/transformers/add_new_model and thought I had to create a pull request.

So, just to make sure I'm clear, should I close this PR and share the model according to https://huggingface.co/docs/transformers/custom_models?

@ArthurZucker
Copy link
Collaborator

Yes! It would be the best 😉 Thanks for your comprehension!

@vpj
Copy link
Author

vpj commented Apr 4, 2023

Just out of curiosity, how do you choose which models to add to the repo and what goes to the hub?

@vpj
Copy link
Author

vpj commented Apr 4, 2023

Added to the hub, but it doesn't work with pipelines (text-generation).

How do I register the model for text-generation.

This is what I'm doing now

  GeoVConfig.register_for_auto_class()
  GeoVModel.register_for_auto_class("AutoModel")
  GeoVForCausalLM.register_for_auto_class("AutoModelForCausalLM")
  GeoVTokenizer.register_for_auto_class()

Trying to load the pipeline with

generator = pipeline(model="GeoV/GeoV-9b", trust_remote_code=True)

gives the error

The model 'GeoVForCausalLM' is not supported for text-generation. Supported models are ...

Thanks

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Apr 4, 2023

Just out of curiosity, how do you choose which models to add to the repo and what goes to the hub?

The more we grow, the more we are trying to add models to the hub! Especially if the model does not have a lot of changes compared to a model that we already support!

For the issue, I think you have to update the mapping MODEL_FOR_MASKED_LM_MAPPING by adding your model

@vpj
Copy link
Author

vpj commented Apr 4, 2023

How can I change MODEL_FOR_MASKED_LM_MAPPING if I'm adding to the hub?

@ArthurZucker
Copy link
Collaborator

The same way you did for the AUTO_CONFIG_MAPPING.
An example from here:

config.json:
...
"auto_map": {
  "AutoConfig": "configuration_glm.GLMConfig",
  "AutoModel": "modeling_glm.GLMModel",
  "AutoModelForSeq2SeqLM": "modeling_glm.GLMForConditionalGeneration",
  "AutoModelForMultipleChoice": "modeling_glm.GLMForMultipleChoice",
  "AutoModelForSequenceClassification": "modeling_glm.GLMForSequenceClassification"
  },
...

@ArthurZucker
Copy link
Collaborator

So my bad, you just need to add AutoModelForMaskedLM !

@vpj
Copy link
Author

vpj commented Apr 5, 2023

This is a causal lm, is it ok to add it to masked lm?

@ArthurZucker
Copy link
Collaborator

Ah sorry for Causal LM it should be AutoModelForSeq2SeqLM

@github-actions
Copy link
Contributor

github-actions bot commented May 5, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants