Skip to content

skip some gpt_neox tests that require 80G RAM#17923

Merged
sgugger merged 4 commits intohuggingface:mainfrom
ydshieh:skip_a_gpt_neox_test
Jul 1, 2022
Merged

skip some gpt_neox tests that require 80G RAM#17923
sgugger merged 4 commits intohuggingface:mainfrom
ydshieh:skip_a_gpt_neox_test

Conversation

@ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Jun 28, 2022

What does this PR do?

GPT-NeoX requires ~80G RAM to run. Our CI runners have only 60G RAM. Skip a few tests for now.

Do you think it's better to use something like

@unittest.skipUnless(psutil.virtual_memory().total / 1024 ** 3 > 80, "GPT-NeoX requires 80G RAM for testing")

The problem is that psutil is not in the requirements.

@ydshieh ydshieh requested review from sgugger and stas00 June 28, 2022 12:23
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 28, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will never have a runner with 80GB of RAM so if there is no better alternative, we should jsut delete those tests.

@stas00
Copy link
Contributor

stas00 commented Jun 28, 2022

What Sylvain said and I'd ask an even more different question - why are we running the same test on many identical models of different sizes. The purpose of our test suite is not to test models on the hub, it's to test the model's code. So such tests should never be there in the first place.

  • 99% of the time the tests should be run against tiny random models, most of which reside under https://huggingface.co/hf-internal-testing - these are functional tests.
  • 1% of tests should be against the smallest non-random model to test the quality of the results. And typically these are @slow tests.

Of course, the % breakdown is symbolic, the point I was trying to convey is that most tests should be really fast in download and execution.


If there is a need to test models on the hub, there should be another CI that all it does is loading the models and performs some basic test on them. That CI would need to have a ton of CPU and GPU memory and # of GPUs for obvious reasons - e.g. t5-11b and other huge models.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 28, 2022

Hi @stas00

The related tests here are decorated with @slow and run in the daily scheduled CI, not push CI. And only one size is tested GPT_NEOX_PRETRAINED_MODEL_ARCHIVE_LIST[:1].

For test_model_from_pretrained, I think we can use tiny random models in hf-internal-testing for GPTNeoX if we want to keep the test. However, we always have integration tests (like GPTNeoXModelIntegrationTest) which are important to have.

Note that on scheduled CI, we use a cache server (FileStore on GCP), so there is no real downloading (e.g. the downloading is very fast, happening between GCP's network).

They also have 16 vCPUs and 60G RAM.

@stas00
Copy link
Contributor

stas00 commented Jun 28, 2022

Ah, good point, I missed [:1] - why then there is a loop then?

for model_name in GPT_NEOX_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:

probably should write out explicitly the desired smallest real model then and perhaps it's small enough to fit?

@sgugger
Copy link
Collaborator

sgugger commented Jun 28, 2022

The main point is that GPT-Neo-X does not come with a smaller pretrained model, there is only the 20B version.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 28, 2022

Ah, good point, I missed [:1] - why then there is a loop then?

for model_name in GPT_NEOX_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:

probably should write out explicitly the desired smallest real model then and perhaps it's small enough to fit?

I think this is from old code. We don't want to maintain ...PRETRAINED_MODEL_ARCHIVE_LIST anymore, and for some models, we do use the explicit checkpoint name.

I will just remove the 2 tests here.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 28, 2022

Removed. Will rebase on main later to see if tests all pass

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jul 1, 2022

I am ready for the merge :-)

@sgugger sgugger merged commit 14fb8a6 into huggingface:main Jul 1, 2022
viclzhu pushed a commit to viclzhu/transformers that referenced this pull request Jul 18, 2022
* skip some gpt_neox tests that require 80G RAM

* remove tests

* fix quality

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
@ydshieh ydshieh deleted the skip_a_gpt_neox_test branch September 7, 2022 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants