ValueError when predicting with pretrained models #150

iocuydi · 2021-03-22T04:14:20Z

Describe the bug
When using GPT3XL to perform inference with the --predict flag as shown in examples, the following error is thrown

ValueError: Argument not a list with same length as devices arg=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255] devices=['device:GPU:0']

This is with a single GTX 1070 GPU.

commands that both produced this error were:
python main.py --model=gpt3xl/config.json --predict --prompt=prompt.txt
python main.py --model=gpt3xl/config.json --predict --prompt=prompt.txt --gpu_ids=['device:GPU:0']

The text was updated successfully, but these errors were encountered:

StellaAthena · 2021-03-22T04:22:58Z

This code was designed for TPUs and although it should work on GPUs that's not something we officially support. We recommend using the GPT-NeoX repo instead for GPU training.

That said it seems like it's having trouble identifying your GPU. Try the command nvidia-smi and check what the device ID number of your GPU is.

Also, does your machine have 255 GPUs? Otherwise I have no idea where it's getting that number from...

iocuydi · 2021-03-22T04:28:26Z

No, my machine has only 1 GPU lol. I haven't used mesh tensorflow before but I found this issue:
google-research/text-to-text-transfer-transformer#334
in which it seemed to be an issue with the mesh shape? I notice that the mesh shape is Shape[x=128, y=2] when running the above commands, so perhaps it has to do with this?

The device appears to be registered as device 0, other tensorflow models pick it up as 0, and when first loading tensorflow I see the typical "Adding visible gpu devices: 0 ... Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6709 MB memory) -> physical GPU (device: 0 ..." messages

iocuydi · 2021-03-22T04:50:47Z

I got around this error by setting params['mesh_shape'] = [], not sure if this broke something else because now I'm getting the error:
'tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'Rsqrt' used by {{node gpt2/h0/norm_1/rsqrt/parallel_0/Rsqrt}} with these attrs: [T=DT_BFLOAT16]'

although it appeared to build the model properly before displaying this

iocuydi · 2021-03-22T06:00:50Z

the issue was XLA devices not being enabled. Setting mesh shape to 1x1 and adding

os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

should make this work for GPUs. (I still couldn't run it because my GPU is too small, but the above errors no longer persisted)

StellaAthena · 2021-03-22T12:49:56Z

the issue was XLA devices not being enabled. Setting mesh shape to 1x1 and adding
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'
should make this work for GPUs. (I still couldn't run it because my GPU is too small, but the above errors no longer persisted)

Great to know! Thanks for chasing this down for us.

I’m going to leave this open as a reminder to make sure this is in the next update.

soneo1127 · 2021-03-23T23:55:58Z

Thanks, that worked for me on 1GPU.

How do I set up a mesh for a multi-GPU system?
(I want to predict on 2 GPUs).

GenTxt · 2021-03-24T00:01:59Z

Hello. I have the same error and would appreciate knowing which files to edit to add the above solutions:

Setting mesh shape to 1x1

model_fns.py

mesh_shape = mtf.convert_to_shape(params["mesh_shape"])

mesh_shape = mtf.convert_to_shape(params["1x1"]) = error

config.json (GPT3_XL_Pile)

"mesh_shape" : "x:128,y:2", (here? "1x1")

Another .py file?

os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

add to 'main.py' and/or 'model_fns.py' (???)

import os
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'

As a sidebar managed to convert both models using your 'convert_gpt.py' repo script. Keep in mind to change the huggingface config.json files to match the same "n_head" values or it will generate gibberish.

Able to sample from transformer-based repos

"n_head" : 16, (GPT3_XL)

"n_head" : 20, (GPT3_2.7B)

Cheers

StellaAthena · 2021-03-24T16:25:19Z

@GenTxt I haven't actually run the model on GPU, so I'll leave that question for @iocuydi or @soneo1127 to answer. I am intrigued by your sidebar though.

Did you test your converted model on long inputs? We are under the impression that that file doesn't work as-is, due to the fact that our model uses local and global attention. Specifically, we think that for short contexts (less than 256 char, if I recall correctly) it works fine but for full contexts it does not. HF is working on (what we think is the problem) on their end, but it'd be a big win if that was extraneous.

StellaAthena · 2021-03-24T16:29:49Z

@soneo1127 I'm going to recommend you check out the Mesh TF documentation for further info.

GenTxt · 2021-03-25T13:20:16Z

Have tested both models with long inputs and max good output is around 400-500+ before the text turns to gibberish. For some reason it starts jamming fragments of words and letters together similar to low epoch LSTM character-based training (appears similar).

The good output from both 2.7B and XL is on par and often better than 1558M gpt2

Doesn't go beyond the default 1024 even after editing transformer files. Will wait for proper HF conversion of models which will, hopefully, solve all those issues.

In the meantime would appreciate requested info from others in this thread.

Cheers,

StellaAthena · 2021-03-25T13:42:07Z

Have tested both models with long inputs and max good output is around 400-500+ before the text turns to gibberish. For some reason it starts jamming fragments of words and letters together similar to low epoch LSTM character-based training (appears similar).

The good output from both 2.7B and XL is on par and often better than 1558M gpt2

Doesn't go beyond the default 1024 even after editing transformer files. Will wait for proper HF conversion of models which will, hopefully, solve all those issues.

I just double checked and it’s actually ~512 where performance should jump off a cliff. For prompts of length 400-512, I would expect that the initial tokens are good but as the model goes on it devolves into gibberish. Is that what you see?

It’s good to see that the model is often better than 1.5B GPT-2: that’s what our preliminary testing has shown too. The next update to the README will include the following table:

Model	Pile BPB	Pile PPL	Lambada Acc.	Lambada PPL.	Wikitext PPL.
GPT-Neo XL (1.3B)	0.7527	6.159	64.73%	5.04	13.10
GPT-3 XL (1.3B)	------	-----	63.6%	5.44	-----
GPT-2 (1.5B)	1.0468	-----	63.24%	8.63	17.48
GPT-Neo Alan (2.7B)	0.7165	5.646	68.83%	4.137	11.39
GPT-3 Ada (2.7B)	0.9631	-----	67.1%	4.60	-----
GPT-3 DaVinci (175B)	0.7177	-----	76.2%	3.00	-----

StellaAthena · 2021-03-26T04:30:09Z

@GenTxt FYI, I have created an issue to serve as the canonical reference for the conversion script issue #174. Please direct any future queries about the conversion script there

jaehyunshinML · 2021-04-06T05:15:19Z

Thanks, that worked for me on 1GPU.

How do I set up a mesh for a multi-GPU system?
(I want to predict on 2 GPUs).

Hi

I think the easiest way to use multi-GPU, change Mesh_shape.
Set x as 1 and set the y with the number of your GPU in the config file.
For example, if you have 4 GPUs.

"mesh_shape" : "x:1,y:4",

iocuydi added the bug Something isn't working. label Mar 22, 2021

EleutherAI deleted a comment Mar 23, 2021

StellaAthena closed this as completed Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError when predicting with pretrained models #150

ValueError when predicting with pretrained models #150

iocuydi commented Mar 22, 2021

StellaAthena commented Mar 22, 2021 •

edited

Loading

iocuydi commented Mar 22, 2021

iocuydi commented Mar 22, 2021

iocuydi commented Mar 22, 2021

StellaAthena commented Mar 22, 2021 •

edited

Loading

soneo1127 commented Mar 23, 2021

GenTxt commented Mar 24, 2021

StellaAthena commented Mar 24, 2021 •

edited

Loading

StellaAthena commented Mar 24, 2021

GenTxt commented Mar 25, 2021

StellaAthena commented Mar 25, 2021 •

edited

Loading

StellaAthena commented Mar 26, 2021 •

edited

Loading

jaehyunshinML commented Apr 6, 2021

ValueError when predicting with pretrained models #150

ValueError when predicting with pretrained models #150

Comments

iocuydi commented Mar 22, 2021

StellaAthena commented Mar 22, 2021 • edited Loading

iocuydi commented Mar 22, 2021

iocuydi commented Mar 22, 2021

iocuydi commented Mar 22, 2021

StellaAthena commented Mar 22, 2021 • edited Loading

soneo1127 commented Mar 23, 2021

GenTxt commented Mar 24, 2021

StellaAthena commented Mar 24, 2021 • edited Loading

StellaAthena commented Mar 24, 2021

GenTxt commented Mar 25, 2021

StellaAthena commented Mar 25, 2021 • edited Loading

StellaAthena commented Mar 26, 2021 • edited Loading

jaehyunshinML commented Apr 6, 2021

StellaAthena commented Mar 22, 2021 •

edited

Loading

StellaAthena commented Mar 22, 2021 •

edited

Loading

StellaAthena commented Mar 24, 2021 •

edited

Loading

StellaAthena commented Mar 25, 2021 •

edited

Loading

StellaAthena commented Mar 26, 2021 •

edited

Loading