You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are trying to train a BART-Model for German from scratch using the GC4 Corpus. For testing purposes, we use only 20GB of the Dataset for training in a container with 250GB of RAM and one NVIDIA A100.
aiohttp==3.8.1; python_version >= "3.7"
aiosignal==1.2.0; python_version >= "3.6"
async-timeout==4.0.2; python_version >= "3.6"
attrs==22.1.0; python_version >= "3.6"
blis==0.7.8; python_version >= "3.6"
catalogue==2.0.8; python_version >= "3.6"
certifi==2022.6.15; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
charset-normalizer==2.1.0; python_full_version >= "3.7.0" and python_version >= "3.7" and python_version < "4"
click==8.1.3; python_version >= "3.7"
colorama==0.4.5; python_full_version >= "3.7.0" and platform_system == "Windows" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" or platform_system == "Windows" and python_version >= "3.7" and python_full_version >= "3.5.0")
cymem==2.0.6; python_version >= "3.6"
datasets==2.4.0
dill==0.3.5.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
docker-pycreds==0.4.0; python_version >= "3.6"
elastic-transport==8.1.2; python_version >= "3.6" and python_version < "4"
elasticsearch==8.3.3; python_version >= "3.6" and python_version < "4"
filelock==3.8.0; python_version >= "3.7" and python_full_version >= "3.7.0"
frozenlist==1.3.1; python_version >= "3.7"
fsspec==2022.7.1; python_version >= "3.7"
gitdb==4.0.9; python_version >= "3.7"
gitpython==3.1.27; python_version >= "3.7"
huggingface-hub==0.8.1; python_full_version >= "3.7.0"
idna==3.3; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
jinja2==3.1.2; python_version >= "3.7"
langcodes==3.3.0; python_version >= "3.6"
markupsafe==2.1.1; python_version >= "3.7"
multidict==6.0.2; python_version >= "3.7"
multiprocess==0.70.13; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
murmurhash==1.0.7; python_version >= "3.6"
numpy==1.23.1
packaging==21.3; python_version >= "3.6" and python_full_version >= "3.7.0"
pandas==1.4.3; python_version >= "3.8"
pathtools==0.1.2; python_version >= "3.6"
pathy==0.6.2; python_version >= "3.6"
preshed==3.0.6; python_version >= "3.6"
promise==2.3; python_version >= "3.6"
protobuf==3.20.1; python_version >= "3.7"
psutil==5.9.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pyarrow==9.0.0; python_version >= "3.7"
pydantic==1.9.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pyparsing==3.0.9; python_version >= "3.6" and python_full_version >= "3.7.0"
python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
pytz==2022.2; python_version >= "3.8"
pyyaml==6.0; python_version >= "3.6" and python_full_version >= "3.7.0"
regex==2022.7.25; python_version >= "3.6" and python_full_version >= "3.7.0"
requests==2.28.1; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
responses==0.18.0; python_version >= "3.7"
sentry-sdk==1.9.4; python_version >= "3.6"
setproctitle==1.3.2; python_version >= "3.7"
shortuuid==1.0.9; python_version >= "3.6"
six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
smart-open==5.2.1; python_version >= "3.6" and python_version < "4.0"
smmap==5.0.0; python_version >= "3.7"
spacy-legacy==3.0.9; python_version >= "3.6"
spacy-loggers==1.0.3; python_version >= "3.6"
spacy==3.4.1; python_version >= "3.6"
srsly==2.4.4; python_version >= "3.6"
thinc==8.1.0; python_version >= "3.6"
tokenizers==0.12.1; python_full_version >= "3.7.0"
tqdm==4.64.0; python_full_version >= "3.7.0" and python_version >= "3.6"
transformers==4.21.1; python_full_version >= "3.7.0"
typer==0.4.2; python_version >= "3.6"
typing-extensions==4.3.0; python_version >= "3.7" and python_full_version >= "3.7.0"
urllib3==1.26.11; python_full_version >= "3.7.0" and python_version < "4" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6")
wandb==0.13.1; python_version >= "3.6"
wasabi==0.10.1; python_version >= "3.6"
xxhash==3.0.0; python_version >= "3.6"
yarl==1.8.1; python_version >= "3.7"
ftfy==6.1.1
git+https://github.com/facebookresearch/fairseq.git
tensorboardX==2.5.1
debugpy==1.6.3
Most importantly, we use git+https://github.com/facebookresearch/fairseq.git to install fairseq as we could not get the denoising task to work when installing fairseq from PyPI.
We used the following commit: 176cd93
We only use one worker for the preprocessing as fairseq-preprocess gets stuck when using more than one worker.
When running this script with a train-file with 20GB of data, fairseq-train runs out of memory and the container crashes without any error messages. When adding a wandb project, we observed that the training of the first epoch starts but does not complete, the pod runs out of memory before that.
The minimal training command we build follows the suggestions from #1899. This issue might be related to #4930.
Is this amount of RAM usage expected?
Thank you very much in advance!
The text was updated successfully, but these errors were encountered:
Hey!
We are trying to train a BART-Model for German from scratch using the GC4 Corpus. For testing purposes, we use only 20GB of the Dataset for training in a container with 250GB of RAM and one NVIDIA A100.
Dockerfile
Requirements.txt
Most importantly, we use
git+https://github.com/facebookresearch/fairseq.git
to install fairseq as we could not get the denoising task to work when installing fairseq from PyPI.We used the following commit: 176cd93
We use the following minimal example:
We only use one worker for the preprocessing as
fairseq-preprocess
gets stuck when using more than one worker.When running this script with a
train
-file with 20GB of data,fairseq-train
runs out of memory and the container crashes without any error messages. When adding a wandb project, we observed that the training of the first epoch starts but does not complete, the pod runs out of memory before that.The minimal training command we build follows the suggestions from #1899. This issue might be related to #4930.
Is this amount of RAM usage expected?
Thank you very much in advance!
The text was updated successfully, but these errors were encountered: