Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Denoising Task crashes OOM #5076

Open
BUCKFAE opened this issue Apr 13, 2023 · 0 comments
Open

Denoising Task crashes OOM #5076

BUCKFAE opened this issue Apr 13, 2023 · 0 comments

Comments

@BUCKFAE
Copy link

BUCKFAE commented Apr 13, 2023

Hey!

We are trying to train a BART-Model for German from scratch using the GC4 Corpus. For testing purposes, we use only 20GB of the Dataset for training in a container with 250GB of RAM and one NVIDIA A100.

Dockerfile
FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

SHELL ["/bin/bash", "-c"]

ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Europe/Berlin

RUN apt-get update && apt-get install -y \
 && apt-get install -y software-properties-common \
 && add-apt-repository -y ppa:deadsnakes/ppa \
 && apt-get update && apt-get install -y \
 python3.9-dev \
 python3.9-venv \
 python3.9-distutils \
 python3-pip \
 git \
 llvm \
 vim \
 neovim \
 tree \
 curl \
 wget \
 htop \
 zsh \
 && rm -rf /var/lib/apt/lists/*

ENV HOME=/tmp
RUN ln -sf /usr/bin/python3.9 /usr/bin/python3

WORKDIR /workdir/code

COPY requirements.txt .
RUN pip3 install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113
Requirements.txt
aiohttp==3.8.1; python_version >= "3.7"
aiosignal==1.2.0; python_version >= "3.6"
async-timeout==4.0.2; python_version >= "3.6"
attrs==22.1.0; python_version >= "3.6"
blis==0.7.8; python_version >= "3.6"
catalogue==2.0.8; python_version >= "3.6"
certifi==2022.6.15; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
charset-normalizer==2.1.0; python_full_version >= "3.7.0" and python_version >= "3.7" and python_version < "4"
click==8.1.3; python_version >= "3.7"
colorama==0.4.5; python_full_version >= "3.7.0" and platform_system == "Windows" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" or platform_system == "Windows" and python_version >= "3.7" and python_full_version >= "3.5.0")
cymem==2.0.6; python_version >= "3.6"
datasets==2.4.0
dill==0.3.5.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
docker-pycreds==0.4.0; python_version >= "3.6"
elastic-transport==8.1.2; python_version >= "3.6" and python_version < "4"
elasticsearch==8.3.3; python_version >= "3.6" and python_version < "4"
filelock==3.8.0; python_version >= "3.7" and python_full_version >= "3.7.0"
frozenlist==1.3.1; python_version >= "3.7"
fsspec==2022.7.1; python_version >= "3.7"
gitdb==4.0.9; python_version >= "3.7"
gitpython==3.1.27; python_version >= "3.7"
huggingface-hub==0.8.1; python_full_version >= "3.7.0"
idna==3.3; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
jinja2==3.1.2; python_version >= "3.7"
langcodes==3.3.0; python_version >= "3.6"
markupsafe==2.1.1; python_version >= "3.7"
multidict==6.0.2; python_version >= "3.7"
multiprocess==0.70.13; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
murmurhash==1.0.7; python_version >= "3.6"
numpy==1.23.1
packaging==21.3; python_version >= "3.6" and python_full_version >= "3.7.0"
pandas==1.4.3; python_version >= "3.8"
pathtools==0.1.2; python_version >= "3.6"
pathy==0.6.2; python_version >= "3.6"
preshed==3.0.6; python_version >= "3.6"
promise==2.3; python_version >= "3.6"
protobuf==3.20.1; python_version >= "3.7"
psutil==5.9.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pyarrow==9.0.0; python_version >= "3.7"
pydantic==1.9.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pyparsing==3.0.9; python_version >= "3.6" and python_full_version >= "3.7.0"
python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
pytz==2022.2; python_version >= "3.8"
pyyaml==6.0; python_version >= "3.6" and python_full_version >= "3.7.0"
regex==2022.7.25; python_version >= "3.6" and python_full_version >= "3.7.0"
requests==2.28.1; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
responses==0.18.0; python_version >= "3.7"
sentry-sdk==1.9.4; python_version >= "3.6"
setproctitle==1.3.2; python_version >= "3.7"
shortuuid==1.0.9; python_version >= "3.6"
six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
smart-open==5.2.1; python_version >= "3.6" and python_version < "4.0"
smmap==5.0.0; python_version >= "3.7"
spacy-legacy==3.0.9; python_version >= "3.6"
spacy-loggers==1.0.3; python_version >= "3.6"
spacy==3.4.1; python_version >= "3.6"
srsly==2.4.4; python_version >= "3.6"
thinc==8.1.0; python_version >= "3.6"
tokenizers==0.12.1; python_full_version >= "3.7.0"
tqdm==4.64.0; python_full_version >= "3.7.0" and python_version >= "3.6"
transformers==4.21.1; python_full_version >= "3.7.0"
typer==0.4.2; python_version >= "3.6"
typing-extensions==4.3.0; python_version >= "3.7" and python_full_version >= "3.7.0"
urllib3==1.26.11; python_full_version >= "3.7.0" and python_version < "4" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6")
wandb==0.13.1; python_version >= "3.6"
wasabi==0.10.1; python_version >= "3.6"
xxhash==3.0.0; python_version >= "3.6"
yarl==1.8.1; python_version >= "3.7"
ftfy==6.1.1
git+https://github.com/facebookresearch/fairseq.git
tensorboardX==2.5.1
debugpy==1.6.3

Most importantly, we use git+https://github.com/facebookresearch/fairseq.git to install fairseq as we could not get the denoising task to work when installing fairseq from PyPI.
We used the following commit: 176cd93

We use the following minimal example:

#!/bin/bash
train_dir="/datasets/text/germancolossal4/debug"
out_dir="data/preprocessed"

echo "Preprocessing data: ${train_dir}"

fairseq-preprocess \
    --trainpref ${train_dir}/train \
    --validpref ${train_dir}/valid \
    --testpref ${train_dir}/test \
    --task denoising \
    --criterion cross_entropy \
    --optimizer adam \
    --only-source \
    --workers 1 \
    --destdir ${out_dir}

echo "Finished preprocessing:"

fairseq-train ${out_dir} \
    --task denoising \
    --arch bart_base \
    --batch-size 1 \
    --skip-invalid-size-inputs-valid-test \
    --optimizer adam

We only use one worker for the preprocessing as fairseq-preprocess gets stuck when using more than one worker.

When running this script with a train-file with 20GB of data, fairseq-train runs out of memory and the container crashes without any error messages. When adding a wandb project, we observed that the training of the first epoch starts but does not complete, the pod runs out of memory before that.
The minimal training command we build follows the suggestions from #1899. This issue might be related to #4930.

Is this amount of RAM usage expected?

Thank you very much in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant