Denoising Task crashes OOM #5076

BUCKFAE · 2023-04-13T18:47:02Z

Hey!

We are trying to train a BART-Model for German from scratch using the GC4 Corpus. For testing purposes, we use only 20GB of the Dataset for training in a container with 250GB of RAM and one NVIDIA A100.

Dockerfile

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

SHELL ["/bin/bash", "-c"]

ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Europe/Berlin

RUN apt-get update && apt-get install -y \
 && apt-get install -y software-properties-common \
 && add-apt-repository -y ppa:deadsnakes/ppa \
 && apt-get update && apt-get install -y \
 python3.9-dev \
 python3.9-venv \
 python3.9-distutils \
 python3-pip \
 git \
 llvm \
 vim \
 neovim \
 tree \
 curl \
 wget \
 htop \
 zsh \
 && rm -rf /var/lib/apt/lists/*

ENV HOME=/tmp
RUN ln -sf /usr/bin/python3.9 /usr/bin/python3

WORKDIR /workdir/code

COPY requirements.txt .
RUN pip3 install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113

Requirements.txt

aiohttp==3.8.1; python_version >= "3.7"
aiosignal==1.2.0; python_version >= "3.6"
async-timeout==4.0.2; python_version >= "3.6"
attrs==22.1.0; python_version >= "3.6"
blis==0.7.8; python_version >= "3.6"
catalogue==2.0.8; python_version >= "3.6"
certifi==2022.6.15; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
charset-normalizer==2.1.0; python_full_version >= "3.7.0" and python_version >= "3.7" and python_version < "4"
click==8.1.3; python_version >= "3.7"
colorama==0.4.5; python_full_version >= "3.7.0" and platform_system == "Windows" and python_version >= "3.6" and (python_version >= "3.7" and python_full_version < "3.0.0" and platform_system == "Windows" or platform_system == "Windows" and python_version >= "3.7" and python_full_version >= "3.5.0")
cymem==2.0.6; python_version >= "3.6"
datasets==2.4.0
dill==0.3.5.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
docker-pycreds==0.4.0; python_version >= "3.6"
elastic-transport==8.1.2; python_version >= "3.6" and python_version < "4"
elasticsearch==8.3.3; python_version >= "3.6" and python_version < "4"
filelock==3.8.0; python_version >= "3.7" and python_full_version >= "3.7.0"
frozenlist==1.3.1; python_version >= "3.7"
fsspec==2022.7.1; python_version >= "3.7"
gitdb==4.0.9; python_version >= "3.7"
gitpython==3.1.27; python_version >= "3.7"
huggingface-hub==0.8.1; python_full_version >= "3.7.0"
idna==3.3; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
jinja2==3.1.2; python_version >= "3.7"
langcodes==3.3.0; python_version >= "3.6"
markupsafe==2.1.1; python_version >= "3.7"
multidict==6.0.2; python_version >= "3.7"
multiprocess==0.70.13; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.7.0"
murmurhash==1.0.7; python_version >= "3.6"
numpy==1.23.1
packaging==21.3; python_version >= "3.6" and python_full_version >= "3.7.0"
pandas==1.4.3; python_version >= "3.8"
pathtools==0.1.2; python_version >= "3.6"
pathy==0.6.2; python_version >= "3.6"
preshed==3.0.6; python_version >= "3.6"
promise==2.3; python_version >= "3.6"
protobuf==3.20.1; python_version >= "3.7"
psutil==5.9.1; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pyarrow==9.0.0; python_version >= "3.7"
pydantic==1.9.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pyparsing==3.0.9; python_version >= "3.6" and python_full_version >= "3.7.0"
python-dateutil==2.8.2; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
pytz==2022.2; python_version >= "3.8"
pyyaml==6.0; python_version >= "3.6" and python_full_version >= "3.7.0"
regex==2022.7.25; python_version >= "3.6" and python_full_version >= "3.7.0"
requests==2.28.1; python_version >= "3.7" and python_version < "4" and python_full_version >= "3.7.0"
responses==0.18.0; python_version >= "3.7"
sentry-sdk==1.9.4; python_version >= "3.6"
setproctitle==1.3.2; python_version >= "3.7"
shortuuid==1.0.9; python_version >= "3.6"
six==1.16.0; python_version >= "3.8" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.8"
smart-open==5.2.1; python_version >= "3.6" and python_version < "4.0"
smmap==5.0.0; python_version >= "3.7"
spacy-legacy==3.0.9; python_version >= "3.6"
spacy-loggers==1.0.3; python_version >= "3.6"
spacy==3.4.1; python_version >= "3.6"
srsly==2.4.4; python_version >= "3.6"
thinc==8.1.0; python_version >= "3.6"
tokenizers==0.12.1; python_full_version >= "3.7.0"
tqdm==4.64.0; python_full_version >= "3.7.0" and python_version >= "3.6"
transformers==4.21.1; python_full_version >= "3.7.0"
typer==0.4.2; python_version >= "3.6"
typing-extensions==4.3.0; python_version >= "3.7" and python_full_version >= "3.7.0"
urllib3==1.26.11; python_full_version >= "3.7.0" and python_version < "4" and python_version >= "3.7" and (python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.7") and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "4" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6") and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.6.0" and python_version < "4" and python_version >= "3.6")
wandb==0.13.1; python_version >= "3.6"
wasabi==0.10.1; python_version >= "3.6"
xxhash==3.0.0; python_version >= "3.6"
yarl==1.8.1; python_version >= "3.7"
ftfy==6.1.1
git+https://github.com/facebookresearch/fairseq.git
tensorboardX==2.5.1
debugpy==1.6.3

Most importantly, we use git+https://github.com/facebookresearch/fairseq.git to install fairseq as we could not get the denoising task to work when installing fairseq from PyPI.
We used the following commit: 176cd93

We use the following minimal example:

#!/bin/bash
train_dir="/datasets/text/germancolossal4/debug"
out_dir="data/preprocessed"

echo "Preprocessing data: ${train_dir}"

fairseq-preprocess \
    --trainpref ${train_dir}/train \
    --validpref ${train_dir}/valid \
    --testpref ${train_dir}/test \
    --task denoising \
    --criterion cross_entropy \
    --optimizer adam \
    --only-source \
    --workers 1 \
    --destdir ${out_dir}

echo "Finished preprocessing:"

fairseq-train ${out_dir} \
    --task denoising \
    --arch bart_base \
    --batch-size 1 \
    --skip-invalid-size-inputs-valid-test \
    --optimizer adam

We only use one worker for the preprocessing as fairseq-preprocess gets stuck when using more than one worker.

When running this script with a train-file with 20GB of data, fairseq-train runs out of memory and the container crashes without any error messages. When adding a wandb project, we observed that the training of the first epoch starts but does not complete, the pod runs out of memory before that.
The minimal training command we build follows the suggestions from #1899. This issue might be related to #4930.

Is this amount of RAM usage expected?

Thank you very much in advance!

The text was updated successfully, but these errors were encountered:

BUCKFAE added needs triage question labels Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Denoising Task crashes OOM #5076

Denoising Task crashes OOM #5076

BUCKFAE commented Apr 13, 2023 •

edited

Loading

Denoising Task crashes OOM #5076

Denoising Task crashes OOM #5076

Comments

BUCKFAE commented Apr 13, 2023 • edited Loading

BUCKFAE commented Apr 13, 2023 •

edited

Loading