This repository was archived by the owner on Oct 19, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 355
This repository was archived by the owner on Oct 19, 2024. It is now read-only.
Ray spill out of disk error when using alpa to auto-parallelize llama #969
Copy link
Copy link
Open
Description
Please describe the bug
When I tried to use alpa to parallelize llama-7b model on ray cluster (one node with 8 GPUs), disk space will continue to grow and never stop due to ray object spilling. Finally the program will throw out of disk space error.
Please describe the expected behavior
As expected, alpa training will run normally.
System information and environment
- OS Platform and Distribution: Ubuntu 20.04 docker
- Python version: 3.10.13
- CUDA version: 11.8
- NCCL version: 2.16.2
- cupy version: cupy-cuda11x==12.2.0
- GPU model and memory: NVIDIA A800 80GB
- Alpa version: 1.0.0.dev0, build from source (alpa main branch)
- TensorFlow version: 2.11.0
- JAX version: 0.3.22
- Ray version:
>>> print(ray.__version__)
2.1.0
>>> print(ray.__commit__)
be49bde7ee4f6adb3f8710aee0665c27f9f0bb62
To Reproduce
Steps to reproduce the behavior:
- LLaMa model used: https://github.com/young-geng/EasyLM/tree/main/EasyLM/models/llama
ray start --head
cd examples/llama_finetune
bash run_llama.sh
Error Logs
cd examples/llama_finetune && bash run_llama.sh
2023-11-21 11:15:40.798832: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /fs/llm/zigzagcai/gcc_10.2.0/lib64:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-21 11:15:40.798902: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /fs/llm/zigzagcai/gcc_10.2.0/lib64:/usr/local/nccl-rdma-sharp-plugins/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-11-21 11:15:40.798911: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-11-21 11:15:42,074 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: xxx.xx.x.xxx:6379...
2023-11-21 11:15:42,080 INFO worker.py:1519 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO:__main__:Training/evaluation parameters TrainingArguments(output_dir='./output', overwrite_output_dir=True, do_train=True, do_eval=False, per_device_train_batch_size=32, per_device_eval_batch_size=16, num_micro_batches=32, operator_parallel=1, pipeline_parallel=1, use_remat=True, learning_rate=0.0005, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adafactor=False, num_train_epochs=3.0, warmup_ratio=0.03, logging_steps=1, save_steps=3000, eval_steps=1000, seed=42, push_to_hub=False, hub_model_id=None, hub_token=None)
Model config LLaMAConfig {
"attn_pdrop": 0.0,
"bos_token_id": 0,
"embd_pdrop": 0.0,
"eos_token_id": 1,
"fcm_max_ratio": 0.0,
"fcm_min_ratio": 0.0,
"gradient_checkpointing": true,
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_sequence_length": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"resid_pdrop": 0.0,
"rms_norm_eps": 1e-06,
"tie_word_embeddings": false,
"transformers_version": "4.28.1",
"use_cache": true,
"vocab_size": 32000
}
loading file tokenizer.model
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"transformers_version": "4.28.1"
}
loading configuration file /root/llama/llama-7b/config.json
Model config LlamaConfig {
"_name_or_path": "/root/llama/llama-7b",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"max_sequence_length": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.28.1",
"use_cache": true,
"vocab_size": 32000
}
loading weights file /root/llama/llama-7b/model.safetensors.index.json
Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.28.1"
}
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.44s/it]
All model checkpoint weights were used when initializing LlamaForCausalLM.
All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/llama/llama-7b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
loading configuration file /root/llama/llama-7b/generation_config.json
Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0,
"transformers_version": "4.28.1"
}
Loading data...
#train 44425, #eval 907
Formatting inputs...Skip in lazy mode
Formatting inputs...Skip in lazy mode
INFO:__main__:***** Build dataset *****
INFO:__main__:***** Running training *****
INFO:__main__: Num examples = 44425
INFO:__main__: Num Epochs = 3
INFO:__main__: Batch size per device (w. accumulation) = 32
INFO:__main__: Global train batch size (w. parallel & distributed) = 256
INFO:__main__: Total optimization steps = 519
Initial compilation. This might take some minutes...
Epoch ... : 0%| | 0/3 [00:00<?, ?it/s(raylet) Spilled 1049654 MiB, 1571 objects, write throughput 663 MiB/s. | 0/173 [00:00<?, ?it/s]
Epoch ... : 0%| | 0/3 [16:44<?, ?it/s]
Traceback (most recent call last):
File "/fs/llm/zigzagcai/alpa/examples/llama_finetune/run_easylm_flax.py", line 886, in <module>
main()
File "/fs/llm/zigzagcai/alpa/examples/llama_finetune/run_easylm_flax.py", line 752, in main
state, train_metric = p_train_step(state, batch)
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/fs/llm/zigzagcai/alpa/alpa/api.py", line 130, in __call__
out = executable.launch_on_driver(*args_flat)
File "/fs/llm/zigzagcai/alpa/alpa/mesh_executable.py", line 665, in launch_on_driver
input_bufs = physical_mesh.shard_args_to_bufs(
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 1325, in shard_args_to_bufs
ref = shard_arg_handlers[type(arg)](arg, self, indices)[0]
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 2484, in _shard_device_array
return _shard_array(np.asarray(array), device_mesh, indices, num_batch,
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 2477, in _shard_array
return _device_mesh_put(device_mesh, datas, num_batch, batch_dim)
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 2434, in _device_mesh_put
device_mesh.workers[host_id].put_buffers.remote(
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 138, in remote
return self._remote(args, kwargs)
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 425, in _start_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 184, in _remote
return invocation(args, kwargs)
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 171, in invocation
return actor._actor_method_call(
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 1170, in _actor_method_call
object_refs = worker.core_worker.submit_actor_task(
File "python/ray/_raylet.pyx", line 1982, in ray._raylet.CoreWorker.submit_actor_task
File "python/ray/_raylet.pyx", line 1987, in ray._raylet.CoreWorker.submit_actor_task
File "python/ray/_raylet.pyx", line 402, in ray._raylet.prepare_args_and_increment_put_refs
File "python/ray/_raylet.pyx", line 393, in ray._raylet.prepare_args_and_increment_put_refs
File "python/ray/_raylet.pyx", line 482, in ray._raylet.prepare_args_internal
File "python/ray/_raylet.pyx", line 1599, in ray._raylet.CoreWorker.put_serialized_object_and_increment_local_ref
File "python/ray/_raylet.pyx", line 1488, in ray._raylet.CoreWorker._create_put_buffer
File "python/ray/_raylet.pyx", line 188, in ray._raylet.check_status
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fs/llm/zigzagcai/alpa/examples/llama_finetune/run_easylm_flax.py", line 886, in <module>
main()
File "/fs/llm/zigzagcai/alpa/examples/llama_finetune/run_easylm_flax.py", line 752, in main
state, train_metric = p_train_step(state, batch)
File "/fs/llm/zigzagcai/alpa/alpa/mesh_executable.py", line 665, in launch_on_driver
input_bufs = physical_mesh.shard_args_to_bufs(
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 1325, in shard_args_to_bufs
ref = shard_arg_handlers[type(arg)](arg, self, indices)[0]
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 2484, in _shard_device_array
return _shard_array(np.asarray(array), device_mesh, indices, num_batch,
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 2477, in _shard_array
return _device_mesh_put(device_mesh, datas, num_batch, batch_dim)
File "/fs/llm/zigzagcai/alpa/alpa/device_mesh.py", line 2434, in _device_mesh_put
device_mesh.workers[host_id].put_buffers.remote(
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 138, in remote
return self._remote(args, kwargs)
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 425, in _start_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 184, in _remote
return invocation(args, kwargs)
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 171, in invocation
return actor._actor_method_call(
File "/root/miniconda3/envs/alpa/lib/python3.10/site-packages/ray/actor.py", line 1170, in _actor_method_call
object_refs = worker.core_worker.submit_actor_task(
File "python/ray/_raylet.pyx", line 1982, in ray._raylet.CoreWorker.submit_actor_task
File "python/ray/_raylet.pyx", line 1987, in ray._raylet.CoreWorker.submit_actor_task
File "python/ray/_raylet.pyx", line 402, in ray._raylet.prepare_args_and_increment_put_refs
File "python/ray/_raylet.pyx", line 393, in ray._raylet.prepare_args_and_increment_put_refs
File "python/ray/_raylet.pyx", line 482, in ray._raylet.prepare_args_internal
File "python/ray/_raylet.pyx", line 1599, in ray._raylet.CoreWorker.put_serialized_object_and_increment_local_ref
File "python/ray/_raylet.pyx", line 1488, in ray._raylet.CoreWorker._create_put_buffer
File "python/ray/_raylet.pyx", line 188, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.
(raylet) [2023-11-21 11:36:54,055 E 446708 446738] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-11-21_10-23-31_768034_446624 is over 95% full, available space: 42212503552; capacity: 844367142912. Object creation will fail if spilling is required.
Metadata
Metadata
Assignees
Labels
No labels