Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion Error when Setting pipe_parallel_size or model_parallel_size in GPT-NeoX #1251

Open
lieh1203 opened this issue Jul 10, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@lieh1203
Copy link

lieh1203 commented Jul 10, 2024

Hello,

I am encountering an issue with the GPT-NeoX library. When I set either pipe_parallel_size or model_parallel_size to 2, I get the following assertion error:

[2024-07-11 06:28:59,215] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
NeoXArgs.from_ymls() ['configs/2-7B.yml', 'configs/local_setup.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
  attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
  batch_size ...................... 4...........................updated
  checkpoint_activations .......... True........................updated
  checkpoint_factor ............... 10000.......................updated
  config_files .................... {'2-7B.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe_parallel_size": 2,\n   "model_parallel_size": 1,\n   #"gradient_accumulation_steps": 2,\n\n   # model settings\n   "num_layers": 32,\n   "hidden_size": 2560,\n   "num_attention_heads": 32,\n   "seq_length": 2048,\n   "max_position_embeddings": 2048,\n   "norm": "layernorm",\n   "pos_emb": "rotary",\n   "no_weight_tying": true,\n   "gpt_j_residual": false,\n   "output_layer_parallelism": "column",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled_upper_triang_masked_softmax_fusion": false,\n   "bias_gelu_fusion": false,\n   "rope_fusion": false,\n   "layernorm_fusion": false,\n\n   # init methods\n   "init_method": "small_init",\n   "output_layer_init_method": "wang_init",\n\n   # optimizer settings\n   "optimizer": {\n     "type": "Adam",\n     "params": {\n       "lr": 0.00016,\n       "betas": [0.9, 0.95],\n       "eps": 1.0e-8,\n     }\n   },\n   "min_lr": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   "zero_optimization": {\n    "stage": 3,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 4,\n   "data_impl": "mmap",\n\n   # activation checkpointing\n   "checkpoint_activations": true,\n   "checkpoint_num_layers": 1,\n   "partition_activations": true,\n   "synchronize_each_layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight_decay": 0.1,\n   "hidden_dropout": 0,\n   "attention_dropout": 0,\n\n   # precision settings\n   "fp16": {\n     "fp16": true,\n     "enabled": true,\n     "loss_scale": 0,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train_iters": 320000,\n   "lr_decay_iters": 320000,\n   "distributed_backend": "nccl",\n   "lr_decay_style": "cosine",\n   "warmup": 0.01,\n   "checkpoint_factor": 10000,\n   "eval_interval": 1000,\n   "eval_iters": 10,\n\n   # logging\n   "log_interval": 100,\n   "steps_per_print": 10,\n   "keep_last_n_checkpoints": 4,\n   "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  "data_path": "data/processed_data/mydataset_text_document",\n  "tokenizer_type":"HFTokenizer",\n  # or for weighted datasets:\n  # "train-data-paths": ["data/enwik8/enwik8_text_document", "data/enwik8/enwik8_text_document"],\n  # "test-data-paths": ["data/enwik8/enwik8_text_document", "data/enwik8/enwik8_text_document"],\n  # "valid-data-paths": ["data/enwik8/enwik8_text_document", "data/enwik8/enwik8_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab_file": "ckpts/20B_tokenizer.json",\n  "merge_file": "data/gpt2-merges.txt",\n\n  "save": "checkpoints",\n  "load": "checkpoints",\n  "checkpoint_validation_with_forward_pass": False,\n\n  "tensorboard_dir": "tensorboard",\n  "log_dir": "logs",\n  "use_wandb": False,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}\n'}updated
  data_impl ....................... mmap........................updated
  data_path ....................... data/processed_data/mydataset_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'fp16': True, 'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  global_num_gpus ................. 4...........................updated
  hidden_size ..................... 2560........................updated
  init_method ..................... small_init..................updated
  is_pipe_parallel ................ True........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints.................updated
  log_dir ......................... logs........................updated
  lr .............................. 0.00016.....................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... data/gpt2-merges.txt........updated
  min_lr .......................... 1.6e-05.....................updated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 32..........................updated
  num_layers ...................... 32..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.00016, 'betas': [0.9, 0.95], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  output_layer_init_method ........ wang_init...................updated
  partition_activations ........... True........................updated
  pipe_parallel_size .............. 2...........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints.................updated
  save_iters ...................... [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000]updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard.................updated
  text_gen_type ................... unconditional...............updated
  tokenizer_type .................. HFTokenizer.................updated
  train_batch_size ................ 8...........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 4...........................updated
  use_wandb ....................... False.......................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... ckpts/20B_tokenizer.json....updated
  wall_clock_breakdown ............ True........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 3, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 3...........................updated
  account ......................... None........................default
  activation ...................... gelu........................default
  activation_checkpointing ........ None........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_dropout ............... 0...........................default
  attention_softmax_in_fp32 ....... False.......................default
  autotuning ...................... None........................default
  autotuning_run .................. None........................default
  base_shapes_file ................ None........................default
  bf16 ............................ None........................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint ...................... None........................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_scale ................ linear......................default
  checkpoint_validation_with_forward_pass  False................default
  clip_grad ....................... 1.0.........................default
  comment ......................... None........................default
  comms_logger .................... None........................default
  communication_data_type ......... None........................default
  compression_training ............ None........................default
  contiguous_checkpointing ........ False.......................default
  coord_check ..................... False.......................default
  create_moe_param_group .......... True........................default
  csv_monitor ..................... None........................default
  curriculum_learning ............. None........................default
  curriculum_seqlen ............... 0...........................default
  data_efficiency ................. None........................default
  data_types ...................... None........................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_extra_args ............ None........................default
  deepspeed_mpi ................... False.......................default
  deepspeed_slurm ................. False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  elasticity ...................... None........................default
  enable_expert_tensor_parallelism  False.......................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  expert_interval ................. 2...........................default
  extra_save_iters ................ None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  force_multi ..................... False.......................default
  fp16_lm_cross_entropy ........... False.......................default
  fp32_allreduce .................. False.......................default
  git_hash ........................ None........................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gpt_j_tied ...................... False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_clipping ............... 1.0.........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hidden_dropout .................. 0...........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method_std ................. 0.02........................default
  intermediate_size ............... None........................default
  iteration ....................... None........................default
  label_data_paths ................ None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  layernorm_fusion ................ False.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_grad_pct_zeros .............. False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_interval .................... 100.........................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  mamba_causal_conv_fusion ........ False.......................default
  mamba_inner_func_fusion ......... False.......................default
  mamba_selective_fp32_params ..... True........................default
  mamba_selective_scan_fusion ..... False.......................default
  mamba_use_bias_in_conv .......... True........................default
  mamba_use_bias_in_linears ....... False.......................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  memory_profiling ................ False.......................default
  memory_profiling_path ........... None........................default
  min_scale ....................... 1.0.........................default
  mlp_type ........................ regular.....................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  moe_eval_capacity_factor ........ 1.0.........................default
  moe_expert_parallel_size ........ 1...........................default
  moe_glu ......................... False.......................default
  moe_jitter_eps .................. None........................default
  moe_lbl_in_fp32 ................. False.......................default
  moe_loss_coeff .................. 0.1.........................default
  moe_min_capacity ................ 4...........................default
  moe_num_experts ................. 1...........................default
  moe_token_dropping .............. False.......................default
  moe_top_k ....................... 1...........................default
  moe_train_capacity_factor ....... 1.0.........................default
  moe_type ........................ megablocks..................default
  moe_use_residual ................ True........................default
  mup_attn_temp ................... 1.0.........................default
  mup_embedding_mult .............. 1.0.........................default
  mup_init_scale .................. 1.0.........................default
  mup_output_temp ................. 1.0.........................default
  mup_rp_embedding_mult ........... 1.0.........................default
  mup_width_scale ................. 2...........................default
  no_load_optim ................... False.......................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  no_ssh_check .................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_kv_heads .................... None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 1...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  opt_pos_emb_offset .............. 0...........................default
  output_layer_parallelism ........ column......................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile ......................... False.......................default
  profile_backward ................ False.......................default
  profile_step_start .............. 10..........................default
  profile_step_stop ............... 12..........................default
  prompt_end ...................... 
...........................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  return_logits ................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rope_fusion ..................... False.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rotary_save_freqs_buffer ........ False.......................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  s3_chunk_size ................... 104857600...................default
  s3_path ......................... None........................default
  sample_input_file ............... None........................default
  sample_output_file .............. samples.txt.................default
  save_base_shapes ................ False.......................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  sliding_window_width ............ None........................default
  soft_prompt_tuning .............. None........................default
  sparse_attention ................ None........................default
  sparse_gradients ................ False.......................default
  split ........................... 969, 30, 1..................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  tensorboard ..................... None........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bias_in_attn_linear ......... True........................default
  use_bias_in_norms ............... True........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  use_mup ......................... False.......................default
  use_qk_layernorm ................ False.......................default
  use_shared_fs ................... True........................default
  use_tutel ....................... False.......................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb ........................... None........................default
  wandb_group ..................... None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_init_all_ranks ............ False.......................default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weight_decay .................... 0.1.........................default
  weighted_sampler_alpha .......... 1.0.........................default
  world_size ...................... None........................default
---------------- end of arguments ----------------
NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1 
[2024-07-11 06:29:02,427] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-11 06:29:02,427] [INFO] [runner.py:568:main] cmd = /usr/local/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed_config eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ== --megatron_config {"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}
[2024-07-11 06:29:04,147] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8
[2024-07-11 06:29:06,944] [INFO] [launch.py:139:main] 0 NCCL_VERSION=2.16.2-1
[2024-07-11 06:29:06,944] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-07-11 06:29:06,944] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-07-11 06:29:06,944] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-07-11 06:29:06,944] [INFO] [launch.py:164:main] dist_world_size=4
[2024-07-11 06:29:06,944] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-07-11 06:29:06,954] [INFO] [launch.py:256:main] process 2159 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:06,961] [INFO] [launch.py:256:main] process 2160 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:06,967] [INFO] [launch.py:256:main] process 2161 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=2', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:06,973] [INFO] [launch.py:256:main] process 2162 spawned with command: ['/usr/local/miniconda3/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDAxNiwgImJldGFzIjogWzAuOSwgMC45NV0sICJlcHMiOiAxZS0wOH19LCAiZnAxNiI6IHsiZnAxNiI6IHRydWUsICJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMywgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '{"train_batch_size": 8, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.00016, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "zero_optimization": {"stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 32, "hidden_size": 2560, "num_attention_heads": 32, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "init_method": "small_init", "output_layer_init_method": "wang_init", "lr_decay_style": "cosine", "lr_decay_iters": 320000, "min_lr": 1.6e-05, "optimizer_type": "Adam", "zero_stage": 3, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.00016, "tokenizer_type": "HFTokenizer", "data_path": "data/processed_data/mydataset_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"2-7B.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe_parallel_size\": 2,\n   \"model_parallel_size\": 1,\n   #\"gradient_accumulation_steps\": 2,\n\n   # model settings\n   \"num_layers\": 32,\n   \"hidden_size\": 2560,\n   \"num_attention_heads\": 32,\n   \"seq_length\": 2048,\n   \"max_position_embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos_emb\": \"rotary\",\n   \"no_weight_tying\": true,\n   \"gpt_j_residual\": false,\n   \"output_layer_parallelism\": \"column\",\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled_upper_triang_masked_softmax_fusion\": false,\n   \"bias_gelu_fusion\": false,\n   \"rope_fusion\": false,\n   \"layernorm_fusion\": false,\n\n   # init methods\n   \"init_method\": \"small_init\",\n   \"output_layer_init_method\": \"wang_init\",\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.00016,\n       \"betas\": [0.9, 0.95],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"min_lr\": 0.000016,\n\n   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n   \"zero_optimization\": {\n    \"stage\": 3,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data_impl\": \"mmap\",\n\n   # activation checkpointing\n   \"checkpoint_activations\": true,\n   \"checkpoint_num_layers\": 1,\n   \"partition_activations\": true,\n   \"synchronize_each_layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight_decay\": 0.1,\n   \"hidden_dropout\": 0,\n   \"attention_dropout\": 0,\n\n   # precision settings\n   \"fp16\": {\n     \"fp16\": true,\n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train_iters\": 320000,\n   \"lr_decay_iters\": 320000,\n   \"distributed_backend\": \"nccl\",\n   \"lr_decay_style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"checkpoint_factor\": 10000,\n   \"eval_interval\": 1000,\n   \"eval_iters\": 10,\n\n   # logging\n   \"log_interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep_last_n_checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data_path\": \"data/processed_data/mydataset_text_document\",\n  \"tokenizer_type\":\"HFTokenizer\",\n  # or for weighted datasets:\n  # \"train-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"test-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"valid-data-paths\": [\"data/enwik8/enwik8_text_document\", \"data/enwik8/enwik8_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group.\n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab_file\": \"ckpts/20B_tokenizer.json\",\n  \"merge_file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n\n  \"tensorboard_dir\": \"tensorboard\",\n  \"log_dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "checkpoint_factor": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "vocab_file": "ckpts/20B_tokenizer.json", "merge_file": "data/gpt2-merges.txt", "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "world_size": 1, "is_pipe_parallel": true, "use_wandb": false, "log_dir": "logs", "tensorboard_dir": "tensorboard", "text_gen_type": "unconditional", "local_rank": 0, "rank": 0, "user_script": "train.py", "save_iters": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000], "global_num_gpus": 4}']
[2024-07-11 06:29:09,055] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 06:29:09,059] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 06:29:09,076] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-11 06:29:09,084] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
[2024-07-11 06:29:12,526] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-11 06:29:12,526] [INFO] [comm.py:637:init_distributed] cdb=None
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
[2024-07-11 06:29:12,613] [INFO] [comm.py:637:init_distributed] cdb=None
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Unable to import Mamba kernels. Install them from our requirements/requirements-mamba.txt, or directly from https://github.com/state-spaces/mamba
For s3 checkpointing, please install boto3 either using requirements/requirements-s3.txt or https://github.com/boto/boto3
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or https://github.com/huggingface/hf_transfer
[2024-07-11 06:29:12,664] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-11 06:29:12,664] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1 
> building HFTokenizer tokenizer ...
 > padded vocab (size: 50277) with 27 dummy tokens (new size: 50304)
> setting tensorboard ...
torch distributed is already initialized, skipping initialization ...
> initializing model parallel with size 1
MPU DP: [0, 1]
MPU DP: [2, 3]
MPU PP: [0, 2]
MPU PP: [1, 3]
MPU IO: [0, 1, 2, 3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2024-07-11 06:29:13,784] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/hy-tmp/gpt-neox-main/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/hy-tmp/gpt-neox-main/megatron/data'
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0
building GPT2 model ...
Traceback (most recent call last):
  File "train.py", line 35, in <module>
    main()
  File "train.py", line 31, in main
    pretrain(neox_args=neox_args)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 195, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 743, in setup_model_and_optimizer
    model = get_model(neox_args=neox_args, use_cache=use_cache)
  File "/hy-tmp/gpt-neox-main/megatron/training.py", line 486, in get_model
    with deepspeed.zero.Init(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 796, in __init__
    self._configure_train_batch_size()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 979, in _configure_train_batch_size
    self._batch_assertion()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 925, in _batch_assertion
    assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be greater than 0"
AssertionError: Gradient accumulation steps: 0 has to be greater than 0

I am trying to enable parallelism but this error is preventing me from proceeding.

Here are some details about my setup:

  • Number of GPUs: 4 GPUs, each with 24GB memory, on a single server
  • Python Version: 3.8.10
  • DeepSpeed Version: 0.14.4
  • CUDA Version: cu118

Below is the content of my 2-7B.yml configuration file:

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe_parallel_size": 2,
   "model_parallel_size": 1,

   # model settings
   "num_layers": 32,
   "hidden_size": 2560,
   "num_attention_heads": 32,
   "seq_length": 2048,
   "max_position_embeddings": 2048,
   "norm": "layernorm",
   "pos_emb": "rotary",
   "no_weight_tying": true,
   "gpt_j_residual": false,
   "output_layer_parallelism": "column",

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled_upper_triang_masked_softmax_fusion": false,
   "bias_gelu_fusion": false,
   "rope_fusion": false,
   "layernorm_fusion": false,

   # init methods
   "init_method": "small_init",
   "output_layer_init_method": "wang_init",

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.00016,
       "betas": [0.9, 0.95],
       "eps": 1.0e-8,
     }
   },
   "min_lr": 0.000016,

   # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
   "zero_optimization": {
    "stage": 3,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 4,
   "data_impl": "mmap",

   # activation checkpointing
   "checkpoint_activations": true,
   "checkpoint_num_layers": 1,
   "partition_activations": true,
   "synchronize_each_layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight_decay": 0.1,
   "hidden_dropout": 0,
   "attention_dropout": 0,

   # precision settings
   "fp16": {
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train_iters": 320000,
   "lr_decay_iters": 320000,
   "distributed_backend": "nccl",
   "lr_decay_style": "cosine",
   "warmup": 0.01,
   "checkpoint_factor": 10000,
   "eval_interval": 1000,
   "eval_iters": 10,

   # logging
   "log_interval": 100,
   "steps_per_print": 10,
   "keep_last_n_checkpoints": 4,
   "wall_clock_breakdown": true,
}

I am using the following command to start the training:

python ./deepy.py train.py -d configs 2-7B.yml local_setup.yml

I would appreciate any guidance or suggestions on how to resolve this issue.

Thank you!

@lieh1203 lieh1203 added the bug Something isn't working label Jul 10, 2024
@lieh1203 lieh1203 reopened this Jul 11, 2024
@StellaAthena
Copy link
Member

StellaAthena commented Jul 12, 2024

What happens if you add "gradient_accumulation_steps": 16 to your config file?

@lieh1203
Copy link
Author

lieh1203 commented Jul 13, 2024

Hi @StellaAthena!

Is it because I enabled both parallel mode and Zero Stage 3 at the same time that caused this error?

@StellaAthena
Copy link
Member

Hi @StellaAthena!

Is it because I enabled both parallel mode and Zero Stage 3 at the same time that caused this error?

Zero-3 and PP > 1 should error but I'm surprised it would error like this? Does it go away if you use zero-1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants