We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PaddlePaddle/PaddleOCR#13912
export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=eno1 export NCCL_SOCKET_IFNAME=enp96s0f0
python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131" --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml
系 统 os:docker ubuntu 20.04 paddleocr:0.1.0.dev0+d20240926 paddlepaddle-gpu:3.0.0.dev20240925 cuda: 12.3 nccl: 2.19.3+cu12.3
(py310_ppocr) ai@hf-13f-gpu-134:/workspace3/code/paddle-ocr-contribute$ python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131" --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml /home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md warnings.warn(warning_message) LAUNCH INFO 2024-09-26 09:28:01,525 ----------- Configuration ---------------------- LAUNCH INFO 2024-09-26 09:28:01,525 auto_cluster_config: 0 LAUNCH INFO 2024-09-26 09:28:01,525 auto_parallel_config: None LAUNCH INFO 2024-09-26 09:28:01,525 auto_tuner_json: None LAUNCH INFO 2024-09-26 09:28:01,525 devices: 3 LAUNCH INFO 2024-09-26 09:28:01,525 elastic_level: -1 LAUNCH INFO 2024-09-26 09:28:01,525 elastic_timeout: 30 LAUNCH INFO 2024-09-26 09:28:01,525 enable_gpu_log: True LAUNCH INFO 2024-09-26 09:28:01,525 gloo_port: 6767 LAUNCH INFO 2024-09-26 09:28:01,525 host: None LAUNCH INFO 2024-09-26 09:28:01,525 ips: 192.168.8.134,192.168.8.131 LAUNCH INFO 2024-09-26 09:28:01,525 job_id: default LAUNCH INFO 2024-09-26 09:28:01,525 legacy: False LAUNCH INFO 2024-09-26 09:28:01,525 log_dir: log LAUNCH INFO 2024-09-26 09:28:01,525 log_level: INFO LAUNCH INFO 2024-09-26 09:28:01,525 log_overwrite: False LAUNCH INFO 2024-09-26 09:28:01,525 master: None LAUNCH INFO 2024-09-26 09:28:01,525 max_restart: 3 LAUNCH INFO 2024-09-26 09:28:01,525 nnodes: 1 LAUNCH INFO 2024-09-26 09:28:01,525 nproc_per_node: None LAUNCH INFO 2024-09-26 09:28:01,525 rank: -1 LAUNCH INFO 2024-09-26 09:28:01,525 run_mode: collective LAUNCH INFO 2024-09-26 09:28:01,525 server_num: None LAUNCH INFO 2024-09-26 09:28:01,525 servers: LAUNCH INFO 2024-09-26 09:28:01,525 sort_ip: False LAUNCH INFO 2024-09-26 09:28:01,525 start_port: 6070 LAUNCH INFO 2024-09-26 09:28:01,525 trainer_num: None LAUNCH INFO 2024-09-26 09:28:01,525 trainers: LAUNCH INFO 2024-09-26 09:28:01,525 training_script: tools/train.py LAUNCH INFO 2024-09-26 09:28:01,525 training_script_args: ['-c', './DIY/configs/rec/rec_svtrnet-hw_english_word.yml'] LAUNCH INFO 2024-09-26 09:28:01,525 with_gloo: 1 LAUNCH INFO 2024-09-26 09:28:01,526 -------------------------------------------------- LAUNCH INFO 2024-09-26 09:28:01,526 Job: default, mode collective, replicas 1[1:1], elastic False LAUNCH INFO 2024-09-26 09:28:01,527 Run Pod: pcelng, replicas 1, status ready LAUNCH INFO 2024-09-26 09:28:01,545 Watching Pod: pcelng, replicas 1, status running /home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md warnings.warn(warning_message) [2024/09/26 09:28:03] ppocr INFO: Architecture : [2024/09/26 09:28:03] ppocr INFO: Backbone : [2024/09/26 09:28:03] ppocr INFO: depth : [3, 6, 3] [2024/09/26 09:28:03] ppocr INFO: embed_dim : [64, 128, 256] [2024/09/26 09:28:03] ppocr INFO: img_size : [32, 600] [2024/09/26 09:28:03] ppocr INFO: last_stage : True [2024/09/26 09:28:03] ppocr INFO: local_mixer : [[7, 11], [7, 11], [7, 11]] [2024/09/26 09:28:03] ppocr INFO: mixer : ['Local', 'Local', 'Local', 'Local', 'Local', 'Local', 'Global', 'Global', 'Global', 'Global', 'Global', 'Global'] [2024/09/26 09:28:03] ppocr INFO: name : SVTRNet [2024/09/26 09:28:03] ppocr INFO: num_heads : [2, 4, 8] [2024/09/26 09:28:03] ppocr INFO: out_channels : 192 [2024/09/26 09:28:03] ppocr INFO: out_char_num : 50 [2024/09/26 09:28:03] ppocr INFO: patch_merging : Conv [2024/09/26 09:28:03] ppocr INFO: prenorm : False [2024/09/26 09:28:03] ppocr INFO: Head : [2024/09/26 09:28:03] ppocr INFO: name : CTCHead [2024/09/26 09:28:03] ppocr INFO: Neck : [2024/09/26 09:28:03] ppocr INFO: encoder_type : reshape [2024/09/26 09:28:03] ppocr INFO: name : SequenceEncoder [2024/09/26 09:28:03] ppocr INFO: Transform : [2024/09/26 09:28:03] ppocr INFO: name : STN_ON [2024/09/26 09:28:03] ppocr INFO: num_control_points : 20 [2024/09/26 09:28:03] ppocr INFO: stn_activation : none [2024/09/26 09:28:03] ppocr INFO: tps_inputsize : [32, 64] [2024/09/26 09:28:03] ppocr INFO: tps_margins : [0.05, 0.05] [2024/09/26 09:28:03] ppocr INFO: tps_outputsize : [32, 600] [2024/09/26 09:28:03] ppocr INFO: algorithm : SVTR [2024/09/26 09:28:03] ppocr INFO: model_type : rec [2024/09/26 09:28:03] ppocr INFO: Eval : [2024/09/26 09:28:03] ppocr INFO: dataset : [2024/09/26 09:28:03] ppocr INFO: data_dir : ./datasets/hw_english_word_dictation-rec [2024/09/26 09:28:03] ppocr INFO: label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] [2024/09/26 09:28:03] ppocr INFO: name : SimpleDataSet [2024/09/26 09:28:03] ppocr INFO: transforms : [2024/09/26 09:28:03] ppocr INFO: DecodeImage : [2024/09/26 09:28:03] ppocr INFO: channel_first : False [2024/09/26 09:28:03] ppocr INFO: img_mode : BGR [2024/09/26 09:28:03] ppocr INFO: CTCLabelEncode : None [2024/09/26 09:28:03] ppocr INFO: SVTRRecResizeImg : [2024/09/26 09:28:03] ppocr INFO: image_shape : [3, 32, 600] [2024/09/26 09:28:03] ppocr INFO: padding : True [2024/09/26 09:28:03] ppocr INFO: KeepKeys : [2024/09/26 09:28:03] ppocr INFO: keep_keys : ['image', 'label', 'length'] [2024/09/26 09:28:03] ppocr INFO: loader : [2024/09/26 09:28:03] ppocr INFO: batch_size_per_card : 100 [2024/09/26 09:28:03] ppocr INFO: drop_last : False [2024/09/26 09:28:03] ppocr INFO: num_workers : 12 [2024/09/26 09:28:03] ppocr INFO: shuffle : False [2024/09/26 09:28:03] ppocr INFO: Global : [2024/09/26 09:28:03] ppocr INFO: cal_metric_during_train : True [2024/09/26 09:28:03] ppocr INFO: character_dict_path : ./DIY/character/hw_english_word.txt [2024/09/26 09:28:03] ppocr INFO: character_type : en [2024/09/26 09:28:03] ppocr INFO: checkpoints : None [2024/09/26 09:28:03] ppocr INFO: d2s_train_image_shape : [3, 32, 600] [2024/09/26 09:28:03] ppocr INFO: distributed : True [2024/09/26 09:28:03] ppocr INFO: epoch_num : 150 [2024/09/26 09:28:03] ppocr INFO: eval_batch_step : [0, 9736] [2024/09/26 09:28:03] ppocr INFO: infer_img : ./datasets/hw_score_data/images_08_test/ [2024/09/26 09:28:03] ppocr INFO: infer_mode : False [2024/09/26 09:28:03] ppocr INFO: log_smooth_window : 1 [2024/09/26 09:28:03] ppocr INFO: max_text_length : 50 [2024/09/26 09:28:03] ppocr INFO: pretrained_model : None [2024/09/26 09:28:03] ppocr INFO: print_batch_step : 1 [2024/09/26 09:28:03] ppocr INFO: save_epoch_step : 1 [2024/09/26 09:28:03] ppocr INFO: save_inference_dir : ./output/rec_svtrnet-hw_english_word/infer_model/ [2024/09/26 09:28:03] ppocr INFO: save_model_dir : ./output/rec_svtrnet-hw_english_word/ [2024/09/26 09:28:03] ppocr INFO: save_res_path : ./output/rec_svtrnet-hw_english_word/rec/predicts_rec_svtrnet-hw_english_word.txt [2024/09/26 09:28:03] ppocr INFO: use_gpu : True [2024/09/26 09:28:03] ppocr INFO: use_space_char : True [2024/09/26 09:28:03] ppocr INFO: use_visualdl : False [2024/09/26 09:28:03] ppocr INFO: Loss : [2024/09/26 09:28:03] ppocr INFO: name : CTCLoss [2024/09/26 09:28:03] ppocr INFO: Metric : [2024/09/26 09:28:03] ppocr INFO: main_indicator : acc [2024/09/26 09:28:03] ppocr INFO: name : RecMetric [2024/09/26 09:28:03] ppocr INFO: Optimizer : [2024/09/26 09:28:03] ppocr INFO: beta1 : 0.9 [2024/09/26 09:28:03] ppocr INFO: beta2 : 0.99 [2024/09/26 09:28:03] ppocr INFO: epsilon : 1e-08 [2024/09/26 09:28:03] ppocr INFO: lr : [2024/09/26 09:28:03] ppocr INFO: learning_rate : 0.0005 [2024/09/26 09:28:03] ppocr INFO: name : Cosine [2024/09/26 09:28:03] ppocr INFO: warmup_epoch : 2 [2024/09/26 09:28:03] ppocr INFO: name : AdamW [2024/09/26 09:28:03] ppocr INFO: no_weight_decay_name : norm pos_embed [2024/09/26 09:28:03] ppocr INFO: one_dim_param_no_weight_decay : True [2024/09/26 09:28:03] ppocr INFO: weight_decay : 0.05 [2024/09/26 09:28:03] ppocr INFO: PostProcess : [2024/09/26 09:28:03] ppocr INFO: name : CTCLabelDecode [2024/09/26 09:28:03] ppocr INFO: Train : [2024/09/26 09:28:03] ppocr INFO: dataset : [2024/09/26 09:28:03] ppocr INFO: data_dir : ./datasets/hw_english_word_dictation-rec [2024/09/26 09:28:03] ppocr INFO: label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] [2024/09/26 09:28:03] ppocr INFO: name : SimpleDataSet [2024/09/26 09:28:03] ppocr INFO: ratio_list : [1] [2024/09/26 09:28:03] ppocr INFO: transforms : [2024/09/26 09:28:03] ppocr INFO: DecodeImage : [2024/09/26 09:28:03] ppocr INFO: channel_first : False [2024/09/26 09:28:03] ppocr INFO: img_mode : BGR [2024/09/26 09:28:03] ppocr INFO: CTCLabelEncode : None [2024/09/26 09:28:03] ppocr INFO: SVTRRecResizeImg : [2024/09/26 09:28:03] ppocr INFO: image_shape : [3, 32, 600] [2024/09/26 09:28:03] ppocr INFO: padding : True [2024/09/26 09:28:03] ppocr INFO: KeepKeys : [2024/09/26 09:28:03] ppocr INFO: keep_keys : ['image', 'label', 'length'] [2024/09/26 09:28:03] ppocr INFO: loader : [2024/09/26 09:28:03] ppocr INFO: batch_size_per_card : 100 [2024/09/26 09:28:03] ppocr INFO: drop_last : False [2024/09/26 09:28:03] ppocr INFO: num_workers : 12 [2024/09/26 09:28:03] ppocr INFO: shuffle : True [2024/09/26 09:28:03] ppocr INFO: profiler_options : None [2024/09/26 09:28:03] ppocr INFO: train with paddle 3.0.0 and device Place(gpu:3) ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_cudnn_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cudnn/lib', default_value='') FLAGS(name='FLAGS_cublas_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cublas/lib', default_value='') FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False) FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='') FLAGS(name='FLAGS_cusolver_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusolver/lib', default_value='') FLAGS(name='FLAGS_nvidia_package_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia', default_value='') FLAGS(name='FLAGS_nccl_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/nccl/lib', default_value='') FLAGS(name='FLAGS_cusparse_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusparse/lib', default_value='') FLAGS(name='FLAGS_cupti_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cuda_cupti/lib', default_value='') FLAGS(name='FLAGS_curand_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/curand/lib', default_value='') ======================================================================= I0926 09:28:03.863519 17650 tcp_utils.cc:181] The server starts to listen on IP_ANY:6070 I0926 09:28:03.863718 17650 tcp_utils.cc:130] Successfully connected to 192.168.8.134:6070 I0926 09:28:11.758026 17650 process_group_nccl.cc:150] ProcessGroupNCCL pg_timeout_ 1800000 I0926 09:28:11.758095 17650 process_group_nccl.cc:151] ProcessGroupNCCL nccl_comm_init_option_ 0 [2024/09/26 09:28:12] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] [2024/09/26 09:28:12] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] W0926 09:28:12.926563 17650 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.9, Driver API Version: 12.5, Runtime API Version: 12.3 W0926 09:28:12.928436 17650 gpu_resources.cc:164] device: 3, cuDNN Version: 9.0. [2024/09/26 09:28:13] ppocr INFO: train dataloader has 1954 iters [2024/09/26 09:28:13] ppocr INFO: valid dataloader has 3907 iters [2024/09/26 09:28:13] ppocr INFO: train from scratch hf-13f-gpu-134:17650:17650 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1 hf-13f-gpu-134:17650:17650 [3] NCCL INFO Bootstrap : Using eno1:192.168.8.134<0> hf-13f-gpu-134:17650:17650 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation hf-13f-gpu-134:17650:17650 [3] NCCL INFO cudaDriverVersion 12050 NCCL version 2.19.3+cuda12.3 hf-13f-gpu-134:17650:18005 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1. hf-13f-gpu-134:17650:18005 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1 hf-13f-gpu-134:17650:18005 [3] NCCL INFO NET/Socket : Using [0]eno1:192.168.8.134<0> hf-13f-gpu-134:17650:18005 [3] NCCL INFO Using non-device net plugin version 0 hf-13f-gpu-134:17650:18005 [3] NCCL INFO Using network Socket
(py310_ppocr) ai@hf-13f-gpu-131:/workspace2/xlg/code/paddle-ocr-contribute$ python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131" --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml /home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md warnings.warn(warning_message) LAUNCH INFO 2024-09-26 09:28:09,243 ----------- Configuration ---------------------- LAUNCH INFO 2024-09-26 09:28:09,243 auto_cluster_config: 0 LAUNCH INFO 2024-09-26 09:28:09,243 auto_parallel_config: None LAUNCH INFO 2024-09-26 09:28:09,243 auto_tuner_json: None LAUNCH INFO 2024-09-26 09:28:09,243 devices: 3 LAUNCH INFO 2024-09-26 09:28:09,243 elastic_level: -1 LAUNCH INFO 2024-09-26 09:28:09,243 elastic_timeout: 30 LAUNCH INFO 2024-09-26 09:28:09,243 enable_gpu_log: True LAUNCH INFO 2024-09-26 09:28:09,243 gloo_port: 6767 LAUNCH INFO 2024-09-26 09:28:09,243 host: None LAUNCH INFO 2024-09-26 09:28:09,243 ips: 192.168.8.134,192.168.8.131 LAUNCH INFO 2024-09-26 09:28:09,243 job_id: default LAUNCH INFO 2024-09-26 09:28:09,243 legacy: False LAUNCH INFO 2024-09-26 09:28:09,243 log_dir: log LAUNCH INFO 2024-09-26 09:28:09,243 log_level: INFO LAUNCH INFO 2024-09-26 09:28:09,243 log_overwrite: False LAUNCH INFO 2024-09-26 09:28:09,243 master: None LAUNCH INFO 2024-09-26 09:28:09,243 max_restart: 3 LAUNCH INFO 2024-09-26 09:28:09,243 nnodes: 1 LAUNCH INFO 2024-09-26 09:28:09,243 nproc_per_node: None LAUNCH INFO 2024-09-26 09:28:09,243 rank: -1 LAUNCH INFO 2024-09-26 09:28:09,243 run_mode: collective LAUNCH INFO 2024-09-26 09:28:09,243 server_num: None LAUNCH INFO 2024-09-26 09:28:09,243 servers: LAUNCH INFO 2024-09-26 09:28:09,243 sort_ip: False LAUNCH INFO 2024-09-26 09:28:09,243 start_port: 6070 LAUNCH INFO 2024-09-26 09:28:09,243 trainer_num: None LAUNCH INFO 2024-09-26 09:28:09,243 trainers: LAUNCH INFO 2024-09-26 09:28:09,243 training_script: tools/train.py LAUNCH INFO 2024-09-26 09:28:09,243 training_script_args: ['-c', './DIY/configs/rec/rec_svtrnet-hw_english_word.yml'] LAUNCH INFO 2024-09-26 09:28:09,243 with_gloo: 1 LAUNCH INFO 2024-09-26 09:28:09,243 -------------------------------------------------- LAUNCH INFO 2024-09-26 09:28:09,244 Job: default, mode collective, replicas 1[1:1], elastic False LAUNCH INFO 2024-09-26 09:28:09,244 Run Pod: nkpvvc, replicas 1, status ready LAUNCH INFO 2024-09-26 09:28:09,265 Watching Pod: nkpvvc, replicas 1, status running /home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md warnings.warn(warning_message) [2024/09/26 09:28:11] ppocr INFO: Architecture : [2024/09/26 09:28:11] ppocr INFO: Backbone : [2024/09/26 09:28:11] ppocr INFO: depth : [3, 6, 3] [2024/09/26 09:28:11] ppocr INFO: embed_dim : [64, 128, 256] [2024/09/26 09:28:11] ppocr INFO: img_size : [32, 600] [2024/09/26 09:28:11] ppocr INFO: last_stage : True [2024/09/26 09:28:11] ppocr INFO: local_mixer : [[7, 11], [7, 11], [7, 11]] [2024/09/26 09:28:11] ppocr INFO: mixer : ['Local', 'Local', 'Local', 'Local', 'Local', 'Local', 'Global', 'Global', 'Global', 'Global', 'Global', 'Global'] [2024/09/26 09:28:11] ppocr INFO: name : SVTRNet [2024/09/26 09:28:11] ppocr INFO: num_heads : [2, 4, 8] [2024/09/26 09:28:11] ppocr INFO: out_channels : 192 [2024/09/26 09:28:11] ppocr INFO: out_char_num : 50 [2024/09/26 09:28:11] ppocr INFO: patch_merging : Conv [2024/09/26 09:28:11] ppocr INFO: prenorm : False [2024/09/26 09:28:11] ppocr INFO: Head : [2024/09/26 09:28:11] ppocr INFO: name : CTCHead [2024/09/26 09:28:11] ppocr INFO: Neck : [2024/09/26 09:28:11] ppocr INFO: encoder_type : reshape [2024/09/26 09:28:11] ppocr INFO: name : SequenceEncoder [2024/09/26 09:28:11] ppocr INFO: Transform : [2024/09/26 09:28:11] ppocr INFO: name : STN_ON [2024/09/26 09:28:11] ppocr INFO: num_control_points : 20 [2024/09/26 09:28:11] ppocr INFO: stn_activation : none [2024/09/26 09:28:11] ppocr INFO: tps_inputsize : [32, 64] [2024/09/26 09:28:11] ppocr INFO: tps_margins : [0.05, 0.05] [2024/09/26 09:28:11] ppocr INFO: tps_outputsize : [32, 600] [2024/09/26 09:28:11] ppocr INFO: algorithm : SVTR [2024/09/26 09:28:11] ppocr INFO: model_type : rec [2024/09/26 09:28:11] ppocr INFO: Eval : [2024/09/26 09:28:11] ppocr INFO: dataset : [2024/09/26 09:28:11] ppocr INFO: data_dir : ./datasets/hw_english_word_dictation-rec [2024/09/26 09:28:11] ppocr INFO: label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] [2024/09/26 09:28:11] ppocr INFO: name : SimpleDataSet [2024/09/26 09:28:11] ppocr INFO: transforms : [2024/09/26 09:28:11] ppocr INFO: DecodeImage : [2024/09/26 09:28:11] ppocr INFO: channel_first : False [2024/09/26 09:28:11] ppocr INFO: img_mode : BGR [2024/09/26 09:28:11] ppocr INFO: CTCLabelEncode : None [2024/09/26 09:28:11] ppocr INFO: SVTRRecResizeImg : [2024/09/26 09:28:11] ppocr INFO: image_shape : [3, 32, 600] [2024/09/26 09:28:11] ppocr INFO: padding : True [2024/09/26 09:28:11] ppocr INFO: KeepKeys : [2024/09/26 09:28:11] ppocr INFO: keep_keys : ['image', 'label', 'length'] [2024/09/26 09:28:11] ppocr INFO: loader : [2024/09/26 09:28:11] ppocr INFO: batch_size_per_card : 100 [2024/09/26 09:28:11] ppocr INFO: drop_last : False [2024/09/26 09:28:11] ppocr INFO: num_workers : 12 [2024/09/26 09:28:11] ppocr INFO: shuffle : False [2024/09/26 09:28:11] ppocr INFO: Global : [2024/09/26 09:28:11] ppocr INFO: cal_metric_during_train : True [2024/09/26 09:28:11] ppocr INFO: character_dict_path : ./DIY/character/hw_english_word.txt [2024/09/26 09:28:11] ppocr INFO: character_type : en [2024/09/26 09:28:11] ppocr INFO: checkpoints : None [2024/09/26 09:28:11] ppocr INFO: d2s_train_image_shape : [3, 32, 600] [2024/09/26 09:28:11] ppocr INFO: distributed : True [2024/09/26 09:28:11] ppocr INFO: epoch_num : 150 [2024/09/26 09:28:11] ppocr INFO: eval_batch_step : [0, 9736] [2024/09/26 09:28:11] ppocr INFO: infer_img : ./datasets/hw_score_data/images_08_test/ [2024/09/26 09:28:11] ppocr INFO: infer_mode : False [2024/09/26 09:28:11] ppocr INFO: log_smooth_window : 1 [2024/09/26 09:28:11] ppocr INFO: max_text_length : 50 [2024/09/26 09:28:11] ppocr INFO: pretrained_model : None [2024/09/26 09:28:11] ppocr INFO: print_batch_step : 1 [2024/09/26 09:28:11] ppocr INFO: save_epoch_step : 1 [2024/09/26 09:28:11] ppocr INFO: save_inference_dir : ./output/rec_svtrnet-hw_english_word/infer_model/ [2024/09/26 09:28:11] ppocr INFO: save_model_dir : ./output/rec_svtrnet-hw_english_word/ [2024/09/26 09:28:11] ppocr INFO: save_res_path : ./output/rec_svtrnet-hw_english_word/rec/predicts_rec_svtrnet-hw_english_word.txt [2024/09/26 09:28:11] ppocr INFO: use_gpu : True [2024/09/26 09:28:11] ppocr INFO: use_space_char : True [2024/09/26 09:28:11] ppocr INFO: use_visualdl : False [2024/09/26 09:28:11] ppocr INFO: Loss : [2024/09/26 09:28:11] ppocr INFO: name : CTCLoss [2024/09/26 09:28:11] ppocr INFO: Metric : [2024/09/26 09:28:11] ppocr INFO: main_indicator : acc [2024/09/26 09:28:11] ppocr INFO: name : RecMetric [2024/09/26 09:28:11] ppocr INFO: Optimizer : [2024/09/26 09:28:11] ppocr INFO: beta1 : 0.9 [2024/09/26 09:28:11] ppocr INFO: beta2 : 0.99 [2024/09/26 09:28:11] ppocr INFO: epsilon : 1e-08 [2024/09/26 09:28:11] ppocr INFO: lr : [2024/09/26 09:28:11] ppocr INFO: learning_rate : 0.0005 [2024/09/26 09:28:11] ppocr INFO: name : Cosine [2024/09/26 09:28:11] ppocr INFO: warmup_epoch : 2 [2024/09/26 09:28:11] ppocr INFO: name : AdamW [2024/09/26 09:28:11] ppocr INFO: no_weight_decay_name : norm pos_embed [2024/09/26 09:28:11] ppocr INFO: one_dim_param_no_weight_decay : True [2024/09/26 09:28:11] ppocr INFO: weight_decay : 0.05 [2024/09/26 09:28:11] ppocr INFO: PostProcess : [2024/09/26 09:28:11] ppocr INFO: name : CTCLabelDecode [2024/09/26 09:28:11] ppocr INFO: Train : [2024/09/26 09:28:11] ppocr INFO: dataset : [2024/09/26 09:28:11] ppocr INFO: data_dir : ./datasets/hw_english_word_dictation-rec [2024/09/26 09:28:11] ppocr INFO: label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] [2024/09/26 09:28:11] ppocr INFO: name : SimpleDataSet [2024/09/26 09:28:11] ppocr INFO: ratio_list : [1] [2024/09/26 09:28:11] ppocr INFO: transforms : [2024/09/26 09:28:11] ppocr INFO: DecodeImage : [2024/09/26 09:28:11] ppocr INFO: channel_first : False [2024/09/26 09:28:11] ppocr INFO: img_mode : BGR [2024/09/26 09:28:11] ppocr INFO: CTCLabelEncode : None [2024/09/26 09:28:11] ppocr INFO: SVTRRecResizeImg : [2024/09/26 09:28:11] ppocr INFO: image_shape : [3, 32, 600] [2024/09/26 09:28:11] ppocr INFO: padding : True [2024/09/26 09:28:11] ppocr INFO: KeepKeys : [2024/09/26 09:28:11] ppocr INFO: keep_keys : ['image', 'label', 'length'] [2024/09/26 09:28:11] ppocr INFO: loader : [2024/09/26 09:28:11] ppocr INFO: batch_size_per_card : 100 [2024/09/26 09:28:11] ppocr INFO: drop_last : False [2024/09/26 09:28:11] ppocr INFO: num_workers : 12 [2024/09/26 09:28:11] ppocr INFO: shuffle : True [2024/09/26 09:28:11] ppocr INFO: profiler_options : None [2024/09/26 09:28:11] ppocr INFO: train with paddle 3.0.0 and device Place(gpu:3) ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_cusparse_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusparse/lib', default_value='') FLAGS(name='FLAGS_cublas_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cublas/lib', default_value='') FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False) FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='') FLAGS(name='FLAGS_nccl_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/nccl/lib', default_value='') FLAGS(name='FLAGS_curand_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/curand/lib', default_value='') FLAGS(name='FLAGS_cupti_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cuda_cupti/lib', default_value='') FLAGS(name='FLAGS_nvidia_package_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia', default_value='') FLAGS(name='FLAGS_cusolver_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusolver/lib', default_value='') FLAGS(name='FLAGS_cudnn_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cudnn/lib', default_value='') ======================================================================= I0926 09:28:11.714093 14571 tcp_utils.cc:181] The server starts to listen on IP_ANY:6070 I0926 09:28:11.714478 14571 tcp_utils.cc:130] Successfully connected to 192.168.8.134:6070 I0926 09:28:11.715672 14571 process_group_nccl.cc:150] ProcessGroupNCCL pg_timeout_ 1800000 I0926 09:28:11.715679 14571 process_group_nccl.cc:151] ProcessGroupNCCL nccl_comm_init_option_ 0 [2024/09/26 09:28:11] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] [2024/09/26 09:28:12] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt'] W0926 09:28:12.378768 14571 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.6, Driver API Version: 12.5, Runtime API Version: 12.3 W0926 09:28:12.379539 14571 gpu_resources.cc:164] device: 3, cuDNN Version: 9.0. [2024/09/26 09:28:12] ppocr INFO: train dataloader has 1954 iters [2024/09/26 09:28:12] ppocr INFO: valid dataloader has 3907 iters [2024/09/26 09:28:12] ppocr INFO: train from scratch hf-13f-gpu-131:14571:14571 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp96s0f0 hf-13f-gpu-131:14571:14571 [3] NCCL INFO Bootstrap : Using enp96s0f0:192.168.8.131<0> hf-13f-gpu-131:14571:14571 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation hf-13f-gpu-131:14571:14571 [3] NCCL INFO cudaDriverVersion 12050 NCCL version 2.19.3+cuda12.3 hf-13f-gpu-131:14571:14710 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1. hf-13f-gpu-131:14571:14710 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp96s0f0 hf-13f-gpu-131:14571:14710 [3] NCCL INFO NET/Socket : Using [0]enp96s0f0:192.168.8.131<0> hf-13f-gpu-131:14571:14710 [3] NCCL INFO Using non-device net plugin version 0 hf-13f-gpu-131:14571:14710 [3] NCCL INFO Using network Socket
The text was updated successfully, but these errors were encountered:
您好,正在联系相关同学进行复现,感谢您的反馈!
Sorry, something went wrong.
万分感谢....
你好,可以参考下面的启动命令试一下:python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131" --host 192.168.8.134 --nnodes 2 --master 192.168.8.134:55555 --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml
太感谢了!!! 直接在宿主机问题解决了; docker 在--network=host也是没问题的; docker不在--network=host就不行了,请问怎么解决?
lijialin03
No branches or pull requests
bug描述 Describe the Bug
PaddlePaddle/PaddleOCR#13912
整个环境分别在两个主(134)从(131)机的docker容器环境下, 容器的网络是--ipc=host --network=host --gpus all;主从机已经分别指定nccl通信的网卡;ssh也已经互为免密,ssh端口是22;主从机能互ping通;
借鉴:https://paddlepaddle.github.io/PaddleOCR/ppocr/blog/distributed_training.html
然而主从机执行到 NCCL INFO Using network Socket 就卡主不动了;主从机的指定gpu,分别只是占用一点显存,除此之外,没有任何利用率,主占用显存520M,从占用显存410M。
主机执行的语句:
从机执行的语句:
其他补充信息 Additional Supplementary Information
🏃♂️ Environment (运行环境)
系 统 os:docker ubuntu 20.04
paddleocr:0.1.0.dev0+d20240926
paddlepaddle-gpu:3.0.0.dev20240925
cuda: 12.3
nccl: 2.19.3+cu12.3
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
master节点(192.168.8.134)
slave节点(192.168.8.131)
The text was updated successfully, but these errors were encountered: