Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问怎样才使程序在GPU上运行? #8

Open
zhouyang-bigdata opened this issue Jul 8, 2020 · 20 comments
Open

请问怎样才使程序在GPU上运行? #8

zhouyang-bigdata opened this issue Jul 8, 2020 · 20 comments

Comments

@zhouyang-bigdata
Copy link

请问怎样才使程序在GPU上运行?服务器用的是腾讯云的GPU,也是用你的命令,可是不管怎么试,都是用CPU运行的。请问,还需要装什么模块吗?

@HuiResearch
Copy link
Owner

HuiResearch commented Jul 8, 2020 via email

@zhouyang-bigdata
Copy link
Author

一番折腾,后腾讯云换了个系统镜像,这应该是在gpu上运行了。请问这个训练一般耗时多久?
日志如下:

INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/kernel:0, shape = (968, 800)
INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/bias:0, shape = (800,)
INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/kernel:0, shape = (968, 800)
INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/bias:0, shape = (800,)
INFO:tensorflow: name = u_omega:0, shape = (1168,)
INFO:tensorflow: name = output_weights:0, shape = (20, 1168)
INFO:tensorflow: name = output_bias:0, shape = (20,)
WARNING:tensorflow:From /usr/local/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2020-07-09 09:42:55.471208: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-09 09:42:55.649475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 09:42:55.650252: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x8a58eb0 executing computations on platform CUDA. Devices:
2020-07-09 09:42:55.650293: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-07-09 09:42:55.663763: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2020-07-09 09:42:55.664574: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x9a2a320 executing computations on platform Host. Devices:
2020-07-09 09:42:55.664608: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2020-07-09 09:42:55.665581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:08.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-07-09 09:42:55.665603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-09 09:42:55.666469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-09 09:42:55.666486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-07-09 09:42:55.666493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-07-09 09:42:55.666986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ckpt/divorce/model.ckpt.
2020-07-09 09:43:22.977919: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
INFO:tensorflow:global_step/sec: 2.03341
INFO:tensorflow:examples/sec: 65.0693
INFO:tensorflow:global_step/sec: 2.27393
INFO:tensorflow:examples/sec: 72.7656
INFO:tensorflow:global_step/sec: 2.27549
INFO:tensorflow:examples/sec: 72.8157
INFO:tensorflow:global_step/sec: 2.2709
INFO:tensorflow:examples/sec: 72.6686
INFO:tensorflow:global_step/sec: 2.27153
INFO:tensorflow:examples/sec: 72.6891

@HuiResearch
Copy link
Owner

HuiResearch commented Jul 9, 2020 via email

@zhouyang-bigdata
Copy link
Author

涨知识了。之前用pytorch时候,还没注意到step。

@HuiResearch
Copy link
Owner

HuiResearch commented Jul 9, 2020 via email

@zhouyang-bigdata
Copy link
Author

训练出来,准确率有0.83. 是训练集和测试集放一起训练了吧?
日志:

99%|█████████▉| 249/252 [01:36<00:01, 2.76it/s]
99%|█████████▉| 250/252 [01:36<00:00, 2.77it/s]
100%|█████████▉| 251/252 [01:37<00:00, 2.76it/s]
100%|██████████| 252/252 [01:37<00:00, 2.76it/s]
100%|██████████| 252/252 [01:37<00:00, 2.58it/s]
INFO:root:模型预测结束

INFO:root:模型每个类别f值计算如下:

INFO:root:{'1': 0.96, '2': 0.92, '3': 0.91, '4': 0.93, '5': 0.91, '6': 0.93, '7': 0.93, '8': 0.97, '9': 0.98, '10': 0.88, '11': 0.84, '12': 0.25, '13': 0.83, '14': 0.47, '15': 0.82, '16': 0.79, '17': 0.72, '18': 0.03, '19': 0.32, '20': 0.64}
INFO:root:总评分如下: 0.8298107041994647

@HuiResearch
Copy link
Owner

HuiResearch commented Jul 9, 2020 via email

@zhouyang-bigdata
Copy link
Author

zhouyang-bigdata commented Jul 9, 2020

应该是训练集(divorce)比较少的原因吧。我看到训练集只有1.93M。而我以前下载的官方训练集,有6.08M。
请问一下,怎样设置多个GPU一起训练?

@zhouyang-bigdata
Copy link
Author

zhouyang-bigdata commented Jul 9, 2020

用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。
日志如下:

98%|█████████▊| 247/252 [01:35<00:01, 2.76it/s]
98%|█████████▊| 248/252 [01:36<00:01, 2.75it/s]
99%|█████████▉| 249/252 [01:36<00:01, 2.75it/s]
99%|█████████▉| 250/252 [01:37<00:00, 2.74it/s]
100%|█████████▉| 251/252 [01:37<00:00, 2.73it/s]
100%|██████████| 252/252 [01:37<00:00, 2.73it/s]
INFO:root:模型预测结束

INFO:root:模型每个类别f值计算如下:

INFO:root:{'1': 0.95, '2': 0.91, '3': 0.91, '4': 0.94, '5': 0.9, '6': 0.92, '7': 0.92, '8': 0.96, '9': 0.98, '10': 0.87, '11': 0.84, '12': 0.21, '13': 0.8, '14': 0.31, '15': 0.81, '16': 0.77, '17': 0.64, '18': 0.0, '19': 0.2, '20': 0.61}
INFO:root:总评分如下: 0.811298028037752

@zhouyang-bigdata
Copy link
Author

这应该是我训练数据文件名的问题。我多训练几遍再看看。

用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。

@zhouyang-bigdata
Copy link
Author

zhouyang-bigdata commented Jul 10, 2020

请问一下2个问题:
(1)请问你之前用的训练数据是多大的?我想重现你的结果
(2)请问一下,怎样设置多个GPU一起训练?

@HuiResearch
Copy link
Owner

HuiResearch commented Jul 10, 2020 via email

@zhouyang-bigdata
Copy link
Author

zhouyang-bigdata commented Jul 14, 2020

你好,我用你的数据训练loan(借贷)类别的模型,准确率在0.61.请问这在合理范围吗?看起来,第10个以后tag的f值是0.0,请问是没有匹配到该tag吗?

'11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0

如下:

92%|█████████▏| 47/51 [01:11<00:05, 1.45s/it]
94%|█████████▍| 48/51 [01:13<00:04, 1.44s/it]
96%|█████████▌| 49/51 [01:14<00:02, 1.45s/it]
98%|█████████▊| 50/51 [01:15<00:01, 1.45s/it]
100%|██████████| 51/51 [01:17<00:00, 1.45s/it]
INFO:root:模型预测结束

INFO:root:模型每个类别f值计算如下:

INFO:root:{'1': 0.87, '2': 0.81, '3': 0.8, '4': 0.76, '5': 0.8, '6': 0.6, '7': 0.86, '8': 0.96, '9': 0.82, '10': 0.89, '11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0}
INFO:root:总评分如下: 0.6121632632937084

evaluation.py代码如下:

if name == 'main':

task = "loan"

##这里传入切分好的测试数据,这里由于是整理代码做测试,随便导入训练数据集测试下
sentences, labels = load_file("data/loan/data_small_selected.json")
#sentences, labels = load_file("my_test_data.json")
logging.info("开始载入bert模型")
model_1 = BERTModel(task=task, pb_model="pb/loan/model.pb",
                    tagDir="data/loan/tags.txt", threshold=[0.5] * 20,
                    vocab_file="chinese_L-12_H-768_A-12/vocab.txt")

logging.info("bert模型载入完毕,开始进行预测!!!\n")
logging.info("模型开始预测\n")
predicts_1 = model_1.getAllResult(sentences)
print(predicts_1)
logging.info("结果:\n")
logging.info(predicts_1)
logging.info("模型预测结束\n")

logging.info("模型每个类别f值计算如下:\n")
score_1, f1_1 = evaluate(predict_labels=predicts_1, target_labels=labels, tag_dir="data/loan/tags.txt")
logging.info(f1_1)
logging.info("总评分如下: {}".format(score_1))

@zhouyang-bigdata
Copy link
Author

这应该是我训练数据文件名的问题。我多训练几遍再看看。

用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。

这个很可能是我测试用的数据不对。改为data_small_selected.json后,试了2次,是0.71 ,很接近了,不过,0.73没重现过。

@HuiResearch
Copy link
Owner

HuiResearch commented Jul 15, 2020 via email

@zhouyang-bigdata
Copy link
Author

测试数据都不一样...我的成绩是官网测试成绩,而且一些trick代码我没有发在github,只是readme写了介绍 | | m13021933043 邮箱:[email protected] | Signature is customized by Netease Mail Master 在2020年07月15日 09:55,zhouyang-bigdata 写道: 这应该是我训练数据文件名的问题。我多训练几遍再看看。 用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。 这个很可能是我测试用的数据不对。改为data_small_selected.json后,试了2次,是0.71 ,很接近了,不过,0.73没重现过。 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

还有一个想请问下,就是关于:

你好,我用你的数据训练loan(借贷)类别的模型,准确率在0.61.请问这在合理范围吗?看起来,第10个以后tag的f值是0.0,请问是没有匹配到该tag吗?

'11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0

我qq 2648759823 ,你qq多少。

@zhouyang-bigdata
Copy link
Author

请问一下,这个是模型正常输出吗?

还有一个想请问下,就是关于:

你好,我用你的数据训练loan(借贷)类别的模型,准确率在0.61.请问这在合理范围吗?看起来,第10个以后tag的f值是0.0,请问是没有匹配到该tag吗?

'11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0

@zhouyang-bigdata
Copy link
Author

zhouyang-bigdata commented Jul 17, 2020

你好,我是自己在做一个新闻、文章要素提取Demo,卡在这里了。能qq聊会吗

@zhouyang-bigdata
Copy link
Author

你好,可以qq聊下吗?请教一下。

@zhouyang-bigdata
Copy link
Author

你好,可以qq聊下吗?请教一下。我qq 2648759823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants