-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请问怎样才使程序在GPU上运行? #8
Comments
你好,你可以先测试下你的tensorflow是否可以使用gpu,如果不行,应该是tensorflow的安装问题,比如cuda版本不对应什么的
| |
m13021933043
邮箱:[email protected]
|
Signature is customized by Netease Mail Master
在2020年07月08日 19:24,zhouyang-bigdata 写道:
请问怎样才使程序在GPU上运行?服务器用的是腾讯云的GPU,也是用你的命令,可是不管怎么试,都是用CPU运行的。请问,还需要装什么模块吗?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
一番折腾,后腾讯云换了个系统镜像,这应该是在gpu上运行了。请问这个训练一般耗时多久?
|
看你设置的epoch和服务器的性能,你这个应该是每一步大概2s,自己可以计算下大概耗时
| |
m13021933043
邮箱:[email protected]
|
Signature is customized by Netease Mail Master
在2020年07月09日 09:50,zhouyang-bigdata 写道:
一番折腾,后腾讯云换了个系统镜像,这应该是在gpu上运行了。请问这个训练一般耗时多久?
日志如下:
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/attention/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/kernel:0, shape = (968, 800)
INFO:tensorflow: name = bidirectional_rnn/fw/basic_lstm_cell/bias:0, shape = (800,)
INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/kernel:0, shape = (968, 800)
INFO:tensorflow: name = bidirectional_rnn/bw/basic_lstm_cell/bias:0, shape = (800,)
INFO:tensorflow: name = u_omega:0, shape = (1168,)
INFO:tensorflow: name = output_weights:0, shape = (20, 1168)
INFO:tensorflow: name = output_bias:0, shape = (20,)
WARNING:tensorflow:From /usr/local/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2020-07-09 09:42:55.471208: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-09 09:42:55.649475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 09:42:55.650252: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x8a58eb0 executing computations on platform CUDA. Devices:
2020-07-09 09:42:55.650293: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-07-09 09:42:55.663763: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2020-07-09 09:42:55.664574: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x9a2a320 executing computations on platform Host. Devices:
2020-07-09 09:42:55.664608: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2020-07-09 09:42:55.665581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:08.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2020-07-09 09:42:55.665603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-09 09:42:55.666469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-09 09:42:55.666486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-07-09 09:42:55.666493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-07-09 09:42:55.666986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30459 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ckpt/divorce/model.ckpt.
2020-07-09 09:43:22.977919: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
INFO:tensorflow:global_step/sec: 2.03341
INFO:tensorflow:examples/sec: 65.0693
INFO:tensorflow:global_step/sec: 2.27393
INFO:tensorflow:examples/sec: 72.7656
INFO:tensorflow:global_step/sec: 2.27549
INFO:tensorflow:examples/sec: 72.8157
INFO:tensorflow:global_step/sec: 2.2709
INFO:tensorflow:examples/sec: 72.6686
INFO:tensorflow:global_step/sec: 2.27153
INFO:tensorflow:examples/sec: 72.6891
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
涨知识了。之前用pytorch时候,还没注意到step。 |
这是tensorflow的estimator训练方式,如果修改成session方式,我感觉可以更灵活,可以像torch一样打印训练日志和进度,只不过比较复杂点
| |
m13021933043
邮箱:[email protected]
|
Signature is customized by Netease Mail Master
在2020年07月09日 10:12,zhouyang-bigdata 写道:
涨知识了。之前用pytorch时候,还没注意到step。
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
训练出来,准确率有0.83. 是训练集和测试集放一起训练了吧?
|
可能有一部分重复吧,毕竟官方没有开源测试数据。也有可能是使用的使用的divorce那个数据集,那个数据集相对其他两个效果要好点。
| |
m13021933043
邮箱:[email protected]
|
Signature is customized by Netease Mail Master
在2020年07月09日 10:53,zhouyang-bigdata 写道:
训练出来,准确率有0.83. 是训练集和测试集放一起训练了吧?
日志:
99%|█████████▉| 249/252 [01:36<00:01, 2.76it/s]
99%|█████████▉| 250/252 [01:36<00:00, 2.77it/s]
100%|█████████▉| 251/252 [01:37<00:00, 2.76it/s]
100%|██████████| 252/252 [01:37<00:00, 2.76it/s]
100%|██████████| 252/252 [01:37<00:00, 2.58it/s]
INFO:root:模型预测结束
INFO:root:模型每个类别f值计算如下:
INFO:root:{'1': 0.96, '2': 0.92, '3': 0.91, '4': 0.93, '5': 0.91, '6': 0.93, '7': 0.93, '8': 0.97, '9': 0.98, '10': 0.88, '11': 0.84, '12': 0.25, '13': 0.83, '14': 0.47, '15': 0.82, '16': 0.79, '17': 0.72, '18': 0.03, '19': 0.32, '20': 0.64}
INFO:root:总评分如下: 0.8298107041994647
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
应该是训练集(divorce)比较少的原因吧。我看到训练集只有1.93M。而我以前下载的官方训练集,有6.08M。 |
用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。
|
这应该是我训练数据文件名的问题。我多训练几遍再看看。
|
请问一下2个问题: |
我之前的训练数据就是我分享的所有数据,多gpu的话团队bert代码没办法,需要更改优化器部分,或者使用horovod。再或者还pytorch,多卡很方便...
| |
m13021933043
邮箱:[email protected]
|
Signature is customized by Netease Mail Master
在2020年07月10日 18:00,zhouyang-bigdata 写道:
请问一下2个问题:
(1)请问你之前用的训练数据是多大的?
(2)请问一下,怎样设置多个GPU一起训练?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
你好,我用你的数据训练loan(借贷)类别的模型,准确率在0.61.请问这在合理范围吗?看起来,第10个以后tag的f值是0.0,请问是没有匹配到该tag吗?
如下:
evaluation.py代码如下:
|
这个很可能是我测试用的数据不对。改为data_small_selected.json后,试了2次,是0.71 ,很接近了,不过,0.73没重现过。 |
测试数据都不一样...我的成绩是官网测试成绩,而且一些trick代码我没有发在github,只是readme写了介绍
| |
m13021933043
邮箱:[email protected]
|
Signature is customized by Netease Mail Master
在2020年07月15日 09:55,zhouyang-bigdata 写道:
这应该是我训练数据文件名的问题。我多训练几遍再看看。
用了6.08M 的数据(divorce)后,准确率只降了0.1. 好神奇……请问这个是合理的吗?没有重现你的0.73的准确率。
这个很可能是我测试用的数据不对。改为data_small_selected.json后,试了2次,是0.71 ,很接近了,不过,0.73没重现过。
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
还有一个想请问下,就是关于:
我qq 2648759823 ,你qq多少。 |
请问一下,这个是模型正常输出吗?
|
你好,我是自己在做一个新闻、文章要素提取Demo,卡在这里了。能qq聊会吗 |
你好,可以qq聊下吗?请教一下。 |
你好,可以qq聊下吗?请教一下。我qq 2648759823 |
请问怎样才使程序在GPU上运行?服务器用的是腾讯云的GPU,也是用你的命令,可是不管怎么试,都是用CPU运行的。请问,还需要装什么模块吗?
The text was updated successfully, but these errors were encountered: