You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.
Does this error occur accidentally or it must appear under some operation? I try some tests and find the status will go wrong only when worker break down and the scheduler completed at the same time. Does your scheduler stop with worker?
Scheduler completed will lead to the success of mxjob. It's possible when scheduler completed, the worker is still running instead of being error so the mxjob status is set to succeeded.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
kubeflow version: 0.5.0
mxnet-operator version: v1beta1
kubernetes dashboard display:
worker-0 log:
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, data_dir='/admin/public/model/mxnet_distributed/data', disp_batches=10, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='1, 28, 28', initializer='default', kv_store='dist_device_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=6000, num_layers=2, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 99, in
fit.fit(args, sym, get_mnist_iter)
File "/admin/public/model/mxnet_model/mxnet_distributed/common/fit.py", line 180, in fit
(train, val) = data_loader(args, kv)
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 57, in get_mnist_iter
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 37, in read_data
with gzip.open(os.path.join(args.data_dir,label)) as flbl:
File "/opt/conda/lib/python3.6/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/opt/conda/lib/python3.6/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/admin/public/model/mxnet_distributed/data/train-labels-idx1-ubyte.gz'
mxjob status:
The text was updated successfully, but these errors were encountered: