-
Couldn't load subscription status.
- Fork 3.7k
Universal Transformer appears to be buggy and not converging correctly #1191
Description
Summary
Universal Transformer appears to be buggy and not converging correctly:
-
Universal transformer does not converge on multi_nli as of the latest tensor2tensor master (9729521). See below for reproduction
-
UT does converge on multi_nli as of August 3 2018 commit 5fff1ca (we didn’t run this fully out, but it was making meaningful progress, unlike below, so we terminated it and considered it successful).
-
To confirm this was not simply an odd issue with multi_nli, we tried UT with a number of other problems (exact repo not shown below), including ‘lambada_rc’ and ‘stanford_nli’ (run at commit ca628e4 from around October 16 2018) All of these failed to converge.
Environment information
Docker image based off nvidia/cuda:9.0-devel-ubuntu16.04
Tf version: tensorflow-gpu=1.11.0
T2t version: Tensor2tensor master at commit 9729521 on Oct 30 2018.
We also saw this failed behavior on tf-nightly-gpu==1.13.0.dev20181022
Reproduce
Problem: multi_nli
Model: universal_transformer
Hparams_set: universal_transformer_tiny
python3 /usr/src/t2t/tensor2tensor/bin/t2t-trainer
--data_dir="DATA_DIR"
--eval_early_stopping_steps="10000"
--eval_steps="10000"
--generate_data="True"
--hparams=""
--hparams_set="universal_transformer_tiny"
--iterations_per_loop="2000"
--keep_checkpoint_max="80"
--local_eval_frequency="2000"
--model="universal_transformer"
--output_dir="OUTPUT_DIR"
--problem="multi_nli"
--t2t_usr_dir="T2T_USR_DIR"
--tmp_dir="T2T_TMP_DIR"
Run was stopped after 50000 steps due to lack of convergence as loss fluctuates between 1.098 and 1.099.
INFO:tensorflow:Saving dict for global step 50000: global_step = 50000, loss = 1.0991247, metrics-multi_nli/targets/accuracy = 0.31821653, metrics-multi_nli/targets/accuracy_per_sequence = 0.31821653, metrics-multi_nli/targets/accuracy_top5 = 1.0, metrics-multi_nli/targets/approx_bleu_score = 0.7479816, metrics-multi_nli/targets/neg_log_perplexity = -1.099124, metrics-multi_nli/targets/rouge_2_fscore = 0.0, metrics-multi_nli/targets/rouge_L_fscore = 0.31869644