Universal Transformer appears to be buggy and not converging correctly

Summary
---------
Universal Transformer appears to be buggy and not converging correctly:

- Universal transformer does not converge on multi_nli as of the latest tensor2tensor master (9729521bc3cd4952c42dcfda53699e14bee7b409).  See below for reproduction

- UT does converge on multi_nli as of August 3 2018 commit 5fff1cad2977f063b981e5d8b839bf9d7008e232 (we didn’t run this fully out, but it was making meaningful progress, unlike below, so we terminated it and considered it successful). 

- To confirm this was not simply an odd issue with multi_nli, we tried UT with a number of other problems (exact repo not shown below), including ‘lambada_rc’ and ‘stanford_nli’ (run at commit ca628e4fcb04ff42ed21549a4f73e6dfa68a5f7a from around October 16 2018)  All of these failed to converge.

Environment information
-------------------------

Docker image based off nvidia/cuda:9.0-devel-ubuntu16.04

Tf version: tensorflow-gpu=1.11.0

T2t version: Tensor2tensor master at commit 9729521bc3cd4952c42dcfda53699e14bee7b409 on Oct 30 2018.

We also saw this failed behavior on tf-nightly-gpu==1.13.0.dev20181022

Reproduce
-----------
Problem: multi_nli
Model: universal_transformer
Hparams_set: universal_transformer_tiny

python3 /usr/src/t2t/tensor2tensor/bin/t2t-trainer \
--data_dir="DATA_DIR" \
--eval_early_stopping_steps="10000" \
--eval_steps="10000" \
--generate_data="True" \
--hparams="" \
--hparams_set="universal_transformer_tiny" \
--iterations_per_loop="2000" \
--keep_checkpoint_max="80" \
--local_eval_frequency="2000" \
--model="universal_transformer" \
--output_dir="OUTPUT_DIR" \
--problem="multi_nli" \
--t2t_usr_dir="T2T_USR_DIR" \
--tmp_dir="T2T_TMP_DIR"

Run was stopped after 50000 steps due to lack of convergence as loss fluctuates between 1.098 and 1.099.

INFO:tensorflow:Saving dict for global step 50000: global_step = 50000, loss = 1.0991247, metrics-multi_nli/targets/accuracy = 0.31821653, metrics-multi_nli/targets/accuracy_per_sequence = 0.31821653, metrics-multi_nli/targets/accuracy_top5 = 1.0, metrics-multi_nli/targets/approx_bleu_score = 0.7479816, metrics-multi_nli/targets/neg_log_perplexity = -1.099124, metrics-multi_nli/targets/rouge_2_fscore = 0.0, metrics-multi_nli/targets/rouge_L_fscore = 0.31869644


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Universal Transformer appears to be buggy and not converging correctly #1191

Summary

Environment information

Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Universal Transformer appears to be buggy and not converging correctly #1191

Description

Summary

Environment information

Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions