TLDR; The authors propose a novel encoder-decoder neural network architecture. The encoder RNN encodes a sequence into a fixed length vector representation and the decoder generates a new variable-length sequence based on this representation. The authors also introduce a new cell type (now called GRU) to be used with this network architecture. The model is evaluated on a statistical machine translation task where it is fed as an additional feature to a log-linear model. It leads to improved BLEU scores. The authors also find that the model learns syntactically and semantically meaningful representations of both words and phrases.
- New encoder-decoder architecture, seq2seq. Decoder conditioned on thought vector.
- Architecture can be used for both scoring or generation
- New hidden unit type, now called GRU. Simplified LSTM.
- Could replace whole pipeline with this architecture, but this paper doesn't
- 15k vocabulary (93% of dataset cover). 100d embeddings, 500 maxout units in final affine layer, batch size of 64, adagrad, 384M words, 3 days training time.
- Architecture is trained without frequency information so we expect it to capture linguistic information rather than statistical information.
- Visualizations of both words embeddings and thought vectors.
- Why not just use LSTM units?