Attempt at improved input/output handling #477

msperber · 2018-07-21T09:39:32Z

I've encountered several problems related to the Input / Output interfaces when working with multiple corpora/vocabs/models as well as reporting:

Information needed to convert them to strings (vocabs, output processors) is passed on separately. This was not a problem with simple seq2seq models but making sure that the correct vocab or output processor is used in the right place is getting increasingly complicated (in the case of reporting, I couldn't find a good solution, for example)
The distinction between input and output is not always clear. For example in cascaded model types, outputs are re-used as inputs, and with forced decoding both are identical in some sense. The main feature of outputs is string representation, but this is useful for inputs as well (e.g. for reporting).
Inputs have no sentence number, but this is needed in inference and reporting where it is currently passed separately
There is currently no way of chaining multiple output processors, but this would be useful e.g. to undo BPE, then undo tokenization.
Padding tokens are controlled by the batcher, but this creates problems with compound inputs, and it’s also not clear what the benefit of configurable padding tokens is. Instead, they should be handled inside the inputs directly.
CharFromWordTextReader is a bit non-standard: it represents word boundaries through explicit boundary indices, while the standard XNMT character models represent spaces through a special token. It should be moved to the specialized_encoders package to prevent people from using it unless they know what they are doing. The former is needed for the segmenting encoder, but I think the latter is the behavior one would expect and should be the default. @philip30 please check if this would still work for you? You would probably just replace CharFromWordTextReader {} by CharTextReader { space_token: ~ } in your code.

This PR attempts to solve the above by merging Input and Output classes into a new Sentence class, plus some interface cleanup.

philip30 · 2018-07-23T11:58:04Z

I think it is fine. Though why we just don't use the PlainTextReader for that purpose? I guess that CharFromWordTextReader is the one that is specially designed for that SegmentingTransducer. Previously I handled that using 2 kinds of input but I think that won't be necessary any longer because it is always annoying to create 2 kinds of input (The space separated characters + Segmentation Boundaries).

msperber · 2018-07-23T12:01:13Z

This would be ready from my side. Although it changes (merges) some of the core data structures, this is not a breaking change and it will be possible to use the same config files after this commit is merged (update: one breaking change is that batcher.src_pad_token is no longer supported, but I expect that this won't cause any bigger problems).

The main things I could see might be discussed:

whether there are any problems with merging Input and Output. I'm not sure what the original intention was for separating these.
if Sentence is an acceptable name, given that this results in class names like ScalarSentence etc.

msperber · 2018-07-23T12:03:21Z

@philip30 yeah I think you're right, it would be better to just use PlainTextInputReader with some special options to read characters etc. I've restored your original reader and moved it inside the specialized_encoders package because as you said its specially designed for the SegmentingTransducer.

…arguments

neubig

LGTM, thanks for cleaning up the code. I'm happy to have you merge after you consider the two comments I left.

neubig · 2018-07-30T05:18:54Z

xnmt/reports.py

@@ -62,16 +76,30 @@ def add_sent_for_report(self, sent_info: Dict[str,Any]) -> None:
      self._sent_info_list = []
    self._sent_info_list.append(sent_info)

+  def report_global_info(self, glob_info: Dict[str, Any]) -> None:


I'm a little bit hesitant about "Globals", as I've found that these can break modularity and add complex data dependencies. That being said, this might be the best solution here? If you don't think there's another better way, we can go with this.

Yes I agree it's not a very good design, but this is a problem with the whole reporting mechanism which spreads global information on either a sentence base (report_sent_info()) or corpus base (report_global_info()). Should I just rename the latter to report_corpus_info, and leave improvements to the reporting to a future PR? I honestly don't have a good idea right now for how to make it more modular without adding complexity to it.

neubig · 2018-07-30T05:29:52Z

xnmt/input_reader.py

@@ -233,25 +244,6 @@ def count_words(self, trg_words):
  def vocab_size(self):
    return len(self.vocab)

-class CharFromWordTextReader(PlainTextReader, Serializable):


I don't think this necessarily needs to be moved to the specialized place, it seems general enough, for example it could be used in any model that uses character-based representations of words. However, it does need to be documented and given type annotations. I don't think it should be a SimpleSentence but rather a SegmentedSentence or something.

I also think perhaps a better representation would be a list of lists, where each interior list represents a single word, but perhaps SegmentedSentence could provide options to convert between the two representations seamlessly.

Sure! I would just move it back and add a TODO for @philip30 if that's ok?

msperber · 2018-07-30T06:52:58Z

Done.

Conflicts: test/test_batchers.py test/test_beam_search.py test/test_decoders.py test/test_input_reader.py test/test_segmenting.py test/test_training.py xnmt/__init__.py xnmt/batchers.py xnmt/classifiers.py xnmt/eval_tasks.py xnmt/infererences.py xnmt/input.py xnmt/input_readers.py xnmt/loss_calculators.py xnmt/loss_trackers.py xnmt/model_base.py xnmt/output.py xnmt/reports.py xnmt/sequence_labelers.py xnmt/translators.py

msperber added 16 commits July 19, 2018 19:52

initial draft of Sentence data structure

6ee601e

some refinements

8f288a0

added some missing parts

cbe7308

minor changes

fa24570

disable inputs/outputs

f12c814

rename post -> output

7b87493

adjust plaintext and compound readers

c2ca831

started integrating new data structure

47300ac

standard example working

6c86a64

remove disabled code

2a4f332

Merge branch 'master' into refactor-sent

d59590c

update some code to use new Sentence class

ea1af4f

make sent idx optional

b0966f7

fix batcher unit test

786c50b

unit tests working; simplified reporting

8af484a

add some doc strings

525c829

msperber added 3 commits July 23, 2018 14:31

close file in util function

ecf931e

fix batching of compound inputs

73db9fe

some fixes regarding output processors

8a9e5a5

msperber changed the title ~~[WIP] Attempt at improved input/output handling~~ Attempt at improved input/output handling Jul 23, 2018

msperber added 7 commits July 23, 2018 15:42

add PlainTextOutputProcessor for backward compatibility

0488ff2

allow passing single or multiple output processor

d5c0187

fix idx for h5 and npz readers

bd7391f

fix: use sent_len instead of len

04bdbd4

fix when creating truncated / padded / sliced sentences: add missing …

f2e8388

…arguments

Merge branch 'master' into refactor-sent

995c325

add missing change

b759b0d

neubig approved these changes Jul 30, 2018

View reviewed changes

msperber added 4 commits July 30, 2018 07:59

Merge branch 'master' into refactor-sent

a13cac8

move CharFromWordTextReader back to input_reader and add TODOs

ad78d97

fix unit test

1f10f55

rename to report_corpus_info

e8ab836

msperber added 2 commits July 30, 2018 10:16

fix problems after merge

b7ab215

msperber merged commit b74fd36 into master Jul 30, 2018

neubig deleted the refactor-sent branch July 30, 2018 12:41

philip30 mentioned this pull request Aug 10, 2018

Need tests to check join-piece and join-bpe output. #511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt at improved input/output handling #477

Attempt at improved input/output handling #477

msperber commented Jul 21, 2018 •

edited

Loading

philip30 commented Jul 23, 2018

msperber commented Jul 23, 2018 •

edited

Loading

msperber commented Jul 23, 2018

neubig left a comment

neubig Jul 30, 2018

msperber Jul 30, 2018

neubig Jul 30, 2018

neubig Jul 30, 2018

msperber Jul 30, 2018

neubig Jul 30, 2018

msperber commented Jul 30, 2018

Attempt at improved input/output handling #477

Attempt at improved input/output handling #477

Conversation

msperber commented Jul 21, 2018 • edited Loading

philip30 commented Jul 23, 2018

msperber commented Jul 23, 2018 • edited Loading

msperber commented Jul 23, 2018

neubig left a comment

Choose a reason for hiding this comment

neubig Jul 30, 2018

Choose a reason for hiding this comment

msperber Jul 30, 2018

Choose a reason for hiding this comment

neubig Jul 30, 2018

Choose a reason for hiding this comment

neubig Jul 30, 2018

Choose a reason for hiding this comment

msperber Jul 30, 2018

Choose a reason for hiding this comment

neubig Jul 30, 2018

Choose a reason for hiding this comment

msperber commented Jul 30, 2018

msperber commented Jul 21, 2018 •

edited

Loading

msperber commented Jul 23, 2018 •

edited

Loading