Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt at improved input/output handling #477

Merged
merged 32 commits into from
Jul 30, 2018
Merged

Attempt at improved input/output handling #477

merged 32 commits into from
Jul 30, 2018

Conversation

msperber
Copy link
Contributor

@msperber msperber commented Jul 21, 2018

I've encountered several problems related to the Input / Output interfaces when working with multiple corpora/vocabs/models as well as reporting:

  • Information needed to convert them to strings (vocabs, output processors) is passed on separately. This was not a problem with simple seq2seq models but making sure that the correct vocab or output processor is used in the right place is getting increasingly complicated (in the case of reporting, I couldn't find a good solution, for example)
  • The distinction between input and output is not always clear. For example in cascaded model types, outputs are re-used as inputs, and with forced decoding both are identical in some sense. The main feature of outputs is string representation, but this is useful for inputs as well (e.g. for reporting).
  • Inputs have no sentence number, but this is needed in inference and reporting where it is currently passed separately
  • There is currently no way of chaining multiple output processors, but this would be useful e.g. to undo BPE, then undo tokenization.
  • Padding tokens are controlled by the batcher, but this creates problems with compound inputs, and it’s also not clear what the benefit of configurable padding tokens is. Instead, they should be handled inside the inputs directly.
  • CharFromWordTextReader is a bit non-standard: it represents word boundaries through explicit boundary indices, while the standard XNMT character models represent spaces through a special token. It should be moved to the specialized_encoders package to prevent people from using it unless they know what they are doing. The former is needed for the segmenting encoder, but I think the latter is the behavior one would expect and should be the default. @philip30 please check if this would still work for you? You would probably just replace CharFromWordTextReader {} by CharTextReader { space_token: ~ } in your code.

This PR attempts to solve the above by merging Input and Output classes into a new Sentence class, plus some interface cleanup.

@philip30
Copy link
Contributor

I think it is fine. Though why we just don't use the PlainTextReader for that purpose? I guess that CharFromWordTextReader is the one that is specially designed for that SegmentingTransducer. Previously I handled that using 2 kinds of input but I think that won't be necessary any longer because it is always annoying to create 2 kinds of input (The space separated characters + Segmentation Boundaries).

@msperber
Copy link
Contributor Author

msperber commented Jul 23, 2018

This would be ready from my side. Although it changes (merges) some of the core data structures, this is not a breaking change and it will be possible to use the same config files after this commit is merged (update: one breaking change is that batcher.src_pad_token is no longer supported, but I expect that this won't cause any bigger problems).

The main things I could see might be discussed:

  • whether there are any problems with merging Input and Output. I'm not sure what the original intention was for separating these.
  • if Sentence is an acceptable name, given that this results in class names like ScalarSentence etc.

@msperber
Copy link
Contributor Author

@philip30 yeah I think you're right, it would be better to just use PlainTextInputReader with some special options to read characters etc. I've restored your original reader and moved it inside the specialized_encoders package because as you said its specially designed for the SegmentingTransducer.

@msperber msperber changed the title [WIP] Attempt at improved input/output handling Attempt at improved input/output handling Jul 23, 2018
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for cleaning up the code. I'm happy to have you merge after you consider the two comments I left.

xnmt/reports.py Outdated
@@ -62,16 +76,30 @@ def add_sent_for_report(self, sent_info: Dict[str,Any]) -> None:
self._sent_info_list = []
self._sent_info_list.append(sent_info)

def report_global_info(self, glob_info: Dict[str, Any]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit hesitant about "Globals", as I've found that these can break modularity and add complex data dependencies. That being said, this might be the best solution here? If you don't think there's another better way, we can go with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree it's not a very good design, but this is a problem with the whole reporting mechanism which spreads global information on either a sentence base (report_sent_info()) or corpus base (report_global_info()). Should I just rename the latter to report_corpus_info, and leave improvements to the reporting to a future PR? I honestly don't have a good idea right now for how to make it more modular without adding complexity to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -233,25 +244,6 @@ def count_words(self, trg_words):
def vocab_size(self):
return len(self.vocab)

class CharFromWordTextReader(PlainTextReader, Serializable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this necessarily needs to be moved to the specialized place, it seems general enough, for example it could be used in any model that uses character-based representations of words. However, it does need to be documented and given type annotations. I don't think it should be a SimpleSentence but rather a SegmentedSentence or something.

I also think perhaps a better representation would be a list of lists, where each interior list represents a single word, but perhaps SegmentedSentence could provide options to convert between the two representations seamlessly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I would just move it back and add a TODO for @philip30 if that's ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@msperber
Copy link
Contributor Author

Done.

msperber added 2 commits July 30, 2018 10:16
Conflicts:
	test/test_batchers.py
	test/test_beam_search.py
	test/test_decoders.py
	test/test_input_reader.py
	test/test_segmenting.py
	test/test_training.py
	xnmt/__init__.py
	xnmt/batchers.py
	xnmt/classifiers.py
	xnmt/eval_tasks.py
	xnmt/infererences.py
	xnmt/input.py
	xnmt/input_readers.py
	xnmt/loss_calculators.py
	xnmt/loss_trackers.py
	xnmt/model_base.py
	xnmt/output.py
	xnmt/reports.py
	xnmt/sequence_labelers.py
	xnmt/translators.py
@msperber msperber merged commit b74fd36 into master Jul 30, 2018
@neubig neubig deleted the refactor-sent branch July 30, 2018 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants