-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt at improved input/output handling #477
Conversation
I think it is fine. Though why we just don't use the PlainTextReader for that purpose? I guess that CharFromWordTextReader is the one that is specially designed for that SegmentingTransducer. Previously I handled that using 2 kinds of input but I think that won't be necessary any longer because it is always annoying to create 2 kinds of input (The space separated characters + Segmentation Boundaries). |
This would be ready from my side. Although it changes (merges) some of the core data structures, this is not a breaking change and it will be possible to use the same config files after this commit is merged (update: one breaking change is that The main things I could see might be discussed:
|
@philip30 yeah I think you're right, it would be better to just use PlainTextInputReader with some special options to read characters etc. I've restored your original reader and moved it inside the specialized_encoders package because as you said its specially designed for the SegmentingTransducer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for cleaning up the code. I'm happy to have you merge after you consider the two comments I left.
xnmt/reports.py
Outdated
@@ -62,16 +76,30 @@ def add_sent_for_report(self, sent_info: Dict[str,Any]) -> None: | |||
self._sent_info_list = [] | |||
self._sent_info_list.append(sent_info) | |||
|
|||
def report_global_info(self, glob_info: Dict[str, Any]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit hesitant about "Globals", as I've found that these can break modularity and add complex data dependencies. That being said, this might be the best solution here? If you don't think there's another better way, we can go with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree it's not a very good design, but this is a problem with the whole reporting mechanism which spreads global information on either a sentence base (report_sent_info()
) or corpus base (report_global_info()
). Should I just rename the latter to report_corpus_info
, and leave improvements to the reporting to a future PR? I honestly don't have a good idea right now for how to make it more modular without adding complexity to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
xnmt/input_reader.py
Outdated
@@ -233,25 +244,6 @@ def count_words(self, trg_words): | |||
def vocab_size(self): | |||
return len(self.vocab) | |||
|
|||
class CharFromWordTextReader(PlainTextReader, Serializable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this necessarily needs to be moved to the specialized place, it seems general enough, for example it could be used in any model that uses character-based representations of words. However, it does need to be documented and given type annotations. I don't think it should be a SimpleSentence
but rather a SegmentedSentence
or something.
I also think perhaps a better representation would be a list of lists, where each interior list represents a single word, but perhaps SegmentedSentence
could provide options to convert between the two representations seamlessly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I would just move it back and add a TODO for @philip30 if that's ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Done. |
Conflicts: test/test_batchers.py test/test_beam_search.py test/test_decoders.py test/test_input_reader.py test/test_segmenting.py test/test_training.py xnmt/__init__.py xnmt/batchers.py xnmt/classifiers.py xnmt/eval_tasks.py xnmt/infererences.py xnmt/input.py xnmt/input_readers.py xnmt/loss_calculators.py xnmt/loss_trackers.py xnmt/model_base.py xnmt/output.py xnmt/reports.py xnmt/sequence_labelers.py xnmt/translators.py
I've encountered several problems related to the
Input
/Output
interfaces when working with multiple corpora/vocabs/models as well as reporting:CharFromWordTextReader
is a bit non-standard: it represents word boundaries through explicit boundary indices, while the standard XNMT character models represent spaces through a special token. It should be moved to the specialized_encoders package to prevent people from using it unless they know what they are doing.The former is needed for the segmenting encoder, but I think the latter is the behavior one would expect and should be the default. @philip30 please check if this would still work for you? You would probably just replaceCharFromWordTextReader {}
byCharTextReader { space_token: ~ }
in your code.This PR attempts to solve the above by merging
Input
andOutput
classes into a newSentence
class, plus some interface cleanup.