diff --git a/.gitignore b/.gitignore index e127122793cd..b7fa7a0e3202 100644 --- a/.gitignore +++ b/.gitignore @@ -27,6 +27,7 @@ eggs/ .eggs/ lib/ lib64/ +#parts/ sdist/ var/ wheels/ @@ -143,3 +144,4 @@ examples/*/outputs examples/*/wandb examples/*/data wandb +dump.py diff --git a/LICENSE b/LICENSE index 261eeb9e9f8b..8f031424e2b1 100644 --- a/LICENSE +++ b/LICENSE @@ -1,201 +1 @@ - Apache License - Version 2.0, January 2004 - http://www.apache.org/licenses/ - - TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION - - 1. Definitions. - - "License" shall mean the terms and conditions for use, reproduction, - and distribution as defined by Sections 1 through 9 of this document. - - "Licensor" shall mean the copyright owner or entity authorized by - the copyright owner that is granting the License. - - "Legal Entity" shall mean the union of the acting entity and all - other entities that control, are controlled by, or are under common - control with that entity. For the purposes of this definition, - "control" means (i) the power, direct or indirect, to cause the - direction or management of such entity, whether by contract or - otherwise, or (ii) ownership of fifty percent (50%) or more of the - outstanding shares, or (iii) beneficial ownership of such entity. - - "You" (or "Your") shall mean an individual or Legal Entity - exercising permissions granted by this License. - - "Source" form shall mean the preferred form for making modifications, - including but not limited to software source code, documentation - source, and configuration files. - - "Object" form shall mean any form resulting from mechanical - transformation or translation of a Source form, including but - not limited to compiled object code, generated documentation, - and conversions to other media types. - - "Work" shall mean the work of authorship, whether in Source or - Object form, made available under the License, as indicated by a - copyright notice that is included in or attached to the work - (an example is provided in the Appendix below). - - "Derivative Works" shall mean any work, whether in Source or Object - form, that is based on (or derived from) the Work and for which the - editorial revisions, annotations, elaborations, or other modifications - represent, as a whole, an original work of authorship. For the purposes - of this License, Derivative Works shall not include works that remain - separable from, or merely link (or bind by name) to the interfaces of, - the Work and Derivative Works thereof. - - "Contribution" shall mean any work of authorship, including - the original version of the Work and any modifications or additions - to that Work or Derivative Works thereof, that is intentionally - submitted to Licensor for inclusion in the Work by the copyright owner - or by an individual or Legal Entity authorized to submit on behalf of - the copyright owner. For the purposes of this definition, "submitted" - means any form of electronic, verbal, or written communication sent - to the Licensor or its representatives, including but not limited to - communication on electronic mailing lists, source code control systems, - and issue tracking systems that are managed by, or on behalf of, the - Licensor for the purpose of discussing and improving the Work, but - excluding communication that is conspicuously marked or otherwise - designated in writing by the copyright owner as "Not a Contribution." - - "Contributor" shall mean Licensor and any individual or Legal Entity - on behalf of whom a Contribution has been received by Licensor and - subsequently incorporated within the Work. - - 2. Grant of Copyright License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - copyright license to reproduce, prepare Derivative Works of, - publicly display, publicly perform, sublicense, and distribute the - Work and such Derivative Works in Source or Object form. - - 3. Grant of Patent License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - (except as stated in this section) patent license to make, have made, - use, offer to sell, sell, import, and otherwise transfer the Work, - where such license applies only to those patent claims licensable - by such Contributor that are necessarily infringed by their - Contribution(s) alone or by combination of their Contribution(s) - with the Work to which such Contribution(s) was submitted. If You - institute patent litigation against any entity (including a - cross-claim or counterclaim in a lawsuit) alleging that the Work - or a Contribution incorporated within the Work constitutes direct - or contributory patent infringement, then any patent licenses - granted to You under this License for that Work shall terminate - as of the date such litigation is filed. - - 4. Redistribution. You may reproduce and distribute copies of the - Work or Derivative Works thereof in any medium, with or without - modifications, and in Source or Object form, provided that You - meet the following conditions: - - (a) You must give any other recipients of the Work or - Derivative Works a copy of this License; and - - (b) You must cause any modified files to carry prominent notices - stating that You changed the files; and - - (c) You must retain, in the Source form of any Derivative Works - that You distribute, all copyright, patent, trademark, and - attribution notices from the Source form of the Work, - excluding those notices that do not pertain to any part of - the Derivative Works; and - - (d) If the Work includes a "NOTICE" text file as part of its - distribution, then any Derivative Works that You distribute must - include a readable copy of the attribution notices contained - within such NOTICE file, excluding those notices that do not - pertain to any part of the Derivative Works, in at least one - of the following places: within a NOTICE text file distributed - as part of the Derivative Works; within the Source form or - documentation, if provided along with the Derivative Works; or, - within a display generated by the Derivative Works, if and - wherever such third-party notices normally appear. The contents - of the NOTICE file are for informational purposes only and - do not modify the License. You may add Your own attribution - notices within Derivative Works that You distribute, alongside - or as an addendum to the NOTICE text from the Work, provided - that such additional attribution notices cannot be construed - as modifying the License. - - You may add Your own copyright statement to Your modifications and - may provide additional or different license terms and conditions - for use, reproduction, or distribution of Your modifications, or - for any such Derivative Works as a whole, provided Your use, - reproduction, and distribution of the Work otherwise complies with - the conditions stated in this License. - - 5. Submission of Contributions. Unless You explicitly state otherwise, - any Contribution intentionally submitted for inclusion in the Work - by You to the Licensor shall be under the terms and conditions of - this License, without any additional terms or conditions. - Notwithstanding the above, nothing herein shall supersede or modify - the terms of any separate license agreement you may have executed - with Licensor regarding such Contributions. - - 6. Trademarks. This License does not grant permission to use the trade - names, trademarks, service marks, or product names of the Licensor, - except as required for reasonable and customary use in describing the - origin of the Work and reproducing the content of the NOTICE file. - - 7. Disclaimer of Warranty. Unless required by applicable law or - agreed to in writing, Licensor provides the Work (and each - Contributor provides its Contributions) on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - implied, including, without limitation, any warranties or conditions - of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A - PARTICULAR PURPOSE. You are solely responsible for determining the - appropriateness of using or redistributing the Work and assume any - risks associated with Your exercise of permissions under this License. - - 8. Limitation of Liability. In no event and under no legal theory, - whether in tort (including negligence), contract, or otherwise, - unless required by applicable law (such as deliberate and grossly - negligent acts) or agreed to in writing, shall any Contributor be - liable to You for damages, including any direct, indirect, special, - incidental, or consequential damages of any character arising as a - result of this License or out of the use or inability to use the - Work (including but not limited to damages for loss of goodwill, - work stoppage, computer failure or malfunction, or any and all - other commercial damages or losses), even if such Contributor - has been advised of the possibility of such damages. - - 9. Accepting Warranty or Additional Liability. While redistributing - the Work or Derivative Works thereof, You may choose to offer, - and charge a fee for, acceptance of support, warranty, indemnity, - or other liability obligations and/or rights consistent with this - License. However, in accepting such obligations, You may act only - on Your own behalf and on Your sole responsibility, not on behalf - of any other Contributor, and only if You agree to indemnify, - defend, and hold each Contributor harmless for any liability - incurred by, or claims asserted against, such Contributor by reason - of your accepting any such warranty or additional liability. - - END OF TERMS AND CONDITIONS - - APPENDIX: How to apply the Apache License to your work. - - To apply the Apache License to your work, attach the following - boilerplate notice, with the fields enclosed by brackets "[]" - replaced with your own identifying information. (Don't include - the brackets!) The text should be enclosed in the appropriate - comment syntax for the file format. We also recommend that a - file or class name and description of purpose be included on the - same "printed page" as the copyright notice for easier - identification within third-party archives. - - Copyright [yyyy] [name of copyright owner] - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. +Please refer to per-package licenses \ No newline at end of file diff --git a/README.rst b/README.rst index eef09ca3a8a8..982c9a43412d 100644 --- a/README.rst +++ b/README.rst @@ -15,6 +15,10 @@ Neural Modules’ inputs and outputs have Neural Type for semantic checking. An application built with NeMo application is a Directed Acyclic Graph(DAG) of connected modules enabling researchers to define and build new speech and nlp networks easily through API Compatible modules. +**Documentation and Tutorials** + +Please refer to the HTML documentation in the `docs` folder + **VIDEO** @@ -31,11 +35,6 @@ An application built with NeMo application is a Directed Acyclic Graph(DAG) of c * **Collections** - NeMo comes with collections - related group of modules such as `nemo_asr` (for Speech Recognition) and `nemo_nlp` for NLP -**Documentation** - -Please refer to the HTML documentation in the `docs` folder - - **Requirements** 1) Python 3.6 or 3.7 @@ -60,7 +59,7 @@ Run this: 2) Go to `nemo` folder and do: `python setup.py install` 3) Install collections: a) ASR collection from `collections/nemo_asr` do: `python setup.py install` - b) NLP collection coming soon ... + b) NLP collection from `collections/nemo_nlp` do: `python setup.py install` 4) For development do: `python setup.py develop` instead of `python setup.py install` in Step (3) above 5) Go to `examples/start_here` to get started with few simple examples diff --git a/collections/nemo_asr/nemo_asr/__init__.py b/collections/nemo_asr/nemo_asr/__init__.py index bfcd12745b68..38123be26b73 100644 --- a/collections/nemo_asr/nemo_asr/__init__.py +++ b/collections/nemo_asr/nemo_asr/__init__.py @@ -1,4 +1,17 @@ -# Copyright (c) 2019 NVIDIA Corporation +# Copyright 2019 AI Applications Design Team at NVIDIA. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== from nemo.core import Backend from .data_layer import AudioToTextDataLayer, AudioPreprocessing, \ @@ -11,3 +24,4 @@ name = "nemo_asr" backend = Backend.PyTorch +__version__ = "0.1" diff --git a/collections/nemo_asr/nemo_asr/data_layer.py b/collections/nemo_asr/nemo_asr/data_layer.py index 1626103f8e66..c425cda89cbb 100644 --- a/collections/nemo_asr/nemo_asr/data_layer.py +++ b/collections/nemo_asr/nemo_asr/data_layer.py @@ -6,7 +6,7 @@ import torch from apex import amp -from nemo.backends.pytorch.nm import DataLayerNM, NonTrainableNM +from nemo.backends.pytorch.nm import DataLayerNM, TrainableNM, NonTrainableNM from nemo.core import Optimization, DeviceType from nemo.core.neural_types import * from .parts.dataset import AudioDataset, seq_collate_fn @@ -112,13 +112,12 @@ def __init__( labels=labels, featurizer=self._featurizer, max_duration=max_duration, min_duration=min_duration, normalize=normalize_transcripts, - trim=trim_silence, verbose=self._master_process, + trim=trim_silence, logger=self._logger, eos_id=eos_id, load_audio=load_audio ) if self._placement == DeviceType.AllGpu: - if self._master_process: - print('Parallelizing DATALAYER') + self._logger.info('Parallelizing DATALAYER') sampler = torch.utils.data.distributed.DistributedSampler( self._dataset) else: @@ -146,7 +145,7 @@ def data_iterator(self): return self._dataloader -class AudioPreprocessing(NonTrainableNM): +class AudioPreprocessing(TrainableNM): """ Neural Module that does batch processing of audio files and converts them to spectrogram representations @@ -232,7 +231,7 @@ def __init__( raise NotImplementedError("AudioPreprocessing currently only " "accepts 'fbank' or 'logfbank' as " "feat_type") - NonTrainableNM.__init__(self, **kwargs) + TrainableNM.__init__(self, **kwargs) self.featurizer = FilterbankFeatures( sample_rate=sample_rate, @@ -248,14 +247,14 @@ def __init__( dither=dither, pad_to=pad_to, frame_splicing=frame_splicing, - stft_conv=stft_conv + stft_conv=stft_conv, + logger=self._logger ) # _pre_procesing_config = self.local_parameters # self.featurizer = FeatureFactory.from_config(_pre_procesing_config) self.featurizer.to(self._device) - stft_conv = kwargs.get("stft_conv", False) - self.disable_casts = (self._opt_level != Optimization.nothing and + self.disable_casts = (self._opt_level == Optimization.mxprO1 and not stft_conv) def forward(self, input_signal, length): diff --git a/collections/nemo_asr/nemo_asr/helpers.py b/collections/nemo_asr/nemo_asr/helpers.py index a16b1d2c3fd1..aa73b9d5e0f8 100644 --- a/collections/nemo_asr/nemo_asr/helpers.py +++ b/collections/nemo_asr/nemo_asr/helpers.py @@ -28,7 +28,8 @@ def __ctc_decoder_predictions_tensor(tensor, labels): return hypotheses -def monitor_asr_train_progress(tensors: list, labels: list, tb_logger=None): +def monitor_asr_train_progress(tensors: list, labels: list, tb_logger=None, + logger=None): """ Takes output of greedy ctc decoder and performs ctc decoding algorithm to remove duplicates and special symbol. Prints sample to screen, computes @@ -46,8 +47,8 @@ def monitor_asr_train_progress(tensors: list, labels: list, tb_logger=None): labels_map = dict([(i, labels[i]) for i in range(len(labels))]) with torch.no_grad(): # prediction_cpu_tensor = tensors[0].long().cpu() - targets_cpu_tensor = tensors[1].long().cpu() - tgt_lenths_cpu_tensor = tensors[2].long().cpu() + targets_cpu_tensor = tensors[2].long().cpu() + tgt_lenths_cpu_tensor = tensors[3].long().cpu() # iterate over batch for ind in range(targets_cpu_tensor.shape[0]): @@ -56,14 +57,21 @@ def monitor_asr_train_progress(tensors: list, labels: list, tb_logger=None): reference = ''.join([labels_map[c] for c in target]) references.append(reference) hypotheses = __ctc_decoder_predictions_tensor( - tensors[0], labels=labels) + tensors[1], labels=labels) tag = "training_batch_WER" wer = word_error_rate(hypotheses, references) if tb_logger is not None: tb_logger.add_scalar(tag, wer) - print('{0}: {1}'.format(tag, wer)) - print('Prediction: {0}'.format(hypotheses[0])) - print('Reference: {0}'.format(references[0])) + if logger: + logger.info(f'Loss: {tensors[0]}') + logger.info(f'{tag}: {wer*100 : 5.2f}%') + logger.info(f'Prediction: {hypotheses[0]}') + logger.info(f'Reference: {references[0]}') + else: + print(f'Loss: {tensors[0]}') + print(f'{tag}: {wer*100 : 5.2f}%') + print(f'Prediction: {hypotheses[0]}') + print(f'Reference: {references[0]}') def __gather_losses(losses_list: list) -> list: @@ -126,7 +134,7 @@ def process_evaluation_batch(tensors: dict, global_vars: dict, labels: list): labels=labels) -def process_evaluation_epoch(global_vars: dict, tag=None): +def process_evaluation_epoch(global_vars: dict, tag=None, logger=None): """ Calculates the aggregated loss and WER across the entire evaluation dataset """ @@ -136,14 +144,22 @@ def process_evaluation_epoch(global_vars: dict, tag=None): wer = word_error_rate(hypotheses=hypotheses, references=references) if tag is None: - print("==========>>>>>>Evaluation Loss: {0}".format(eloss)) - print("==========>>>>>>Evaluation WER: {0}".format(wer)) - return dict({"Evaluation Loss": eloss, "Evaluation WER": wer}) + if logger: + logger.info(f"==========>>>>>>Evaluation Loss: {eloss}") + logger.info(f"==========>>>>>>Evaluation WER: {wer*100 : 5.2f}%") + else: + print(f"==========>>>>>>Evaluation Loss: {eloss}") + print(f"==========>>>>>>Evaluation WER: {wer*100 : 5.2f}%") + return {"Evaluation_Loss": eloss, "Evaluation_WER": wer} else: - print("==========>>>>>>Evaluation Loss {1}: {0}".format(eloss, tag)) - print("==========>>>>>>Evaluation WER {1}: {0}".format(wer, tag)) - return dict({"Evaluation Loss {0}".format(tag): eloss, - "Evaluation WER {0}".format(tag): wer}) + if logger: + logger.info(f"==========>>>>>>Evaluation Loss {tag}: {eloss}") + logger.info(f"==========>>>>>>Evaluation WER {tag}: " + f"{wer*100 : 5.2f}%") + else: + print(f"==========>>>>>>Evaluation Loss {tag}: {eloss}") + print(f"==========>>>>>>Evaluation WER {tag}: {wer*100 : 5.2f}%") + return {f"Evaluation_Loss_{tag}": eloss, f"Evaluation_WER_{tag}": wer} def post_process_predictions(predictions, labels): diff --git a/collections/nemo_asr/nemo_asr/las/helpers.py b/collections/nemo_asr/nemo_asr/las/helpers.py index 6043f372fa8e..993bf3385605 100644 --- a/collections/nemo_asr/nemo_asr/las/helpers.py +++ b/collections/nemo_asr/nemo_asr/las/helpers.py @@ -1,5 +1,5 @@ from itertools import chain -from pprint import pprint +from pprint import pformat import torch from nemo.backends.pytorch.common.metrics import char_lm_metrics @@ -10,7 +10,7 @@ def process_evaluation_batch(tensors, global_vars, labels, specials, - tb_writer=None): + tb_writer=None, write_attn=True): loss, log_probs = ([],) * 2 transcripts, transcript_texts = ([],) * 2 predictions, prediction_texts = ([],) * 2 @@ -43,7 +43,7 @@ def process_evaluation_batch(tensors, global_vars, labels, specials, global_vars['prediction_texts'].extend(prediction_texts) # TODO: Add step number? - if tb_writer is not None and len(attention_weights): + if tb_writer is not None and len(attention_weights) and write_attn: sample_len = len(prediction_texts[0][0]) if sample_len > 0: attention_weights = attention_weights[0][0, :sample_len, :] @@ -55,7 +55,7 @@ def process_evaluation_batch(tensors, global_vars, labels, specials, def process_evaluation_epoch(global_vars, metrics=('loss', 'bpc', 'ppl'), calc_wer=False, - log=True, mode='eval', tag='none'): + logger=None, mode='eval', tag='none'): tag = '_'.join(tag.lower().strip().split()) return_dict = {} for metric in metrics: @@ -70,17 +70,17 @@ def process_evaluation_epoch(global_vars, transcript_texts = list(chain(*global_vars['transcript_texts'])) prediction_texts = list(chain(*global_vars['prediction_texts'])) - if log: - print(f'Ten examples (transcripts and predictions)') - print(transcript_texts[:10]) - print(prediction_texts[:10]) + if logger: + logger.info(f'Ten examples (transcripts and predictions)') + logger.info(transcript_texts[:10]) + logger.info(prediction_texts[:10]) wer = word_error_rate(hypotheses=prediction_texts, references=transcript_texts) return_dict[f'metric/{mode}_wer_{tag}'] = wer - if log: - pprint(return_dict) + if logger: + logger.info(pformat(return_dict)) return return_dict diff --git a/collections/nemo_asr/nemo_asr/parts/dataset.py b/collections/nemo_asr/nemo_asr/parts/dataset.py index 69bfa7a57021..e1297aff417e 100644 --- a/collections/nemo_asr/nemo_asr/parts/dataset.py +++ b/collections/nemo_asr/nemo_asr/parts/dataset.py @@ -74,7 +74,7 @@ class AudioDataset(Dataset): def __init__(self, manifest_filepath, labels, featurizer, max_duration=None, min_duration=None, max_utts=0, normalize=True, - trim=False, eos_id=None, verbose=False, load_audio=True): + trim=False, eos_id=None, logger=False, load_audio=True): """ Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations @@ -112,8 +112,8 @@ def __init__(self, manifest_filepath, labels, featurizer, self.trim = trim self.eos_id = eos_id self.load_audio = load_audio - if verbose: - print( + if logger: + logger.info( "Dataset loaded with {0:.2f} hours. Filtered {1:.2f} " "hours.".format( self.manifest.duration / 3600, diff --git a/collections/nemo_asr/nemo_asr/parts/features.py b/collections/nemo_asr/nemo_asr/parts/features.py index 23ecebb9cc38..8cf85f153cd2 100644 --- a/collections/nemo_asr/nemo_asr/parts/features.py +++ b/collections/nemo_asr/nemo_asr/parts/features.py @@ -196,9 +196,12 @@ def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01, preemph=0.97, nfilt=64, lowfreq=0, highfreq=None, log=True, dither=CONSTANT, pad_to=16, max_duration=16.7, - frame_splicing=1, stft_conv=False): + frame_splicing=1, stft_conv=False, logger=None): super(FilterbankFeatures, self).__init__() - print("PADDING: {}".format(pad_to)) + if logger: + logger.info(f"PADDING: {pad_to}") + else: + print(f"PADDING: {pad_to}") self.win_length = int(sample_rate * window_size) self.hop_length = int(sample_rate * window_stride) @@ -206,7 +209,10 @@ def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01, self.stft_conv = stft_conv if stft_conv: - print("STFT using conv") + if logger: + logger.info("STFT using conv") + else: + print("STFT using conv") # Create helper class to patch forward func for use with AMP class STFTPatch(STFT): diff --git a/collections/nemo_asr/setup.py b/collections/nemo_asr/setup.py index 7a28bb877060..bc5266761866 100644 --- a/collections/nemo_asr/setup.py +++ b/collections/nemo_asr/setup.py @@ -5,8 +5,8 @@ setuptools.setup( name="nemo_asr", - version="0.0.1", - author="AI Applications @ NVIDIA", + version="0.3", + author="NVIDIA", author_email="okuchaiev@nvidia.com", description="Collection of Neural Modules for Speech Recognition", long_description=long_description, @@ -16,10 +16,10 @@ classifiers=[ "Programming Language :: Python :: 3", "Operating System :: OS Independent", + "License :: OSI Approved :: Apache License 2.0" ], install_requires=[ - 'nemo', - 'toml', + 'nemo_toolkit', 'librosa', 'num2words', 'inflect', diff --git a/collections/nemo_nlp/LICENSE b/collections/nemo_nlp/LICENSE new file mode 100644 index 000000000000..261eeb9e9f8b --- /dev/null +++ b/collections/nemo_nlp/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/collections/nemo_nlp/README.md b/collections/nemo_nlp/README.md new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/collections/nemo_nlp/nemo_nlp/__init__.py b/collections/nemo_nlp/nemo_nlp/__init__.py new file mode 100644 index 000000000000..c1b82569c519 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/__init__.py @@ -0,0 +1,36 @@ +# Copyright 2019 AI Applications Design Team at NVIDIA. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== + +from .data import * +from .transformer_nm import TransformerEncoderNM, TransformerDecoderNM, \ + TransformerLogSoftmaxNM, PaddedSmoothedCrossEntropyLossNM, \ + BeamSearchTranslatorNM, GreedyLanguageGeneratorNM +from .bert import MaskedLanguageModelingLossNM, \ + SentenceClassificationLogSoftmaxNM, NextSentencePredictionLossNM, \ + LossAggregatorNM, QuestionAnsweringPredictionLoss, \ + TokenClassificationLoss, SequenceClassifier, \ + JointIntentSlotLoss, ZerosLikeNM, \ + JointIntentSlotClassifier +from .nlp_utils import read_intent_slot_outputs +from . import transformer, huggingface + +from .callbacks import * + +from nemo.core import Backend + + +name = "nemo_nlp" +backend = Backend.PyTorch +__version__ = "0.1" diff --git a/collections/nemo_nlp/nemo_nlp/bert.py b/collections/nemo_nlp/nemo_nlp/bert.py new file mode 100644 index 000000000000..d25e32bbd309 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/bert.py @@ -0,0 +1,511 @@ +# Copyright (c) 2019 NVIDIA Corporation +""" +This package contains BERT Neural Module +""" +import torch +import torch.nn as nn + +from nemo.backends.pytorch.nm import TrainableNM, LossNM +from nemo.core.neural_types import * +from .transformer import ClassificationLogSoftmax +from .transformer import SmoothedCrossEntropyLoss +from .transformer import SequenceClassificationLoss +from .transformer.utils import transformer_weights_init + + +class MaskedLanguageModelingLossNM(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "log_probs": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "output_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "output_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + } + + output_ports = {"loss": NeuralType(None)} + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + label_smoothing = self.local_parameters.get("label_smoothing", 0.0) + self._loss_fn = SmoothedCrossEntropyLoss(label_smoothing) + + def forward(self, log_probs, output_ids, output_mask): + loss = self._loss_fn(log_probs, output_ids, output_mask) + return loss + + +class SentenceClassificationLogSoftmaxNM(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "hidden_states": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + } + + output_ports = { + "log_probs": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + } + return input_ports, output_ports + + def __init__(self, *, d_model, num_classes, **kwargs): + TrainableNM.__init__(self, **kwargs) + + self.log_softmax = ClassificationLogSoftmax( + hidden_size=d_model, + num_classes=num_classes + ) + + self.log_softmax.apply(transformer_weights_init) + self.log_softmax.to(self._device) + + def forward(self, hidden_states): + log_probs = self.log_softmax(hidden_states) + return log_probs + + +class NextSentencePredictionLossNM(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "log_probs": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "labels": + NeuralType({0: AxisType(BatchTag)}), + } + + output_ports = {"loss": NeuralType(None)} + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + self._loss_fn = SequenceClassificationLoss() + + def forward(self, log_probs, labels): + loss = self._loss_fn(log_probs, labels) + return loss + + +class LossAggregatorNM(LossNM): + @staticmethod + def create_ports(num_losses=2): + input_ports = {} + for i in range(num_losses): + input_ports["loss_" + str(i + 1)] = NeuralType(None) + + output_ports = {"loss": NeuralType(None)} + return input_ports, output_ports + + def __init__(self, *, num_inputs, **kwargs): + kwargs["create_port_args"] = {"num_losses": num_inputs} + LossNM.__init__(self, **kwargs) + + def _loss_function(self, **kwargs): + values = [kwargs[x] for x in sorted(kwargs.keys())] + loss = values[0] + for loss_i in values[1:]: + loss = loss.add(loss_i.item()) + return loss + + +class QuestionAnsweringPredictionLoss(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "hidden_states": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "start_positions": + NeuralType({0: AxisType(BatchTag)}), + "end_positions": + NeuralType({0: AxisType(BatchTag)}) + } + + output_ports = { + "loss": + NeuralType(None), + "start_logits": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "end_logits": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + + self.hidden_size = self.local_parameters["d_model"] + + self.qa_outputs = nn.Linear(self.hidden_size, 2) + self.qa_outputs.apply(transformer_weights_init) + self.qa_outputs.to(self._device) + + def forward(self, hidden_states, start_positions, end_positions): + + logits = self.qa_outputs(hidden_states) + + start_logits, end_logits = logits.split(1, dim=-1) + start_logits = start_logits.squeeze(-1) + end_logits = end_logits.squeeze(-1) + + # If we are on multi-GPU, split add a dimension + if len(start_positions.size()) > 1: + start_positions = start_positions.squeeze(-1) + if len(end_positions.size()) > 1: + end_positions = end_positions.squeeze(-1) + + # sometimes the start/end positions are outside our model inputs, + # we ignore these terms + ignored_index = start_logits.size(1) + start_positions.clamp_(0, ignored_index) + end_positions.clamp_(0, ignored_index) + + loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index) + start_loss = loss_fct(start_logits, start_positions) + end_loss = loss_fct(end_logits, end_positions) + total_loss = (start_loss + end_loss) / 2 + + return total_loss, start_logits, end_logits + + +class TokenClassificationLoss(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "hidden_states": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "labels": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + + output_ports = { + "loss": + NeuralType(None), + "logits": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + + self.hidden_size = self.local_parameters["d_model"] + self.num_labels = self.local_parameters["num_labels"] + self.dropout = nn.Dropout(self.local_parameters["dropout"]) + self.classifier = nn.Linear(self.hidden_size, self.num_labels) + + self.apply( + lambda module: transformer_weights_init(module, xavier=False)) + self.to(self._device) + + def forward(self, hidden_states, labels, input_mask): + + hidden_states = self.dropout(hidden_states) + logits = self.classifier(hidden_states) + + loss_fct = nn.CrossEntropyLoss() + + active_loss = input_mask.view(-1) > 0.5 + active_logits = logits.view(-1, self.num_labels)[active_loss] + active_labels = labels.view(-1)[active_loss] + + loss = loss_fct(active_logits, active_labels) + + return loss, logits + + +class ZerosLikeNM(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "input_type_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + }) + } + + output_ports = { + "input_type_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + }) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + + def forward(self, input_type_ids): + return torch.zeros_like(input_type_ids).long() + + +class SequenceClassifier(TrainableNM): + """ + Loss function for the joint intent classification and slot + filling task. + + The loss is a joint loss of both tasks, aim to maximize: + p(y^i | x)P(y^s1, y^s2, ..., y^sn | x) + + with y^i being the predicted intent and y^s1, y^s2, ..., y^sn + are the predicted slots corresponding to x1, x2, ..., xn. + + Args: + hidden_states: output of the hidden layers + intents: ground truth intents, + slots: ground truth slots. + input_mask: to differentiate from original tokens and paddings + intent_loss_weight: the loss is the sum of: + intent_loss_weight * intent_loss + + (1 - intent_loss_weight) * slot_loss + + """ + @staticmethod + def create_ports(): + input_ports = { + "hidden_states": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }) + } + + output_ports = { + "logits": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(ChannelTag) + }), + } + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + self.hidden_size = self.local_parameters["d_model"] + self.num_classes = self.local_parameters["num_classes"] + self.dropout = nn.Dropout(self.local_parameters["dropout"]) + self.dense = nn.Linear(self.hidden_size, self.hidden_size) + self.classifier = nn.Linear(self.hidden_size, self.num_classes) + self.apply( + lambda module: transformer_weights_init(module, xavier=False)) + self.to(self._device) + + def forward(self, hidden_states): + hidden_states = self.dropout(hidden_states) + hidden_states = self.dense(hidden_states[:, 0]) + hidden_states = torch.relu(hidden_states) + logits = self.classifier(hidden_states) + + return logits + + +class JointIntentSlotClassifier(TrainableNM): + """ + Loss function for the joint intent classification and slot + filling task. + + The loss is a joint loss of both tasks, aim to maximize: + p(y^i | x)P(y^s1, y^s2, ..., y^sn | x) + + with y^i being the predicted intent and y^s1, y^s2, ..., y^sn + are the predicted slots corresponding to x1, x2, ..., xn. + + Args: + hidden_states: output of the hidden layers + intents: ground truth intents, + slots: ground truth slots. + input_mask: to differentiate from original tokens and paddings + intent_loss_weight: the loss is the sum of: + intent_loss_weight * intent_loss + + (1 - intent_loss_weight) * slot_loss + + """ + @staticmethod + def create_ports(): + input_ports = { + "hidden_states": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "input_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + } + + output_ports = { + "intent_logits": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(ChannelTag) + }), + "slot_logits": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }) + } + return input_ports, output_ports + + def __init__(self, + hidden_size, + num_intents, + num_slots, + dropout, + **kwargs): + TrainableNM.__init__(self, **kwargs) + self.hidden_size = hidden_size + self.num_intents = num_intents + self.num_slots = num_slots + self.dropout = nn.Dropout(dropout) + self.intent_dense = nn.Linear(self.hidden_size, self.hidden_size) + self.intent_classifier = nn.Linear(self.hidden_size, self.num_intents) + self.slot_dense = nn.Linear(self.hidden_size, self.hidden_size) + self.slot_classifier = nn.Linear(self.hidden_size, self.num_slots) + self.apply( + lambda module: transformer_weights_init(module, xavier=False)) + self.to(self._device) + + def forward(self, hidden_states): + hidden_states = self.dropout(hidden_states) + + intent_states = self.intent_dense(hidden_states[:, 0]) + intent_states = torch.relu(intent_states) + intent_logits = self.intent_classifier(intent_states) + + # slot_states = self.slot_dense(hidden_states[1:, :]) + slot_states = self.slot_dense(hidden_states) + slot_states = torch.relu(slot_states) + slot_logits = self.slot_classifier(slot_states) + + return intent_logits, slot_logits + + +class JointIntentSlotLoss(LossNM): + """ + Loss function for the joint intent classification and slot + filling task. + + The loss is a joint loss of both tasks, aim to maximize: + p(y^i | x)P(y^s1, y^s2, ..., y^sn | x) + + with y^i being the predicted intent and y^s1, y^s2, ..., y^sn + are the predicted slots corresponding to x1, x2, ..., xn. + + Args: + hidden_states: output of the hidden layers + intents: ground truth intents, + slots: ground truth slots. + input_mask: to differentiate from original tokens and paddings + intent_loss_weight: the loss is the sum of: + intent_loss_weight * intent_loss + + (1 - intent_loss_weight) * slot_loss + + """ + @staticmethod + def create_ports(): + input_ports = { + "intent_logits": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(ChannelTag) + }), + "slot_logits": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "input_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "intents": NeuralType({ + 0: AxisType(BatchTag), + }), + "slots": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + # "intent_loss_weight": NeuralType(None) + } + + output_ports = { + "loss": NeuralType(None), + } + return input_ports, output_ports + + def __init__(self, num_slots, **kwargs): + LossNM.__init__(self, **kwargs) + self.num_slots = num_slots + self._criterion = nn.CrossEntropyLoss() + + def _loss_function(self, + intent_logits, + slot_logits, + input_mask, + intents, + slots, + intent_loss_weight=0.6): + intent_loss = self._criterion(intent_logits, intents) + + active_loss = input_mask.view(-1) > 0.5 + active_logits = slot_logits.view(-1, self.num_slots)[active_loss] + active_labels = slots.view(-1)[active_loss] + + slot_loss = self._criterion(active_logits, active_labels) + loss = intent_loss * intent_loss_weight + \ + slot_loss * (1 - intent_loss_weight) + + return loss diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/__init__.py b/collections/nemo_nlp/nemo_nlp/callbacks/__init__.py new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/bert_pretraining.py b/collections/nemo_nlp/nemo_nlp/callbacks/bert_pretraining.py new file mode 100644 index 000000000000..4abe25545f74 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/bert_pretraining.py @@ -0,0 +1,35 @@ +# Copyright (c) 2019 NVIDIA Corporation +import numpy as np + + +def eval_iter_callback(tensors, global_vars): + if "dev_mlm_loss" not in global_vars.keys(): + global_vars["dev_mlm_loss"] = [] + if "dev_nsp_loss" not in global_vars.keys(): + global_vars["dev_nsp_loss"] = [] + keys = list(tensors.keys()) + # TODO: referring to these by name here is error-prone + for dev_mlm_loss in tensors[keys[1]]: + global_vars["dev_mlm_loss"].append(dev_mlm_loss.item()) + + if len(keys) > 2: + for dev_nsp_loss in tensors[keys[2]]: + global_vars["dev_nsp_loss"].append(dev_nsp_loss.item()) + + +def eval_epochs_done_callback(global_vars): + if 'dev_mlm_loss' in global_vars: + mlm_loss = np.mean(global_vars["dev_mlm_loss"]) + print("Dev MLM perplexity: {0}".format(np.round(np.exp(mlm_loss), 3))) + global_vars["dev_mlm_loss"] = [] + else: + mlm_loss = -123.0 + + if 'dev_nsp_loss' in global_vars: + nsp_loss = np.mean(global_vars["dev_nsp_loss"]) + print("Dev NSP perplexity: {0}".format(np.round(np.exp(nsp_loss), 3))) + global_vars["dev_nsp_loss"] = [] + else: + nsp_loss = -123.0 + + return dict({"Dev MLM loss": mlm_loss, "Dev NSP loss": nsp_loss}) diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/joint_intent_slot.py b/collections/nemo_nlp/nemo_nlp/callbacks/joint_intent_slot.py new file mode 100644 index 000000000000..0305098dbb69 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/joint_intent_slot.py @@ -0,0 +1,103 @@ +# Copyright (c) 2019 NVIDIA Corporation +import os +import random +import time + +import matplotlib +from matplotlib import pyplot as plt +import numpy as np +from sklearn.metrics import confusion_matrix, classification_report + +from nemo.utils.exp_logging import get_logger +matplotlib.use("TkAgg") + + +logger = get_logger('') + + +def tensor2list(tensor): + return tensor.detach().cpu().tolist() + + +def eval_iter_callback(tensors, + global_vars, + eval_data_layer): + if "all_intent_preds" not in global_vars.keys(): + global_vars["all_intent_preds"] = [] + if "all_intent_labels" not in global_vars.keys(): + global_vars["all_intent_labels"] = [] + if "all_slot_preds" not in global_vars.keys(): + global_vars["all_slot_preds"] = [] + if "all_slot_labels" not in global_vars.keys(): + global_vars["all_slot_labels"] = [] + + intent_logits_lists, intent_labels_lists = [], [] + slot_logits_lists, slot_labels_lists = [], [] + + for kv, v in tensors.items(): + if kv.startswith('intent_logits'): + for v_tensor in v: + for logit_tensor in v_tensor: + intent_logits_lists.append(tensor2list(logit_tensor)) + + if kv.startswith('intents'): + for v_tensor in v: + for label_tensor in v_tensor: + intent_labels_lists.append(tensor2list(label_tensor)) + + if kv.startswith('slot_logits'): + for v_tensor in v: + for logit_tensor in v_tensor: + slot_logits_lists.append(tensor2list(logit_tensor)) + + if kv.startswith('slots'): + for v_tensor in v: + for label_tensor in v_tensor: + slot_labels_lists.extend(tensor2list(label_tensor)) + + intent_preds = list(np.argmax(np.asarray(intent_logits_lists), 1)) + slot_preds = list(np.argmax(np.asarray(slot_logits_lists), 2).flatten()) + global_vars["all_intent_preds"].extend(intent_preds) + global_vars["all_intent_labels"].extend(intent_labels_lists) + global_vars["all_slot_preds"].extend(slot_preds) + global_vars["all_slot_labels"].extend(slot_labels_lists) + + +def list2str(l): + return ' '.join([str(j) for j in l]) + + +def eval_epochs_done_callback(global_vars, graph_fold): + intent_labels = np.asarray(global_vars['all_intent_labels']) + intent_preds = np.asarray(global_vars['all_intent_preds']) + correct_preds = sum(intent_labels == intent_preds) + intent_accuracy = correct_preds / intent_labels.shape[0] + logger.info(f'Intent accuracy: {intent_accuracy}') + + slot_labels = np.asarray(global_vars['all_slot_labels']) + slot_preds = np.asarray(global_vars['all_slot_preds']) + slot_accuracy = sum(slot_labels == slot_preds) / slot_labels.shape[0] + logger.info(f'Slot accuracy: {slot_accuracy}') + + i = 0 + if intent_preds.shape[0] > 21: + i = random.randint(0, intent_preds.shape[0] - 21) + logger.info("Sampled i_preds: [%s]" % list2str(intent_preds[i:i+20])) + logger.info("Sampled intents: [%s]" % list2str(intent_labels[i:i+20])) + logger.info("Sampled s_preds: [%s]" % list2str(slot_preds[i:i+20])) + logger.info("Sampled slots: [%s]" % list2str(slot_labels[i:i+20])) + cm = confusion_matrix(intent_labels, intent_preds) + fig = plt.figure() + ax = fig.add_subplot(111) + cax = ax.matshow(cm) + plt.title('Confusion matrix of the classifier') + fig.colorbar(cax) + plt.xlabel('Predicted') + plt.ylabel('True') + os.makedirs(graph_fold, exist_ok=True) + plt.savefig(os.path.join(graph_fold, time.strftime('%Y%m%d-%H%M%S'))) + + logger.info(classification_report(intent_labels, intent_preds)) + + return dict({'intent_accuracy': intent_accuracy, + 'slot_accuracy': slot_accuracy}) diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/language_modeling.py b/collections/nemo_nlp/nemo_nlp/callbacks/language_modeling.py new file mode 100644 index 000000000000..6ea43ff6ea6c --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/language_modeling.py @@ -0,0 +1,30 @@ +# Copyright (c) 2019 NVIDIA Corporation +import numpy as np + + +GLOBAL_KEYS = ["eval_loss", "sys"] + + +def eval_iter_callback(tensors, global_vars): + + for key in GLOBAL_KEYS: + if key not in global_vars.keys(): + global_vars[key] = [] + + for kv, v in tensors.items(): + if "loss" in kv: + for eval_loss in v: + global_vars["eval_loss"].append(eval_loss.item()) + + +def eval_epochs_done_callback(global_vars): + eval_loss = np.mean(global_vars["eval_loss"]) + eval_ppl = np.exp(eval_loss) + + print("------------------------------------------------------------") + print("Validation loss: {0}".format(np.round(eval_loss, 3))) + print("Validation ppl: {0}".format(np.round(eval_ppl, 3))) + print("------------------------------------------------------------") + for key in GLOBAL_KEYS: + global_vars[key] = [] + return dict({"Eval loss": eval_loss, "Eval ppl": eval_ppl}) diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/ner.py b/collections/nemo_nlp/nemo_nlp/callbacks/ner.py new file mode 100644 index 000000000000..2e6b2c367961 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/ner.py @@ -0,0 +1,98 @@ +# Copyright (c) 2019 NVIDIA Corporation + + +def eval_iter_callback(tensors, global_vars, eval_data_layer, tag_ids): + if "correct_tags" not in global_vars.keys(): + global_vars["correct_tags"] = 0 + if "token_count" not in global_vars.keys(): + global_vars["token_count"] = 0 + if "correct_chunks" not in global_vars.keys(): + global_vars["correct_chunks"] = 0 + if "predicted_chunks" not in global_vars.keys(): + global_vars["predicted_chunks"] = 0 + if "total_chunks" not in global_vars.keys(): + global_vars["total_chunks"] = 0 + if "lines" not in global_vars.keys(): + global_vars["lines"] = [] + + logits_lists = [] + seq_ids = [] + + for kv, v in tensors.items(): + if "logits" in kv: + for v_tensor in v: + for logit_tensor in v_tensor: + logits_lists.append(logit_tensor.detach().cpu().tolist()) + + if "seq_ids" in kv: + for v_tensor in v: + for seq_id_tensor in v_tensor: + seq_ids.append(seq_id_tensor.detach().cpu().tolist()) + + correct_tags, token_count, correct_chunks, predicted_chunks, \ + total_chunks, lines = \ + eval_data_layer.eval_preds(logits_lists, seq_ids, tag_ids) + + global_vars["correct_tags"] += correct_tags + global_vars["token_count"] += token_count + global_vars["correct_chunks"] += correct_chunks + global_vars["predicted_chunks"] += predicted_chunks + global_vars["total_chunks"] += total_chunks + global_vars["lines"].extend(lines) + + +def eval_epochs_done_callback(global_vars, tag_ids, output_filename): + correct_tags = global_vars["correct_tags"] + token_count = global_vars["token_count"] + correct_chunks = global_vars["correct_chunks"] + predicted_chunks = global_vars["predicted_chunks"] + total_chunks = global_vars["total_chunks"] + lines = global_vars["lines"] + + if output_filename is not None: + # Create output file that can be evaluated by conlleval.pl script + tag_ids = {tag_ids[k]: k for k in tag_ids} + + last_label = "" + last_prediction = "" + + with open(output_filename, "w") as f: + for line in lines: + if line["word"] == "": + f.write("\n") + last_label = "" + last_prediction = "" + continue + + label = tag_ids[int(line["label"])] + prediction = tag_ids[int(line["prediction"])] + + # Correctly precede tags with B- and I- as necessary (slightly + # modified from https://www.clips.uantwerpen.be/conll2003/ner/) + if label != "O": + if last_label == line["label"]: + label = "I-{}".format(label) + else: + label = "B-{}".format(label) + + if prediction != "O": + if last_prediction == line["prediction"]: + prediction = "I-{}".format(prediction) + else: + prediction = "B-{}".format(prediction) + + last_label = line["label"] + last_prediction = line["prediction"] + + f.write("{}\t{}\t{}\n".format(line["word"], label, prediction)) + + accuracy = correct_tags / token_count + + p = correct_chunks / predicted_chunks if predicted_chunks > 0 else 0 + r = correct_chunks / total_chunks if total_chunks > 0 else 0 + f1 = 2 * p * r / (p + r) if p > 0 and r > 0 else 0 + + print(f"Accuracy = {accuracy}") + print(f"F1 = {f1}") + + return {"accuracy": accuracy, "f1": f1} diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/sentence_classification.py b/collections/nemo_nlp/nemo_nlp/callbacks/sentence_classification.py new file mode 100644 index 000000000000..10162f7e5943 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/sentence_classification.py @@ -0,0 +1,71 @@ +# Copyright (c) 2019 NVIDIA Corporation +import os +import random +import time + +import logging + +import matplotlib +from matplotlib import pyplot as plt +import numpy as np +from sklearn.metrics import confusion_matrix, classification_report + +matplotlib.use("TkAgg") +logger = logging.getLogger('log') + + +def eval_iter_callback(tensors, + global_vars, + eval_data_layer): + if "all_preds" not in global_vars.keys(): + global_vars["all_preds"] = [] + if "all_labels" not in global_vars.keys(): + global_vars["all_labels"] = [] + + logits_lists = [] + labels_lists = [] + + for kv, v in tensors.items(): + if 'logits' in kv: + for v_tensor in v: + for logit_tensor in v_tensor: + logits_lists.append(logit_tensor.detach().cpu().tolist()) + + if 'labels' in kv: + for v_tensor in v: + for label_tensor in v_tensor: + labels_lists.append(label_tensor.detach().cpu().tolist()) + + preds = list(np.argmax(np.asarray(logits_lists), 1)) + global_vars["all_preds"].extend(preds) + global_vars["all_labels"].extend(labels_lists) + + +def list2str(l): + return ' '.join([str(j) for j in l]) + + +def eval_epochs_done_callback(global_vars, graph_fold): + labels = np.asarray(global_vars['all_labels']) + preds = np.asarray(global_vars['all_preds']) + accuracy = sum(labels == preds) / labels.shape[0] + logger.info(f'Accuracy: {accuracy}') + i = 0 + if preds.shape[0] > 21: + i = random.randint(0, preds.shape[0] - 21) + logger.info("Sampled preds: [%s]" % list2str(preds[i:i+20])) + logger.info("Sampled labels: [%s]" % list2str(labels[i:i+20])) + cm = confusion_matrix(labels, preds) + fig = plt.figure() + ax = fig.add_subplot(111) + cax = ax.matshow(cm) + plt.title('Confusion matrix of the classifier') + fig.colorbar(cax) + plt.xlabel('Predicted') + plt.ylabel('True') + os.makedirs(graph_fold, exist_ok=True) + plt.savefig(os.path.join(graph_fold, time.strftime('%Y%m%d-%H%M%S'))) + + logger.info(classification_report(labels, preds)) + + return dict({"accuracy": accuracy}) diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/squad.py b/collections/nemo_nlp/nemo_nlp/callbacks/squad.py new file mode 100644 index 000000000000..0b740b2a13f2 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/squad.py @@ -0,0 +1,47 @@ +# Copyright (c) 2019 NVIDIA Corporation + + +def eval_iter_callback(tensors, global_vars): + if "eval_start_logits" not in global_vars.keys(): + global_vars["eval_start_logits"] = [] + if "eval_end_logits" not in global_vars.keys(): + global_vars["eval_end_logits"] = [] + if "eval_unique_ids" not in global_vars.keys(): + global_vars["eval_unique_ids"] = [] + + for kv, v in tensors.items(): + + if 'logits' in kv: + logits = [] + for v_tensor in v: + for logit_tensor in v_tensor: + logits.append(logit_tensor.detach().cpu().tolist()) + + if kv.startswith('start_logits'): + global_vars['eval_start_logits'].extend(logits) + elif kv.startswith('end_logits'): + global_vars['eval_end_logits'].extend(logits) + + if 'unique_ids' in kv: + unique_ids = [] + for v_tensor in v: + for id_tensor in v_tensor: + unique_ids.append(id_tensor.detach().cpu().tolist()) + global_vars['eval_unique_ids'].extend(unique_ids) + + +def eval_epochs_done_callback(global_vars, eval_data_layer, do_lower_case): + + exact_match, f1 = eval_data_layer.calculate_exact_match_and_f1( + global_vars["eval_unique_ids"], + global_vars["eval_start_logits"], + global_vars["eval_end_logits"], + do_lower_case=do_lower_case) + + print(f"Exact_match = {exact_match}, f1 = {f1}") + + global_vars["eval_unique_ids"] = [] + global_vars["eval_start_logits"] = [] + global_vars["eval_end_logits"] = [] + + return dict({"exact_match": exact_match, "f1": f1}) diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/token_classification.py b/collections/nemo_nlp/nemo_nlp/callbacks/token_classification.py new file mode 100644 index 000000000000..45b4bf1d22f9 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/token_classification.py @@ -0,0 +1,57 @@ +# Copyright (c) 2019 NVIDIA Corporation + + +def eval_iter_callback(tensors, global_vars, eval_data_layer): + if "correct_labels" not in global_vars.keys(): + global_vars["correct_labels"] = 0 + if "incorrect_labels" not in global_vars.keys(): + global_vars["incorrect_labels"] = 0 + if "correct_preds" not in global_vars.keys(): + global_vars["correct_preds"] = 0 + if "total_preds" not in global_vars.keys(): + global_vars["total_preds"] = 0 + if "total_correct" not in global_vars.keys(): + global_vars["total_correct"] = 0 + + logits_lists = [] + seq_ids = [] + + for kv, v in tensors.items(): + if 'logits' in kv: + for v_tensor in v: + for logit_tensor in v_tensor: + logits_lists.append(logit_tensor.detach().cpu().tolist()) + + if 'seq_ids' in kv: + for v_tensor in v: + for seq_id_tensor in v_tensor: + seq_ids.append(seq_id_tensor.detach().cpu().tolist()) + + correct_labels, incorrect_labels, correct_preds, total_preds, \ + total_correct = eval_data_layer.eval_preds(logits_lists, seq_ids) + + global_vars["correct_labels"] += correct_labels + global_vars["incorrect_labels"] += incorrect_labels + global_vars["correct_preds"] += correct_preds + global_vars["total_preds"] += total_preds + global_vars["total_correct"] += total_correct + + +def eval_epochs_done_callback(global_vars): + + correct_labels = global_vars["correct_labels"] + incorrect_labels = global_vars["incorrect_labels"] + correct_preds = global_vars["correct_preds"] + total_preds = global_vars["total_preds"] + total_correct = global_vars["total_correct"] + + accuracy = correct_labels / (correct_labels + incorrect_labels) + + p = correct_preds / total_preds + r = correct_preds / total_correct + f1 = 2 * p * r / (p + r) + + print(f"Accuracy = {accuracy}") + print(f"F1= {f1}") + + return dict({"accuracy": accuracy, "f1": f1}) diff --git a/collections/nemo_nlp/nemo_nlp/callbacks/translation.py b/collections/nemo_nlp/nemo_nlp/callbacks/translation.py new file mode 100644 index 000000000000..924323047bfe --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/callbacks/translation.py @@ -0,0 +1,109 @@ +# Copyright (c) 2019 NVIDIA Corporation +import numpy as np +from ..externals.sacrebleu import corpus_bleu +from nemo_asr.metrics import word_error_rate + + +GLOBAL_KEYS = ["eval_loss", "ref", "sys", "sent_ids", "nonpad_tokens"] + + +def eval_iter_callback(tensors, global_vars, tgt_tokenizer): + + for key in GLOBAL_KEYS: + if key not in global_vars.keys(): + global_vars[key] = [] + + for kv, v in tensors.items(): + + if "output_ids" in kv: + sys = [] + for beam in v: + beam_search_translation = beam.cpu().numpy().tolist() + for sentence in beam_search_translation: + sys.append(tgt_tokenizer.ids_to_text(sentence)) + global_vars["sys"].append(sys) + + if "tgt" in kv: + ref = [] + for tgt in v: + nonpad_tokens = (tgt != tgt_tokenizer.pad_id()).sum().item() + tgt_sentences = tgt.cpu().numpy().tolist() + for sentence in tgt_sentences: + ref.append(tgt_tokenizer.ids_to_text(sentence)) + global_vars["nonpad_tokens"].append(nonpad_tokens) + global_vars["ref"].append(ref) + + if "sent_ids" in kv: + for sent_ids in v: + global_vars["sent_ids"].extend(sent_ids.cpu().numpy().tolist()) + + if "loss" in kv: + for eval_loss in v: + global_vars["eval_loss"].append(eval_loss.item()) + + +def eval_epochs_done_callback(global_vars, validation_dataset=None): + + losses = np.array(global_vars["eval_loss"]) + counts = np.array(global_vars["nonpad_tokens"]) + eval_loss = np.sum(losses * counts) / np.sum(counts) + + all_sys = [j for i in global_vars["sys"] for j in i] + _, indices = np.unique(global_vars["sent_ids"], return_index=True) + all_sys = [all_sys[i] for i in indices] + + if validation_dataset is not None: + all_ref = [open(validation_dataset, "r").readlines()] + # _, *refs = download_test_set("wmt14/full", "en-de") + # all_ref = [smart_open(x).readlines() for x in refs] + else: + all_ref = [[j for i in global_vars["ref"] for j in i]] + + token_bleu = corpus_bleu(all_sys, all_ref, tokenize="fairseq").score + sacre_bleu = corpus_bleu(all_sys, all_ref, tokenize="13a").score + + for i in range(3): + sent_id = np.random.randint(len(all_sys)) + print("Ground truth: {0}\n".format(all_ref[0][sent_id])) + print("Translation: {0}\n".format(all_sys[sent_id])) + + print("------------------------------------------------------------") + print("Validation loss: {0}".format(np.round(eval_loss, 3))) + print("TokenBLEU: {0}".format(np.round(token_bleu, 2))) + print("SacreBLEU: {0}".format(np.round(sacre_bleu, 2))) + print("------------------------------------------------------------") + + for key in GLOBAL_KEYS: + global_vars[key] = [] + + metrics = dict( + {"eval_loss": eval_loss, + "token_bleu": token_bleu, + "sacre_bleu": sacre_bleu}) + + return metrics + + +def eval_epochs_done_callback_wer(global_vars): + eval_loss = np.mean(global_vars["eval_loss"]) + all_ref = [] + for r in global_vars["ref"]: + all_ref += r + all_sys = [] + for s in global_vars["sys"]: + all_sys += s + ref = all_ref + sys = all_sys + eval_wer = word_error_rate(ref, sys) + for i in range(3): + sent_id = np.random.randint(len(sys)) + print("Ground truth: {0}\n".format(ref[sent_id])) + print("Translation: {0}\n".format(sys[sent_id])) + + print("Validation loss: {0}".format(np.round(eval_loss, 3))) + print("Validation WER: {0}".format(eval_wer)) + global_vars["eval_loss"] = [] + global_vars["ref"] = [] + global_vars["sys"] = [] + + return dict({"eval_loss": eval_loss, "eval_wer": eval_wer}) diff --git a/collections/nemo_nlp/nemo_nlp/data/__init__.py b/collections/nemo_nlp/nemo_nlp/data/__init__.py new file mode 100644 index 000000000000..f105e6d1f727 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/__init__.py @@ -0,0 +1,10 @@ +from .translation_data_layer import TranslationDataLayer +from .bert_ner_data_layer import BertNERDataLayer +from .bert_pretraining_data_layer import BertPretrainingDataLayer +from .bert_qa_data_layer import BertQuestionAnsweringDataLayer +from .bert_tc_data_layer import BertTokenClassificationDataLayer +from .bert_sc_data_layer import BertSentenceClassificationDataLayer,\ + BertJointIntentSlotDataLayer, \ + BertJointIntentSlotInferDataLayer +from .language_modeling_data_layer import LanguageModelingDataLayer +from .tokenizers import * diff --git a/collections/nemo_nlp/nemo_nlp/data/bert_ner_data_layer.py b/collections/nemo_nlp/nemo_nlp/data/bert_ner_data_layer.py new file mode 100644 index 000000000000..51d4ef73dcdf --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/bert_ner_data_layer.py @@ -0,0 +1,62 @@ +# Copyright (c) 2019 NVIDIA Corporation +# pylint: disable=E0401, E0602, E0611, E1101 + +import torch + +from nemo.backends.pytorch.nm import DataLayerNM +from nemo.core.neural_types import * +from nemo.core import DeviceType +from .datasets import BertNERDataset + + +class BertNERDataLayer(DataLayerNM): + @staticmethod + def create_ports(): + input_ports = {} + output_ports = { + "input_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_type_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "labels": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "seq_ids": NeuralType({0: AxisType(BatchTag)}) + } + return input_ports, output_ports + + def __init__(self, *, tokenizer, path_to_data, max_seq_length, **kwargs): + DataLayerNM.__init__(self, **kwargs) + + self._device = torch.device( + "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu] + else "cpu" + ) + + self._dataset = BertNERDataset( + tokenizer=tokenizer, + input_file=path_to_data, + max_seq_length=max_seq_length) + + def eval_preds(self, logits, seq_ids, tag_ids): + return self._dataset.eval_preds(logits, seq_ids, tag_ids) + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None diff --git a/collections/nemo_nlp/nemo_nlp/data/bert_pretraining_data_layer.py b/collections/nemo_nlp/nemo_nlp/data/bert_pretraining_data_layer.py new file mode 100644 index 000000000000..2f9fdd97e3f5 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/bert_pretraining_data_layer.py @@ -0,0 +1,73 @@ +# Copyright (c) 2019 NVIDIA Corporation + +from nemo.backends.pytorch.nm import DataLayerNM +from nemo.core.neural_types import * +from nemo.core import DeviceType +import torch +from .datasets import BertPretrainingDataset + + +class BertPretrainingDataLayer(DataLayerNM): + @staticmethod + def create_ports(): + input_ports = {} + output_ports = { + "input_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_type_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "output_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "output_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "labels": + NeuralType({0: AxisType(BatchTag)}), + } + + return input_ports, output_ports + + def __init__(self, *, tokenizer, dataset, name, max_seq_length, + sentence_indices_filename=None, mask_probability=0.15, + **kwargs): + DataLayerNM.__init__(self, **kwargs) + + self._device = torch.device( + "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu] + else "cpu" + ) + + self._dataset = BertPretrainingDataset( + tokenizer=tokenizer, + dataset=dataset, + name=name, + sentence_indices_filename=sentence_indices_filename, + max_length=max_seq_length, + mask_probability=mask_probability) + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None diff --git a/collections/nemo_nlp/nemo_nlp/data/bert_qa_data_layer.py b/collections/nemo_nlp/nemo_nlp/data/bert_qa_data_layer.py new file mode 100644 index 000000000000..380fa1ee920f --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/bert_qa_data_layer.py @@ -0,0 +1,94 @@ +# Copyright (c) 2019 NVIDIA Corporation + +from nemo.backends.pytorch.nm import DataLayerNM +from nemo.core.neural_types import * +from nemo.core import DeviceType +import torch +from .datasets import BertQuestionAnsweringDataset + + +class BertQuestionAnsweringDataLayer(DataLayerNM): + @staticmethod + def create_ports(): + input_ports = {} + output_ports = { + "input_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_type_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(TimeTag) + }), + "start_positions": + NeuralType({0: AxisType(BatchTag)}), + "end_positions": + NeuralType({0: AxisType(BatchTag)}), + "unique_ids": + NeuralType({0: AxisType(BatchTag)}) + } + + return input_ports, output_ports + + def __init__( + self, *, + tokenizer, + path_to_data, + data_format, + features_file_prefix, + max_seq_length, + is_training, + max_query_length, + local_rank, + **kwargs + ): + DataLayerNM.__init__(self, **kwargs) + + self._device = torch.device( + "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu] + else "cpu" + ) + + self._dataset = BertQuestionAnsweringDataset( + tokenizer=tokenizer, + input_file=path_to_data, + data_format=data_format, + features_file_prefix=features_file_prefix, + max_seq_length=max_seq_length, + is_training=is_training, + max_query_length=max_query_length, + local_rank=local_rank) + + def calculate_exact_match_and_f1(self, + unique_ids, + start_logits, + end_logits, + n_best_size=20, + max_answer_length=30, + do_lower_case=False, + version_2_with_negative=False, + null_score_diff_thresold=0.0): + exact_match, f1 = self._dataset.calculate_exact_match_and_f1( + unique_ids, start_logits, end_logits, n_best_size, + max_answer_length, do_lower_case, version_2_with_negative, + null_score_diff_thresold) + return exact_match, f1 + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None diff --git a/collections/nemo_nlp/nemo_nlp/data/bert_sc_data_layer.py b/collections/nemo_nlp/nemo_nlp/data/bert_sc_data_layer.py new file mode 100644 index 000000000000..d6b6be1473c1 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/bert_sc_data_layer.py @@ -0,0 +1,213 @@ +# Copyright (c) 2019 NVIDIA Corporation +import torch + +import nemo +from nemo.backends.pytorch.nm import DataLayerNM +from nemo.core.neural_types import * +from .datasets import BertSentenceClassificationDataset,\ + BertJointIntentSlotDataset, \ + BertJointIntentSlotInferDataset + + +class BertSentenceClassificationDataLayer(DataLayerNM): + """ + Creates the data layer to use for the task of sentence classification + with pretrained model. + + All the data processing is done BertSentenceClassificationDataset. + + Args: + input_file: file to sequence + label. + the first line is header (sentence [tab] label) + each line should be [sentence][tab][label] + max_seq_length: max sequence length (minus 2 for [CLS] and [SEP]) + tokenizer: such as BERT tokenizer. + num_samples: number of samples you want to use for the dataset. + if -1, use all dataset. + useful for testing. + """ + + @staticmethod + def create_ports(): + output_ports = { + "input_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_type_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "labels": NeuralType({ + 0: AxisType(BatchTag), + }), + } + return {}, output_ports + + def __init__(self, + path_to_data, + tokenizer, + max_seq_length, + num_samples=-1, + **kwargs): + DataLayerNM.__init__(self, **kwargs) + self._device = nemo.utils.get_cuda_device(self.placement) + self._dataset = BertSentenceClassificationDataset( + input_file=path_to_data, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + num_samples=num_samples) + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None + + +class BertJointIntentSlotDataLayer(DataLayerNM): + """ + Creates the data layer to use for the task of joint intent + and slot classification with pretrained model. + + All the data processing is done in BertJointIntentSlotDataset. + + Args: + input_file: file to sequence + label. + the first line is header (sentence [tab] label) + each line should be [sentence][tab][label] + slot_file: file to slot labels, each line corresponding to + slot labels for a sentence in input_file. No header. + max_seq_length: max sequence length (minus 2 for [CLS] and [SEP]) + tokenizer: such as BERT tokenizer. + num_samples: number of samples you want to use for the dataset. + if -1, use all dataset. + useful for testing. + """ + @staticmethod + def create_ports(): + output_ports = { + "input_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_type_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "token_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "intents": NeuralType({ + 0: AxisType(BatchTag), + }), + "slots": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + } + return {}, output_ports + + def __init__(self, + path_to_data, + path_to_slot, + pad_label, + tokenizer, + max_seq_length, + num_samples=-1, + **kwargs): + DataLayerNM.__init__(self, **kwargs) + self._device = nemo.utils.get_cuda_device(self.placement) + self._dataset = BertJointIntentSlotDataset( + input_file=path_to_data, + slot_file=path_to_slot, + pad_label=pad_label, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + num_samples=num_samples) + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None + + +class BertJointIntentSlotInferDataLayer(DataLayerNM): + """ + Creates the data layer to use for the task of joint intent + and slot classification with pretrained model. This is for + + All the data processing is done in BertJointIntentSlotDataset. + + Args: + input_file: file to sequence + label. + the first line is header (sentence [tab] label) + each line should be [sentence][tab][label] + slot_file: file to slot labels, each line corresponding to + slot labels for a sentence in input_file. No header. + max_seq_length: max sequence length (minus 2 for [CLS] and [SEP]) + tokenizer: such as BERT tokenizer. + num_samples: number of samples you want to use for the dataset. + if -1, use all dataset. + useful for testing. + """ + @staticmethod + def create_ports(): + output_ports = { + "input_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_type_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "token_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + return {}, output_ports + + def __init__(self, queries, tokenizer, max_seq_length, **kwargs): + DataLayerNM.__init__(self, **kwargs) + self._device = nemo.utils.get_cuda_device(self.placement) + self._dataset = BertJointIntentSlotInferDataset( + queries=queries, + tokenizer=tokenizer, + max_seq_length=max_seq_length) + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None diff --git a/collections/nemo_nlp/nemo_nlp/data/bert_tc_data_layer.py b/collections/nemo_nlp/nemo_nlp/data/bert_tc_data_layer.py new file mode 100644 index 000000000000..f4709c3673b2 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/bert_tc_data_layer.py @@ -0,0 +1,70 @@ +# Copyright (c) 2019 NVIDIA Corporation + +from nemo.backends.pytorch.nm import DataLayerNM +from nemo.core.neural_types import * +from nemo.core import DeviceType +import torch +from .datasets import BertTokenClassificationDataset + + +class BertTokenClassificationDataLayer(DataLayerNM): + @staticmethod + def create_ports(): + input_ports = {} + output_ports = { + "input_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_type_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(TimeTag) + }), + "labels": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "seq_ids": + NeuralType({0: AxisType(BatchTag)}) + } + + return input_ports, output_ports + + def __init__(self, *, tokenizer, path_to_data, max_seq_length, **kwargs): + DataLayerNM.__init__(self, **kwargs) + + self._device = torch.device( + "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu] + else "cpu" + ) + + self._dataset = BertTokenClassificationDataset( + tokenizer=tokenizer, + input_file=path_to_data, + max_seq_length=max_seq_length) + + def eval_preds(self, logits, seq_ids): + correct_labels, incorrect_labels, correct_preds, total_preds, \ + total_correct = self._dataset.eval_preds(logits, seq_ids) + return correct_labels, incorrect_labels, correct_preds, total_preds, \ + total_correct + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/__init__.py b/collections/nemo_nlp/nemo_nlp/data/datasets/__init__.py new file mode 100644 index 000000000000..4d503bfa90ad --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/__init__.py @@ -0,0 +1,9 @@ +from .translation import TranslationDataset +from .bert_pretraining import BertPretrainingDataset +from .ner import BertNERDataset +from .question_answering import BertQuestionAnsweringDataset +from .token_classification import BertTokenClassificationDataset +from .sentence_classification import BertSentenceClassificationDataset +from .joint_intent_slot import BertJointIntentSlotDataset, \ + BertJointIntentSlotInferDataset +from .language_modeling import LanguageModelingDataset diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/bert_pretraining.py b/collections/nemo_nlp/nemo_nlp/data/datasets/bert_pretraining.py new file mode 100644 index 000000000000..fd1d2cdde3ec --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/bert_pretraining.py @@ -0,0 +1,240 @@ +# ============================================================================= +# Copyright 2019 AI Applications Design Team at NVIDIA. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================= +"""Pytorch Dataset for training BERT.""" + +import array +import glob +import os +import pickle +import random + +import numpy as np +from torch.utils.data import Dataset +from tqdm import tqdm + + +class BertPretrainingDataset(Dataset): + def __init__(self, + tokenizer, + dataset, + name, + sentence_indices_filename=None, + max_length=128, + mask_probability=0.15): + self.tokenizer = tokenizer + + if sentence_indices_filename is None: + sentence_indices_filename = "{}_sentence_indices.pkl".format(name) + + # Loading enormous datasets into RAM isn't always feasible -- for + # example, the pubmed corpus is 200+ GB, which doesn't fit into RAM on + # most computers. To get around this, we store the indices of newlines + # in each file so we can seek to and retrieve sentences immediately + # from main memory when needed during training. + + if os.path.isfile(sentence_indices_filename): + # If the sentence indices file already exists, load from it + with open(sentence_indices_filename, "rb") as f: + sentence_indices = pickle.load(f) + else: + # Otherwise, generate and store sentence indices + sentence_indices = {} + total_tokens = 0 + used_tokens = 0 + + # Finds all of the newline indices in a string + def find_newlines(contents): + nonlocal used_tokens, total_tokens + start = 0 + + while True: + try: + # index and split are much faster than Python for loops + new_start = contents.index(b"\n", start) + line = contents[start:new_start] \ + .replace(b"\xc2\x99", b" ") \ + .replace(b"\xc2\xa0", b" ") \ + .replace(b"\xa0", b" ") + num_tokens = len(line.split()) + + # Ensure the line has at least max_length tokens + if num_tokens >= max_length: + yield start - 1 + used_tokens += num_tokens + + total_tokens += num_tokens + start = new_start + 1 + except ValueError: + break + + if os.path.isdir(dataset): + dataset_pattern = os.path.join(dataset, "**", "*.txt") + filenames = glob.glob(dataset_pattern, recursive=True) + else: + filenames = [dataset] + + for filename in tqdm(filenames): + with open(filename, "rb") as f: + contents = f.read() + newline_indices = find_newlines(contents) + + if os.path.isdir(dataset): + # Only keep the parts of the filepath that are invariant to + # the dataset's location on disk + filename = os.path.basename(filename) + + # In python, arrays are much more space-efficient than lists + sentence_indices[filename] = array.array("I", newline_indices) + + # Save sentence indices so we don't have to do this again + with open(sentence_indices_filename, "wb") as f: + pickle.dump(sentence_indices, f) + + print("Used {} tokens of total {}".format(used_tokens, + total_tokens)) + + corpus_size = 0 + empty_files = [] + + # Find total number of newlines across entire corpus and remove files + # without any newlines + for filename in sentence_indices: + if len(sentence_indices[filename]) <= 1: + empty_files.append(filename) + else: + corpus_size += len(sentence_indices[filename]) + + for filename in empty_files: + del sentence_indices[filename] + + self.corpus_size = corpus_size + self.dataset = dataset + self.filenames = list(sentence_indices.keys()) + self.mask_probability = mask_probability + self.max_length = max_length + self.sentence_indices = sentence_indices + self.vocab_size = self.tokenizer.vocab_size + + def __len__(self): + return self.corpus_size + + def __getitem__(self, idx): + # Each sequence has three special tokens, as follows: + # [CLS] [SEP] [SEP] + num_special_tokens = 3 + + # TODO: Make seq_length = 512 for the last 10% of epochs, as specified + # in BERT paper + seq_length = self.max_length - num_special_tokens + min_doc_length = 16 + + a_length = random.randrange(min_doc_length, + seq_length - min_doc_length + 1) + b_length = seq_length - a_length + + a_filename = random.choice(self.filenames) + a_line = random.choice(self.sentence_indices[a_filename]) + + def get_document(filepath, line): + # Retrieve a specific line from a file and return as a document + if os.path.isdir(self.dataset): + filepath = os.path.join(self.dataset, filepath) + + with open(filepath, "rb") as f: + # Add one to go to the character after the newline + f.seek(line + 1) + + # Read line, remove newline, and decode as UTF8 + doc_text = f.readline()[:-1].decode("utf-8", errors="ignore") + document = [self.tokenizer.token_to_id("[CLS]")] \ + + self.tokenizer.text_to_ids(doc_text) \ + + [self.tokenizer.token_to_id("[SEP]")] + + assert len(document) >= self.max_length + return document + + a_document = get_document(a_filename, a_line) + + if random.random() < 0.5: + # 50% of the time, B is the sentence that follows A + label = 1 + + a_start_idx = random.randrange(len(a_document) - seq_length) + b_start_idx = a_start_idx + a_length + + b_filename = a_filename + b_line = a_line + b_document = a_document + else: + # The rest of the time, B is a random sentence from the corpus + label = 0 + + a_start_idx = random.randrange(len(a_document) - a_length) + + b_filename = random.choice(self.filenames) + b_line = random.choice(self.sentence_indices[b_filename]) + b_document = get_document(b_filename, b_line) + b_start_idx = random.randrange(len(b_document) - b_length) + + # Process retrieved documents for use in training + a_ids = a_document[a_start_idx:a_start_idx + a_length] + b_ids = b_document[b_start_idx:b_start_idx + b_length] + + output_ids = [self.tokenizer.special_tokens["[CLS]"]] + a_ids + \ + [self.tokenizer.special_tokens["[SEP]"]] + b_ids + \ + [self.tokenizer.special_tokens["[SEP]"]] + + input_ids, output_mask = self.mask_ids(output_ids) + + output_mask = np.array(output_mask, dtype=np.float32) + input_mask = np.ones(self.max_length, dtype=np.float32) + + input_type_ids = np.zeros(self.max_length, dtype=np.int) + input_type_ids[a_length + 2:seq_length + 3] = 1 + + return np.array(input_ids), input_type_ids, input_mask, \ + np.array(output_ids), output_mask, label + + def mask_ids(self, ids): + """ + Args: + tokens: list of tokens representing a chunk of text + + Returns: + masked_tokens: list of input tokens with some of the entries masked + according to the following protocol from the original BERT paper: + each token is masked with a probability of 15% and is replaced with + 1) the [MASK] token 80% of the time, + 2) random token 10% of the time, + 3) the same token 10% of the time. + output_mask: list of binary variables which indicate what tokens has + been masked (to calculate the loss function for these tokens only) + """ + masked_ids = [] + output_mask = [] + for id in ids: + if random.random() < self.mask_probability: + output_mask.append(1) + if random.random() < 0.8: + masked_ids.append(self.tokenizer.special_tokens["[MASK]"]) + elif random.random() < 0.5: + masked_ids.append(random.randrange(self.vocab_size)) + else: + masked_ids.append(id) + else: + masked_ids.append(id) + output_mask.append(0) + return masked_ids, output_mask diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/joint_intent_slot.py b/collections/nemo_nlp/nemo_nlp/data/datasets/joint_intent_slot.py new file mode 100644 index 000000000000..7b580239dff0 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/joint_intent_slot.py @@ -0,0 +1,265 @@ +# Copyright 2018 The Google AI Language Team Authors and +# The HuggingFace Inc. team. +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Utility functions for Token Classification NLP tasks +Some parts of this code were adapted from the HuggingFace library at +https://github.com/huggingface/pytorch-pretrained-BERT +""" + +from collections import Counter +import itertools +import random + +import numpy as np +from torch.utils.data import Dataset + +from nemo.utils.exp_logging import get_logger +from ... import text_data_utils + + +logger = get_logger('') + + +def get_stats(lengths): + lengths = np.asarray(lengths) + logger.info(f'Min: {np.min(lengths)} | \ + Max: {np.max(lengths)} | \ + Mean: {np.mean(lengths)} | \ + Median: {np.median(lengths)}') + logger.info(f'75 percentile: {np.percentile(lengths, 75)} | \ + 99 percentile: {np.percentile(lengths, 99)}') + + +def list2str(l): + return ' '.join([str(x) for x in l]) + + +def get_label_stats(labels, outfile='stats.tsv'): + labels = Counter(labels) + total = sum(labels.values()) + out = open(outfile, 'w') + i = 0 + for k, v in labels.most_common(): + out.write(f'{k}\t{v/total}\n') + if i < 3: + logger.info(f'{i} item: {k}, {v} out of {total}, {v/total}.') + i += 1 + + +def get_features(queries, + max_seq_length, + tokenizer, + pad_label=128, + raw_slots=None): + all_subtokens = [] + all_slot_masks = [] + all_segment_ids = [] + all_input_ids = [] + all_input_masks = [] + sent_lengths = [] + all_slots = [] + with_label = False + if raw_slots is not None: + with_label = True + + for i, query in enumerate(queries): + words = query.strip().split() + subtokens = ['[CLS]'] + slot_mask = [True] # True if a token is the start of a new word + if with_label: + slots = [pad_label] + + for j, word in enumerate(words): + word_tokens = tokenizer.tokenize(word) + subtokens.extend(word_tokens) + slot_mask.append(True) + slot_mask.extend([False] * (len(word_tokens) - 1)) + if with_label: + slots.extend([raw_slots[i][j]] * len(word_tokens)) + + subtokens.append('[SEP]') + slot_mask.append(True) + sent_lengths.append(len(subtokens)) + all_subtokens.append(subtokens) + all_slot_masks.append(slot_mask) + all_input_masks.append([1] * len(subtokens)) + if with_label: + slots.append(pad_label) + all_slots.append(slots) + + max_seq_length = min(max_seq_length, max(sent_lengths)) + logger.info(f'Max length: {max_seq_length}') + get_stats(sent_lengths) + too_long_count = 0 + + for i, subtokens in enumerate(all_subtokens): + if len(subtokens) > max_seq_length: + subtokens = ['[CLS]'] + subtokens[-max_seq_length + 1:] + all_input_masks[i] = [1] + all_input_masks[i][-max_seq_length + 1:] + all_slot_masks[i] = [True] + \ + all_slot_masks[i][-max_seq_length + 1:] + + if with_label: + all_slots[i] = [pad_label] + all_slots[i][-max_seq_length + 1:] + too_long_count += 1 + + all_input_ids.append([tokenizer._convert_token_to_id(t) + for t in subtokens]) + all_input_masks.append([1] * len(subtokens)) + + if len(subtokens) < max_seq_length: + extra = (max_seq_length - len(subtokens)) + all_input_ids[i] = all_input_ids[i] + [0] * extra + all_slot_masks[i] = all_slot_masks[i] + [False] * extra + all_input_masks[i] = all_input_masks[i] + [0] * extra + + if with_label: + all_slots[i] = all_slots[i] + [pad_label] * extra + + all_segment_ids.append([0] * max_seq_length) + + logger.info(f'{too_long_count} are longer than {max_seq_length}') + + return (all_input_ids, + all_segment_ids, + all_input_masks, + all_slot_masks, + all_slots) + + +class BertJointIntentSlotDataset(Dataset): + """ + Creates dataset to use for the task of joint intent + and slot classification with pretrained model. + + Args: + input_file: file to sequence + label. + the first line is header (sentence [tab] label) + each line should be [sentence][tab][label] + slot_file: file to slot labels, each line corresponding to + slot labels for a sentence in input_file. No header. + max_seq_length: max sequence length (minus 2 for [CLS] and [SEP]) + tokenizer: such as BERT tokenizer. + num_samples: number of samples you want to use for the dataset. + if -1, use all dataset. + useful for testing. + shuffle: whether to shuffle + pad_label: pad value use for slot labels. + by default, it's the neural label. + + """ + + def __init__(self, + input_file, + slot_file, + max_seq_length, + tokenizer, + num_samples=-1, + shuffle=True, + pad_label=128): + if num_samples == 0: + raise ValueError("num_samples has to be positive", num_samples) + + with open(slot_file, 'r') as f: + slot_lines = f.readlines() + + with open(input_file, 'r') as f: + input_lines = f.readlines()[1:] + + assert len(slot_lines) == len(input_lines) + + dataset = list(zip(slot_lines, input_lines)) + + if shuffle or num_samples > 0: + random.shuffle(dataset) + if num_samples > 0: + dataset = dataset[:num_samples] + + raw_slots, queries, raw_intents = [], [], [] + for slot_line, input_line in dataset: + raw_slots.append([int(slot) for slot in slot_line.strip().split()]) + parts = input_line.strip().split() + raw_intents.append(int(parts[-1])) + queries.append(' '.join(parts[:-1])) + + features = get_features(queries, + max_seq_length, + tokenizer, + pad_label=pad_label, + raw_slots=raw_slots) + self.all_input_ids = features[0] + self.all_segment_ids = features[1] + self.all_input_masks = features[2] + self.all_slot_masks = features[3] + self.all_slots = features[4] + self.all_intents = raw_intents + + infold = input_file[:input_file.rfind('/')] + logger.info('Three most popular intents') + get_label_stats(self.all_intents, infold + '/intent_stats.tsv') + merged_slots = itertools.chain.from_iterable(self.all_slots) + logger.info('Three most popular slots') + get_label_stats(merged_slots, infold + '/slot_stats.tsv') + + def __len__(self): + return len(self.all_input_ids) + + def __getitem__(self, idx): + return (np.array(self.all_input_ids[idx]), + np.array(self.all_segment_ids[idx]), + np.array(self.all_input_masks[idx], dtype=np.float32), + np.array(self.all_slot_masks[idx]), + self.all_intents[idx], + np.array(self.all_slots[idx])) + + +class BertJointIntentSlotInferDataset(Dataset): + """ + Creates dataset to use for the task of joint intent + and slot classification with pretrained model. + This is to be used during inference only. + + Args: + query: the query to run inference on + max_seq_length: max sequence length (minus 2 for [CLS] and [SEP]) + tokenizer: such as BERT tokenizer. + pad_label: pad value use for slot labels. + by default, it's the neural label. + + """ + + def __init__(self, + queries, + max_seq_length, + tokenizer): + + features = get_features(queries, + max_seq_length, + tokenizer) + + self.all_input_ids = features[0] + self.all_segment_ids = features[1] + self.all_input_masks = features[2] + self.all_slot_masks = features[3] + + def __len__(self): + return len(self.all_input_ids) + + def __getitem__(self, idx): + return (np.array(self.all_input_ids[idx]), + np.array(self.all_segment_ids[idx]), + np.array(self.all_input_masks[idx], dtype=np.float32), + np.array(self.all_slot_masks[idx])) diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/language_modeling.py b/collections/nemo_nlp/nemo_nlp/data/datasets/language_modeling.py new file mode 100644 index 000000000000..4732da944950 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/language_modeling.py @@ -0,0 +1,43 @@ +# Copyright 2019 AI Applications Design Team at NVIDIA. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Pytorch Dataset for training Neural Machine Translation.""" + +import numpy as np +from torch.utils.data import Dataset +from ..utils import dataset_to_ids + + +class LanguageModelingDataset(Dataset): + def __init__(self, + tokenizer, + dataset, + max_sequence_length=512, + batch_step=None): + self.tokenizer = tokenizer + self.max_seq_length = max_sequence_length + self.batch_step = batch_step or self.max_seq_length + ids = dataset_to_ids(dataset, tokenizer, add_bos_eos=False) + self.ids = np.array([j for i in ids for j in i]) + + def __len__(self): + return (len(self.ids) - self.max_seq_length) // self.batch_step + + def __getitem__(self, idx): + left = idx * self.batch_step + right = left + self.max_seq_length + src_ids = self.ids[left:right] + labels = self.ids[left + 1:right + 1] + src_mask = (src_ids != self.tokenizer.pad_id()).astype(np.float32) + return src_ids, src_mask, labels diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/ner.py b/collections/nemo_nlp/nemo_nlp/data/datasets/ner.py new file mode 100644 index 000000000000..2ec24ad4fcf5 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/ner.py @@ -0,0 +1,450 @@ +# Copyright 2018 The Google AI Language Team Authors and +# The HuggingFace Inc. team. +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Utility functions for NER NLP tasks +Some transformer of this code were adapted from the HuggingFace library at +https://github.com/huggingface/pytorch-pretrained-BERT +""" + +# TODO: REFACTOR to minimize code reusing + +import collections +import numpy as np +from torch.utils.data import Dataset + + +class BertNERDataset(Dataset): + def __init__(self, input_file, max_seq_length, tokenizer): + # Read the sentences and group them in sequences up to max_seq_length + with open(input_file, "r") as f: + self.seq_words = [] + self.seq_token_labels = [] + self.seq_subtokens = [] + + new_seq_words = [] + new_seq_token_labels = [] + new_seq_subtokens = [] + new_seq_subtoken_count = 0 + + lines = f.readlines() + + words = [] + tags = [] + tokens = [] + token_tags = [] + token_count = 0 + + def process_sentence(): + nonlocal new_seq_words, new_seq_token_labels, \ + new_seq_subtokens, new_seq_subtoken_count + + # The -1 accounts for [CLS] + max_tokens_for_doc = max_seq_length - 1 + + if (new_seq_subtoken_count + token_count) < max_tokens_for_doc: + new_seq_words.extend(words) + new_seq_token_labels.extend(token_tags) + new_seq_subtokens.append(tokens) + new_seq_subtoken_count += token_count + else: + self.seq_words.append(new_seq_words) + self.seq_token_labels.append(new_seq_token_labels) + self.seq_subtokens.append(new_seq_subtokens) + + new_seq_words = words + new_seq_token_labels = token_tags + new_seq_subtokens = [tokens] + new_seq_subtoken_count = token_count + + all_tags = {} + + # Collect a list of all possible tags + for line in lines: + if line == "\n": + continue + + tag = line.split()[-1] + + if tag not in tags and tag != "O": + tag = tag.split("-")[1] + all_tags[tag] = 0 + + # Create mapping of tags to tag ids that starts with "O" -> 0 and + # then increases in alphabetical order + tag_ids = {"O": 0} + + for tag in sorted(all_tags): + tag_ids[tag] = len(tag_ids) + + # Process all lines in input data + for line in lines: + if line == "\n": + # If we hit a newline, we've reached the end of a sentence + process_sentence() + words = [] + tags = [] + tokens = [] + token_tags = [] + continue + + word = line.split()[0] + tag = line.split()[-1] + + if tag != "O": + tag = tag.split("-")[1] + + word_tokens = tokenizer.text_to_tokens(word) + tag_id = tag_ids[tag] + + words.append(word) + tags.append(tag_id) + tokens.append(word_tokens) + token_tags.extend([tag_id] * len(word_tokens)) + token_count += len(word_tokens) + + self.features = convert_sequences_to_features( + self.seq_words, self.seq_subtokens, self.seq_token_labels, + tokenizer, max_seq_length) + + self.tag_ids = tag_ids + self.tokenizer = tokenizer + self.max_seq_length = max_seq_length + self.vocab_size = self.tokenizer.vocab_size + + def __len__(self): + return len(self.features) + + def __getitem__(self, idx): + feature = self.features[idx] + + return np.array(feature.input_ids), np.array(feature.segment_ids), \ + np.array(feature.input_mask, dtype=np.float32), \ + np.array(feature.labels), np.array(feature.seq_id) + + def eval_preds(self, logits_lists, seq_ids, tag_ids): + correct_chunks = 0 + total_chunks = 0 + predicted_chunks = 0 + + correct_tags = 0 + token_count = 0 + + lines = [] + tag_ids = {tag_ids[k]: k for k in tag_ids} + + for logits, seq_id in zip(logits_lists, seq_ids): + feature = self.features[seq_id] + masks = feature.input_mask + + try: + last_mask_index = masks.index(0) + except ValueError: + last_mask_index = len(masks) + + labels = feature.labels[:last_mask_index] + labels = np.array(labels[:last_mask_index]) + logits = logits[:last_mask_index] + preds = np.argmax(logits, axis=1) + + in_correct = False + + last_label = "O" + last_pred = "O" + + last_correct = "O" + last_correct_type = "" + last_guessed = "O" + last_guessed_type = "" + + previous_word_id = -1 + + # Adapted from conlleval.pl (included with CoNLL-2003 dataset) + for token_id, word_id in feature.token_to_orig_map.items(): + if word_id is not previous_word_id: + word = feature.words[word_id] + label = tag_ids[feature.labels[token_id]] + pred = tag_ids[preds[token_id]] + + if pred != "O": + guessed = "B" if last_pred != pred else "I" + guessed_type = pred + else: + guessed = pred + guessed_type = "" + + if label != "O": + correct = "B" if last_label != label else "I" + correct_type = label + else: + correct = label + correct_type = "" + + if in_correct: + if end_of_chunk(last_correct, correct, + last_correct_type, correct_type) and \ + end_of_chunk(last_guessed, guessed, + last_guessed_type, + guessed_type) and \ + last_guessed_type == last_correct_type: + in_correct = False + correct_chunks += 1 + + elif end_of_chunk(last_correct, correct, + last_correct_type, correct_type) != \ + end_of_chunk(last_guessed, guessed, + last_guessed_type, + guessed_type) or \ + guessed_type != correct_type: + in_correct = False + + if start_of_chunk(last_correct, correct, last_correct_type, + correct_type) and \ + start_of_chunk(last_guessed, guessed, + last_guessed_type, + guessed_type) and \ + guessed_type == correct_type: + in_correct = True + + if start_of_chunk(last_correct, correct, last_correct_type, + correct_type): + total_chunks += 1 + + if start_of_chunk(last_guessed, guessed, last_guessed_type, + guessed_type): + predicted_chunks += 1 + + if correct == guessed and guessed_type == correct_type: + correct_tags += 1 + + token_count += 1 + + last_label = label + last_pred = pred + + last_guessed = guessed + last_correct = correct + last_guessed_type = guessed_type + last_correct_type = correct_type + + lines.append({ + "word": word, + "label": feature.labels[token_id], + "prediction": preds[token_id] + }) + + previous_word_id = word_id + + if in_correct: + correct_chunks += 1 + + lines.append({ + "word": "" + }) + + return correct_tags, token_count, correct_chunks, predicted_chunks, \ + total_chunks, lines + + +def start_of_chunk(prev_tag, tag, prev_type, type): + if prev_tag == "B" and tag == "B": + return True + elif prev_tag == "I" and tag == "B": + return True + elif prev_tag == "O" and tag == "B": + return True + elif prev_tag == "O" and tag == "I": + return True + elif prev_tag == "O" and tag == "I": + return True + elif tag != "O" and prev_type != type: + return True + + return False + + +def end_of_chunk(prev_tag, tag, prev_type, type): + if prev_tag == "B" and tag == "B": + return True + elif prev_tag == "B" and tag == "O": + return True + elif prev_tag == "I" and tag == "B": + return True + elif prev_tag == "I" and tag == "O": + return True + elif prev_tag == "I" and tag == "O": + return True + elif prev_tag != "O" and prev_type != type: + return True + + return False + + +def convert_sequences_to_features(seqs_words, seqs_subtokens, + seqs_token_labels, tokenizer, + max_seq_length): + """Loads a data file into a list of `InputBatch`s.""" + + features = [] + for seq_id, (words, seq_subtokens, seq_token_labels) in \ + enumerate(zip(seqs_words, seqs_subtokens, seqs_token_labels)): + + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + + word_count = 0 + for sent_subtokens in seq_subtokens: + for word_subtokens in sent_subtokens: + orig_to_tok_index.append(len(all_doc_tokens)) + for sub_token in word_subtokens: + tok_to_orig_index.append(word_count) + all_doc_tokens.append(sub_token) + word_count += 1 + + _DocSpan = collections.namedtuple( # pylint: disable=invalid-name + "DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + length = len(all_doc_tokens) + doc_spans.append(_DocSpan(start=start_offset, length=length)) + + doc_span_index = 0 + doc_span = doc_spans[0] + + tokens = [] + token_labels = [] + token_to_orig_map = {} + token_is_max_context = {} + segment_ids = [] + tokens.append("[CLS]") + token_labels.append(0) + segment_ids.append(0) + + # Ensure that we don't go over the maximum sequence length + for i in range(min(doc_span.length, max_seq_length - 1)): + split_token_index = doc_span.start + i + token_to_orig_map[len(tokens)] = \ + tok_to_orig_index[split_token_index] + + is_max_context = _check_is_max_context(doc_spans, doc_span_index, + split_token_index) + token_is_max_context[len(tokens)] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + segment_ids.append(0) + + for label in seq_token_labels: + if len(token_labels) == len(tokens): + break + + token_labels.append(label) + + input_ids = tokenizer.tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1] * len(input_ids) + + # Zero-pad up to the sequence length. + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + token_labels.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + assert len(token_labels) == max_seq_length + + features.append( + InputFeatures( + seq_id=seq_id, + doc_span_index=doc_span_index, + tokens=tokens, + words=words, + labels=token_labels, + token_to_orig_map=token_to_orig_map, + token_is_max_context=token_is_max_context, + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids)) + + return features + + +def _check_is_max_context(doc_spans, cur_span_index, position): + """Check if this is the 'max context' doc span for the token.""" + + # Because of the sliding window approach taken to scoring documents, a + # single token can appear in multiple documents. E.g. + # Doc: the man went to the store and bought a gallon of milk + # Span A: the man went to the + # Span B: to the store and bought + # Span C: and bought a gallon of + # ... + # + # Now the word 'bought' will have two scores from spans B and C. We only + # want to consider the score with "maximum context", which we define as + # the *minimum* of its left and right context (the *sum* of left and + # right context will always be the same, of course). + # + # In the example the maximum context for 'bought' would be span C since + # it has 1 left context and 3 right context, while span B has 4 left + # context and 0 right context. + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span.start + doc_span.length - 1 + if position < doc_span.start: + continue + if position > end: + continue + num_left_context = position - doc_span.start + num_right_context = end - position + score = min(num_left_context, num_right_context) + \ + 0.01 * doc_span.length + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + + return cur_span_index == best_span_index + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__(self, + seq_id, + doc_span_index, + tokens, + words, + labels, + token_to_orig_map, + token_is_max_context, + input_ids, + input_mask, + segment_ids): + self.seq_id = seq_id + self.doc_span_index = doc_span_index + self.tokens = tokens + self.words = words + self.labels = labels + self.token_to_orig_map = token_to_orig_map + self.token_is_max_context = token_is_max_context + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/question_answering.py b/collections/nemo_nlp/nemo_nlp/data/datasets/question_answering.py new file mode 100644 index 000000000000..285825bc729b --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/question_answering.py @@ -0,0 +1,829 @@ +# Copyright 2018 The Google AI Language Team Authors and +# The HuggingFace Inc. team. +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Utility functions for Question/Answering NLP tasks +Some transformer of this code were adapted from the HuggingFace library at +https://github.com/huggingface/pytorch-pretrained-BERT +""" + +#TODO: REFACTOR to minimize code reusing + + +import pickle +import collections +import json +import numpy as np +from torch.utils.data import Dataset +from ...externals.tokenization import whitespace_tokenize +from ...externals.tokenization import BasicTokenizer +from ...externals.run_squad import InputFeatures, \ + _compute_softmax, _get_best_indexes, _check_is_max_context +import string +import re +import torch + + +class BertQuestionAnsweringDataset(Dataset): + def __init__(self, + input_file, + data_format, + features_file_prefix, + max_seq_length, + is_training, + tokenizer, + local_rank, + max_query_length=64, + doc_stride=128): + + # Read the context/question/answers from file with format data_format + print("Reading examples") + self.qa_examples = read_qa_examples(input_file, data_format, + is_training) + + cached_features_file = input_file + '_{0}_{1}_{2}_{3}'.format( + features_file_prefix, str(max_seq_length), str(doc_stride), + str(max_query_length)) + + print("Trying to open cached_features file:", cached_features_file) + try: + with open(cached_features_file, "rb") as reader: + self.qa_features = pickle.load(reader) + except: + print("Converting examples to features") + self.qa_features = convert_examples_to_features( + self.qa_examples, tokenizer, max_seq_length, doc_stride, + max_query_length, is_training) + + print(" Saving features into cached file %s", + cached_features_file) + if (local_rank == None or torch.distributed.get_rank() == 0): + print(f"Local rank {local_rank} writing cached file") + with open(cached_features_file, "wb") as writer: + pickle.dump(self.qa_features, writer) + # + print("Number of features=", len(self.qa_features)) + + # #TODO: Deal with distributed training + # logger.info(" Saving train features into cached file %s", + # cached_train_features_file) + # with open(cached_train_features_file, "wb") as writer: + # pickle.dump(self.qa_features, writer) + + self.tokenizer = tokenizer + self.max_seq_length = max_seq_length + self.vocab_size = self.tokenizer.vocab_size + self.max_query_length = max_query_length + self.doc_stride = doc_stride + self.is_training = is_training + + def __len__(self): + return len(self.qa_features) + + def __getitem__(self, idx): + + feature = self.qa_features[idx] + + return np.array(feature.input_ids), np.array(feature.segment_ids), \ + np.array(feature.input_mask, dtype=np.float32)[..., None], \ + feature.start_position, feature.end_position, feature.unique_id + + def get_predictions(self, unique_ids, start_logits, end_logits, + n_best_size, max_answer_length, do_lower_case, + version_2_with_negative, null_score_diff_threshold): + + example_index_to_features = collections.defaultdict(list) + + unique_id_to_pos = {} + for index, unique_id in enumerate(unique_ids): + unique_id_to_pos[unique_id] = index + + for feature in self.qa_features: + example_index_to_features[feature.example_index].append(feature) + + _PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name + "PrelimPrediction", [ + "feature_index", "start_index", "end_index", "start_logit", + "end_logit" + ]) + + all_predictions = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + scores_diff_json = collections.OrderedDict() + + for (example_index, example) in enumerate(self.qa_examples): + + features = example_index_to_features[example_index] + + prelim_predictions = [] + # keep track of the minimum score of null start+end of position 0 + score_null = 1000000 # large and positive + min_null_feature_index = 0 # the paragraph slice with min null score + null_start_logit = 0 # start logit at the slice with min null score + null_end_logit = 0 # end logit at the slice with min null score + for (feature_index, feature) in enumerate(features): + pos = unique_id_to_pos[feature.unique_id] + start_indexes = _get_best_indexes(start_logits[pos], + n_best_size) + end_indexes = _get_best_indexes(end_logits[pos], n_best_size) + # if we could have irrelevant answers, + # get the min score of irrelevant + if version_2_with_negative: + feature_null_score = start_logits[pos][0] + end_logits[ + pos][0] + if feature_null_score < score_null: + score_null = feature_null_score + min_null_feature_index = feature_index + null_start_logit = start_logits[pos][0] + null_end_logit = end_logits[pos][0] + for start_index in start_indexes: + for end_index in end_indexes: + # We could hypothetically create invalid predictions, + # e.g., predict that the start of the span is in the + # question. We throw out all invalid predictions. + if start_index >= len(feature.tokens): + continue + if end_index >= len(feature.tokens): + continue + if start_index not in feature.token_to_orig_map: + continue + if end_index not in feature.token_to_orig_map: + continue + if not feature.token_is_max_context.get( + start_index, False): + continue + if end_index < start_index: + continue + length = end_index - start_index + 1 + if length > max_answer_length: + continue + prelim_predictions.append( + _PrelimPrediction( + feature_index=feature_index, + start_index=start_index, + end_index=end_index, + start_logit=start_logits[pos][start_index], + end_logit=end_logits[pos][end_index])) + + if version_2_with_negative: + prelim_predictions.append( + _PrelimPrediction(feature_index=min_null_feature_index, + start_index=0, + end_index=0, + start_logit=null_start_logit, + end_logit=null_end_logit)) + prelim_predictions = sorted( + prelim_predictions, + key=lambda x: (x.start_logit + x.end_logit), + reverse=True) + + _NbestPrediction = collections.namedtuple( + "NbestPrediction", ["text", "start_logit", "end_logit"]) + + seen_predictions = {} + nbest = [] + for pred in prelim_predictions: + if len(nbest) >= n_best_size: + break + feature = features[pred.feature_index] + if pred.start_index > 0: # this is a non-null prediction + tok_tokens = feature.tokens[pred.start_index:( + pred.end_index + 1)] + orig_doc_start = feature.token_to_orig_map[ + pred.start_index] + orig_doc_end = feature.token_to_orig_map[pred.end_index] + orig_tokens = example.doc_tokens[orig_doc_start:( + orig_doc_end + 1)] + tok_text = " ".join(tok_tokens) + + # De-tokenize WordPieces that have been split off. + tok_text = tok_text.replace(" ##", "") + tok_text = tok_text.replace("##", "") + + # Clean whitespace + tok_text = tok_text.strip() + tok_text = " ".join(tok_text.split()) + orig_text = " ".join(orig_tokens) + + final_text = get_final_text(tok_text, orig_text, + do_lower_case) + if final_text in seen_predictions: + continue + + seen_predictions[final_text] = True + else: + final_text = "" + seen_predictions[final_text] = True + + nbest.append( + _NbestPrediction(text=final_text, + start_logit=pred.start_logit, + end_logit=pred.end_logit)) + # if we didn't include the empty option in the n-best, include it + if version_2_with_negative: + if "" not in seen_predictions: + nbest.append( + _NbestPrediction(text="", + start_logit=null_start_logit, + end_logit=null_end_logit)) + + # In very rare edge cases we could only have single null pred. + # We just create a nonce prediction in this case to avoid failure. + if len(nbest) == 1: + nbest.insert( + 0, + _NbestPrediction(text="empty", + start_logit=0.0, + end_logit=0.0)) + + # In very rare edge cases we could have no valid predictions. So we + # just create a nonce prediction in this case to avoid failure. + if not nbest: + nbest.append( + _NbestPrediction(text="empty", + start_logit=0.0, + end_logit=0.0)) + + assert len(nbest) >= 1 + + total_scores = [] + best_non_null_entry = None + for entry in nbest: + total_scores.append(entry.start_logit + entry.end_logit) + if not best_non_null_entry: + if entry.text: + best_non_null_entry = entry + + probs = _compute_softmax(total_scores) + + nbest_json = [] + for (i, entry) in enumerate(nbest): + output = collections.OrderedDict() + output["text"] = entry.text + output["probability"] = probs[i] + output["start_logit"] = entry.start_logit + output["end_logit"] = entry.end_logit + nbest_json.append(output) + + assert len(nbest_json) >= 1 + + if not version_2_with_negative: + all_predictions[example.qas_id] = nbest_json[0]["text"] + else: + # predict "" iff the null score - + # the score of best non-null > threshold + score_diff = score_null - best_non_null_entry.start_logit - ( + best_non_null_entry.end_logit) + scores_diff_json[example.qas_id] = score_diff + if score_diff > null_score_diff_threshold: + all_predictions[example.qas_id] = "" + else: + all_predictions[example.qas_id] = best_non_null_entry.text + all_nbest_json[example.qas_id] = nbest_json + + # with open("output_predictions.json", "w") as writer: + # writer.write(json.dumps(all_predictions, indent=4) + "\n") + + return all_predictions, all_nbest_json, scores_diff_json + + def evaluate_predictions(self, all_predictions): + + exact_match = 0. + f1 = 0. + # Loop over all the examples and evaluate the predictions + for example in self.qa_examples: + + qas_id = example.qas_id + if qas_id not in all_predictions: + continue + + ground_truths = example.ground_truths + prediction = all_predictions[qas_id] + + exact_match += metric_max_over_ground_truths( + exact_match_score, prediction, ground_truths) + + f1 += metric_max_over_ground_truths(f1_score, prediction, + ground_truths) + + exact_match = 100.0 * exact_match / len(self.qa_examples) + f1 = 100.0 * f1 / len(self.qa_examples) + + return exact_match, f1 + + def calculate_exact_match_and_f1(self, unique_ids, start_logits, + end_logits, n_best_size, + max_answer_length, do_lower_case, + version_2_with_negative, + null_score_diff_threshold): + + all_predictions, all_nbest_json, scores_diff_json = \ + self.get_predictions(unique_ids, start_logits, end_logits, n_best_size, + max_answer_length, do_lower_case, + version_2_with_negative, null_score_diff_threshold) + + exact_match, f1 = self.evaluate_predictions(all_predictions) + + return exact_match, f1 + + +class SquadExample(object): + """ + A single training/test example for the Squad dataset. + For examples without an answer, the start and end position are -1. + """ + + def __init__(self, + qas_id, + question_text, + ground_truths, + doc_tokens, + orig_answer_text=None, + start_position=None, + end_position=None, + is_impossible=None): + self.qas_id = qas_id + self.question_text = question_text + self.ground_truths = ground_truths + self.doc_tokens = doc_tokens + self.orig_answer_text = orig_answer_text + self.start_position = start_position + self.end_position = end_position + self.is_impossible = is_impossible + + def __str__(self): + return self.__repr__() + + def __repr__(self): + s = "" + s += "qas_id: %s" % (self.qas_id) + s += ", question_text: %s" % (self.question_text) + s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens)) + if self.start_position: + s += ", start_position: %d" % (self.start_position) + if self.end_position: + s += ", end_position: %d" % (self.end_position) + if self.is_impossible: + s += ", is_impossible: %r" % (self.is_impossible) + return s + + +def convert_examples_to_features(examples, tokenizer, max_seq_length, + doc_stride, max_query_length, is_training): + """Loads a data file into a list of `InputBatch`s.""" + + unique_id = 1000000000 + + features = [] + for (example_index, example) in enumerate(examples): + + query_tokens = tokenizer.text_to_tokens(example.question_text) + + if len(query_tokens) > max_query_length: + query_tokens = query_tokens[0:max_query_length] + + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + for (i, token) in enumerate(example.doc_tokens): + orig_to_tok_index.append(len(all_doc_tokens)) + sub_tokens = tokenizer.text_to_tokens(token) + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + + tok_start_position = None + tok_end_position = None + if is_training and example.is_impossible: + tok_start_position = -1 + tok_end_position = -1 + if is_training and not example.is_impossible: + tok_start_position = orig_to_tok_index[example.start_position] + if example.end_position < len(example.doc_tokens) - 1: + tok_end_position = orig_to_tok_index[example.end_position + + 1] - 1 + else: + tok_end_position = len(all_doc_tokens) - 1 + (tok_start_position, tok_end_position) = _improve_answer_span( + all_doc_tokens, tok_start_position, tok_end_position, + tokenizer, example.orig_answer_text) + + # The -3 accounts for [CLS], [SEP] and [SEP] + max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 + + # We can have documents that are longer than the maximum sequence length. + # To deal with this we do a sliding window approach, where we take chunks + # of the up to our max length with a stride of `doc_stride`. + _DocSpan = collections.namedtuple( # pylint: disable=invalid-name + "DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens = [] + token_to_orig_map = {} + token_is_max_context = {} + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in query_tokens: + tokens.append(token) + segment_ids.append(0) + tokens.append("[SEP]") + segment_ids.append(0) + + for i in range(doc_span.length): + split_token_index = doc_span.start + i + token_to_orig_map[len(tokens)] = \ + tok_to_orig_index[split_token_index] + + is_max_context = _check_is_max_context(doc_spans, + doc_span_index, + split_token_index) + token_is_max_context[len(tokens)] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + input_ids = tokenizer.tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1] * len(input_ids) + + # Zero-pad up to the sequence length. + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + + start_position = 0 + end_position = 0 + if is_training and not example.is_impossible: + # For training, if our document chunk does not contain an + # annotation we throw it out, since there is nothing to predict. + doc_start = doc_span.start + doc_end = doc_span.start + doc_span.length - 1 + out_of_span = False + if not (tok_start_position >= doc_start + and tok_end_position <= doc_end): + out_of_span = True + if out_of_span: + start_position = 0 + end_position = 0 + else: + doc_offset = len(query_tokens) + 2 + start_position = tok_start_position - doc_start + doc_offset + end_position = tok_end_position - doc_start + doc_offset + + if is_training and example.is_impossible: + start_position = 0 + end_position = 0 + + if example_index < 1: + print("*** Example ***") + print("unique_id: %s" % unique_id) + print("example_index: %s" % example_index) + print("doc_span_index: %s" % doc_span_index) + print("tokens: %s" % " ".join(tokens)) + print("token_to_orig_map: %s" % " ".join( + ["%d:%d" % (x, y) + for (x, y) in token_to_orig_map.items()])) + print("token_is_max_context: %s" % " ".join([ + "%d:%s" % (x, y) + for (x, y) in token_is_max_context.items() + ])) + print("input_ids: %s" % " ".join([str(x) for x in input_ids])) + print("input_mask: %s" % " ".join([str(x) + for x in input_mask])) + print("segment_ids: %s" % + " ".join([str(x) for x in segment_ids])) + if is_training and example.is_impossible: + print("impossible example") + if is_training and not example.is_impossible: + answer_text = " ".join( + tokens[start_position:(end_position + 1)]) + print("start_position: %d" % start_position) + print("end_position: %d" % end_position) + print("answer: %s" % answer_text) + + features.append( + InputFeatures(unique_id=unique_id, + example_index=example_index, + doc_span_index=doc_span_index, + tokens=tokens, + token_to_orig_map=token_to_orig_map, + token_is_max_context=token_is_max_context, + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids, + start_position=start_position, + end_position=end_position, + is_impossible=example.is_impossible)) + + unique_id += 1 + + return features + + +def read_qa_examples(input_file, data_format, is_training): + if data_format.lower() == "squad_json": + examples = read_squad_examples(input_file, is_training) + else: + raise ValueError(f"Invalid format in QADataLayerForPretrainedModel: " + f"{data_format}") + + return examples + + +def read_squad_examples(input_file, is_training): + version_2_with_negative = False + """Read a SQuAD json file into a list of SquadExample.""" + with open(input_file, "r", encoding='utf-8') as reader: + input_data = json.load(reader)["data"] + + def is_whitespace(c): + if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: + return True + return False + + examples = [] + for entry in input_data: + for paragraph in entry["paragraphs"]: + paragraph_text = paragraph["context"] + doc_tokens = [] + char_to_word_offset = [] + prev_is_whitespace = True + for c in paragraph_text: + if is_whitespace(c): + prev_is_whitespace = True + else: + if prev_is_whitespace: + doc_tokens.append(c) + else: + doc_tokens[-1] += c + prev_is_whitespace = False + char_to_word_offset.append(len(doc_tokens) - 1) + + for qa in paragraph["qas"]: + qas_id = qa["id"] + question_text = qa["question"] + start_position = -1 + end_position = -1 + orig_answer_text = None + is_impossible = False + ground_truths = [] + if is_training: + if version_2_with_negative: + is_impossible = qa["is_impossible"] + if (len(qa["answers"]) != 1) and (not is_impossible): + raise ValueError( + "For training, each question should have " + "exactly 1 answer.") + if not is_impossible: + answer = qa["answers"][0] + orig_answer_text = answer["text"] + answer_offset = answer["answer_start"] + answer_length = len(orig_answer_text) + start_position = char_to_word_offset[answer_offset] + end_position = char_to_word_offset[answer_offset + + answer_length - 1] + # Only add answers where the text can be exactly + # recovered from the document. If this CAN'T happen + # it's likely due to weird Unicode + # stuff so we will just skip the example. + # + # Note that this means for training mode, every example + # is NOT guaranteed to be preserved. + actual_text = " ".join( + doc_tokens[start_position:(end_position + 1)]) + cleaned_answer_text = " ".join( + whitespace_tokenize(orig_answer_text)) + if actual_text.find(cleaned_answer_text) == -1: + print("Could not find answer: '%s' vs. '%s'", + actual_text, cleaned_answer_text) + continue + else: + start_position = -1 + end_position = -1 + orig_answer_text = "" + else: # Eval data set + # Store the potential answers in examples + ground_truths = list( + map(lambda x: x['text'], qa['answers'])) + + example = SquadExample(qas_id=qas_id, + question_text=question_text, + ground_truths=ground_truths, + doc_tokens=doc_tokens, + orig_answer_text=orig_answer_text, + start_position=start_position, + end_position=end_position, + is_impossible=is_impossible) + examples.append(example) + + return examples + + +def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, + orig_answer_text): + """Returns tokenized answer spans that better match the + annotated answer.""" + + # The SQuAD annotations are character based. We first project them to + # whitespace-tokenized words. But then after WordPiece tokenization, we + # can often find a "better match". For example: + # + # Question: What year was John Smith born? + # Context: The leader was John Smith (1895-1943). + # Answer: 1895 + # + # The original whitespace-tokenized answer will be "(1895-1943).". However + # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can + # match the exact answer, 1895. + # + # However, this is not always possible. Consider the following: + # + # Question: What country is the top exporter of electornics? + # Context: The Japanese electronics industry is the lagest in the world. + # Answer: Japan + # + # In this case, the annotator chose "Japan" as a character sub-span of + # the word "Japanese". Since our WordPiece tokenizer does not split + # "Japanese", we just use "Japanese" as the annotation. This is fairly rare + # in SQuAD, but does happen. + tok_answer_text = " ".join(tokenizer.text_to_tokens(orig_answer_text)) + + for new_start in range(input_start, input_end + 1): + for new_end in range(input_end, new_start - 1, -1): + text_span = " ".join(doc_tokens[new_start:(new_end + 1)]) + if text_span == tok_answer_text: + return new_start, new_end + + return input_start, input_end + + +def normalize_answer(s): + """Lower text and remove punctuation, articles and extra whitespace.""" + + def remove_articles(text): + return re.sub(r'\b(a|an|the)\b', ' ', text) + + def white_space_fix(text): + return ' '.join(text.split()) + + def remove_punc(text): + exclude = set(string.punctuation) + return ''.join(ch for ch in text if ch not in exclude) + + def lower(text): + return text.lower() + + return white_space_fix(remove_articles(remove_punc(lower(s)))) + + +def f1_score(prediction, ground_truth): + prediction_tokens = normalize_answer(prediction).split() + ground_truth_tokens = normalize_answer(ground_truth).split() + common = collections.Counter(prediction_tokens) & \ + collections.Counter(ground_truth_tokens) + num_same = sum(common.values()) + if num_same == 0: + return 0 + precision = 1.0 * num_same / len(prediction_tokens) + recall = 1.0 * num_same / len(ground_truth_tokens) + f1 = (2 * precision * recall) / (precision + recall) + return f1 + + +def exact_match_score(prediction, ground_truth): + return normalize_answer(prediction) == normalize_answer(ground_truth) + + +def metric_max_over_ground_truths(metric_fn, prediction, ground_truths): + scores_for_ground_truths = [] + for ground_truth in ground_truths: + score = metric_fn(prediction, ground_truth) + scores_for_ground_truths.append(score) + return max(scores_for_ground_truths) + + +def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False): + """Project the tokenized prediction back to the original text.""" + + # When we created the data, we kept track of the alignment between original + # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So + # now `orig_text` contains the span of our original text corresponding to + # the span that we predicted. + # + # However, `orig_text` may contain extra characters that we don't want in + # our prediction. + # + # For example, let's say: + # pred_text = steve smith + # orig_text = Steve Smith's + # + # We don't want to return `orig_text` because it contains the extra "'s". + # + # We don't want to return `pred_text` because it's already been normalized + # (the SQuAD eval script also does punctuation stripping/lower casing but + # our tokenizer does additional normalization like stripping accent + # characters). + # + # What we really want to return is "Steve Smith". + # + # Therefore, we have to apply a semi-complicated alignment heuristic + # between `pred_text` and `orig_text` to get a character-to-character + # alignment. This can fail in certain cases in which case we just return + # `orig_text`. + + def _strip_spaces(text): + ns_chars = [] + ns_to_s_map = collections.OrderedDict() + for (i, c) in enumerate(text): + if c == " ": + continue + ns_to_s_map[len(ns_chars)] = i + ns_chars.append(c) + ns_text = "".join(ns_chars) + return ns_text, ns_to_s_map + + # We first tokenize `orig_text`, strip whitespace from the result + # and `pred_text`, and check if they are the same length. If they are + # NOT the same length, the heuristic has failed. If they are the same + # length, we assume the characters are one-to-one aligned. + tokenizer = BasicTokenizer(do_lower_case=do_lower_case) + + tok_text = " ".join(tokenizer.tokenize(orig_text)) + + start_position = tok_text.find(pred_text) + if start_position == -1: + if verbose_logging: + print("Unable to find text: '%s' in '%s'" % (pred_text, orig_text)) + return orig_text + end_position = start_position + len(pred_text) - 1 + + (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text) + (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text) + + if len(orig_ns_text) != len(tok_ns_text): + if verbose_logging: + print("Length not equal after stripping spaces: '%s' vs '%s'", + orig_ns_text, tok_ns_text) + return orig_text + + # We then project the characters in `pred_text` back to `orig_text` using + # the character-to-character alignment. + tok_s_to_ns_map = {} + for (i, tok_index) in tok_ns_to_s_map.items(): + tok_s_to_ns_map[tok_index] = i + + orig_start_position = None + if start_position in tok_s_to_ns_map: + ns_start_position = tok_s_to_ns_map[start_position] + if ns_start_position in orig_ns_to_s_map: + orig_start_position = orig_ns_to_s_map[ns_start_position] + + if orig_start_position is None: + if verbose_logging: + print("Couldn't map start position") + return orig_text + + orig_end_position = None + if end_position in tok_s_to_ns_map: + ns_end_position = tok_s_to_ns_map[end_position] + if ns_end_position in orig_ns_to_s_map: + orig_end_position = orig_ns_to_s_map[ns_end_position] + + if orig_end_position is None: + if verbose_logging: + print("Couldn't map end position") + return orig_text + + output_text = orig_text[orig_start_position:(orig_end_position + 1)] + return output_text diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/sentence_classification.py b/collections/nemo_nlp/nemo_nlp/data/datasets/sentence_classification.py new file mode 100644 index 000000000000..271c553891ee --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/sentence_classification.py @@ -0,0 +1,181 @@ +# Copyright 2018 The Google AI Language Team Authors and +# The HuggingFace Inc. team. +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Utility functions for Token Classification NLP tasks +Some parts of this code were adapted from the HuggingFace library at +https://github.com/huggingface/pytorch-pretrained-BERT +""" + +import collections +import logging +import os +import random +import random +import string +import time + +import numpy as np +from torch.utils.data import Dataset + +logger = logging.getLogger('log') + + +def get_stats(lengths): + lengths = np.asarray(lengths) + logger.info(f'Min: {np.min(lengths)} | \ + Max: {np.max(lengths)} | \ + Mean: {np.mean(lengths)} | \ + Median: {np.median(lengths)}') + print(f'75 percentile: {np.percentile(lengths, 75)} | \ + 99 percentile: {np.percentile(lengths, 99)}') + + +def list2str(l): + return ' '.join([str(x) for x in l]) + + +class BertSentenceClassificationDataset(Dataset): + def __init__(self, + input_file, + max_seq_length, + tokenizer, + num_samples=-1, + shuffle=True): + with open(input_file, "r") as f: + sent_labels, all_sent_subtokens = [], [] + sent_lengths = [] + too_long_count = 0 + + lines = f.readlines()[1:] + print(input_file, len(lines)) + + if shuffle or num_samples > -1: + random.seed(0) + random.shuffle(lines) + if num_samples > 0: + lines = lines[:num_samples] + + for index, line in enumerate(lines): + if index % 20000 == 0: + logger.debug(f"Processing line {index}/{len(lines)}") + + sent_label = int(line.split()[-1]) + sent_labels.append(sent_label) + sent_words = line.strip().split()[:-1] + sent_subtokens = ['[CLS]'] + + for word in sent_words: + word_tokens = tokenizer.tokenize(word) + sent_subtokens.extend(word_tokens) + + sent_subtokens.append('[SEP]') + + all_sent_subtokens.append(sent_subtokens) + sent_lengths.append(len(sent_subtokens)) + + get_stats(sent_lengths) + self.max_seq_length = min(max_seq_length, max(sent_lengths)) + + for i in range(len(all_sent_subtokens)): + if len(all_sent_subtokens[i]) > self.max_seq_length: + shorten_sent = all_sent_subtokens[i][-self.max_seq_length+1:] + all_sent_subtokens[i] = ['[CLS]'] + shorten_sent + too_long_count += 1 + + logger.info(f'{too_long_count} out of {len(sent_lengths)} \ + sentencess with more than {max_seq_length} subtokens.') + + self.convert_sequences_to_features(all_sent_subtokens, + sent_labels, + tokenizer, + self.max_seq_length) + + self.tokenizer = tokenizer + self.vocab_size = self.tokenizer.vocab_size + + def __len__(self): + return len(self.features) + + def __getitem__(self, idx): + + feature = self.features[idx] + + return (np.array(feature.input_ids), + np.array(feature.segment_ids), + np.array(feature.input_mask, dtype=np.float32), + feature.sent_label) + + def convert_sequences_to_features(self, + all_sent_subtokens, + sent_labels, + tokenizer, + max_seq_length): + """Loads a data file into a list of `InputBatch`s. + """ + + self.features = [] + for sent_id in range(len(all_sent_subtokens)): + sent_subtokens = all_sent_subtokens[sent_id] + sent_label = sent_labels[sent_id] + word_count = 0 + # input_ids = tokenizer.tokens_to_ids(sent_subtokens) + input_ids = [tokenizer._convert_token_to_id( + t) for t in sent_subtokens] + + # The mask has 1 for real tokens and 0 for padding tokens. + # Only real tokens are attended to. + input_mask = [1] * len(input_ids) + + # Zero-pad up to the sequence length. + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids = [0] * max_seq_length + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + + if sent_id == 0: + logger.info("*** Example ***") + logger.info("example_index: %s" % sent_id) + logger.info("subtokens: %s" % " ".join(sent_subtokens)) + logger.info("sent_label: %s" % sent_label) + logger.info("input_ids: %s" % list2str(input_ids)) + logger.info("input_mask: %s" % list2str(input_mask)) + + self.features.append(InputFeatures( + sent_id=sent_id, + sent_label=sent_label, + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids)) + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__(self, + sent_id, + sent_label, + input_ids, + input_mask, + segment_ids): + self.sent_id = sent_id + self.sent_label = sent_label + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/token_classification.py b/collections/nemo_nlp/nemo_nlp/data/datasets/token_classification.py new file mode 100644 index 000000000000..032ee77e923f --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/token_classification.py @@ -0,0 +1,319 @@ +# Copyright 2018 The Google AI Language Team Authors and +# The HuggingFace Inc. team. +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Utility functions for Token Classification NLP tasks +Some transformer of this code were adapted from the HuggingFace library at +https://github.com/huggingface/pytorch-pretrained-BERT +""" + +#TODO: REFACTOR to minimize code reusing + + +import collections +import numpy as np +from torch.utils.data import Dataset +import string +import re +import random +from ...externals.run_squad import _check_is_max_context + + +def remove_punctuation_from_sentence(sentence): + sentence = re.sub('[' + string.punctuation + ']', '', sentence) + sentence = sentence.lower() + return sentence + + +class BertTokenClassificationDataset(Dataset): + def __init__(self, input_file, max_seq_length, tokenizer): + + # Read the sentences and group them in sequences up to max_seq_length + with open(input_file, "r") as f: + self.seq_words = [] + self.seq_token_labels = [] + self.seq_sentence_labels = [] + self.seq_subtokens = [] + + new_seq_words = [] + new_seq_token_labels = [] + new_seq_sentence_labels = [] + new_seq_subtokens = [] + new_seq_subtoken_count = 0 + + lines = f.readlines() + random.seed(0) + random.shuffle(lines) + + for index, line in enumerate(lines): + + if index % 20000 == 0: + print(f"processing line {index}/{len(lines)}") + + sentence_label = line.split()[0] + sentence = line.split()[2:] + sentence = " ".join(sentence) + # Remove punctuation + sentence = remove_punctuation_from_sentence(sentence) + sentence_words = sentence.split() + + sentence_subtoken_count = 0 + sentence_subtokens = [] + for word in sentence_words: + word_tokens = tokenizer.text_to_tokens(word) + sentence_subtokens.append(word_tokens) + sentence_subtoken_count += len(word_tokens) + + sentence_token_labels = [0] * sentence_subtoken_count + sentence_token_labels[0] = 1 + + # The -1 accounts for [CLS] + max_tokens_for_doc = max_seq_length - 1 + + if (new_seq_subtoken_count + sentence_subtoken_count) < \ + max_tokens_for_doc: + + new_seq_words.extend(sentence_words) + new_seq_token_labels.extend(sentence_token_labels) + new_seq_sentence_labels.append(sentence_label) + new_seq_subtokens.append(sentence_subtokens) + new_seq_subtoken_count += sentence_subtoken_count + + else: + self.seq_words.append(new_seq_words) + self.seq_token_labels.append(new_seq_token_labels) + self.seq_sentence_labels.append(new_seq_sentence_labels) + self.seq_subtokens.append(new_seq_subtokens) + + new_seq_words = sentence_words + new_seq_token_labels = sentence_token_labels + new_seq_sentence_labels = [sentence_label] + new_seq_subtokens = [sentence_subtokens] + new_seq_subtoken_count = sentence_subtoken_count + + self.features = convert_sequences_to_features( + self.seq_words, self.seq_subtokens, self.seq_token_labels, + self.seq_sentence_labels, tokenizer, max_seq_length) + + self.tokenizer = tokenizer + self.max_seq_length = max_seq_length + self.vocab_size = self.tokenizer.vocab_size + + def __len__(self): + return len(self.features) + + def __getitem__(self, idx): + + feature = self.features[idx] + + return np.array(feature.input_ids), np.array(feature.segment_ids), \ + np.array(feature.input_mask, dtype=np.float32)[..., None], \ + np.array(feature.labels), np.array(feature.seq_id) + + def eval_preds(self, logits_lists, seq_ids): + + # Count the number of correct and incorrect predictions + correct_labels = 0 + incorrect_labels = 0 + + correct_preds = 0 + total_preds = 0 + total_correct = 0 + + for logits, seq_id in zip(logits_lists, seq_ids): + + feature = self.features[seq_id] + + masks = feature.input_mask + last_mask_index = masks.index(0) + labels = feature.labels[:last_mask_index] + labels = labels[:last_mask_index] + logits = logits[:last_mask_index] + + preds = [1 if (a[1] > a[0]) else 0 for a in logits] + + correct_preds = 0 + correct_labels = 0 + for label, pred in zip(labels, preds): + if pred == label: + correct_labels += 1 + if pred == 1: + correct_preds += 1 + + total_preds = preds.count(1) + total_correct = labels.count(1) + incorrect_labels = len(labels) - correct_labels + + if seq_id < 1: + previous_word_id = -1 + predicted_seq = "" + correct_seq = "" + unpunctuated_seq = "" + + for token_id, word_id in feature.token_to_orig_map.items(): + + word = feature.words[word_id] + + if word_id is not previous_word_id: + # New words has been found, handle it + if feature.labels[token_id] is 1: + if previous_word_id is not -1: + correct_seq += ". " + correct_seq += word.capitalize() + else: + correct_seq += " " + word + + if preds[token_id] is 1: + if previous_word_id is not -1: + predicted_seq += ". " + predicted_seq += word.capitalize() + else: + predicted_seq += " " + word + + unpunctuated_seq += " " + word + + previous_word_id = word_id + + print("unpunctuated_seq:\n", unpunctuated_seq) + print("correct_seq:\n", correct_seq) + print("pred_seq:\n", predicted_seq) + + return correct_labels, incorrect_labels, correct_preds, total_preds, \ + total_correct + + +def convert_sequences_to_features(seqs_words, seqs_subtokens, + seqs_token_labels, seqs_sentence_labels, + tokenizer, max_seq_length): + """Loads a data file into a list of `InputBatch`s.""" + + features = [] + for seq_id, (words, seq_subtokens, seq_token_labels, sentence_labels) in \ + enumerate(zip(seqs_words, seqs_subtokens, seqs_token_labels, + seqs_sentence_labels)): + + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + + word_count = 0 + for sent_subtokens in seq_subtokens: + for word_subtokens in sent_subtokens: + orig_to_tok_index.append(len(all_doc_tokens)) + for sub_token in word_subtokens: + tok_to_orig_index.append(word_count) + all_doc_tokens.append(sub_token) + word_count += 1 + + _DocSpan = collections.namedtuple( # pylint: disable=invalid-name + "DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + length = len(all_doc_tokens) + doc_spans.append(_DocSpan(start=start_offset, length=length)) + + doc_span_index = 0 + doc_span = doc_spans[0] + + tokens = [] + token_labels = [] + token_to_orig_map = {} + token_is_max_context = {} + segment_ids = [] + tokens.append("[CLS]") + token_labels.append(0) + segment_ids.append(0) + + for i in range(doc_span.length): + split_token_index = doc_span.start + i + token_to_orig_map[len( + tokens)] = tok_to_orig_index[split_token_index] + + is_max_context = _check_is_max_context(doc_spans, doc_span_index, + split_token_index) + token_is_max_context[len(tokens)] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + segment_ids.append(0) + + for label in seq_token_labels: + token_labels.append(label) + + input_ids = tokenizer.tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1] * len(input_ids) + + # Zero-pad up to the sequence length. + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + token_labels.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + assert len(token_labels) == max_seq_length + + if seq_id < 1: + print("*** Example ***") + print("example_index: %s" % seq_id) + print("doc_span_index: %s" % doc_span_index) + print("tokens: %s" % " ".join(tokens)) + print("words: %s" % " ".join(words)) + print("labels: %s" % " ".join(str(token_labels))) + print("sentence_labels: %s" % " ".join(sentence_labels)) + print("token_to_orig_map: %s" % " ".join( + ["%d:%d" % (x, y) for (x, y) in token_to_orig_map.items()])) + print("token_is_max_context: %s" % " ".join( + ["%d:%s" % (x, y) for (x, y) in token_is_max_context.items()])) + print("input_ids: %s" % " ".join([str(x) for x in input_ids])) + print("input_mask: %s" % " ".join([str(x) for x in input_mask])) + print("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) + + features.append( + InputFeatures(seq_id=seq_id, + doc_span_index=doc_span_index, + tokens=tokens, + words=words, + labels=token_labels, + sentence_labels=sentence_labels, + token_to_orig_map=token_to_orig_map, + token_is_max_context=token_is_max_context, + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids)) + + return features + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__(self, seq_id, doc_span_index, tokens, words, labels, + sentence_labels, token_to_orig_map, token_is_max_context, + input_ids, input_mask, segment_ids): + self.seq_id = seq_id + self.doc_span_index = doc_span_index + self.tokens = tokens + self.words = words + self.labels = labels + self.sentence_labels = sentence_labels + self.token_to_orig_map = token_to_orig_map + self.token_is_max_context = token_is_max_context + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids diff --git a/collections/nemo_nlp/nemo_nlp/data/datasets/translation.py b/collections/nemo_nlp/nemo_nlp/data/datasets/translation.py new file mode 100644 index 000000000000..29a31a177014 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/datasets/translation.py @@ -0,0 +1,163 @@ +# Copyright 2019 AI Applications Design Team at NVIDIA. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Pytorch Dataset for training Neural Machine Translation.""" + +import numpy as np +from torch.utils.data import Dataset +from collections import OrderedDict +from ..utils import dataset_to_ids, clean_src_and_target + + +class TranslationDataset(Dataset): + def __init__(self, + tokenizer_src, + tokenizer_tgt, + dataset_src, + dataset_tgt, + tokens_in_batch=1024, + clean=False): + + self.src_tokenizer = tokenizer_src + self.tgt_tokenizer = tokenizer_tgt + self.tokens_in_batch = tokens_in_batch + + src_ids = dataset_to_ids(dataset_src, tokenizer_src) + tgt_ids = dataset_to_ids(dataset_tgt, tokenizer_tgt) + if clean: + src_ids, tgt_ids = clean_src_and_target(src_ids, tgt_ids) + self.batch_indices = self.pack_data_into_batches(src_ids, tgt_ids) + self.batches = self.pad_batches(src_ids, tgt_ids, self.batch_indices) + + def __len__(self): + return len(self.batches) + + def __getitem__(self, idx): + src_ids = self.batches[idx]["src"] + tgt = self.batches[idx]["tgt"] + labels = tgt[:, 1:] + tgt_ids = tgt[:, :-1] + src_mask = (src_ids != self.src_tokenizer.pad_id()).astype(np.int32) + tgt_mask = (tgt_ids != self.tgt_tokenizer.pad_id()).astype(np.int32) + sent_ids = self.batch_indices[idx] + return src_ids, src_mask, tgt_ids, tgt_mask, labels, sent_ids + + def pad_batches(self, src_ids, tgt_ids, batch_indices): + """ + Augments source and target ids in the batches with padding symbol + to make the lengths of all sentences in the batches equal. + """ + + batches = {} + for batch_idx, b in enumerate(batch_indices): + src_len = max([len(src_ids[i]) for i in b]) + tgt_len = max([len(tgt_ids[i]) for i in b]) + src_ids_ = self.src_tokenizer.pad_id() * np.ones( + (len(b), src_len), dtype=np.int) + tgt_ids_ = self.tgt_tokenizer.pad_id() * np.ones( + (len(b), tgt_len), dtype=np.int) + for i, sentence_idx in enumerate(b): + src_ids_[i][:len(src_ids[sentence_idx] + )] = src_ids[sentence_idx] + tgt_ids_[i][:len(tgt_ids[sentence_idx] + )] = tgt_ids[sentence_idx] + batches[batch_idx] = {"src": src_ids_, "tgt": tgt_ids_} + return batches + + def pack_data_into_batches(self, src_ids, tgt_ids): + """ + Takes two lists of source and target sentences, sorts them, and packs + into batches to minimize the use of padding tokens. Returns a list of + batches where each batch contains indices of sentences included into it + """ + + # create buckets sorted by the number of src tokens + # each bucket is also sorted by the number of tgt tokens + buckets = {} + for i in range(len(src_ids)): + src_len, tgt_len = len(src_ids[i]), len(tgt_ids[i]) + if src_len not in buckets.keys(): + buckets[src_len] = [(tgt_len, i)] + else: + buckets[src_len].append((tgt_len, i)) + for b in buckets: + buckets[b] = sorted(buckets[b]) + buckets = OrderedDict(sorted(buckets.items())) + + indices = list(buckets.keys()) + + batches = [[]] + num_batches = 0 + batch_size = 0 + i = 0 + src_len = 0 + tgt_len = 0 + + while i < len(buckets.keys()): + + while buckets[indices[i]]: + + i_src = max(src_len, indices[i]) + i_tgt = max(tgt_len, buckets[indices[i]][0][0]) + + try: + ip1_src = max(src_len, indices[i + 1]) + ip1_tgt = max(tgt_len, buckets[indices[i + 1]][0][0]) + except IndexError: + ip1_src = i_src + 1 + ip1_tgt = i_tgt + 1 + + if i_src + i_tgt <= ip1_src + ip1_tgt: + src_len = i_src + tgt_len = i_tgt + _, idx = buckets[indices[i]].pop(0) + else: + src_len = ip1_src + tgt_len = ip1_tgt + _, idx = buckets[indices[i + 1]].pop(0) + + batches[num_batches].append(idx) + batch_size += 1 + + if (batch_size * (src_len + tgt_len) > self.tokens_in_batch): + + num_examples_to_split = len(batches[num_batches]) + batches_to_evict = 8 * ((num_examples_to_split - 1) // 8) + + if batches_to_evict == 0: + batches_to_evict = num_examples_to_split + + batches.append(batches[num_batches][batches_to_evict:]) + batches[num_batches] = \ + batches[num_batches][:batches_to_evict] + batch_size = num_examples_to_split - batches_to_evict + + num_batches += 1 + if batch_size > 0: + src_len = max( + [len(src_ids[j]) for j in batches[num_batches]]) + tgt_len = max( + [len(tgt_ids[j]) for j in batches[num_batches]]) + else: + src_len = 0 + tgt_len = 0 + break + + if not buckets[indices[i]]: + i = i + 1 + + if not batches[-1]: + batches.pop(-1) + + return batches diff --git a/collections/nemo_nlp/nemo_nlp/data/language_modeling_data_layer.py b/collections/nemo_nlp/nemo_nlp/data/language_modeling_data_layer.py new file mode 100644 index 000000000000..382583fb5386 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/language_modeling_data_layer.py @@ -0,0 +1,58 @@ +# Copyright (c) 2019 NVIDIA Corporation + +from nemo.backends.pytorch.nm import DataLayerNM +from nemo.core.neural_types import * +from nemo.core import DeviceType +import torch +from .datasets import LanguageModelingDataset + + +class LanguageModelingDataLayer(DataLayerNM): + @staticmethod + def create_ports(): + input_ports = {} + output_ports = { + "input_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "labels": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + + return input_ports, output_ports + + def __init__(self, *, tokenizer, dataset, max_seq_length, **kwargs): + DataLayerNM.__init__(self, **kwargs) + + self._device = torch.device( + "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu] + else "cpu" + ) + + self._dataset = LanguageModelingDataset( + tokenizer=tokenizer, + dataset=dataset, + max_sequence_length=max_seq_length, + batch_step=self.local_parameters.get("batch_step", None) + ) + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/__init__.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/__init__.py new file mode 100644 index 000000000000..41c6646c77e6 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/__init__.py @@ -0,0 +1,6 @@ +from .spc_tokenizer import SentencePieceTokenizer +from .bert_tokenizer import NemoBertTokenizer +from .yttm_tokenizer import YouTokenToMeTokenizer +from .gpt2_tokenizer import NemoGPT2Tokenizer +from .word_tokenizer import WordTokenizer +from .char_tokenizer import CharTokenizer diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/bert_tokenizer.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/bert_tokenizer.py new file mode 100644 index 000000000000..e145cdaa3189 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/bert_tokenizer.py @@ -0,0 +1,100 @@ +from .tokenizer_spec import TokenizerSpec +from pytorch_transformers import BertTokenizer +import re + + +def handle_quotes(text): + text_ = "" + quote = 0 + i = 0 + while i < len(text): + if text[i] == "\"": + if quote % 2: + text_ = text_[:-1] + "\"" + else: + text_ += "\"" + i += 1 + quote += 1 + else: + text_ += text[i] + i += 1 + return text_ + + +def remove_spaces(text): + text = text.replace("( ", "(") + text = text.replace(" )", ")") + text = text.replace("[ ", "[") + text = text.replace(" ]", "]") + text = text.replace(" / ", "/") + text = text.replace("„ ", "„") + text = text.replace(" - ", "-") + text = text.replace(" ' ", "'") + text = re.sub(r'([0-9])( )([\.,])', '\\1\\3', text) + text = re.sub(r'([\.,])( )([0-9])', '\\1\\3', text) + text = re.sub(r'([0-9])(:)( )([0-9])', '\\1\\2\\4', text) + text = text.replace(" %", "%") + text = text.replace("$ ", "$") + text = text.replace("\xa0", " ") + text = re.sub(r'([^0-9])(,)([0-9])', '\\1\\2 \\3', text) + return text + + +class NemoBertTokenizer(TokenizerSpec): + def __init__(self, pretrained_model=None, + vocab_file=None, + do_lower_case=True, + max_len=None, + do_basic_tokenize=True, + never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")): + if pretrained_model: + self.tokenizer = BertTokenizer.from_pretrained(pretrained_model) + if "uncased" not in pretrained_model: + self.tokenizer.basic_tokenizer.do_lower_case = False + else: + self.tokenizer = BertTokenizer(vocab_file, + do_lower_case, + max_len, + do_basic_tokenize, + never_split) + self.vocab_size = len(self.tokenizer.vocab) + self.never_split = never_split + + def text_to_tokens(self, text): + tokens = self.tokenizer.tokenize(text) + return tokens + + def tokens_to_text(self, tokens): + text = self.tokenizer.convert_tokens_to_string(tokens) + return remove_spaces(handle_quotes(text.strip())) + + def token_to_id(self, token): + return self.tokens_to_ids([token])[0] + + def tokens_to_ids(self, tokens): + ids = self.tokenizer.convert_tokens_to_ids(tokens) + return ids + + def ids_to_tokens(self, ids): + tokens = self.tokenizer.convert_ids_to_tokens(ids) + return tokens + + def text_to_ids(self, text): + tokens = self.text_to_tokens(text) + ids = self.tokens_to_ids(tokens) + return ids + + def ids_to_text(self, ids): + tokens = self.ids_to_tokens(ids) + tokens_clean = [t for t in tokens if t not in self.never_split] + text = self.tokens_to_text(tokens_clean) + return text + + def pad_id(self): + return self.tokens_to_ids(["[PAD]"])[0] + + def bos_id(self): + return self.tokens_to_ids(["[CLS]"])[0] + + def eos_id(self): + return self.tokens_to_ids(["[SEP]"])[0] diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/char_tokenizer.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/char_tokenizer.py new file mode 100644 index 000000000000..64c6b8cacc6b --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/char_tokenizer.py @@ -0,0 +1,50 @@ +from .tokenizer_spec import TokenizerSpec + + +class CharTokenizer(TokenizerSpec): + def __init__(self, vocab_path): + + vocab_list = open(vocab_path, "r").readlines() + self.vocab = {vocab_list[i][0]: i for i in range(len(vocab_list))} + for special_token in ["", "", "", ""]: + if special_token not in self.vocab: + self.vocab[special_token] = len(self.vocab) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.vocab_size = len(self.vocab) + self.special_tokens = self.tokens_to_ids( + ["", "", "", ""]) + + def text_to_tokens(self, text): + token_candidates = [char for char in text] + tokens = [] + for token in token_candidates: + if token in self.vocab: + tokens.append(token) + else: + tokens.append("") + return tokens + + def tokens_to_text(self, tokens): + return self.ids_to_text(self.tokens_to_ids(tokens)) + + def text_to_ids(self, text): + return [self.vocab[token] for token in self.text_to_tokens(text)] + + def ids_to_text(self, ids): + ids_ = [id_ for id_ in ids if id_ not in self.special_tokens] + return "".join(self.ids_to_tokens(ids_)) + + def tokens_to_ids(self, tokens): + return [self.vocab[token] for token in tokens] + + def ids_to_tokens(self, ids): + return [self.inv_vocab[id] for id in ids] + + def pad_id(self): + return self.vocab[""] + + def bos_id(self): + return self.vocab[""] + + def eos_id(self): + return self.vocab[""] diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/gpt2_tokenizer.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/gpt2_tokenizer.py new file mode 100644 index 000000000000..cf5ef1d39fa1 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/gpt2_tokenizer.py @@ -0,0 +1,59 @@ +from .tokenizer_spec import TokenizerSpec +from pytorch_transformers import GPT2Tokenizer + + +class NemoGPT2Tokenizer(TokenizerSpec): + def __init__(self, pretrained_model=None, + vocab_file=None, merges_file=None, errors='replace', + bos_token="<|endoftext|>", + eos_token="<|endoftext|>", + **kwargs): + if pretrained_model: + self.tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model) + self.vocab_size = self.tokenizer.vocab_size + special_tokens_dict = {} + if self.tokenizer.unk_token is None: + self.tokenizer.unk_token = "<|unk|>" + special_tokens_dict["unk_token"] = "<|unk|>" + if self.tokenizer.bos_token is None: + special_tokens_dict["bos_token"] = bos_token + if self.tokenizer.eos_token is None: + special_tokens_dict["eos_token"] = eos_token + if self.tokenizer.pad_token is None: + special_tokens_dict["pad_token"] = "<|pad|>" + self.tokenizer.add_special_tokens(special_tokens_dict) + + def text_to_tokens(self, text): + tokens = self.tokenizer.tokenize(text) + return tokens + + def tokens_to_text(self, tokens): + text = self.tokenizer.convert_tokens_to_string(tokens) + return text + + def tokens_to_ids(self, tokens): + ids = self.tokenizer.convert_tokens_to_ids(tokens) + return ids + + def ids_to_tokens(self, ids): + tokens = self.tokenizer.convert_ids_to_tokens(ids) + return tokens + + def text_to_ids(self, text): + tokens = self.text_to_tokens(text) + ids = self.tokens_to_ids(tokens) + return ids + + def ids_to_text(self, ids): + tokens = self.ids_to_tokens(ids) + text = self.tokens_to_text(tokens) + return text + + def pad_id(self): + return self.tokens_to_ids([self.tokenizer.pad_token])[0] + + def bos_id(self): + return self.tokens_to_ids([self.tokenizer.bos_token])[0] + + def eos_id(self): + return self.tokens_to_ids([self.tokenizer.eos_token])[0] diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/spc_tokenizer.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/spc_tokenizer.py new file mode 100644 index 000000000000..7dc21f1efe9a --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/spc_tokenizer.py @@ -0,0 +1,117 @@ +import sentencepiece as spm +from .tokenizer_spec import TokenizerSpec + + +class SentencePieceTokenizer(TokenizerSpec): + def __init__(self, model_path): + self.tokenizer = spm.SentencePieceProcessor() + self.tokenizer.Load(model_path) + self.original_vocab_size = self.tokenizer.get_piece_size() + self.vocab_size = self.tokenizer.get_piece_size() + self.special_tokens = {} + self.special_token_ids = {} + + def text_to_tokens(self, text): + tokens = [] + idx = 0 + last_idx = 0 + + while 1: + indices = {} + + for token in self.special_tokens: + try: + indices[token] = text[idx:].index(token) + except ValueError: + continue + + if len(indices) == 0: + break + + next_token = min(indices, key=indices.get) + next_idx = idx + indices[next_token] + + tokens.extend(self.tokenizer.encode_as_pieces(text[idx:next_idx])) + tokens.append(next_token) + idx = next_idx + len(next_token) + + tokens.extend(self.tokenizer.encode_as_pieces(text[idx:])) + return tokens + + def tokens_to_text(self, tokens): + return self.tokenizer.decode_pieces(tokens) + + def text_to_ids(self, text): + ids = [] + idx = 0 + last_idx = 0 + + while 1: + indices = {} + + for token in self.special_tokens: + try: + indices[token] = text[idx:].index(token) + except ValueError: + continue + + if len(indices) == 0: + break + + next_token = min(indices, key=indices.get) + next_idx = idx + indices[next_token] + + ids.extend(self.tokenizer.encode_as_ids(text[idx:next_idx])) + ids.append(self.special_tokens[next_token]) + idx = next_idx + len(next_token) + + ids.extend(self.tokenizer.encode_as_ids(text[idx:])) + return ids + + def ids_to_text(self, ids): + text = "" + last_i = 0 + + for i, id in enumerate(ids): + if id in self.special_token_ids: + text += self.tokenizer.decode_ids(ids[last_i:i]) + " " + text += self.special_token_ids[id] + " " + last_i = i + 1 + + text += self.tokenizer.decode_ids(ids[last_i:]) + return text.strip() + + def token_to_id(self, token): + if token in self.special_tokens: + return self.special_tokens[token] + + return self.tokenizer.piece_to_id(token) + + def tokens_to_ids(self, tokens): + ids = [] + + for token in tokens: + if token in self.special_tokens: + ids.append(self.special_tokens[token]) + else: + ids.append(self.tokenizer.piece_to_id(token)) + + return ids + + def ids_to_tokens(self, ids): + tokens = [] + + for id in ids: + if id >= self.original_vocab_size: + tokens.append(self.special_token_ids[id]) + else: + tokens.append(self.tokenizer.id_to_piece(id)) + + return tokens + + def add_special_tokens(self, special_tokens): + for token in special_tokens: + if self.tokenizer.piece_to_id(token) == self.tokenizer.unk_id(): + self.special_tokens[token] = self.vocab_size + self.special_token_ids[self.vocab_size] = token + self.vocab_size += 1 diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/tokenizer_spec.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/tokenizer_spec.py new file mode 100644 index 000000000000..687df89b2650 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/tokenizer_spec.py @@ -0,0 +1,31 @@ +from abc import abstractmethod, ABC +from typing import List + + +class TokenizerSpec(ABC): + @abstractmethod + def text_to_tokens(self, text): + pass + + @abstractmethod + def tokens_to_text(self, tokens): + pass + + @abstractmethod + def tokens_to_ids(self, tokens): + pass + + @abstractmethod + def ids_to_tokens(self, ids): + pass + + @abstractmethod + def text_to_ids(self, text): + pass + + @abstractmethod + def ids_to_text(self, ids): + pass + + def add_special_tokens(self, special_tokens: List[str]): + pass diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/word_tokenizer.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/word_tokenizer.py new file mode 100644 index 000000000000..04026454abe7 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/word_tokenizer.py @@ -0,0 +1,50 @@ +from .tokenizer_spec import TokenizerSpec + + +class WordTokenizer(TokenizerSpec): + def __init__(self, vocab_path): + + vocab_list = open(vocab_path, "r").readlines() + self.vocab = {vocab_list[i].strip(): i for i in range(len(vocab_list))} + for special_token in ["", "", "", ""]: + if special_token not in self.vocab: + self.vocab[special_token] = len(self.vocab) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.vocab_size = len(self.vocab) + self.special_tokens = self.tokens_to_ids( + ["", "", "", ""]) + + def text_to_tokens(self, text): + token_candidates = text.strip().split() + tokens = [] + for token in token_candidates: + if token in self.vocab: + tokens.append(token) + else: + tokens.append("") + return tokens + + def tokens_to_text(self, tokens): + return self.ids_to_text(self.tokens_to_ids(tokens)) + + def text_to_ids(self, text): + return [self.vocab[token] for token in self.text_to_tokens(text)] + + def ids_to_text(self, ids): + ids_ = [id_ for id_ in ids if id_ not in self.special_tokens] + return " ".join(self.ids_to_tokens(ids_)) + + def tokens_to_ids(self, tokens): + return [self.vocab[token] for token in tokens] + + def ids_to_tokens(self, ids): + return [self.inv_vocab[id] for id in ids] + + def pad_id(self): + return self.vocab[""] + + def bos_id(self): + return self.vocab[""] + + def eos_id(self): + return self.vocab[""] diff --git a/collections/nemo_nlp/nemo_nlp/data/tokenizers/yttm_tokenizer.py b/collections/nemo_nlp/nemo_nlp/data/tokenizers/yttm_tokenizer.py new file mode 100644 index 000000000000..612aada2f76d --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/tokenizers/yttm_tokenizer.py @@ -0,0 +1,39 @@ +import youtokentome as yttm +from .tokenizer_spec import TokenizerSpec + + +class YouTokenToMeTokenizer(TokenizerSpec): + def __init__(self, model_path): + self.tokenizer = yttm.BPE(model=model_path) + self.vocab_size = len(self.tokenizer.vocab()) + self.special_tokens = self.tokens_to_ids( + ["", "", "", ""]) + + def text_to_tokens(self, text): + return self.tokenizer.encode(text, output_type=yttm.OutputType.SUBWORD) + + def tokens_to_text(self, tokens): + return self.ids_to_text(self.tokens_to_ids(tokens)) + + def text_to_ids(self, text): + return self.tokenizer.encode(text, output_type=yttm.OutputType.ID) + + def ids_to_text(self, ids): + ids_ = [id_ for id_ in ids if id_ not in self.special_tokens] + return self.tokenizer.decode([ids_])[0] + + def tokens_to_ids(self, tokens): + return [self.tokenizer.subword_to_id(token) for token in tokens] + + def ids_to_tokens(self, ids): + ids_ = [id_ for id_ in ids if id_ not in self.special_tokens] + return [self.tokenizer.id_to_subword(id_) for id_ in ids_] + + def pad_id(self): + return self.tokenizer.subword_to_id("") + + def bos_id(self): + return self.tokenizer.subword_to_id("") + + def eos_id(self): + return self.tokenizer.subword_to_id("") diff --git a/collections/nemo_nlp/nemo_nlp/data/translation_data_layer.py b/collections/nemo_nlp/nemo_nlp/data/translation_data_layer.py new file mode 100644 index 000000000000..d6fcf8843f59 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/translation_data_layer.py @@ -0,0 +1,105 @@ +# Copyright (c) 2019 NVIDIA Corporation + +from nemo.backends.pytorch.nm import DataLayerNM +from nemo.core.neural_types import * +from nemo.core import DeviceType +import torch +from torch.utils.data import DataLoader +from torch.utils.data.distributed import DistributedSampler +from ..data.datasets.translation import TranslationDataset + + +class TranslationDataLayer(DataLayerNM): + @staticmethod + def create_ports(): + input_ports = {} + output_ports = { + "src_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "src_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "tgt_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "tgt_mask": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "labels": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "sent_ids": NeuralType({ + 0: AxisType(BatchTag) + }) + } + + return input_ports, output_ports + + def __init__( + self, *, + tokenizer_src, + tokenizer_tgt, + dataset_src, + dataset_tgt, + tokens_in_batch, + clean=False, + **kwargs + ): + DataLayerNM.__init__(self, **kwargs) + + self._device = torch.device( + "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu] + else "cpu" + ) + + self.translation_dataset = TranslationDataset( + tokenizer_src=tokenizer_src, + tokenizer_tgt=tokenizer_tgt, + dataset_src=dataset_src, + dataset_tgt=dataset_tgt, + tokens_in_batch=tokens_in_batch, + clean=clean) + + if self._placement == DeviceType.AllGpu: + sampler = DistributedSampler(self.translation_dataset) + else: + sampler = None + + self._dataloader = DataLoader( + dataset=self.translation_dataset, + batch_size=1, + collate_fn=lambda x: self._collate_fn(x), + shuffle=True if sampler is None else False, + sampler=sampler) + + def _collate_fn(self, x): + src_ids, src_mask, tgt_ids, tgt_mask, labels, sent_ids = x[0] + src_ids = torch.Tensor(src_ids).long().to(self._device) + src_mask = torch.Tensor(src_mask).float().to(self._device) + tgt_ids = torch.Tensor(tgt_ids).long().to(self._device) + tgt_mask = torch.Tensor(tgt_mask).float().to(self._device) + labels = torch.Tensor(labels).long().to(self._device) + sent_ids = torch.Tensor(sent_ids).long().to(self._device) + return src_ids, src_mask, tgt_ids, tgt_mask, labels, sent_ids + + def __len__(self): + return len(self.translation_dataset) + + @property + def dataset(self): + return None + + @property + def data_iterator(self): + return self._dataloader diff --git a/collections/nemo_nlp/nemo_nlp/data/utils.py b/collections/nemo_nlp/nemo_nlp/data/utils.py new file mode 100644 index 000000000000..9b894bbf133b --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/data/utils.py @@ -0,0 +1,68 @@ +import os +import pickle +import numpy as np + + +def dataset_to_ids(dataset, tokenizer, cache_ids=False, add_bos_eos=True): + """ + Reads dataset from file line by line, tokenizes each line with tokenizer, + and returns list of lists which corresponds to ids of tokenized strings. + + Args: + dataset: path to dataset + tokenizer: tokenizer to convert text into ids + cache_ids: if True, ids are saved to disk as pickle file + with similar name (e.g., data.txt --> data.txt.pkl) + add_bos_eos: bool, whether to add and symbols (e.g., for NMT) + Returns: + ids: list of ids which correspond to tokenized strings of the dataset + """ + + cached_ids_dataset = dataset + str(".pkl") + if os.path.isfile(cached_ids_dataset): + print("Loading cached tokenized dataset ...") + ids = pickle.load(open(cached_ids_dataset, "rb")) + else: + print("Tokenizing dataset ...") + data = open(dataset, "rb").readlines() + ids = [] + for sentence in data: + sent_ids = tokenizer.text_to_ids(sentence.decode("utf-8")) + if add_bos_eos: + sent_ids = [tokenizer.bos_id()] + sent_ids + \ + [tokenizer.eos_id()] + ids.append(sent_ids) + if cache_ids: + print("Caching tokenized dataset ...") + pickle.dump(ids, open(cached_ids_dataset, "wb")) + return ids + + +def clean_src_and_target(src_ids, tgt_ids, max_tokens=128, min_tokens=3, + max_tokens_diff=25, max_tokens_ratio=2.5): + """ + Cleans source and target sentences to get rid of noisy data. + Specifically, a pair of sentences is removed if + -- either source or target is longer than *max_tokens* + -- either source or target is shorter than *min_tokens* + -- absolute difference between source and target is larger than + *max_tokens_diff* + -- one sentence is *max_tokens_ratio* times longer than the other + """ + + if len(src_ids) != len(tgt_ids): + raise ValueError("Source and target corpora have different lengths!") + src_ids_, tgt_ids_ = [], [] + for i in range(len(src_ids)): + src_len, tgt_len = len(src_ids[i]), len(tgt_ids[i]) + if src_len > max_tokens or tgt_len > max_tokens or \ + src_len < min_tokens or tgt_len < min_tokens or \ + (src_ids[i] == tgt_ids[i]) or \ + np.abs(src_len - tgt_len) > max_tokens_diff: + continue + ratio = max(src_len - 2, 1) / max(tgt_len - 2, 1) + if ratio > max_tokens_ratio or ratio < (1 / max_tokens_ratio): + continue + src_ids_.append(src_ids[i]) + tgt_ids_.append(tgt_ids[i]) + return src_ids_, tgt_ids_ diff --git a/collections/nemo_nlp/nemo_nlp/externals/__init__.py b/collections/nemo_nlp/nemo_nlp/externals/__init__.py new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/collections/nemo_nlp/nemo_nlp/externals/bleu.py b/collections/nemo_nlp/nemo_nlp/externals/bleu.py new file mode 100644 index 000000000000..6786fa2585b0 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/externals/bleu.py @@ -0,0 +1,123 @@ +# Copyright 2017 Google Inc. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================== +"""Python implementation of BLEU and smooth-BLEU. +This module provides a Python implementation of BLEU and smooth-BLEU. +Smooth BLEU is computed following the method outlined in the paper: +Chin-Yew Lin, Franz Josef Och. ORANGE: a method for evaluating automatic +evaluation metrics for machine translation. COLING 2004. +""" + +import collections +import math + + +def compound_split(segment): + segment = segment.replace(".", " . ") + segment = segment.replace(",", " , ") + segment = segment.replace(":", " : ") + segment = segment.replace("!", " ! ") + segment = segment.replace("?", " ? ") + segment = segment.replace("-", " ##AT##-##AT## ") + segment = segment.replace("\"", " "e ") + segment = segment.replace("%", " % ") + return segment.split() + + +def _get_ngrams(segment, max_order): + """Extracts all n-grams upto a given maximum order from an input segment. + Args: + segment: text segment from which n-grams will be extracted. + max_order: maximum length in tokens of the n-grams returned by this + methods. + Returns: + The Counter containing all n-grams upto max_order in segment + with a count of how many times each n-gram occurred. + """ + + ngram_counts = collections.Counter() + for order in range(1, max_order + 1): + for i in range(0, len(segment) - order + 1): + ngram = tuple(segment[i:i + order]) + ngram_counts[ngram] += 1 + return ngram_counts + + +def compute_bleu(reference_corpus, + translation_corpus, + max_order=4, + smooth=False): + """Computes BLEU score of translated segments against one or more references. + Args: + reference_corpus: list of lists of references for each translation. Each + reference should be tokenized into a list of tokens. + translation_corpus: list of translations to score. Each translation + should be tokenized into a list of tokens. + max_order: Maximum n-gram order to use when computing BLEU score. + smooth: Whether or not to apply Lin et al. 2004 smoothing. + Returns: + 3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram + precisions and brevity penalty. + """ + matches_by_order = [0] * max_order + possible_matches_by_order = [0] * max_order + reference_length = 0 + translation_length = 0 + for (references, translation) in zip(reference_corpus, translation_corpus): + reference_length += min(len(r) for r in references) + translation_length += len(translation) + + merged_ref_ngram_counts = collections.Counter() + for reference in references: + merged_ref_ngram_counts |= _get_ngrams(reference, max_order) + translation_ngram_counts = _get_ngrams(translation, max_order) + overlap = translation_ngram_counts & merged_ref_ngram_counts + for ngram in overlap: + matches_by_order[len(ngram) - 1] += overlap[ngram] + for order in range(1, max_order + 1): + possible_matches = len(translation) - order + 1 + if possible_matches > 0: + possible_matches_by_order[order - 1] += possible_matches + + precisions = [0] * max_order + for i in range(0, max_order): + if smooth: + precisions[i] = ((matches_by_order[i] + 1.) / + (possible_matches_by_order[i] + 1.)) + else: + if possible_matches_by_order[i] > 0: + precisions[i] = (float(matches_by_order[i]) / + possible_matches_by_order[i]) + else: + precisions[i] = 0.0 + + if min(precisions) > 0: + p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions) + geo_mean = math.exp(p_log_sum) + else: + geo_mean = 0 + + ratio = float(translation_length) / reference_length + + if ratio > 1.0: + bp = 1. + else: + bp = math.exp(1 - 1. / (ratio + 1e-6)) + + bleu = geo_mean * bp + + precisions = [p * 100 for p in precisions] + + return (bleu * 100, precisions, bp, ratio, translation_length, + reference_length) diff --git a/collections/nemo_nlp/nemo_nlp/externals/fairseq_tokenizer.py b/collections/nemo_nlp/nemo_nlp/externals/fairseq_tokenizer.py new file mode 100644 index 000000000000..f3ec52d7bab5 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/externals/fairseq_tokenizer.py @@ -0,0 +1,102 @@ +import sys +import re +import unicodedata +from collections import defaultdict + + +def get_unicode_categories(): + cats = defaultdict(list) + for c in map(chr, range(sys.maxunicode + 1)): + cats[unicodedata.category(c)].append(c) + return cats + + +NUMERICS = ''.join(get_unicode_categories()['No']) + + +def tokenize_en(line): + line = line.strip() + line = ' ' + line + ' ' + # remove ASCII junk + line = re.sub(r'\s+', ' ', line) + line = re.sub(r'[\x00-\x1F]', '', line) + #fix whitespaces + line = re.sub('\ +', ' ', line) + line = re.sub('^ ', '', line) + line = re.sub(' $', '', line) + #separate other special characters + line = re.sub(r'([^\s\.\'\`\,\-\w]|[_'+NUMERICS+'])', r' \g<1> ', line) + line = re.sub(r'(\w)\-(?=\w)', r'\g<1> @-@ ', line) + + #multidots stay together + line = re.sub(r'\.([\.]+)', r' DOTMULTI\g<1>', line) + while re.search(r'DOTMULTI\.', line): + line = re.sub(r'DOTMULTI\.([^\.])', r'DOTDOTMULTI \g<1>', line) + line = re.sub(r'DOTMULTI\.', r'DOTDOTMULTI', line) + + # separate out "," except if within numbers (5,300) + line = re.sub(r'([\D])[,]', r'\g<1> , ', line) + line = re.sub(r'[,]([\D])', r' , \g<1>', line) + + # separate "," after a number if it's the end of sentence + line = re.sub(r'(\d)[,]$', r'\g<1> ,', line) + + # split contractions right + line = re.sub(r'([\W\d])[\']([\W\d])', '\g<1> \' \g<2>', line) + line = re.sub(r'(\W)[\']([\w\D])', '\g<1> \' \g<2>', line) + line = re.sub(r'([\w\D])[\']([\W\d])', '\g<1> \' \g<2>', line) + line = re.sub(r'([\w\D])[\']([\w\D])', '\g<1> \'\g<2>', line) + # special case for "1990's" + line = re.sub(r'([\W\d])[\']([s])', '\g<1> \'\g<2>', line) + + # apply nonbreaking prefixes + words = line.split() + line = '' + for i in range(len(words)): + word = words[i] + match = re.search(r'^(\S+)\.$', word) + if match: + pre = match.group(1) + if i==len(words)-1: + # split last words independently as they are unlikely to be non-breaking prefixes + word = pre+' .' + # elif ((re.search(r'\.', pre) and re.search(r'[^\.\W\d]', pre)) + # or (pre in prefixes and prefixes[pre]==1) + # or re.search(r'^[a-z]', words[i+1]) + # or (pre in prefixes and prefixes[pre]==2 and re.search(r'^[0-9]+', words[i+1]))): + # pass + else: + word = pre+' .' + + word +=' ' + line += word + + # clean up extraneous spaces + line = re.sub(' +', ' ', line) + line = re.sub('^ ', '', line) + line = re.sub(' $', '', line) + + # .' at end of sentence is missed + line = re.sub(r'\.\' ?$', ' . \' ', line) + + #restore multi-dots + while re.search('DOTDOTMULTI', line): + line = re.sub('DOTDOTMULTI', 'DOTMULTI.', line) + + line = re.sub('DOTMULTI', '.', line) + + # escape special characters + line = re.sub(r'\&', r'&', line) + line = re.sub(r'\|', r'|', line) + line = re.sub(r'\<', r'<', line) + line = re.sub(r'\>', r'>', line) + line = re.sub(r'\'', r''', line) + line = re.sub(r'\"', r'"', line) + line = re.sub(r'\[', r'[', line) + line = re.sub(r'\]', r']', line) + + #ensure final line breaks + # if line[-1] is not '\n': + # line += '\n' + + return line diff --git a/collections/nemo_nlp/nemo_nlp/externals/file_utils.py b/collections/nemo_nlp/nemo_nlp/externals/file_utils.py new file mode 100644 index 000000000000..709f0fdc4ce1 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/externals/file_utils.py @@ -0,0 +1,275 @@ +""" +Utilities for working with the local dataset cache. +This file is adapted from the AllenNLP library at https://github.com/allenai/allennlp +Copyright by the AllenNLP authors. +""" +from __future__ import (absolute_import, division, print_function, + unicode_literals) + +import json +import logging +import os +import shutil +import tempfile +import fnmatch +from functools import wraps +from hashlib import sha256 +import sys +from io import open + +import boto3 +import requests +from botocore.exceptions import ClientError +from tqdm import tqdm + +try: + from urllib.parse import urlparse +except ImportError: + from urlparse import urlparse + +try: + from pathlib import Path + PYTORCH_PRETRAINED_BERT_CACHE = Path( + os.getenv('PYTORCH_PRETRAINED_BERT_CACHE', + Path.home() / '.pytorch_pretrained_bert')) +except (AttributeError, ImportError): + PYTORCH_PRETRAINED_BERT_CACHE = os.getenv( + 'PYTORCH_PRETRAINED_BERT_CACHE', + os.path.join(os.path.expanduser("~"), '.pytorch_pretrained_bert')) + +CONFIG_NAME = "config.json" +WEIGHTS_NAME = "pytorch_model.bin" + + +def url_to_filename(url, etag=None): + """ + Convert `url` into a hashed filename in a repeatable way. + If `etag` is specified, append its hash to the url's, delimited + by a period. + """ + url_bytes = url.encode('utf-8') + url_hash = sha256(url_bytes) + filename = url_hash.hexdigest() + + if etag: + etag_bytes = etag.encode('utf-8') + etag_hash = sha256(etag_bytes) + filename += '.' + etag_hash.hexdigest() + + return filename + + +def filename_to_url(filename, cache_dir=None): + """ + Return the url and etag (which may be ``None``) stored for `filename`. + Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist. + """ + if cache_dir is None: + cache_dir = PYTORCH_PRETRAINED_BERT_CACHE + if sys.version_info[0] == 3 and isinstance(cache_dir, Path): + cache_dir = str(cache_dir) + + cache_path = os.path.join(cache_dir, filename) + if not os.path.exists(cache_path): + raise EnvironmentError("file {} not found".format(cache_path)) + + meta_path = cache_path + '.json' + if not os.path.exists(meta_path): + raise EnvironmentError("file {} not found".format(meta_path)) + + with open(meta_path, encoding="utf-8") as meta_file: + metadata = json.load(meta_file) + url = metadata['url'] + etag = metadata['etag'] + + return url, etag + + +def cached_path(url_or_filename, cache_dir=None): + """ + Given something that might be a URL (or might be a local path), + determine which. If it's a URL, download the file and cache it, and + return the path to the cached file. If it's already a local path, + make sure the file exists and then return the path. + """ + if cache_dir is None: + cache_dir = PYTORCH_PRETRAINED_BERT_CACHE + if sys.version_info[0] == 3 and isinstance(url_or_filename, Path): + url_or_filename = str(url_or_filename) + if sys.version_info[0] == 3 and isinstance(cache_dir, Path): + cache_dir = str(cache_dir) + + parsed = urlparse(url_or_filename) + + if parsed.scheme in ('http', 'https', 's3'): + # URL, so get it from the cache (downloading if necessary) + return get_from_cache(url_or_filename, cache_dir) + elif os.path.exists(url_or_filename): + # File, and it exists. + return url_or_filename + elif parsed.scheme == '': + # File, but it doesn't exist. + raise EnvironmentError("file {} not found".format(url_or_filename)) + else: + # Something unknown + raise ValueError( + "unable to parse {} as a URL or as a local path".format( + url_or_filename)) + + +def split_s3_path(url): + """Split a full s3 path into the bucket name and path.""" + parsed = urlparse(url) + if not parsed.netloc or not parsed.path: + raise ValueError("bad s3 path {}".format(url)) + bucket_name = parsed.netloc + s3_path = parsed.path + # Remove '/' at beginning of path. + if s3_path.startswith("/"): + s3_path = s3_path[1:] + return bucket_name, s3_path + + +def s3_request(func): + """ + Wrapper function for s3 requests in order to create more helpful error + messages. + """ + + @wraps(func) + def wrapper(url, *args, **kwargs): + try: + return func(url, *args, **kwargs) + except ClientError as exc: + if int(exc.response["Error"]["Code"]) == 404: + raise EnvironmentError("file {} not found".format(url)) + else: + raise + + return wrapper + + +@s3_request +def s3_etag(url): + """Check ETag on S3 object.""" + s3_resource = boto3.resource("s3") + bucket_name, s3_path = split_s3_path(url) + s3_object = s3_resource.Object(bucket_name, s3_path) + return s3_object.e_tag + + +@s3_request +def s3_get(url, temp_file): + """Pull a file directly from S3.""" + s3_resource = boto3.resource("s3") + bucket_name, s3_path = split_s3_path(url) + s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file) + + +def http_get(url, temp_file): + req = requests.get(url, stream=True) + content_length = req.headers.get('Content-Length') + total = int(content_length) if content_length is not None else None + progress = tqdm(unit="B", total=total) + for chunk in req.iter_content(chunk_size=1024): + if chunk: # filter out keep-alive new chunks + progress.update(len(chunk)) + temp_file.write(chunk) + progress.close() + + +def get_from_cache(url, cache_dir=None): + """ + Given a URL, look for the corresponding dataset in the local cache. + If it's not there, download it. Then return the path to the cached file. + """ + if cache_dir is None: + cache_dir = PYTORCH_PRETRAINED_BERT_CACHE + if sys.version_info[0] == 3 and isinstance(cache_dir, Path): + cache_dir = str(cache_dir) + + if not os.path.exists(cache_dir): + os.makedirs(cache_dir) + + # Get eTag to add to filename, if it exists. + if url.startswith("s3://"): + etag = s3_etag(url) + else: + try: + response = requests.head(url, allow_redirects=True) + if response.status_code != 200: + etag = None + else: + etag = response.headers.get("ETag") + except EnvironmentError: + etag = None + + if sys.version_info[0] == 2 and etag is not None: + etag = etag.decode('utf-8') + filename = url_to_filename(url, etag) + + # get cache path to put the file + cache_path = os.path.join(cache_dir, filename) + + # If we don't have a connection (etag is None) and can't identify the file + # try to get the last downloaded one + if not os.path.exists(cache_path) and etag is None: + matching_files = fnmatch.filter(os.listdir(cache_dir), filename + '.*') + matching_files = list( + filter(lambda s: not s.endswith('.json'), matching_files)) + if matching_files: + cache_path = os.path.join(cache_dir, matching_files[-1]) + + if not os.path.exists(cache_path): + # Download to temporary file, then copy to cache dir once finished. + # Otherwise you get corrupt cache entries if the download gets interrupted. + with tempfile.NamedTemporaryFile() as temp_file: + print("%s not found in cache, downloading to %s", url, + temp_file.name) + + # GET file object + if url.startswith("s3://"): + s3_get(url, temp_file) + else: + http_get(url, temp_file) + + # we are copying the file before closing it, so flush to avoid truncation + temp_file.flush() + # shutil.copyfileobj() starts at the current position, so go to the start + temp_file.seek(0) + + print("copying %s to cache at %s", temp_file.name, cache_path) + with open(cache_path, 'wb') as cache_file: + shutil.copyfileobj(temp_file, cache_file) + + print("creating metadata file for %s", cache_path) + meta = {'url': url, 'etag': etag} + meta_path = cache_path + '.json' + with open(meta_path, 'w') as meta_file: + output_string = json.dumps(meta) + if sys.version_info[0] == 2 and isinstance(output_string, str): + output_string = unicode(output_string, + 'utf-8') # The beauty of python 2 + meta_file.write(output_string) + + print("removing temp file %s", temp_file.name) + + return cache_path + + +def read_set_from_file(filename): + ''' + Extract a de-duped collection (set) of text from a file. + Expected file format is one item per line. + ''' + collection = set() + with open(filename, 'r', encoding='utf-8') as file_: + for line in file_: + collection.add(line.rstrip()) + return collection + + +def get_file_extension(path, dot=True, lower=True): + ext = os.path.splitext(path)[1] + ext = ext if dot else ext[1:] + return ext.lower() if lower else ext diff --git a/collections/nemo_nlp/nemo_nlp/externals/run_squad.py b/collections/nemo_nlp/nemo_nlp/externals/run_squad.py new file mode 100644 index 000000000000..7d15af1f2d7f --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/externals/run_squad.py @@ -0,0 +1,1276 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Run BERT on SQuAD.""" + +from __future__ import absolute_import, division, print_function + +import argparse +import collections +import json +import logging +import math +import os +import random +import sys +from io import open + +import numpy as np +import torch +from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, + TensorDataset) +from torch.utils.data.distributed import DistributedSampler +from tqdm import tqdm, trange + +from pytorch_transformers.file_utils import PYTORCH_PRETRAINED_BERT_CACHE +from pytorch_transformers.modeling_utils import WEIGHTS_NAME, CONFIG_NAME +from pytorch_transformers.modeling_bert import BertForQuestionAnswering, BertConfig +from pytorch_transformers.optimization import AdamW, WarmupLinearSchedule +from pytorch_transformers.tokenization_bert import (BasicTokenizer, + BertTokenizer, + whitespace_tokenize) + +if sys.version_info[0] == 2: + import cPickle as pickle +else: + import pickle + +logger = logging.getLogger(__name__) + + +class SquadExample(object): + """ + A single training/test example for the Squad dataset. + For examples without an answer, the start and end position are -1. + """ + + def __init__(self, + qas_id, + question_text, + doc_tokens, + orig_answer_text=None, + start_position=None, + end_position=None, + is_impossible=None): + self.qas_id = qas_id + self.question_text = question_text + self.doc_tokens = doc_tokens + self.orig_answer_text = orig_answer_text + self.start_position = start_position + self.end_position = end_position + self.is_impossible = is_impossible + + def __str__(self): + return self.__repr__() + + def __repr__(self): + s = "" + s += "qas_id: %s" % (self.qas_id) + s += ", question_text: %s" % (self.question_text) + s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens)) + if self.start_position: + s += ", start_position: %d" % (self.start_position) + if self.end_position: + s += ", end_position: %d" % (self.end_position) + if self.is_impossible: + s += ", is_impossible: %r" % (self.is_impossible) + return s + + +class InputFeatures(object): + """A single set of features of data.""" + + def __init__(self, + unique_id, + example_index, + doc_span_index, + tokens, + token_to_orig_map, + token_is_max_context, + input_ids, + input_mask, + segment_ids, + start_position=None, + end_position=None, + is_impossible=None): + self.unique_id = unique_id + self.example_index = example_index + self.doc_span_index = doc_span_index + self.tokens = tokens + self.token_to_orig_map = token_to_orig_map + self.token_is_max_context = token_is_max_context + self.input_ids = input_ids + self.input_mask = input_mask + self.segment_ids = segment_ids + self.start_position = start_position + self.end_position = end_position + self.is_impossible = is_impossible + + +def read_squad_examples(input_file, is_training, version_2_with_negative): + """Read a SQuAD json file into a list of SquadExample.""" + with open(input_file, "r", encoding='utf-8') as reader: + input_data = json.load(reader)["data"] + + def is_whitespace(c): + if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: + return True + return False + + examples = [] + for entry in input_data: + for paragraph in entry["paragraphs"]: + paragraph_text = paragraph["context"] + doc_tokens = [] + char_to_word_offset = [] + prev_is_whitespace = True + for c in paragraph_text: + if is_whitespace(c): + prev_is_whitespace = True + else: + if prev_is_whitespace: + doc_tokens.append(c) + else: + doc_tokens[-1] += c + prev_is_whitespace = False + char_to_word_offset.append(len(doc_tokens) - 1) + + for qa in paragraph["qas"]: + qas_id = qa["id"] + question_text = qa["question"] + start_position = None + end_position = None + orig_answer_text = None + is_impossible = False + if is_training: + if version_2_with_negative: + is_impossible = qa["is_impossible"] + if (len(qa["answers"]) != 1) and (not is_impossible): + raise ValueError( + "For training, each question should have exactly 1 answer." + ) + if not is_impossible: + answer = qa["answers"][0] + orig_answer_text = answer["text"] + answer_offset = answer["answer_start"] + answer_length = len(orig_answer_text) + start_position = char_to_word_offset[answer_offset] + end_position = char_to_word_offset[answer_offset + + answer_length - 1] + # Only add answers where the text can be exactly recovered from the + # document. If this CAN'T happen it's likely due to weird Unicode + # stuff so we will just skip the example. + # + # Note that this means for training mode, every example is NOT + # guaranteed to be preserved. + actual_text = " ".join( + doc_tokens[start_position:(end_position + 1)]) + cleaned_answer_text = " ".join( + whitespace_tokenize(orig_answer_text)) + if actual_text.find(cleaned_answer_text) == -1: + logger.warning( + "Could not find answer: '%s' vs. '%s'", + actual_text, cleaned_answer_text) + continue + else: + start_position = -1 + end_position = -1 + orig_answer_text = "" + + example = SquadExample(qas_id=qas_id, + question_text=question_text, + doc_tokens=doc_tokens, + orig_answer_text=orig_answer_text, + start_position=start_position, + end_position=end_position, + is_impossible=is_impossible) + examples.append(example) + return examples + + +def convert_examples_to_features(examples, tokenizer, max_seq_length, + doc_stride, max_query_length, is_training): + """Loads a data file into a list of `InputBatch`s.""" + + unique_id = 1000000000 + + features = [] + for (example_index, example) in enumerate(examples): + query_tokens = tokenizer.tokenize(example.question_text) + + if len(query_tokens) > max_query_length: + query_tokens = query_tokens[0:max_query_length] + + tok_to_orig_index = [] + orig_to_tok_index = [] + all_doc_tokens = [] + for (i, token) in enumerate(example.doc_tokens): + orig_to_tok_index.append(len(all_doc_tokens)) + sub_tokens = tokenizer.tokenize(token) + for sub_token in sub_tokens: + tok_to_orig_index.append(i) + all_doc_tokens.append(sub_token) + + tok_start_position = None + tok_end_position = None + if is_training and example.is_impossible: + tok_start_position = -1 + tok_end_position = -1 + if is_training and not example.is_impossible: + tok_start_position = orig_to_tok_index[example.start_position] + if example.end_position < len(example.doc_tokens) - 1: + tok_end_position = orig_to_tok_index[example.end_position + + 1] - 1 + else: + tok_end_position = len(all_doc_tokens) - 1 + (tok_start_position, tok_end_position) = _improve_answer_span( + all_doc_tokens, tok_start_position, tok_end_position, + tokenizer, example.orig_answer_text) + + # The -3 accounts for [CLS], [SEP] and [SEP] + max_tokens_for_doc = max_seq_length - len(query_tokens) - 3 + + # We can have documents that are longer than the maximum sequence length. + # To deal with this we do a sliding window approach, where we take chunks + # of the up to our max length with a stride of `doc_stride`. + _DocSpan = collections.namedtuple( # pylint: disable=invalid-name + "DocSpan", ["start", "length"]) + doc_spans = [] + start_offset = 0 + while start_offset < len(all_doc_tokens): + length = len(all_doc_tokens) - start_offset + if length > max_tokens_for_doc: + length = max_tokens_for_doc + doc_spans.append(_DocSpan(start=start_offset, length=length)) + if start_offset + length == len(all_doc_tokens): + break + start_offset += min(length, doc_stride) + + for (doc_span_index, doc_span) in enumerate(doc_spans): + tokens = [] + token_to_orig_map = {} + token_is_max_context = {} + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in query_tokens: + tokens.append(token) + segment_ids.append(0) + tokens.append("[SEP]") + segment_ids.append(0) + + for i in range(doc_span.length): + split_token_index = doc_span.start + i + token_to_orig_map[len( + tokens)] = tok_to_orig_index[split_token_index] + + is_max_context = _check_is_max_context(doc_spans, + doc_span_index, + split_token_index) + token_is_max_context[len(tokens)] = is_max_context + tokens.append(all_doc_tokens[split_token_index]) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + input_ids = tokenizer.convert_tokens_to_ids(tokens) + + # The mask has 1 for real tokens and 0 for padding tokens. Only real + # tokens are attended to. + input_mask = [1] * len(input_ids) + + # Zero-pad up to the sequence length. + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + + start_position = None + end_position = None + if is_training and not example.is_impossible: + # For training, if our document chunk does not contain an annotation + # we throw it out, since there is nothing to predict. + doc_start = doc_span.start + doc_end = doc_span.start + doc_span.length - 1 + out_of_span = False + if not (tok_start_position >= doc_start + and tok_end_position <= doc_end): + out_of_span = True + if out_of_span: + start_position = 0 + end_position = 0 + else: + doc_offset = len(query_tokens) + 2 + start_position = tok_start_position - doc_start + doc_offset + end_position = tok_end_position - doc_start + doc_offset + if is_training and example.is_impossible: + start_position = 0 + end_position = 0 + if example_index < 20: + logger.info("*** Example ***") + logger.info("unique_id: %s" % (unique_id)) + logger.info("example_index: %s" % (example_index)) + logger.info("doc_span_index: %s" % (doc_span_index)) + logger.info("tokens: %s" % " ".join(tokens)) + logger.info("token_to_orig_map: %s" % " ".join( + ["%d:%d" % (x, y) + for (x, y) in token_to_orig_map.items()])) + logger.info("token_is_max_context: %s" % " ".join([ + "%d:%s" % (x, y) + for (x, y) in token_is_max_context.items() + ])) + logger.info("input_ids: %s" % + " ".join([str(x) for x in input_ids])) + logger.info("input_mask: %s" % + " ".join([str(x) for x in input_mask])) + logger.info("segment_ids: %s" % + " ".join([str(x) for x in segment_ids])) + if is_training and example.is_impossible: + logger.info("impossible example") + if is_training and not example.is_impossible: + answer_text = " ".join( + tokens[start_position:(end_position + 1)]) + logger.info("start_position: %d" % (start_position)) + logger.info("end_position: %d" % (end_position)) + logger.info("answer: %s" % (answer_text)) + + features.append( + InputFeatures(unique_id=unique_id, + example_index=example_index, + doc_span_index=doc_span_index, + tokens=tokens, + token_to_orig_map=token_to_orig_map, + token_is_max_context=token_is_max_context, + input_ids=input_ids, + input_mask=input_mask, + segment_ids=segment_ids, + start_position=start_position, + end_position=end_position, + is_impossible=example.is_impossible)) + unique_id += 1 + + return features + + +def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, + orig_answer_text): + """Returns tokenized answer spans that better match the annotated answer.""" + + # The SQuAD annotations are character based. We first project them to + # whitespace-tokenized words. But then after WordPiece tokenization, we can + # often find a "better match". For example: + # + # Question: What year was John Smith born? + # Context: The leader was John Smith (1895-1943). + # Answer: 1895 + # + # The original whitespace-tokenized answer will be "(1895-1943).". However + # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match + # the exact answer, 1895. + # + # However, this is not always possible. Consider the following: + # + # Question: What country is the top exporter of electornics? + # Context: The Japanese electronics industry is the lagest in the world. + # Answer: Japan + # + # In this case, the annotator chose "Japan" as a character sub-span of + # the word "Japanese". Since our WordPiece tokenizer does not split + # "Japanese", we just use "Japanese" as the annotation. This is fairly rare + # in SQuAD, but does happen. + tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text)) + + for new_start in range(input_start, input_end + 1): + for new_end in range(input_end, new_start - 1, -1): + text_span = " ".join(doc_tokens[new_start:(new_end + 1)]) + if text_span == tok_answer_text: + return (new_start, new_end) + + return (input_start, input_end) + + +def _check_is_max_context(doc_spans, cur_span_index, position): + """Check if this is the 'max context' doc span for the token.""" + + # Because of the sliding window approach taken to scoring documents, a single + # token can appear in multiple documents. E.g. + # Doc: the man went to the store and bought a gallon of milk + # Span A: the man went to the + # Span B: to the store and bought + # Span C: and bought a gallon of + # ... + # + # Now the word 'bought' will have two scores from spans B and C. We only + # want to consider the score with "maximum context", which we define as + # the *minimum* of its left and right context (the *sum* of left and + # right context will always be the same, of course). + # + # In the example the maximum context for 'bought' would be span C since + # it has 1 left context and 3 right context, while span B has 4 left context + # and 0 right context. + best_score = None + best_span_index = None + for (span_index, doc_span) in enumerate(doc_spans): + end = doc_span.start + doc_span.length - 1 + if position < doc_span.start: + continue + if position > end: + continue + num_left_context = position - doc_span.start + num_right_context = end - position + score = min(num_left_context, + num_right_context) + 0.01 * doc_span.length + if best_score is None or score > best_score: + best_score = score + best_span_index = span_index + + return cur_span_index == best_span_index + + +RawResult = collections.namedtuple("RawResult", + ["unique_id", "start_logits", "end_logits"]) + + +def write_predictions(all_examples, all_features, all_results, n_best_size, + max_answer_length, do_lower_case, output_prediction_file, + output_nbest_file, output_null_log_odds_file, + verbose_logging, version_2_with_negative, + null_score_diff_threshold): + """Write final predictions to the json file and log-odds of null if needed.""" + logger.info("Writing predictions to: %s" % (output_prediction_file)) + logger.info("Writing nbest to: %s" % (output_nbest_file)) + + example_index_to_features = collections.defaultdict(list) + for feature in all_features: + example_index_to_features[feature.example_index].append(feature) + + unique_id_to_result = {} + for result in all_results: + unique_id_to_result[result.unique_id] = result + + _PrelimPrediction = collections.namedtuple( # pylint: disable=invalid-name + "PrelimPrediction", [ + "feature_index", "start_index", "end_index", "start_logit", + "end_logit" + ]) + + all_predictions = collections.OrderedDict() + all_nbest_json = collections.OrderedDict() + scores_diff_json = collections.OrderedDict() + + for (example_index, example) in enumerate(all_examples): + features = example_index_to_features[example_index] + + prelim_predictions = [] + # keep track of the minimum score of null start+end of position 0 + score_null = 1000000 # large and positive + min_null_feature_index = 0 # the paragraph slice with min null score + null_start_logit = 0 # the start logit at the slice with min null score + null_end_logit = 0 # the end logit at the slice with min null score + for (feature_index, feature) in enumerate(features): + result = unique_id_to_result[feature.unique_id] + start_indexes = _get_best_indexes(result.start_logits, n_best_size) + end_indexes = _get_best_indexes(result.end_logits, n_best_size) + # if we could have irrelevant answers, get the min score of irrelevant + if version_2_with_negative: + feature_null_score = result.start_logits[ + 0] + result.end_logits[0] + if feature_null_score < score_null: + score_null = feature_null_score + min_null_feature_index = feature_index + null_start_logit = result.start_logits[0] + null_end_logit = result.end_logits[0] + for start_index in start_indexes: + for end_index in end_indexes: + # We could hypothetically create invalid predictions, e.g., predict + # that the start of the span is in the question. We throw out all + # invalid predictions. + if start_index >= len(feature.tokens): + continue + if end_index >= len(feature.tokens): + continue + if start_index not in feature.token_to_orig_map: + continue + if end_index not in feature.token_to_orig_map: + continue + if not feature.token_is_max_context.get( + start_index, False): + continue + if end_index < start_index: + continue + length = end_index - start_index + 1 + if length > max_answer_length: + continue + prelim_predictions.append( + _PrelimPrediction( + feature_index=feature_index, + start_index=start_index, + end_index=end_index, + start_logit=result.start_logits[start_index], + end_logit=result.end_logits[end_index])) + if version_2_with_negative: + prelim_predictions.append( + _PrelimPrediction(feature_index=min_null_feature_index, + start_index=0, + end_index=0, + start_logit=null_start_logit, + end_logit=null_end_logit)) + prelim_predictions = sorted( + prelim_predictions, + key=lambda x: (x.start_logit + x.end_logit), + reverse=True) + + _NbestPrediction = collections.namedtuple( # pylint: disable=invalid-name + "NbestPrediction", ["text", "start_logit", "end_logit"]) + + seen_predictions = {} + nbest = [] + for pred in prelim_predictions: + if len(nbest) >= n_best_size: + break + feature = features[pred.feature_index] + if pred.start_index > 0: # this is a non-null prediction + tok_tokens = feature.tokens[pred.start_index:(pred.end_index + + 1)] + orig_doc_start = feature.token_to_orig_map[pred.start_index] + orig_doc_end = feature.token_to_orig_map[pred.end_index] + orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end + + 1)] + tok_text = " ".join(tok_tokens) + + # De-tokenize WordPieces that have been split off. + tok_text = tok_text.replace(" ##", "") + tok_text = tok_text.replace("##", "") + + # Clean whitespace + tok_text = tok_text.strip() + tok_text = " ".join(tok_text.split()) + orig_text = " ".join(orig_tokens) + + final_text = get_final_text(tok_text, orig_text, do_lower_case, + verbose_logging) + if final_text in seen_predictions: + continue + + seen_predictions[final_text] = True + else: + final_text = "" + seen_predictions[final_text] = True + + nbest.append( + _NbestPrediction(text=final_text, + start_logit=pred.start_logit, + end_logit=pred.end_logit)) + # if we didn't include the empty option in the n-best, include it + if version_2_with_negative: + if "" not in seen_predictions: + nbest.append( + _NbestPrediction(text="", + start_logit=null_start_logit, + end_logit=null_end_logit)) + + # In very rare edge cases we could only have single null prediction. + # So we just create a nonce prediction in this case to avoid failure. + if len(nbest) == 1: + nbest.insert( + 0, + _NbestPrediction(text="empty", + start_logit=0.0, + end_logit=0.0)) + + # In very rare edge cases we could have no valid predictions. So we + # just create a nonce prediction in this case to avoid failure. + if not nbest: + nbest.append( + _NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0)) + + assert len(nbest) >= 1 + + total_scores = [] + best_non_null_entry = None + for entry in nbest: + total_scores.append(entry.start_logit + entry.end_logit) + if not best_non_null_entry: + if entry.text: + best_non_null_entry = entry + + probs = _compute_softmax(total_scores) + + nbest_json = [] + for (i, entry) in enumerate(nbest): + output = collections.OrderedDict() + output["text"] = entry.text + output["probability"] = probs[i] + output["start_logit"] = entry.start_logit + output["end_logit"] = entry.end_logit + nbest_json.append(output) + + assert len(nbest_json) >= 1 + + if not version_2_with_negative: + all_predictions[example.qas_id] = nbest_json[0]["text"] + else: + # predict "" iff the null score - the score of best non-null > threshold + score_diff = score_null - best_non_null_entry.start_logit - ( + best_non_null_entry.end_logit) + scores_diff_json[example.qas_id] = score_diff + if score_diff > null_score_diff_threshold: + all_predictions[example.qas_id] = "" + else: + all_predictions[example.qas_id] = best_non_null_entry.text + all_nbest_json[example.qas_id] = nbest_json + + with open(output_prediction_file, "w") as writer: + writer.write(json.dumps(all_predictions, indent=4) + "\n") + + with open(output_nbest_file, "w") as writer: + writer.write(json.dumps(all_nbest_json, indent=4) + "\n") + + if version_2_with_negative: + with open(output_null_log_odds_file, "w") as writer: + writer.write(json.dumps(scores_diff_json, indent=4) + "\n") + + +def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging=False): + """Project the tokenized prediction back to the original text.""" + + # When we created the data, we kept track of the alignment between original + # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So + # now `orig_text` contains the span of our original text corresponding to the + # span that we predicted. + # + # However, `orig_text` may contain extra characters that we don't want in + # our prediction. + # + # For example, let's say: + # pred_text = steve smith + # orig_text = Steve Smith's + # + # We don't want to return `orig_text` because it contains the extra "'s". + # + # We don't want to return `pred_text` because it's already been normalized + # (the SQuAD eval script also does punctuation stripping/lower casing but + # our tokenizer does additional normalization like stripping accent + # characters). + # + # What we really want to return is "Steve Smith". + # + # Therefore, we have to apply a semi-complicated alignment heuristic between + # `pred_text` and `orig_text` to get a character-to-character alignment. This + # can fail in certain cases in which case we just return `orig_text`. + + def _strip_spaces(text): + ns_chars = [] + ns_to_s_map = collections.OrderedDict() + for (i, c) in enumerate(text): + if c == " ": + continue + ns_to_s_map[len(ns_chars)] = i + ns_chars.append(c) + ns_text = "".join(ns_chars) + return (ns_text, ns_to_s_map) + + # We first tokenize `orig_text`, strip whitespace from the result + # and `pred_text`, and check if they are the same length. If they are + # NOT the same length, the heuristic has failed. If they are the same + # length, we assume the characters are one-to-one aligned. + tokenizer = BasicTokenizer(do_lower_case=do_lower_case) + + tok_text = " ".join(tokenizer.tokenize(orig_text)) + + start_position = tok_text.find(pred_text) + if start_position == -1: + if verbose_logging: + logger.info("Unable to find text: '%s' in '%s'" % + (pred_text, orig_text)) + return orig_text + end_position = start_position + len(pred_text) - 1 + + (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text) + (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text) + + if len(orig_ns_text) != len(tok_ns_text): + if verbose_logging: + logger.info( + "Length not equal after stripping spaces: '%s' vs '%s'", + orig_ns_text, tok_ns_text) + return orig_text + + # We then project the characters in `pred_text` back to `orig_text` using + # the character-to-character alignment. + tok_s_to_ns_map = {} + for (i, tok_index) in tok_ns_to_s_map.items(): + tok_s_to_ns_map[tok_index] = i + + orig_start_position = None + if start_position in tok_s_to_ns_map: + ns_start_position = tok_s_to_ns_map[start_position] + if ns_start_position in orig_ns_to_s_map: + orig_start_position = orig_ns_to_s_map[ns_start_position] + + if orig_start_position is None: + if verbose_logging: + logger.info("Couldn't map start position") + return orig_text + + orig_end_position = None + if end_position in tok_s_to_ns_map: + ns_end_position = tok_s_to_ns_map[end_position] + if ns_end_position in orig_ns_to_s_map: + orig_end_position = orig_ns_to_s_map[ns_end_position] + + if orig_end_position is None: + if verbose_logging: + logger.info("Couldn't map end position") + return orig_text + + output_text = orig_text[orig_start_position:(orig_end_position + 1)] + return output_text + + +def _get_best_indexes(logits, n_best_size): + """Get the n-best logits from a list.""" + index_and_score = sorted(enumerate(logits), + key=lambda x: x[1], + reverse=True) + + best_indexes = [] + for i in range(len(index_and_score)): + if i >= n_best_size: + break + best_indexes.append(index_and_score[i][0]) + return best_indexes + + +def _compute_softmax(scores): + """Compute softmax probability over raw logits.""" + if not scores: + return [] + + max_score = None + for score in scores: + if max_score is None or score > max_score: + max_score = score + + exp_scores = [] + total_sum = 0.0 + for score in scores: + x = math.exp(score - max_score) + exp_scores.append(x) + total_sum += x + + probs = [] + for score in exp_scores: + probs.append(score / total_sum) + return probs + + +def main(): + parser = argparse.ArgumentParser() + + ## Required parameters + parser.add_argument( + "--bert_model", + default=None, + type=str, + required=True, + help="Bert pre-trained model selected in the list: bert-base-uncased, " + "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, " + "bert-base-multilingual-cased, bert-base-chinese.") + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help= + "The output directory where the model checkpoints and predictions will be written." + ) + + ## Other parameters + parser.add_argument("--train_file", + default=None, + type=str, + help="SQuAD json for training. E.g., train-v1.1.json") + parser.add_argument( + "--predict_file", + default=None, + type=str, + help="SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json" + ) + parser.add_argument( + "--max_seq_length", + default=384, + type=int, + help= + "The maximum total input sequence length after WordPiece tokenization. Sequences " + "longer than this will be truncated, and sequences shorter than this will be padded." + ) + parser.add_argument( + "--doc_stride", + default=128, + type=int, + help= + "When splitting up a long document into chunks, how much stride to take between chunks." + ) + parser.add_argument( + "--max_query_length", + default=64, + type=int, + help= + "The maximum number of tokens for the question. Questions longer than this will " + "be truncated to this length.") + parser.add_argument("--do_train", + action='store_true', + help="Whether to run training.") + parser.add_argument("--do_predict", + action='store_true', + help="Whether to run eval on the dev set.") + parser.add_argument("--train_batch_size", + default=32, + type=int, + help="Total batch size for training.") + parser.add_argument("--predict_batch_size", + default=8, + type=int, + help="Total batch size for predictions.") + parser.add_argument("--learning_rate", + default=5e-5, + type=float, + help="The initial learning rate for Adam.") + parser.add_argument("--num_train_epochs", + default=3.0, + type=float, + help="Total number of training epochs to perform.") + parser.add_argument( + "--warmup_proportion", + default=0.1, + type=float, + help= + "Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10%% " + "of training.") + parser.add_argument( + "--n_best_size", + default=20, + type=int, + help= + "The total number of n-best predictions to generate in the nbest_predictions.json " + "output file.") + parser.add_argument( + "--max_answer_length", + default=30, + type=int, + help= + "The maximum length of an answer that can be generated. This is needed because the start " + "and end predictions are not conditioned on one another.") + parser.add_argument( + "--verbose_logging", + action='store_true', + help= + "If true, all of the warnings related to data processing will be printed. " + "A number of warnings are expected for a normal SQuAD evaluation.") + parser.add_argument("--no_cuda", + action='store_true', + help="Whether not to use CUDA when available") + parser.add_argument('--seed', + type=int, + default=42, + help="random seed for initialization") + parser.add_argument( + '--gradient_accumulation_steps', + type=int, + default=1, + help= + "Number of updates steps to accumulate before performing a backward/update pass." + ) + parser.add_argument( + "--do_lower_case", + action='store_true', + help= + "Whether to lower case the input text. True for uncased models, False for cased models." + ) + parser.add_argument("--local_rank", + type=int, + default=-1, + help="local_rank for distributed training on gpus") + parser.add_argument( + '--fp16', + action='store_true', + help="Whether to use 16-bit float precision instead of 32-bit") + parser.add_argument( + '--loss_scale', + type=float, + default=0, + help= + "Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n" + "0 (default value): dynamic loss scaling.\n" + "Positive power of 2: static loss scaling value.\n") + parser.add_argument( + '--version_2_with_negative', + action='store_true', + help= + 'If true, the SQuAD examples contain some that do not have an answer.') + parser.add_argument( + '--null_score_diff_threshold', + type=float, + default=0.0, + help= + "If null_score - best_non_null is greater than the threshold predict null." + ) + parser.add_argument('--server_ip', + type=str, + default='', + help="Can be used for distant debugging.") + parser.add_argument('--server_port', + type=str, + default='', + help="Can be used for distant debugging.") + args = parser.parse_args() + print(args) + + if args.server_ip and args.server_port: + # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script + import ptvsd + print("Waiting for debugger attach") + ptvsd.enable_attach(address=(args.server_ip, args.server_port), + redirect_output=True) + ptvsd.wait_for_attach() + + if args.local_rank == -1 or args.no_cuda: + device = torch.device("cuda" if torch.cuda.is_available() + and not args.no_cuda else "cpu") + n_gpu = torch.cuda.device_count() + else: + torch.cuda.set_device(args.local_rank) + device = torch.device("cuda", args.local_rank) + n_gpu = 1 + # Initializes the distributed backend which will take care of sychronizing nodes/GPUs + torch.distributed.init_process_group(backend='nccl') + + logging.basicConfig( + format='%(asctime)s - %(levelname)s - %(name)s - %(message)s', + datefmt='%m/%d/%Y %H:%M:%S', + level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN) + + logger.info( + "device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}". + format(device, n_gpu, bool(args.local_rank != -1), args.fp16)) + + if args.gradient_accumulation_steps < 1: + raise ValueError( + "Invalid gradient_accumulation_steps parameter: {}, should be >= 1" + .format(args.gradient_accumulation_steps)) + + args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps + + random.seed(args.seed) + np.random.seed(args.seed) + torch.manual_seed(args.seed) + if n_gpu > 0: + torch.cuda.manual_seed_all(args.seed) + + if not args.do_train and not args.do_predict: + raise ValueError( + "At least one of `do_train` or `do_predict` must be True.") + + if args.do_train: + if not args.train_file: + raise ValueError( + "If `do_train` is True, then `train_file` must be specified.") + if args.do_predict: + if not args.predict_file: + raise ValueError( + "If `do_predict` is True, then `predict_file` must be specified." + ) + + if os.path.exists(args.output_dir) and os.listdir( + args.output_dir) and args.do_train: + raise ValueError( + "Output directory () already exists and is not empty.") + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + + tokenizer = BertTokenizer.from_pretrained(args.bert_model, + do_lower_case=args.do_lower_case) + + train_examples = None + num_train_optimization_steps = None + if args.do_train: + train_examples = read_squad_examples( + input_file=args.train_file, + is_training=True, + version_2_with_negative=args.version_2_with_negative) + num_train_optimization_steps = int( + len(train_examples) / args.train_batch_size / + args.gradient_accumulation_steps) * args.num_train_epochs + if args.local_rank != -1: + num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size( + ) + + # Prepare model + model = BertForQuestionAnswering.from_pretrained( + args.bert_model, + cache_dir=os.path.join(str(PYTORCH_PRETRAINED_BERT_CACHE), + 'distributed_{}'.format(args.local_rank))) + + if args.fp16: + model.half() + model.to(device) + if args.local_rank != -1: + try: + from apex.parallel import DistributedDataParallel as DDP + except ImportError: + raise ImportError( + "Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training." + ) + + model = DDP(model) + elif n_gpu > 1: + model = torch.nn.DataParallel(model) + + # Prepare optimizer + if args.do_train: + param_optimizer = list(model.named_parameters()) + + # hack to remove pooler, which is not used + # thus it produce None grad that break apex + param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]] + + no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] + optimizer_grouped_parameters = [{ + 'params': [ + p for n, p in param_optimizer + if not any(nd in n for nd in no_decay) + ], + 'weight_decay': + 0.01 + }, { + 'params': + [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], + 'weight_decay': + 0.0 + }] + + if args.fp16: + try: + from apex.optimizers import FP16_Optimizer + from apex.optimizers import FusedAdam + except ImportError: + raise ImportError( + "Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training." + ) + + optimizer = FusedAdam(optimizer_grouped_parameters, + lr=args.learning_rate, + bias_correction=False, + max_grad_norm=1.0) + if args.loss_scale == 0: + optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True) + else: + optimizer = FP16_Optimizer(optimizer, + static_loss_scale=args.loss_scale) + warmup_linear = WarmupLinearSchedule( + warmup=args.warmup_proportion, + t_total=num_train_optimization_steps) + else: + optimizer = AdamW(optimizer_grouped_parameters, + lr=args.learning_rate, + warmup=args.warmup_proportion, + t_total=num_train_optimization_steps) + + global_step = 0 + if args.do_train: + cached_train_features_file = args.train_file + '_{0}_{1}_{2}_{3}'.format( + list(filter(None, args.bert_model.split('/'))).pop(), + str(args.max_seq_length), str(args.doc_stride), + str(args.max_query_length)) + train_features = None + try: + with open(cached_train_features_file, "rb") as reader: + train_features = pickle.load(reader) + except: + train_features = convert_examples_to_features( + examples=train_examples, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + doc_stride=args.doc_stride, + max_query_length=args.max_query_length, + is_training=True) + if args.local_rank == -1 or torch.distributed.get_rank() == 0: + logger.info(" Saving train features into cached file %s", + cached_train_features_file) + with open(cached_train_features_file, "wb") as writer: + pickle.dump(train_features, writer) + logger.info("***** Running training *****") + logger.info(" Num orig examples = %d", len(train_examples)) + logger.info(" Num split examples = %d", len(train_features)) + logger.info(" Batch size = %d", args.train_batch_size) + logger.info(" Num steps = %d", num_train_optimization_steps) + all_input_ids = torch.tensor([f.input_ids for f in train_features], + dtype=torch.long) + all_input_mask = torch.tensor([f.input_mask for f in train_features], + dtype=torch.long) + all_segment_ids = torch.tensor([f.segment_ids for f in train_features], + dtype=torch.long) + all_start_positions = torch.tensor( + [f.start_position for f in train_features], dtype=torch.long) + all_end_positions = torch.tensor( + [f.end_position for f in train_features], dtype=torch.long) + train_data = TensorDataset(all_input_ids, all_input_mask, + all_segment_ids, all_start_positions, + all_end_positions) + if args.local_rank == -1: + train_sampler = RandomSampler(train_data) + else: + train_sampler = DistributedSampler(train_data) + train_dataloader = DataLoader(train_data, + sampler=train_sampler, + batch_size=args.train_batch_size) + + model.train() + for _ in trange(int(args.num_train_epochs), desc="Epoch"): + for step, batch in enumerate( + tqdm(train_dataloader, + desc="Iteration", + disable=args.local_rank not in [-1, 0])): + if n_gpu == 1: + batch = tuple( + t.to(device) + for t in batch) # multi-gpu does scattering it-self + input_ids, input_mask, segment_ids, start_positions, end_positions = batch + loss = model(input_ids, segment_ids, input_mask, + start_positions, end_positions) + if n_gpu > 1: + loss = loss.mean() # mean() to average on multi-gpu. + if args.gradient_accumulation_steps > 1: + loss = loss / args.gradient_accumulation_steps + + if args.fp16: + optimizer.backward(loss) + else: + loss.backward() + if (step + 1) % args.gradient_accumulation_steps == 0: + if args.fp16: + # modify learning rate with special warm up BERT uses + # if args.fp16 is False, AdamW is used and handles this automatically + lr_this_step = args.learning_rate * warmup_linear.get_lr( + global_step, args.warmup_proportion) + for param_group in optimizer.param_groups: + param_group['lr'] = lr_this_step + optimizer.step() + optimizer.zero_grad() + global_step += 1 + + if args.do_train and (args.local_rank == -1 + or torch.distributed.get_rank() == 0): + # Save a trained model, configuration and tokenizer + model_to_save = model.module if hasattr( + model, 'module') else model # Only save the model it-self + + # If we save using the predefined names, we can load using `from_pretrained` + output_model_file = os.path.join(args.output_dir, WEIGHTS_NAME) + output_config_file = os.path.join(args.output_dir, CONFIG_NAME) + + torch.save(model_to_save.state_dict(), output_model_file) + model_to_save.config.to_json_file(output_config_file) + tokenizer.save_vocabulary(args.output_dir) + + # Load a trained model and vocabulary that you have fine-tuned + model = BertForQuestionAnswering.from_pretrained(args.output_dir) + tokenizer = BertTokenizer.from_pretrained( + args.output_dir, do_lower_case=args.do_lower_case) + else: + model = BertForQuestionAnswering.from_pretrained(args.bert_model) + + model.to(device) + + if args.do_predict and (args.local_rank == -1 + or torch.distributed.get_rank() == 0): + eval_examples = read_squad_examples( + input_file=args.predict_file, + is_training=False, + version_2_with_negative=args.version_2_with_negative) + eval_features = convert_examples_to_features( + examples=eval_examples, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + doc_stride=args.doc_stride, + max_query_length=args.max_query_length, + is_training=False) + + logger.info("***** Running predictions *****") + logger.info(" Num orig examples = %d", len(eval_examples)) + logger.info(" Num split examples = %d", len(eval_features)) + logger.info(" Batch size = %d", args.predict_batch_size) + + all_input_ids = torch.tensor([f.input_ids for f in eval_features], + dtype=torch.long) + all_input_mask = torch.tensor([f.input_mask for f in eval_features], + dtype=torch.long) + all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], + dtype=torch.long) + all_example_index = torch.arange(all_input_ids.size(0), + dtype=torch.long) + eval_data = TensorDataset(all_input_ids, all_input_mask, + all_segment_ids, all_example_index) + # Run prediction for full data + eval_sampler = SequentialSampler(eval_data) + eval_dataloader = DataLoader(eval_data, + sampler=eval_sampler, + batch_size=args.predict_batch_size) + + model.eval() + all_results = [] + logger.info("Start evaluating") + for input_ids, input_mask, segment_ids, example_indices in tqdm( + eval_dataloader, + desc="Evaluating", + disable=args.local_rank not in [-1, 0]): + if len(all_results) % 1000 == 0: + logger.info("Processing example: %d" % (len(all_results))) + input_ids = input_ids.to(device) + input_mask = input_mask.to(device) + segment_ids = segment_ids.to(device) + with torch.no_grad(): + batch_start_logits, batch_end_logits = model( + input_ids, segment_ids, input_mask) + for i, example_index in enumerate(example_indices): + start_logits = batch_start_logits[i].detach().cpu().tolist() + end_logits = batch_end_logits[i].detach().cpu().tolist() + eval_feature = eval_features[example_index.item()] + unique_id = int(eval_feature.unique_id) + all_results.append( + RawResult(unique_id=unique_id, + start_logits=start_logits, + end_logits=end_logits)) + output_prediction_file = os.path.join(args.output_dir, + "predictions.json") + output_nbest_file = os.path.join(args.output_dir, + "nbest_predictions.json") + output_null_log_odds_file = os.path.join(args.output_dir, + "null_odds.json") + write_predictions(eval_examples, eval_features, all_results, + args.n_best_size, args.max_answer_length, + args.do_lower_case, output_prediction_file, + output_nbest_file, output_null_log_odds_file, + args.verbose_logging, args.version_2_with_negative, + args.null_score_diff_threshold) + + +if __name__ == "__main__": + main() diff --git a/collections/nemo_nlp/nemo_nlp/externals/sacrebleu.py b/collections/nemo_nlp/nemo_nlp/externals/sacrebleu.py new file mode 100755 index 000000000000..7d14c91b46c2 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/externals/sacrebleu.py @@ -0,0 +1,2403 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +# Copyright 2017--2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"). You may not +# use this file except in compliance with the License. A copy of the License +# is located at +# +# http://aws.amazon.com/apache2.0/ +# +# or in the "license" file accompanying this file. This file is distributed on +# an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either +# express or implied. See the License for the specific language governing +# permissions and limitations under the License. +""" +SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. +Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text. +It also knows all the standard test sets and handles downloading, processing, and tokenization for you. + +See the [README.md] file for more information. +""" + +import argparse +import gzip +import hashlib +import io +import logging +import math +import os +import re +import sys +import unicodedata +import urllib.request + +from collections import Counter, namedtuple +from itertools import zip_longest +from typing import List, Iterable, Tuple, Union + +VERSION = '1.3.5' + +try: + # SIGPIPE is not available on Windows machines, throwing an exception. + from signal import SIGPIPE + + # If SIGPIPE is available, change behaviour to default instead of ignore. + from signal import signal, SIG_DFL + signal(SIGPIPE, SIG_DFL) + +except ImportError: + logging.warning( + 'Could not import signal.SIGPIPE (this is expected on Windows machines)' + ) + +# Where to store downloaded test sets. +# Define the environment variable $SACREBLEU, or use the default of ~/.sacrebleu. +# +# Querying for a HOME environment variable can result in None (e.g., on Windows) +# in which case the os.path.join() throws a TypeError. Using expanduser() is +# a safe way to get the user's home folder. +USERHOME = os.path.expanduser("~") +SACREBLEU_DIR = os.environ.get('SACREBLEU', + os.path.join(USERHOME, '.sacrebleu')) + +# n-gram order. Don't change this. +NGRAM_ORDER = 4 + +# Default values for CHRF +CHRF_ORDER = 6 +# default to 2 (per http://www.aclweb.org/anthology/W16-2341) +CHRF_BETA = 2 + +# The default floor value to use with `--smooth floor` +SMOOTH_VALUE_DEFAULT = 0.0 + +# This defines data locations. +# At the top level are test sets. +# Beneath each test set, we define the location to download the test data. +# The other keys are each language pair contained in the tarball, and the respective locations of the source and reference data within each. +# Many of these are *.sgm files, which are processed to produced plain text that can be used by this script. +# The canonical location of unpacked, processed data is $SACREBLEU_DIR/$TEST/$SOURCE-$TARGET.{$SOURCE,$TARGET} +DATASETS = { + 'mtnt2019': { + 'data': ['http://www.cs.cmu.edu/~pmichel1/hosting/MTNT2019.tar.gz'], + 'description': 'Test set for the WMT 19 robustness shared task', + 'md5': ['78a672e1931f106a8549023c0e8af8f6'], + 'en-fr': ['2:MTNT2019/en-fr.final.tsv', '3:MTNT2019/en-fr.final.tsv'], + 'fr-en': ['2:MTNT2019/fr-en.final.tsv', '3:MTNT2019/fr-en.final.tsv'], + 'en-ja': ['2:MTNT2019/en-ja.final.tsv', '3:MTNT2019/en-ja.final.tsv'], + 'ja-en': ['2:MTNT2019/ja-en.final.tsv', '3:MTNT2019/ja-en.final.tsv'], + }, + 'mtnt1.1/test': { + 'data': [ + 'https://github.com/pmichel31415/mtnt/releases/download/v1.1/MTNT.1.1.tar.gz' + ], + 'description': + 'Test data for the Machine Translation of Noisy Text task: http://www.cs.cmu.edu/~pmichel1/mtnt/', + 'citation': + '@InProceedings{michel2018a:mtnt,\n author = "Michel, Paul and Neubig, Graham",\n title = "MTNT: A Testbed for Machine Translation of Noisy Text",\n booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",\n year = "2018",\n publisher = "Association for Computational Linguistics",\n pages = "543--553",\n location = "Brussels, Belgium",\n url = "http://aclweb.org/anthology/D18-1050"\n}', + 'md5': ['8ce1831ac584979ba8cdcd9d4be43e1d'], + 'en-fr': ['1:MTNT/test/test.en-fr.tsv', '2:MTNT/test/test.en-fr.tsv'], + 'fr-en': ['1:MTNT/test/test.fr-en.tsv', '2:MTNT/test/test.fr-en.tsv'], + 'en-ja': ['1:MTNT/test/test.en-ja.tsv', '2:MTNT/test/test.en-ja.tsv'], + 'ja-en': ['1:MTNT/test/test.ja-en.tsv', '2:MTNT/test/test.ja-en.tsv'], + }, + 'mtnt1.1/valid': { + 'data': [ + 'https://github.com/pmichel31415/mtnt/releases/download/v1.1/MTNT.1.1.tar.gz' + ], + 'description': + 'Validation data for the Machine Translation of Noisy Text task: http://www.cs.cmu.edu/~pmichel1/mtnt/', + 'citation': + '@InProceedings{michel2018a:mtnt,\n author = "Michel, Paul and Neubig, Graham",\n title = "MTNT: A Testbed for Machine Translation of Noisy Text",\n booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",\n year = "2018",\n publisher = "Association for Computational Linguistics",\n pages = "543--553",\n location = "Brussels, Belgium",\n url = "http://aclweb.org/anthology/D18-1050"\n}', + 'md5': ['8ce1831ac584979ba8cdcd9d4be43e1d'], + 'en-fr': + ['1:MTNT/valid/valid.en-fr.tsv', '2:MTNT/valid/valid.en-fr.tsv'], + 'fr-en': + ['1:MTNT/valid/valid.fr-en.tsv', '2:MTNT/valid/valid.fr-en.tsv'], + 'en-ja': + ['1:MTNT/valid/valid.en-ja.tsv', '2:MTNT/valid/valid.en-ja.tsv'], + 'ja-en': + ['1:MTNT/valid/valid.ja-en.tsv', '2:MTNT/valid/valid.ja-en.tsv'], + }, + 'mtnt1.1/train': { + 'data': [ + 'https://github.com/pmichel31415/mtnt/releases/download/v1.1/MTNT.1.1.tar.gz' + ], + 'description': + 'Training data for the Machine Translation of Noisy Text task: http://www.cs.cmu.edu/~pmichel1/mtnt/', + 'citation': + '@InProceedings{michel2018a:mtnt,\n author = "Michel, Paul and Neubig, Graham",\n title = "MTNT: A Testbed for Machine Translation of Noisy Text",\n booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",\n year = "2018",\n publisher = "Association for Computational Linguistics",\n pages = "543--553",\n location = "Brussels, Belgium",\n url = "http://aclweb.org/anthology/D18-1050"\n}', + 'md5': ['8ce1831ac584979ba8cdcd9d4be43e1d'], + 'en-fr': + ['1:MTNT/train/train.en-fr.tsv', '2:MTNT/train/train.en-fr.tsv'], + 'fr-en': + ['1:MTNT/train/train.fr-en.tsv', '2:MTNT/train/train.fr-en.tsv'], + 'en-ja': + ['1:MTNT/train/train.en-ja.tsv', '2:MTNT/train/train.en-ja.tsv'], + 'ja-en': + ['1:MTNT/train/train.ja-en.tsv', '2:MTNT/train/train.ja-en.tsv'], + }, + 'wmt19': { + 'data': ['http://data.statmt.org/wmt19/translation-task/test.tgz'], + 'md5': ['84de7162d158e28403103b01aeefc39a'], + 'cs-de': [ + 'sgm/newstest2019-csde-src.cs.sgm', + 'sgm/newstest2019-csde-ref.de.sgm' + ], + 'de-cs': [ + 'sgm/newstest2019-decs-src.de.sgm', + 'sgm/newstest2019-decs-ref.cs.sgm' + ], + 'de-en': [ + 'sgm/newstest2019-deen-src.de.sgm', + 'sgm/newstest2019-deen-ref.en.sgm' + ], + 'de-fr': [ + 'sgm/newstest2019-defr-src.de.sgm', + 'sgm/newstest2019-defr-ref.fr.sgm' + ], + 'en-cs': [ + 'sgm/newstest2019-encs-src.en.sgm', + 'sgm/newstest2019-encs-ref.cs.sgm' + ], + 'en-de': [ + 'sgm/newstest2019-ende-src.en.sgm', + 'sgm/newstest2019-ende-ref.de.sgm' + ], + 'en-fi': [ + 'sgm/newstest2019-enfi-src.en.sgm', + 'sgm/newstest2019-enfi-ref.fi.sgm' + ], + 'en-gu': [ + 'sgm/newstest2019-engu-src.en.sgm', + 'sgm/newstest2019-engu-ref.gu.sgm' + ], + 'en-kk': [ + 'sgm/newstest2019-enkk-src.en.sgm', + 'sgm/newstest2019-enkk-ref.kk.sgm' + ], + 'en-lt': [ + 'sgm/newstest2019-enlt-src.en.sgm', + 'sgm/newstest2019-enlt-ref.lt.sgm' + ], + 'en-ru': [ + 'sgm/newstest2019-enru-src.en.sgm', + 'sgm/newstest2019-enru-ref.ru.sgm' + ], + 'en-zh': [ + 'sgm/newstest2019-enzh-src.en.sgm', + 'sgm/newstest2019-enzh-ref.zh.sgm' + ], + 'fi-en': [ + 'sgm/newstest2019-fien-src.fi.sgm', + 'sgm/newstest2019-fien-ref.en.sgm' + ], + 'fr-de': [ + 'sgm/newstest2019-frde-src.fr.sgm', + 'sgm/newstest2019-frde-ref.de.sgm' + ], + 'gu-en': [ + 'sgm/newstest2019-guen-src.gu.sgm', + 'sgm/newstest2019-guen-ref.en.sgm' + ], + 'kk-en': [ + 'sgm/newstest2019-kken-src.kk.sgm', + 'sgm/newstest2019-kken-ref.en.sgm' + ], + 'lt-en': [ + 'sgm/newstest2019-lten-src.lt.sgm', + 'sgm/newstest2019-lten-ref.en.sgm' + ], + 'ru-en': [ + 'sgm/newstest2019-ruen-src.ru.sgm', + 'sgm/newstest2019-ruen-ref.en.sgm' + ], + 'zh-en': [ + 'sgm/newstest2019-zhen-src.zh.sgm', + 'sgm/newstest2019-zhen-ref.en.sgm' + ], + }, + 'wmt19/dev': { + 'data': ['http://data.statmt.org/wmt19/translation-task/dev.tgz'], + 'description': + 'Development data for tasks new to 2019.', + 'md5': ['f2ec7af5947c19e0cacb3882eb208002'], + 'lt-en': + ['dev/newsdev2019-lten-src.lt.sgm', 'dev/newsdev2019-lten-ref.en.sgm'], + 'en-lt': + ['dev/newsdev2019-enlt-src.en.sgm', 'dev/newsdev2019-enlt-ref.lt.sgm'], + 'gu-en': + ['dev/newsdev2019-guen-src.gu.sgm', 'dev/newsdev2019-guen-ref.en.sgm'], + 'en-gu': + ['dev/newsdev2019-engu-src.en.sgm', 'dev/newsdev2019-engu-ref.gu.sgm'], + 'kk-en': + ['dev/newsdev2019-kken-src.kk.sgm', 'dev/newsdev2019-kken-ref.en.sgm'], + 'en-kk': + ['dev/newsdev2019-enkk-src.en.sgm', 'dev/newsdev2019-enkk-ref.kk.sgm'], + }, + 'wmt18': { + 'data': ['http://data.statmt.org/wmt18/translation-task/test.tgz'], + 'md5': ['f996c245ecffea23d0006fa4c34e9064'], + 'description': + 'Official evaluation data.', + 'citation': + '@inproceedings{bojar-etal-2018-findings,\n title = "Findings of the 2018 Conference on Machine Translation ({WMT}18)",\n author = "Bojar, Ond{\v{r}}ej and\n Federmann, Christian and\n Fishel, Mark and\n Graham, Yvette and\n Haddow, Barry and\n Koehn, Philipp and\n Monz, Christof",\n booktitle = "Proceedings of the Third Conference on Machine Translation: Shared Task Papers",\n month = oct,\n year = "2018",\n address = "Belgium, Brussels",\n publisher = "Association for Computational Linguistics",\n url = "https://www.aclweb.org/anthology/W18-6401",\n pages = "272--303",\n}', + 'cs-en': [ + 'test/newstest2018-csen-src.cs.sgm', + 'test/newstest2018-csen-ref.en.sgm' + ], + 'de-en': [ + 'test/newstest2018-deen-src.de.sgm', + 'test/newstest2018-deen-ref.en.sgm' + ], + 'en-cs': [ + 'test/newstest2018-encs-src.en.sgm', + 'test/newstest2018-encs-ref.cs.sgm' + ], + 'en-de': [ + 'test/newstest2018-ende-src.en.sgm', + 'test/newstest2018-ende-ref.de.sgm' + ], + 'en-et': [ + 'test/newstest2018-enet-src.en.sgm', + 'test/newstest2018-enet-ref.et.sgm' + ], + 'en-fi': [ + 'test/newstest2018-enfi-src.en.sgm', + 'test/newstest2018-enfi-ref.fi.sgm' + ], + 'en-ru': [ + 'test/newstest2018-enru-src.en.sgm', + 'test/newstest2018-enru-ref.ru.sgm' + ], + 'et-en': [ + 'test/newstest2018-eten-src.et.sgm', + 'test/newstest2018-eten-ref.en.sgm' + ], + 'fi-en': [ + 'test/newstest2018-fien-src.fi.sgm', + 'test/newstest2018-fien-ref.en.sgm' + ], + 'ru-en': [ + 'test/newstest2018-ruen-src.ru.sgm', + 'test/newstest2018-ruen-ref.en.sgm' + ], + 'en-tr': [ + 'test/newstest2018-entr-src.en.sgm', + 'test/newstest2018-entr-ref.tr.sgm' + ], + 'tr-en': [ + 'test/newstest2018-tren-src.tr.sgm', + 'test/newstest2018-tren-ref.en.sgm' + ], + 'en-zh': [ + 'test/newstest2018-enzh-src.en.sgm', + 'test/newstest2018-enzh-ref.zh.sgm' + ], + 'zh-en': [ + 'test/newstest2018-zhen-src.zh.sgm', + 'test/newstest2018-zhen-ref.en.sgm' + ], + }, + 'wmt18/test-ts': { + 'data': ['http://data.statmt.org/wmt18/translation-task/test-ts.tgz'], + 'md5': ['5c621a34d512cc2dd74162ae7d00b320'], + 'description': + 'Official evaluation sources with extra test sets interleaved.', + 'cs-en': ['test/newstest2018-csen-src-ts.cs.sgm'], + 'de-en': ['test/newstest2018-deen-src-ts.de.sgm'], + 'en-cs': ['test/newstest2018-encs-src-ts.en.sgm'], + 'en-de': ['test/newstest2018-ende-src-ts.en.sgm'], + 'en-et': ['test/newstest2018-enet-src-ts.en.sgm'], + 'en-fi': ['test/newstest2018-enfi-src-ts.en.sgm'], + 'en-ru': ['test/newstest2018-enru-src-ts.en.sgm'], + 'et-en': ['test/newstest2018-eten-src-ts.et.sgm'], + 'fi-en': ['test/newstest2018-fien-src-ts.fi.sgm'], + 'ru-en': ['test/newstest2018-ruen-src-ts.ru.sgm'], + 'en-tr': ['test/newstest2018-entr-src-ts.en.sgm'], + 'tr-en': ['test/newstest2018-tren-src-ts.tr.sgm'], + 'en-zh': ['test/newstest2018-enzh-src-ts.en.sgm'], + 'zh-en': ['test/newstest2018-zhen-src-ts.zh.sgm'], + }, + 'wmt18/dev': { + 'data': ['http://data.statmt.org/wmt18/translation-task/dev.tgz'], + 'md5': ['486f391da54a7a3247f02ebd25996f24'], + 'description': + 'Development data (Estonian<>English).', + 'et-en': + ['dev/newsdev2018-eten-src.et.sgm', 'dev/newsdev2018-eten-ref.en.sgm'], + 'en-et': + ['dev/newsdev2018-enet-src.en.sgm', 'dev/newsdev2018-enet-ref.et.sgm'], + }, + 'wmt17': { + 'data': ['http://data.statmt.org/wmt17/translation-task/test.tgz'], + 'md5': ['86a1724c276004aa25455ae2a04cef26'], + 'description': + 'Official evaluation data.', + 'citation': + '@InProceedings{bojar-EtAl:2017:WMT1,\n author = {Bojar, Ond\\v{r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huang, Shujian and Huck, Matthias and Koehn, Philipp and Liu, Qun and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Rubino, Raphael and Specia, Lucia and Turchi, Marco},\n title = {Findings of the 2017 Conference on Machine Translation (WMT17)},\n booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},\n month = {September},\n year = {2017},\n address = {Copenhagen, Denmark},\n publisher = {Association for Computational Linguistics},\n pages = {169--214},\n url = {http://www.aclweb.org/anthology/W17-4717}\n}', + 'cs-en': [ + 'test/newstest2017-csen-src.cs.sgm', + 'test/newstest2017-csen-ref.en.sgm' + ], + 'de-en': [ + 'test/newstest2017-deen-src.de.sgm', + 'test/newstest2017-deen-ref.en.sgm' + ], + 'en-cs': [ + 'test/newstest2017-encs-src.en.sgm', + 'test/newstest2017-encs-ref.cs.sgm' + ], + 'en-de': [ + 'test/newstest2017-ende-src.en.sgm', + 'test/newstest2017-ende-ref.de.sgm' + ], + 'en-fi': [ + 'test/newstest2017-enfi-src.en.sgm', + 'test/newstest2017-enfi-ref.fi.sgm' + ], + 'en-lv': [ + 'test/newstest2017-enlv-src.en.sgm', + 'test/newstest2017-enlv-ref.lv.sgm' + ], + 'en-ru': [ + 'test/newstest2017-enru-src.en.sgm', + 'test/newstest2017-enru-ref.ru.sgm' + ], + 'en-tr': [ + 'test/newstest2017-entr-src.en.sgm', + 'test/newstest2017-entr-ref.tr.sgm' + ], + 'en-zh': [ + 'test/newstest2017-enzh-src.en.sgm', + 'test/newstest2017-enzh-ref.zh.sgm' + ], + 'fi-en': [ + 'test/newstest2017-fien-src.fi.sgm', + 'test/newstest2017-fien-ref.en.sgm' + ], + 'lv-en': [ + 'test/newstest2017-lven-src.lv.sgm', + 'test/newstest2017-lven-ref.en.sgm' + ], + 'ru-en': [ + 'test/newstest2017-ruen-src.ru.sgm', + 'test/newstest2017-ruen-ref.en.sgm' + ], + 'tr-en': [ + 'test/newstest2017-tren-src.tr.sgm', + 'test/newstest2017-tren-ref.en.sgm' + ], + 'zh-en': [ + 'test/newstest2017-zhen-src.zh.sgm', + 'test/newstest2017-zhen-ref.en.sgm' + ], + }, + 'wmt17/B': { + 'data': ['http://data.statmt.org/wmt17/translation-task/test.tgz'], + 'md5': ['86a1724c276004aa25455ae2a04cef26'], + 'description': + 'Additional reference for EN-FI and FI-EN.', + 'en-fi': [ + 'test/newstestB2017-enfi-src.en.sgm', + 'test/newstestB2017-enfi-ref.fi.sgm' + ], + }, + 'wmt17/tworefs': { + 'data': ['http://data.statmt.org/wmt17/translation-task/test.tgz'], + 'md5': ['86a1724c276004aa25455ae2a04cef26'], + 'description': + 'Systems with two references.', + 'en-fi': [ + 'test/newstest2017-enfi-src.en.sgm', + 'test/newstest2017-enfi-ref.fi.sgm', + 'test/newstestB2017-enfi-ref.fi.sgm' + ], + }, + 'wmt17/improved': { + 'data': + ['http://data.statmt.org/wmt17/translation-task/test-update-1.tgz'], + 'md5': ['91dbfd5af99bc6891a637a68e04dfd41'], + 'description': 'Improved zh-en and en-zh translations.', + 'en-zh': + ['newstest2017-enzh-src.en.sgm', 'newstest2017-enzh-ref.zh.sgm'], + 'zh-en': + ['newstest2017-zhen-src.zh.sgm', 'newstest2017-zhen-ref.en.sgm'], + }, + 'wmt17/dev': { + 'data': ['http://data.statmt.org/wmt17/translation-task/dev.tgz'], + 'md5': ['9b1aa63c1cf49dccdd20b962fe313989'], + 'description': + 'Development sets released for new languages in 2017.', + 'en-lv': + ['dev/newsdev2017-enlv-src.en.sgm', 'dev/newsdev2017-enlv-ref.lv.sgm'], + 'en-zh': + ['dev/newsdev2017-enzh-src.en.sgm', 'dev/newsdev2017-enzh-ref.zh.sgm'], + 'lv-en': + ['dev/newsdev2017-lven-src.lv.sgm', 'dev/newsdev2017-lven-ref.en.sgm'], + 'zh-en': + ['dev/newsdev2017-zhen-src.zh.sgm', 'dev/newsdev2017-zhen-ref.en.sgm'], + }, + 'wmt17/ms': { + 'data': [ + 'https://github.com/MicrosoftTranslator/Translator-HumanParityData/archive/master.zip', + 'http://data.statmt.org/wmt17/translation-task/test-update-1.tgz' + ], + 'md5': [ + '18fdaa7a3c84cf6ef688da1f6a5fa96f', + '91dbfd5af99bc6891a637a68e04dfd41' + ], + 'description': + 'Additional Chinese-English references from Microsoft Research.', + 'citation': + '@inproceedings{achieving-human-parity-on-automatic-chinese-to-english-news-translation,\n author = {Hassan Awadalla, Hany and Aue, Anthony and Chen, Chang and Chowdhary, Vishal and Clark, Jonathan and Federmann, Christian and Huang, Xuedong and Junczys-Dowmunt, Marcin and Lewis, Will and Li, Mu and Liu, Shujie and Liu, Tie-Yan and Luo, Renqian and Menezes, Arul and Qin, Tao and Seide, Frank and Tan, Xu and Tian, Fei and Wu, Lijun and Wu, Shuangzhi and Xia, Yingce and Zhang, Dongdong and Zhang, Zhirui and Zhou, Ming},\n title = {Achieving Human Parity on Automatic Chinese to English News Translation},\n booktitle = {},\n year = {2018},\n month = {March},\n abstract = {Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human parity in translation. We then describe Microsoft’s machine translation system and measure the quality of its translations on the widely used WMT 2017 news translation task from Chinese to English. We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations. We also find that it significantly exceeds the quality of crowd-sourced non-professional translations.},\n publisher = {},\n url = {https://www.microsoft.com/en-us/research/publication/achieving-human-parity-on-automatic-chinese-to-english-news-translation/},\n address = {},\n pages = {},\n journal = {},\n volume = {},\n chapter = {},\n isbn = {},\n}', + 'zh-en': [ + 'newstest2017-zhen-src.zh.sgm', 'newstest2017-zhen-ref.en.sgm', + 'Translator-HumanParityData-master/Translator-HumanParityData/References/Translator-HumanParityData-Reference-HT.txt', + 'Translator-HumanParityData-master/Translator-HumanParityData/References/Translator-HumanParityData-Reference-PE.txt' + ], + }, + 'wmt16': { + 'data': ['http://data.statmt.org/wmt16/translation-task/test.tgz'], + 'md5': ['3d809cd0c2c86adb2c67034d15c4e446'], + 'description': + 'Official evaluation data.', + 'citation': + '@InProceedings{bojar-EtAl:2016:WMT1,\n author = {Bojar, Ond\\v{r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos},\n title = {Findings of the 2016 Conference on Machine Translation},\n booktitle = {Proceedings of the First Conference on Machine Translation},\n month = {August},\n year = {2016},\n address = {Berlin, Germany},\n publisher = {Association for Computational Linguistics},\n pages = {131--198},\n url = {http://www.aclweb.org/anthology/W/W16/W16-2301}\n}', + 'cs-en': [ + 'test/newstest2016-csen-src.cs.sgm', + 'test/newstest2016-csen-ref.en.sgm' + ], + 'de-en': [ + 'test/newstest2016-deen-src.de.sgm', + 'test/newstest2016-deen-ref.en.sgm' + ], + 'en-cs': [ + 'test/newstest2016-encs-src.en.sgm', + 'test/newstest2016-encs-ref.cs.sgm' + ], + 'en-de': [ + 'test/newstest2016-ende-src.en.sgm', + 'test/newstest2016-ende-ref.de.sgm' + ], + 'en-fi': [ + 'test/newstest2016-enfi-src.en.sgm', + 'test/newstest2016-enfi-ref.fi.sgm' + ], + 'en-ro': [ + 'test/newstest2016-enro-src.en.sgm', + 'test/newstest2016-enro-ref.ro.sgm' + ], + 'en-ru': [ + 'test/newstest2016-enru-src.en.sgm', + 'test/newstest2016-enru-ref.ru.sgm' + ], + 'en-tr': [ + 'test/newstest2016-entr-src.en.sgm', + 'test/newstest2016-entr-ref.tr.sgm' + ], + 'fi-en': [ + 'test/newstest2016-fien-src.fi.sgm', + 'test/newstest2016-fien-ref.en.sgm' + ], + 'ro-en': [ + 'test/newstest2016-roen-src.ro.sgm', + 'test/newstest2016-roen-ref.en.sgm' + ], + 'ru-en': [ + 'test/newstest2016-ruen-src.ru.sgm', + 'test/newstest2016-ruen-ref.en.sgm' + ], + 'tr-en': [ + 'test/newstest2016-tren-src.tr.sgm', + 'test/newstest2016-tren-ref.en.sgm' + ], + }, + 'wmt16/B': { + 'data': ['http://data.statmt.org/wmt16/translation-task/test.tgz'], + 'md5': ['3d809cd0c2c86adb2c67034d15c4e446'], + 'description': + 'Additional reference for EN-FI.', + 'en-fi': [ + 'test/newstest2016-enfi-src.en.sgm', + 'test/newstestB2016-enfi-ref.fi.sgm' + ], + }, + 'wmt16/tworefs': { + 'data': ['http://data.statmt.org/wmt16/translation-task/test.tgz'], + 'md5': ['3d809cd0c2c86adb2c67034d15c4e446'], + 'description': + 'EN-FI with two references.', + 'en-fi': [ + 'test/newstest2016-enfi-src.en.sgm', + 'test/newstest2016-enfi-ref.fi.sgm', + 'test/newstestB2016-enfi-ref.fi.sgm' + ], + }, + 'wmt16/dev': { + 'data': ['http://data.statmt.org/wmt16/translation-task/dev.tgz'], + 'md5': ['4a3dc2760bb077f4308cce96b06e6af6'], + 'description': + 'Development sets released for new languages in 2016.', + 'en-ro': + ['dev/newsdev2016-enro-src.en.sgm', 'dev/newsdev2016-enro-ref.ro.sgm'], + 'en-tr': + ['dev/newsdev2016-entr-src.en.sgm', 'dev/newsdev2016-entr-ref.tr.sgm'], + 'ro-en': + ['dev/newsdev2016-roen-src.ro.sgm', 'dev/newsdev2016-roen-ref.en.sgm'], + 'tr-en': + ['dev/newsdev2016-tren-src.tr.sgm', 'dev/newsdev2016-tren-ref.en.sgm'] + }, + 'wmt15': { + 'data': ['http://statmt.org/wmt15/test.tgz'], + 'md5': ['67e3beca15e69fe3d36de149da0a96df'], + 'description': + 'Official evaluation data.', + 'citation': + '@InProceedings{bojar-EtAl:2015:WMT,\n author = {Bojar, Ond\\v{r}ej and Chatterjee, Rajen and Federmann, Christian and Haddow, Barry and Huck, Matthias and Hokamp, Chris and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Scarton, Carolina and Specia, Lucia and Turchi, Marco},\n title = {Findings of the 2015 Workshop on Statistical Machine Translation},\n booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},\n month = {September},\n year = {2015},\n address = {Lisbon, Portugal},\n publisher = {Association for Computational Linguistics},\n pages = {1--46},\n url = {http://aclweb.org/anthology/W15-3001}\n}', + 'en-fr': [ + 'test/newsdiscusstest2015-enfr-src.en.sgm', + 'test/newsdiscusstest2015-enfr-ref.fr.sgm' + ], + 'fr-en': [ + 'test/newsdiscusstest2015-fren-src.fr.sgm', + 'test/newsdiscusstest2015-fren-ref.en.sgm' + ], + 'cs-en': [ + 'test/newstest2015-csen-src.cs.sgm', + 'test/newstest2015-csen-ref.en.sgm' + ], + 'de-en': [ + 'test/newstest2015-deen-src.de.sgm', + 'test/newstest2015-deen-ref.en.sgm' + ], + 'en-cs': [ + 'test/newstest2015-encs-src.en.sgm', + 'test/newstest2015-encs-ref.cs.sgm' + ], + 'en-de': [ + 'test/newstest2015-ende-src.en.sgm', + 'test/newstest2015-ende-ref.de.sgm' + ], + 'en-fi': [ + 'test/newstest2015-enfi-src.en.sgm', + 'test/newstest2015-enfi-ref.fi.sgm' + ], + 'en-ru': [ + 'test/newstest2015-enru-src.en.sgm', + 'test/newstest2015-enru-ref.ru.sgm' + ], + 'fi-en': [ + 'test/newstest2015-fien-src.fi.sgm', + 'test/newstest2015-fien-ref.en.sgm' + ], + 'ru-en': [ + 'test/newstest2015-ruen-src.ru.sgm', + 'test/newstest2015-ruen-ref.en.sgm' + ] + }, + 'wmt14': { + 'data': ['http://statmt.org/wmt14/test-filtered.tgz'], + 'md5': ['84c597844c1542e29c2aff23aaee4310'], + 'description': + 'Official evaluation data.', + 'citation': + '@InProceedings{bojar-EtAl:2014:W14-33,\n author = {Bojar, Ondrej and Buck, Christian and Federmann, Christian and Haddow, Barry and Koehn, Philipp and Leveling, Johannes and Monz, Christof and Pecina, Pavel and Post, Matt and Saint-Amand, Herve and Soricut, Radu and Specia, Lucia and Tamchyna, Ale\\v{s}},\n title = {Findings of the 2014 Workshop on Statistical Machine Translation},\n booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},\n month = {June},\n year = {2014},\n address = {Baltimore, Maryland, USA},\n publisher = {Association for Computational Linguistics},\n pages = {12--58},\n url = {http://www.aclweb.org/anthology/W/W14/W14-3302}\n}', + 'cs-en': [ + 'test/newstest2014-csen-src.cs.sgm', + 'test/newstest2014-csen-ref.en.sgm' + ], + 'en-cs': [ + 'test/newstest2014-csen-src.en.sgm', + 'test/newstest2014-csen-ref.cs.sgm' + ], + 'de-en': [ + 'test/newstest2014-deen-src.de.sgm', + 'test/newstest2014-deen-ref.en.sgm' + ], + 'en-de': [ + 'test/newstest2014-deen-src.en.sgm', + 'test/newstest2014-deen-ref.de.sgm' + ], + 'en-fr': [ + 'test/newstest2014-fren-src.en.sgm', + 'test/newstest2014-fren-ref.fr.sgm' + ], + 'fr-en': [ + 'test/newstest2014-fren-src.fr.sgm', + 'test/newstest2014-fren-ref.en.sgm' + ], + 'en-hi': [ + 'test/newstest2014-hien-src.en.sgm', + 'test/newstest2014-hien-ref.hi.sgm' + ], + 'hi-en': [ + 'test/newstest2014-hien-src.hi.sgm', + 'test/newstest2014-hien-ref.en.sgm' + ], + 'en-ru': [ + 'test/newstest2014-ruen-src.en.sgm', + 'test/newstest2014-ruen-ref.ru.sgm' + ], + 'ru-en': [ + 'test/newstest2014-ruen-src.ru.sgm', + 'test/newstest2014-ruen-ref.en.sgm' + ] + }, + 'wmt14/full': { + 'data': ['http://statmt.org/wmt14/test-full.tgz'], + 'md5': ['a8cd784e006feb32ac6f3d9ec7eb389a'], + 'description': + 'Evaluation data released after official evaluation for further research.', + 'cs-en': [ + 'test-full/newstest2014-csen-src.cs.sgm', + 'test-full/newstest2014-csen-ref.en.sgm' + ], + 'en-cs': [ + 'test-full/newstest2014-csen-src.en.sgm', + 'test-full/newstest2014-csen-ref.cs.sgm' + ], + 'de-en': [ + 'test-full/newstest2014-deen-src.de.sgm', + 'test-full/newstest2014-deen-ref.en.sgm' + ], + 'en-de': [ + 'test-full/newstest2014-deen-src.en.sgm', + 'test-full/newstest2014-deen-ref.de.sgm' + ], + 'en-fr': [ + 'test-full/newstest2014-fren-src.en.sgm', + 'test-full/newstest2014-fren-ref.fr.sgm' + ], + 'fr-en': [ + 'test-full/newstest2014-fren-src.fr.sgm', + 'test-full/newstest2014-fren-ref.en.sgm' + ], + 'en-hi': [ + 'test-full/newstest2014-hien-src.en.sgm', + 'test-full/newstest2014-hien-ref.hi.sgm' + ], + 'hi-en': [ + 'test-full/newstest2014-hien-src.hi.sgm', + 'test-full/newstest2014-hien-ref.en.sgm' + ], + 'en-ru': [ + 'test-full/newstest2014-ruen-src.en.sgm', + 'test-full/newstest2014-ruen-ref.ru.sgm' + ], + 'ru-en': [ + 'test-full/newstest2014-ruen-src.ru.sgm', + 'test-full/newstest2014-ruen-ref.en.sgm' + ] + }, + 'wmt13': { + 'data': ['http://statmt.org/wmt13/test.tgz'], + 'md5': ['48eca5d02f637af44e85186847141f67'], + 'description': 'Official evaluation data.', + 'citation': + '@InProceedings{bojar-EtAl:2013:WMT,\n author = {Bojar, Ond\\v{r}ej and Buck, Christian and Callison-Burch, Chris and Federmann, Christian and Haddow, Barry and Koehn, Philipp and Monz, Christof and Post, Matt and Soricut, Radu and Specia, Lucia},\n title = {Findings of the 2013 {Workshop on Statistical Machine Translation}},\n booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation},\n month = {August},\n year = {2013},\n address = {Sofia, Bulgaria},\n publisher = {Association for Computational Linguistics},\n pages = {1--44},\n url = {http://www.aclweb.org/anthology/W13-2201}\n}', + 'cs-en': + ['test/newstest2013-src.cs.sgm', 'test/newstest2013-src.en.sgm'], + 'en-cs': + ['test/newstest2013-src.en.sgm', 'test/newstest2013-src.cs.sgm'], + 'de-en': + ['test/newstest2013-src.de.sgm', 'test/newstest2013-src.en.sgm'], + 'en-de': + ['test/newstest2013-src.en.sgm', 'test/newstest2013-src.de.sgm'], + 'es-en': + ['test/newstest2013-src.es.sgm', 'test/newstest2013-src.en.sgm'], + 'en-es': + ['test/newstest2013-src.en.sgm', 'test/newstest2013-src.es.sgm'], + 'fr-en': + ['test/newstest2013-src.fr.sgm', 'test/newstest2013-src.en.sgm'], + 'en-fr': + ['test/newstest2013-src.en.sgm', 'test/newstest2013-src.fr.sgm'], + 'ru-en': + ['test/newstest2013-src.ru.sgm', 'test/newstest2013-src.en.sgm'], + 'en-ru': + ['test/newstest2013-src.en.sgm', 'test/newstest2013-src.ru.sgm'] + }, + 'wmt12': { + 'data': ['http://statmt.org/wmt12/test.tgz'], + 'md5': ['608232d34ebc4ba2ff70fead45674e47'], + 'description': 'Official evaluation data.', + 'citation': + '@InProceedings{callisonburch-EtAl:2012:WMT,\n author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Post, Matt and Soricut, Radu and Specia, Lucia},\n title = {Findings of the 2012 Workshop on Statistical Machine Translation},\n booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},\n month = {June},\n year = {2012},\n address = {Montr{\'e}al, Canada},\n publisher = {Association for Computational Linguistics},\n pages = {10--51},\n url = {http://www.aclweb.org/anthology/W12-3102}\n}', + 'cs-en': + ['test/newstest2012-src.cs.sgm', 'test/newstest2012-src.en.sgm'], + 'en-cs': + ['test/newstest2012-src.en.sgm', 'test/newstest2012-src.cs.sgm'], + 'de-en': + ['test/newstest2012-src.de.sgm', 'test/newstest2012-src.en.sgm'], + 'en-de': + ['test/newstest2012-src.en.sgm', 'test/newstest2012-src.de.sgm'], + 'es-en': + ['test/newstest2012-src.es.sgm', 'test/newstest2012-src.en.sgm'], + 'en-es': + ['test/newstest2012-src.en.sgm', 'test/newstest2012-src.es.sgm'], + 'fr-en': + ['test/newstest2012-src.fr.sgm', 'test/newstest2012-src.en.sgm'], + 'en-fr': + ['test/newstest2012-src.en.sgm', 'test/newstest2012-src.fr.sgm'] + }, + 'wmt11': { + 'data': ['http://statmt.org/wmt11/test.tgz'], + 'md5': ['b0c9680adf32d394aefc2b24e3a5937e'], + 'description': 'Official evaluation data.', + 'citation': + '@InProceedings{callisonburch-EtAl:2011:WMT,\n author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Zaidan, Omar},\n title = {Findings of the 2011 Workshop on Statistical Machine Translation},\n booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},\n month = {July},\n year = {2011},\n address = {Edinburgh, Scotland},\n publisher = {Association for Computational Linguistics},\n pages = {22--64},\n url = {http://www.aclweb.org/anthology/W11-2103}\n}', + 'cs-en': ['newstest2011-src.cs.sgm', 'newstest2011-src.en.sgm'], + 'en-cs': ['newstest2011-src.en.sgm', 'newstest2011-src.cs.sgm'], + 'de-en': ['newstest2011-src.de.sgm', 'newstest2011-src.en.sgm'], + 'en-de': ['newstest2011-src.en.sgm', 'newstest2011-src.de.sgm'], + 'fr-en': ['newstest2011-src.fr.sgm', 'newstest2011-src.en.sgm'], + 'en-fr': ['newstest2011-src.en.sgm', 'newstest2011-src.fr.sgm'], + 'es-en': ['newstest2011-src.es.sgm', 'newstest2011-src.en.sgm'], + 'en-es': ['newstest2011-src.en.sgm', 'newstest2011-src.es.sgm'] + }, + 'wmt10': { + 'data': ['http://statmt.org/wmt10/test.tgz'], + 'md5': ['491cb885a355da5a23ea66e7b3024d5c'], + 'description': 'Official evaluation data.', + 'citation': + '@InProceedings{callisonburch-EtAl:2010:WMT,\n author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Peterson, Kay and Przybocki, Mark and Zaidan, Omar},\n title = {Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation},\n booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR},\n month = {July},\n year = {2010},\n address = {Uppsala, Sweden},\n publisher = {Association for Computational Linguistics},\n pages = {17--53},\n note = {Revised August 2010},\n url = {http://www.aclweb.org/anthology/W10-1703}\n}', + 'cs-en': + ['test/newstest2010-src.cz.sgm', 'test/newstest2010-src.en.sgm'], + 'en-cs': + ['test/newstest2010-src.en.sgm', 'test/newstest2010-src.cz.sgm'], + 'de-en': + ['test/newstest2010-src.de.sgm', 'test/newstest2010-src.en.sgm'], + 'en-de': + ['test/newstest2010-src.en.sgm', 'test/newstest2010-src.de.sgm'], + 'es-en': + ['test/newstest2010-src.es.sgm', 'test/newstest2010-src.en.sgm'], + 'en-es': + ['test/newstest2010-src.en.sgm', 'test/newstest2010-src.es.sgm'], + 'fr-en': + ['test/newstest2010-src.fr.sgm', 'test/newstest2010-src.en.sgm'], + 'en-fr': + ['test/newstest2010-src.en.sgm', 'test/newstest2010-src.fr.sgm'] + }, + 'wmt09': { + 'data': ['http://statmt.org/wmt09/test.tgz'], + 'md5': ['da227abfbd7b666ec175b742a0d27b37'], + 'description': 'Official evaluation data.', + 'citation': + '@InProceedings{callisonburch-EtAl:2009:WMT-09,\n author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Schroeder, Josh},\n title = {Findings of the 2009 {W}orkshop on {S}tatistical {M}achine {T}ranslation},\n booktitle = {Proceedings of the Fourth Workshop on Statistical Machine Translation},\n month = {March},\n year = {2009},\n address = {Athens, Greece},\n publisher = {Association for Computational Linguistics},\n pages = {1--28},\n url = {http://www.aclweb.org/anthology/W/W09/W09-0401}\n}', + 'cs-en': + ['test/newstest2009-src.cz.sgm', 'test/newstest2009-src.en.sgm'], + 'en-cs': + ['test/newstest2009-src.en.sgm', 'test/newstest2009-src.cz.sgm'], + 'de-en': + ['test/newstest2009-src.de.sgm', 'test/newstest2009-src.en.sgm'], + 'en-de': + ['test/newstest2009-src.en.sgm', 'test/newstest2009-src.de.sgm'], + 'es-en': + ['test/newstest2009-src.es.sgm', 'test/newstest2009-src.en.sgm'], + 'en-es': + ['test/newstest2009-src.en.sgm', 'test/newstest2009-src.es.sgm'], + 'fr-en': + ['test/newstest2009-src.fr.sgm', 'test/newstest2009-src.en.sgm'], + 'en-fr': + ['test/newstest2009-src.en.sgm', 'test/newstest2009-src.fr.sgm'], + 'hu-en': + ['test/newstest2009-src.hu.sgm', 'test/newstest2009-src.en.sgm'], + 'en-hu': + ['test/newstest2009-src.en.sgm', 'test/newstest2009-src.hu.sgm'], + 'it-en': + ['test/newstest2009-src.it.sgm', 'test/newstest2009-src.en.sgm'], + 'en-it': [ + 'test/newstest2009-src.en.sgm', 'test/newstest2009-src.it.sgm' + ] + }, + 'wmt08': { + 'data': ['http://statmt.org/wmt08/test.tgz'], + 'md5': ['0582e4e894a3342044059c894e1aea3d'], + 'description': 'Official evaluation data.', + 'citation': + '@InProceedings{callisonburch-EtAl:2008:WMT,\n author = {Callison-Burch, Chris and Fordyce, Cameron and Koehn, Philipp and Monz, Christof and Schroeder, Josh},\n title = {Further Meta-Evaluation of Machine Translation},\n booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation},\n month = {June},\n year = {2008},\n address = {Columbus, Ohio},\n publisher = {Association for Computational Linguistics},\n pages = {70--106},\n url = {http://www.aclweb.org/anthology/W/W08/W08-0309}\n}', + 'cs-en': + ['test/newstest2008-src.cz.sgm', 'test/newstest2008-src.en.sgm'], + 'en-cs': + ['test/newstest2008-src.en.sgm', 'test/newstest2008-src.cz.sgm'], + 'de-en': + ['test/newstest2008-src.de.sgm', 'test/newstest2008-src.en.sgm'], + 'en-de': + ['test/newstest2008-src.en.sgm', 'test/newstest2008-src.de.sgm'], + 'es-en': + ['test/newstest2008-src.es.sgm', 'test/newstest2008-src.en.sgm'], + 'en-es': + ['test/newstest2008-src.en.sgm', 'test/newstest2008-src.es.sgm'], + 'fr-en': + ['test/newstest2008-src.fr.sgm', 'test/newstest2008-src.en.sgm'], + 'en-fr': + ['test/newstest2008-src.en.sgm', 'test/newstest2008-src.fr.sgm'], + 'hu-en': + ['test/newstest2008-src.hu.sgm', 'test/newstest2008-src.en.sgm'], + 'en-hu': + ['test/newstest2008-src.en.sgm', 'test/newstest2008-src.hu.sgm'] + }, + 'wmt08/nc': { + 'data': ['http://statmt.org/wmt08/test.tgz'], + 'md5': ['0582e4e894a3342044059c894e1aea3d'], + 'description': 'Official evaluation data (news commentary).', + 'cs-en': + ['test/nc-test2008-src.cz.sgm', 'test/nc-test2008-src.en.sgm'], + 'en-cs': + ['test/nc-test2008-src.en.sgm', 'test/nc-test2008-src.cz.sgm'] + }, + 'wmt08/europarl': { + 'data': ['http://statmt.org/wmt08/test.tgz'], + 'md5': ['0582e4e894a3342044059c894e1aea3d'], + 'description': 'Official evaluation data (Europarl).', + 'de-en': ['test/test2008-src.de.sgm', 'test/test2008-src.en.sgm'], + 'en-de': ['test/test2008-src.en.sgm', 'test/test2008-src.de.sgm'], + 'es-en': ['test/test2008-src.es.sgm', 'test/test2008-src.en.sgm'], + 'en-es': ['test/test2008-src.en.sgm', 'test/test2008-src.es.sgm'], + 'fr-en': ['test/test2008-src.fr.sgm', 'test/test2008-src.en.sgm'], + 'en-fr': ['test/test2008-src.en.sgm', 'test/test2008-src.fr.sgm'] + }, + 'iwslt17': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/ar/en-ar.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/ar/en/ar-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/ja/en-ja.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/ja/en/ja-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/ko/en-ko.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/ko/en/ko-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/zh/en/zh-en.tgz' + ], + 'md5': [ + "1849bcc3b006dc0642a8843b11aa7192", + "79bf7a2ef02d226875f55fb076e7e473", + "b68e7097b179491f6c466ef41ad72b9b", + "e3f5b2a075a2da1a395c8b60bf1e9be1", + "ecdc6bc4ab4c8984e919444f3c05183a", + "4b5141d14b98706c081371e2f8afe0ca", + "d957ee79de1f33c89077d37c5a2c5b06", + "c213e8bb918ebf843543fe9fd2e33db2", + "59f6a81c707378176e9ad8bb8d811f5f", + "7e580af973bb389ec1d1378a1850742f", + "975a858783a0ebec8c57d83ddd5bd381", + "cc51d9b7fe1ff2af858c6a0dd80b8815" + ], + 'description': + 'Official evaluation data for IWSLT.', + 'citation': + '@InProceedings{iwslt2017,\n author = {Cettolo, Mauro and Federico, Marcello and Bentivogli, Luisa and Niehues, Jan and Stüker, Sebastian and Sudoh, Katsuitho and Yoshino, Koichiro and Federmann, Christian},\n title = {Overview of the IWSLT 2017 Evaluation Campaign},\n booktitle = {14th International Workshop on Spoken Language Translation},\n month = {December},\n year = {2017},\n address = {Tokyo, Japan},\n pages = {2--14},\n url = {http://workshop2017.iwslt.org/downloads/iwslt2017_proceeding_v2.pdf}\n}', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2017.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2017.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2017.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2017.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2017.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2017.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2017.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2017.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2017.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2017.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2017.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2017.en-zh.en.xml' + ], + }, + 'iwslt17/tst2016': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-ted-test/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "1849bcc3b006dc0642a8843b11aa7192", + "79bf7a2ef02d226875f55fb076e7e473", + "b68e7097b179491f6c466ef41ad72b9b", + "e3f5b2a075a2da1a395c8b60bf1e9be1", + "975a858783a0ebec8c57d83ddd5bd381", + "cc51d9b7fe1ff2af858c6a0dd80b8815" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2016.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2016.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2016.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2016.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2016.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2016.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2016.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2016.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2016.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2016.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2016.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2016.en-zh.en.xml' + ], + }, + 'iwslt17/tst2015': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "d8a32cfc002a4f12b17429cfa78050e6", + "ca2b94d694150d4d6c5dc64c200fa589", + "3cf07ebe305312b12f7f1a4d5f8f8377", + "19927da9de0f40348cad9c0fc61642ac", + "575b788dad6c5b9c5cee636f9ac1094a", + "1c0ae40171d52593df8a6963d3828116" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2015.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2015.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2015.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2015.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2015.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2015.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2015.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2015.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2015.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2015.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2015.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2015.en-zh.en.xml' + ], + }, + 'iwslt17/tst2014': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "d8a32cfc002a4f12b17429cfa78050e6", + "ca2b94d694150d4d6c5dc64c200fa589", + "3cf07ebe305312b12f7f1a4d5f8f8377", + "19927da9de0f40348cad9c0fc61642ac", + "575b788dad6c5b9c5cee636f9ac1094a", + "1c0ae40171d52593df8a6963d3828116" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2014.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2014.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2014.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2014.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2014.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2014.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2014.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2014.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2014.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2014.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2014.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2014.en-zh.en.xml' + ], + }, + 'iwslt17/tst2013': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "d8a32cfc002a4f12b17429cfa78050e6", + "ca2b94d694150d4d6c5dc64c200fa589", + "3cf07ebe305312b12f7f1a4d5f8f8377", + "19927da9de0f40348cad9c0fc61642ac", + "575b788dad6c5b9c5cee636f9ac1094a", + "1c0ae40171d52593df8a6963d3828116" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2013.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2013.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2013.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2013.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2013.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2013.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2013.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2013.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2013.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2013.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2013.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2013.en-zh.en.xml' + ], + }, + 'iwslt17/tst2012': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "d8a32cfc002a4f12b17429cfa78050e6", + "ca2b94d694150d4d6c5dc64c200fa589", + "3cf07ebe305312b12f7f1a4d5f8f8377", + "19927da9de0f40348cad9c0fc61642ac", + "575b788dad6c5b9c5cee636f9ac1094a", + "1c0ae40171d52593df8a6963d3828116" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2012.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2012.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2012.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2012.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2012.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2012.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2012.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2012.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2012.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2012.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2012.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2012.en-zh.en.xml' + ], + }, + 'iwslt17/tst2011': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "d8a32cfc002a4f12b17429cfa78050e6", + "ca2b94d694150d4d6c5dc64c200fa589", + "3cf07ebe305312b12f7f1a4d5f8f8377", + "19927da9de0f40348cad9c0fc61642ac", + "575b788dad6c5b9c5cee636f9ac1094a", + "1c0ae40171d52593df8a6963d3828116" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2011.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2011.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2011.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2011.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2011.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2011.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2011.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2011.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2011.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2011.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2011.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2011.en-zh.en.xml' + ], + }, + 'iwslt17/tst2010': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "d8a32cfc002a4f12b17429cfa78050e6", + "ca2b94d694150d4d6c5dc64c200fa589", + "3cf07ebe305312b12f7f1a4d5f8f8377", + "19927da9de0f40348cad9c0fc61642ac", + "575b788dad6c5b9c5cee636f9ac1094a", + "1c0ae40171d52593df8a6963d3828116" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.tst2010.en-fr.en.xml', + 'fr-en/IWSLT17.TED.tst2010.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.tst2010.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.tst2010.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.tst2010.en-de.en.xml', + 'de-en/IWSLT17.TED.tst2010.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.tst2010.de-en.de.xml', + 'en-de/IWSLT17.TED.tst2010.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.tst2010.en-zh.en.xml', + 'zh-en/IWSLT17.TED.tst2010.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.tst2010.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.tst2010.en-zh.en.xml' + ], + }, + 'iwslt17/dev2010': { + 'data': [ + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/de/en-de.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/fr/en-fr.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/en/zh/en-zh.tgz', + 'https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz' + ], + "md5": [ + "d8a32cfc002a4f12b17429cfa78050e6", + "ca2b94d694150d4d6c5dc64c200fa589", + "3cf07ebe305312b12f7f1a4d5f8f8377", + "19927da9de0f40348cad9c0fc61642ac", + "575b788dad6c5b9c5cee636f9ac1094a", + "1c0ae40171d52593df8a6963d3828116" + ], + 'description': + 'Development data for IWSLT 2017.', + 'en-fr': [ + 'en-fr/IWSLT17.TED.dev2010.en-fr.en.xml', + 'fr-en/IWSLT17.TED.dev2010.fr-en.fr.xml' + ], + 'fr-en': [ + 'fr-en/IWSLT17.TED.dev2010.fr-en.fr.xml', + 'en-fr/IWSLT17.TED.dev2010.en-fr.en.xml' + ], + 'en-de': [ + 'en-de/IWSLT17.TED.dev2010.en-de.en.xml', + 'de-en/IWSLT17.TED.dev2010.de-en.de.xml' + ], + 'de-en': [ + 'de-en/IWSLT17.TED.dev2010.de-en.de.xml', + 'en-de/IWSLT17.TED.dev2010.en-de.en.xml' + ], + 'en-zh': [ + 'en-zh/IWSLT17.TED.dev2010.en-zh.en.xml', + 'zh-en/IWSLT17.TED.dev2010.zh-en.zh.xml' + ], + 'zh-en': [ + 'zh-en/IWSLT17.TED.dev2010.zh-en.zh.xml', + 'en-zh/IWSLT17.TED.dev2010.en-zh.en.xml' + ], + }, +} + + +def tokenize_13a(line): + """ + Tokenizes an input line using a relatively minimal tokenization that is however equivalent to mteval-v13a, used by WMT. + + :param line: a segment to tokenize + :return: the tokenized line + """ + + norm = line + + # language-independent part: + norm = norm.replace('', '') + norm = norm.replace('-\n', '') + norm = norm.replace('\n', ' ') + norm = norm.replace('"', '"') + norm = norm.replace('&', '&') + norm = norm.replace('<', '<') + norm = norm.replace('>', '>') + + # language-dependent part (assuming Western languages): + norm = " {} ".format(norm) + norm = re.sub(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])', ' \\1 ', norm) + norm = re.sub(r'([^0-9])([\.,])', '\\1 \\2 ', + norm) # tokenize period and comma unless preceded by a digit + norm = re.sub(r'([\.,])([^0-9])', ' \\1 \\2', + norm) # tokenize period and comma unless followed by a digit + norm = re.sub(r'([0-9])(-)', '\\1 \\2 ', + norm) # tokenize dash when preceded by a digit + norm = re.sub(r'\s+', ' ', norm) # one space only between words + norm = re.sub(r'^\s+', '', norm) # no leading space + norm = re.sub(r'\s+$', '', norm) # no trailing space + + return norm + + +class UnicodeRegex: + """Ad-hoc hack to recognize all punctuation and symbols. + + without depending on https://pypi.python.org/pypi/regex/.""" + + def _property_chars(prefix): + return ''.join( + chr(x) for x in range(sys.maxunicode) + if unicodedata.category(chr(x)).startswith(prefix)) + + punctuation = _property_chars('P') + nondigit_punct_re = re.compile(r'([^\d])([' + punctuation + r'])') + punct_nondigit_re = re.compile(r'([' + punctuation + r'])([^\d])') + symbol_re = re.compile('([' + _property_chars('S') + '])') + + +def tokenize_v14_international(string): + r"""Tokenize a string following the official BLEU implementation. + + See https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl#L954-L983 + In our case, the input string is expected to be just one line + and no HTML entities de-escaping is needed. + So we just tokenize on punctuation and symbols, + except when a punctuation is preceded and followed by a digit + (e.g. a comma/dot as a thousand/decimal separator). + + Note that a number (e.g., a year) followed by a dot at the end of sentence is NOT tokenized, + i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g` + does not match this case (unless we add a space after each sentence). + However, this error is already in the original mteval-v14.pl + and we want to be consistent with it. + The error is not present in the non-international version, + which uses `$norm_text = " $norm_text "` (or `norm = " {} ".format(norm)` in Python). + + :param string: the input string + :return: a list of tokens + """ + string = UnicodeRegex.nondigit_punct_re.sub(r'\1 \2 ', string) + string = UnicodeRegex.punct_nondigit_re.sub(r' \1 \2', string) + string = UnicodeRegex.symbol_re.sub(r' \1 ', string) + return string.strip() + + +def tokenize_zh(sentence): + """MIT License + Copyright (c) 2017 - Shujian Huang + + Permission is hereby granted, free of charge, to any person obtaining a copy + of this software and associated documentation files (the "Software"), to deal + in the Software without restriction, including without limitation the rights + to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + copies of the Software, and to permit persons to whom the Software is + furnished to do so, subject to the following conditions: + + The above copyright notice and this permission notice shall be included in all + copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + The tokenization of Chinese text in this script contains two steps: separate each Chinese + characters (by utf-8 encoding); tokenize the non Chinese part (following the mteval script). + Author: Shujian Huang huangsj@nju.edu.cn + + :param sentence: input sentence + :return: tokenized sentence + """ + + def is_chinese_char(uchar): + """ + :param uchar: input char in unicode + :return: whether the input char is a Chinese character. + """ + if uchar >= u'\u3400' and uchar <= u'\u4db5': # CJK Unified Ideographs Extension A, release 3.0 + return True + elif uchar >= u'\u4e00' and uchar <= u'\u9fa5': # CJK Unified Ideographs, release 1.1 + return True + elif uchar >= u'\u9fa6' and uchar <= u'\u9fbb': # CJK Unified Ideographs, release 4.1 + return True + elif uchar >= u'\uf900' and uchar <= u'\ufa2d': # CJK Compatibility Ideographs, release 1.1 + return True + elif uchar >= u'\ufa30' and uchar <= u'\ufa6a': # CJK Compatibility Ideographs, release 3.2 + return True + elif uchar >= u'\ufa70' and uchar <= u'\ufad9': # CJK Compatibility Ideographs, release 4.1 + return True + elif uchar >= u'\u20000' and uchar <= u'\u2a6d6': # CJK Unified Ideographs Extension B, release 3.1 + return True + elif uchar >= u'\u2f800' and uchar <= u'\u2fa1d': # CJK Compatibility Supplement, release 3.1 + return True + elif uchar >= u'\uff00' and uchar <= u'\uffef': # Full width ASCII, full width of English punctuation, half width Katakana, half wide half width kana, Korean alphabet + return True + elif uchar >= u'\u2e80' and uchar <= u'\u2eff': # CJK Radicals Supplement + return True + elif uchar >= u'\u3000' and uchar <= u'\u303f': # CJK punctuation mark + return True + elif uchar >= u'\u31c0' and uchar <= u'\u31ef': # CJK stroke + return True + elif uchar >= u'\u2f00' and uchar <= u'\u2fdf': # Kangxi Radicals + return True + elif uchar >= u'\u2ff0' and uchar <= u'\u2fff': # Chinese character structure + return True + elif uchar >= u'\u3100' and uchar <= u'\u312f': # Phonetic symbols + return True + elif uchar >= u'\u31a0' and uchar <= u'\u31bf': # Phonetic symbols (Taiwanese and Hakka expansion) + return True + elif uchar >= u'\ufe10' and uchar <= u'\ufe1f': + return True + elif uchar >= u'\ufe30' and uchar <= u'\ufe4f': + return True + elif uchar >= u'\u2600' and uchar <= u'\u26ff': + return True + elif uchar >= u'\u2700' and uchar <= u'\u27bf': + return True + elif uchar >= u'\u3200' and uchar <= u'\u32ff': + return True + elif uchar >= u'\u3300' and uchar <= u'\u33ff': + return True + + return False + + sentence = sentence.strip() + sentence_in_chars = "" + for char in sentence: + if is_chinese_char(char): + sentence_in_chars += " " + sentence_in_chars += char + sentence_in_chars += " " + else: + sentence_in_chars += char + sentence = sentence_in_chars + + # TODO: the code above could probably be replaced with the following line: + # import regex + # sentence = regex.sub(r'(\p{Han})', r' \1 ', sentence) + + # tokenize punctuation + sentence = re.sub(r'([\{-\~\[-\` -\&\(-\+\:-\@\/])', r' \1 ', sentence) + + # tokenize period and comma unless preceded by a digit + sentence = re.sub(r'([^0-9])([\.,])', r'\1 \2 ', sentence) + + # tokenize period and comma unless followed by a digit + sentence = re.sub(r'([\.,])([^0-9])', r' \1 \2', sentence) + + # tokenize dash when preceded by a digit + sentence = re.sub(r'([0-9])(-)', r'\1 \2 ', sentence) + + # one space only between words + sentence = re.sub(r'\s+', r' ', sentence) + + # no leading or trailing spaces + sentence = sentence.strip() + + return sentence + + +from .fairseq_tokenizer import tokenize_en + + +TOKENIZERS = { + '13a': tokenize_13a, + 'intl': tokenize_v14_international, + 'zh': tokenize_zh, + 'fairseq': tokenize_en, + 'none': lambda x: x, +} +DEFAULT_TOKENIZER = '13a' + + +def smart_open(file, mode='rt', encoding='utf-8'): + """Convenience function for reading compressed or plain text files. + :param file: The file to read. + :param encoding: The file encoding. + """ + if file.endswith('.gz'): + return gzip.open(file, mode=mode, encoding=encoding, newline="\n") + return open(file, mode=mode, encoding=encoding, newline="\n") + + +def my_log(num): + """ + Floors the log function + + :param num: the number + :return: log(num) floored to a very low number + """ + + if num == 0.0: + return -9999999999 + return math.log(num) + + +def bleu_signature(args, numrefs): + """ + Builds a signature that uniquely identifies the scoring parameters used. + :param args: the arguments passed into the script + :return: the signature + """ + + # Abbreviations for the signature + abbr = { + 'test': 't', + 'lang': 'l', + 'smooth': 's', + 'case': 'c', + 'tok': 'tok', + 'numrefs': '#', + 'version': 'v' + } + + signature = { + 'tok': args.tokenize, + 'version': VERSION, + 'smooth': args.smooth, + 'numrefs': numrefs, + 'case': 'lc' if args.lc else 'mixed' + } + + if args.test_set is not None: + signature['test'] = args.test_set + + if args.langpair is not None: + signature['lang'] = args.langpair + + sigstr = '+'.join([ + '{}.{}'.format(abbr[x] if args.short else x, signature[x]) + for x in sorted(signature.keys()) + ]) + + return sigstr + + +def chrf_signature(args, numrefs): + """ + Builds a signature that uniquely identifies the scoring parameters used. + :param args: the arguments passed into the script + :return: the chrF signature + """ + + # Abbreviations for the signature + abbr = { + 'test': 't', + 'lang': 'l', + 'numchars': 'n', + 'space': 's', + 'case': 'c', + 'numrefs': '#', + 'version': 'v' + } + + signature = { + 'tok': args.tokenize, + 'version': VERSION, + 'space': args.chrf_whitespace, + 'numchars': args.chrf_order, + 'numrefs': numrefs, + 'case': 'lc' if args.lc else 'mixed' + } + + if args.test_set is not None: + signature['test'] = args.test_set + + if args.langpair is not None: + signature['lang'] = args.langpair + + sigstr = '+'.join([ + '{}.{}'.format(abbr[x] if args.short else x, signature[x]) + for x in sorted(signature.keys()) + ]) + + return sigstr + + +def extract_ngrams(line, min_order=1, max_order=NGRAM_ORDER) -> Counter: + """Extracts all the ngrams (1 <= n <= NGRAM_ORDER) from a sequence of tokens. + + :param line: a segment containing a sequence of words + :param max_order: collect n-grams from 1<=n<=max + :return: a dictionary containing ngrams and counts + """ + + ngrams = Counter() + tokens = line.split() + for n in range(min_order, max_order + 1): + for i in range(0, len(tokens) - n + 1): + ngram = ' '.join(tokens[i:i + n]) + ngrams[ngram] += 1 + + return ngrams + + +def extract_char_ngrams(s: str, n: int) -> Counter: + """ + Yields counts of character n-grams from string s of order n. + """ + return Counter([s[i:i + n] for i in range(len(s) - n + 1)]) + + +def ref_stats(output, refs): + ngrams = Counter() + closest_diff = None + closest_len = None + for ref in refs: + tokens = ref.split() + reflen = len(tokens) + diff = abs(len(output.split()) - reflen) + if closest_diff is None or diff < closest_diff: + closest_diff = diff + closest_len = reflen + elif diff == closest_diff: + if reflen < closest_len: + closest_len = reflen + + ngrams_ref = extract_ngrams(ref) + for ngram in ngrams_ref.keys(): + ngrams[ngram] = max(ngrams[ngram], ngrams_ref[ngram]) + + return ngrams, closest_diff, closest_len + + +def _clean(s): + """ + Removes trailing and leading spaces and collapses multiple consecutive internal spaces to a single one. + + :param s: The string. + :return: A cleaned-up string. + """ + return re.sub(r'\s+', ' ', s.strip()) + + +def process_to_text(rawfile, txtfile, field: int = None): + """Processes raw files to plain text files. + :param rawfile: the input file (possibly SGML) + :param txtfile: the plaintext file + :param field: For TSV files, which field to extract. + """ + + if not os.path.exists(txtfile) or os.path.getsize(txtfile) == 0: + logging.info("Processing %s to %s", rawfile, txtfile) + if rawfile.endswith('.sgm') or rawfile.endswith('.sgml'): + with smart_open(rawfile) as fin, smart_open(txtfile, 'wt') as fout: + for line in fin: + if line.startswith('(.*).*?', '\\1', line)), + file=fout) + elif rawfile.endswith('.xml'): # IWSLT + with smart_open(rawfile) as fin, smart_open(txtfile, 'wt') as fout: + for line in fin: + if line.startswith('(.*).*?', '\\1', line)), + file=fout) + elif rawfile.endswith('.txt'): # wmt17/ms + with smart_open(rawfile) as fin, smart_open(txtfile, 'wt') as fout: + for line in fin: + print(line.rstrip(), file=fout) + elif rawfile.endswith('.tsv'): # MTNT + with smart_open(rawfile) as fin, smart_open(txtfile, 'wt') as fout: + for line in fin: + print(line.rstrip().split('\t')[field], file=fout) + + +def print_test_set(test_set, langpair, side): + """Prints to STDOUT the specified side of the specified test set + :param test_set: the test set to print + :param langpair: the language pair + :param side: 'src' for source, 'ref' for reference + """ + + files = download_test_set(test_set, langpair) + if side == 'src': + files = [files[0]] + elif side == 'ref': + files.pop(0) + + streams = [smart_open(file) for file in files] + for lines in zip(*streams): + print('\t'.join(map(lambda x: x.rstrip(), lines))) + + +def download_test_set(test_set, langpair=None): + """Downloads the specified test to the system location specified by the SACREBLEU environment variable. + :param test_set: the test set to download + :param langpair: the language pair (needed for some datasets) + :return: the set of processed files + """ + + outdir = os.path.join(SACREBLEU_DIR, test_set) + if not os.path.exists(outdir): + logging.info('Creating %s', outdir) + os.makedirs(outdir) + + expected_checksums = DATASETS[test_set].get('md5', [None] * + len(DATASETS[test_set])) + for dataset, expected_md5 in zip(DATASETS[test_set]['data'], + expected_checksums): + tarball = os.path.join(outdir, os.path.basename(dataset)) + rawdir = os.path.join(outdir, 'raw') + if not os.path.exists(tarball) or os.path.getsize(tarball) == 0: + logging.info("Downloading %s to %s", dataset, tarball) + try: + with urllib.request.urlopen(dataset) as f, open(tarball, + 'wb') as out: + out.write(f.read()) + except ssl.SSLError: + logging.warning( + 'An SSL error was encountered in downloading the files. If you\'re on a Mac, ' + 'you may need to run the "Install Certificates.command" file located in the ' + '"Python 3" folder, often found under /Applications') + sys.exit(1) + + # Check md5sum + if expected_md5 is not None: + md5 = hashlib.md5() + with open(tarball, 'rb') as infile: + for line in infile: + md5.update(line) + if md5.hexdigest() != expected_md5: + logging.error( + 'Fatal: MD5 sum of downloaded file was incorrect (got {}, expected {}).' + .format(md5.hexdigest(), expected_md5)) + logging.error( + 'Please manually delete "{}" and rerun the command.'. + format(tarball)) + logging.error( + 'If the problem persists, the tarball may have changed, in which case, please contact the SacreBLEU maintainer.' + ) + sys.exit(1) + else: + logging.info('Checksum passed: {}'.format(md5.hexdigest())) + + # Extract the tarball + logging.info('Extracting %s', tarball) + if tarball.endswith('.tar.gz') or tarball.endswith('.tgz'): + import tarfile + tar = tarfile.open(tarball) + tar.extractall(path=rawdir) + elif tarball.endswith('.zip'): + import zipfile + zipfile = zipfile.ZipFile(tarball, 'r') + zipfile.extractall(path=rawdir) + zipfile.close() + + found = [] + + # Process the files into plain text + languages = DATASETS[test_set].keys() if langpair is None else [langpair] + for pair in languages: + if '-' not in pair: + continue + src, tgt = pair.split('-') + rawfile = DATASETS[test_set][pair][0] + field = None # used for TSV files + if rawfile.endswith('.tsv'): + field, rawfile = rawfile.split(':', maxsplit=1) + field = int(field) + rawpath = os.path.join(rawdir, rawfile) + outpath = os.path.join(outdir, '{}.{}'.format(pair, src)) + process_to_text(rawpath, outpath, field=field) + found.append(outpath) + + refs = DATASETS[test_set][pair][1:] + for i, ref in enumerate(refs): + field = None + if ref.endswith('.tsv'): + field, ref = ref.split(':', maxsplit=1) + field = int(field) + rawpath = os.path.join(rawdir, ref) + if len(refs) >= 2: + outpath = os.path.join(outdir, '{}.{}.{}'.format(pair, tgt, i)) + else: + outpath = os.path.join(outdir, '{}.{}'.format(pair, tgt)) + process_to_text(rawpath, outpath, field=field) + found.append(outpath) + + return found + + +class BLEU( + namedtuple('BaseBLEU', + 'score, counts, totals, precisions, bp, sys_len, ref_len')): + def format(self, width=2): + precisions = "/".join(["{:.1f}".format(p) for p in self.precisions]) + return f'BLEU = {self.score:.{width}f} {precisions} (BP = {self.bp:.3f}' \ + f' ratio = {(self.sys_len / self.ref_len):.3f} hyp_len = {self.sys_len:d}' \ + f' ref_len = {self.ref_len:d})' + + def __str__(self): + return self.format() + + +def compute_bleu(correct: List[int], + total: List[int], + sys_len: int, + ref_len: int, + smooth_method='none', + smooth_value=SMOOTH_VALUE_DEFAULT, + use_effective_order=False) -> BLEU: + """Computes BLEU score from its sufficient statistics. Adds smoothing. + + Smoothing methods (citing "A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU", + Boxing Chen and Colin Cherry, WMT 2014: http://aclweb.org/anthology/W14-3346) + + - exp: NIST smoothing method (Method 3) + - floor: Method 1 + - add-k: Method 2 (generalizing Lin and Och, 2004) + - none: do nothing. + + :param correct: List of counts of correct ngrams, 1 <= n <= NGRAM_ORDER + :param total: List of counts of total ngrams, 1 <= n <= NGRAM_ORDER + :param sys_len: The cumulative system length + :param ref_len: The cumulative reference length + :param smooth: The smoothing method to use + :param smooth_value: The smoothing value added, if smooth method 'floor' is used + :param use_effective_order: Use effective order. + :return: A BLEU object with the score (100-based) and other statistics. + """ + + precisions = [0 for x in range(NGRAM_ORDER)] + + smooth_mteval = 1. + effective_order = NGRAM_ORDER + for n in range(NGRAM_ORDER): + if smooth_method == 'add-k' and n > 1: + correct[n] += smooth_value + total[n] += smooth_value + if total[n] == 0: + break + + if use_effective_order: + effective_order = n + 1 + + if correct[n] == 0: + if smooth_method == 'exp': + smooth_mteval *= 2 + precisions[n] = 100. / (smooth_mteval * total[n]) + elif smooth_method == 'floor': + precisions[n] = 100. * smooth_value / total[n] + else: + precisions[n] = 100. * correct[n] / total[n] + + # If the system guesses no i-grams, 1 <= i <= NGRAM_ORDER, the BLEU score is 0 (technically undefined). + # This is a problem for sentence-level BLEU or a corpus of short sentences, where systems will get no credit + # if sentence lengths fall under the NGRAM_ORDER threshold. This fix scales NGRAM_ORDER to the observed + # maximum order. It is only available through the API and off by default + + brevity_penalty = 1.0 + if sys_len < ref_len: + brevity_penalty = math.exp(1 - + ref_len / sys_len) if sys_len > 0 else 0.0 + + bleu = brevity_penalty * math.exp( + sum(map(my_log, precisions[:effective_order])) / effective_order) + + return BLEU._make( + [bleu, correct, total, precisions, brevity_penalty, sys_len, ref_len]) + + +def sentence_bleu(hypothesis: str, + reference: str, + smooth_method: str = 'floor', + smooth_value: float = SMOOTH_VALUE_DEFAULT, + use_effective_order: bool = True): + """ + Computes BLEU on a single sentence pair. + + Disclaimer: computing BLEU on the sentence level is not its intended use, + BLEU is a corpus-level metric. + + :param hypothesis: Hypothesis string. + :param reference: Reference string. + :param smooth_value: For 'floor' smoothing, the floor value to use. + :param use_effective_order: Account for references that are shorter than the largest n-gram. + :return: Returns a single BLEU score as a float. + """ + bleu = corpus_bleu(hypothesis, + reference, + smooth_method=smooth_method, + smooth_value=smooth_value, + use_effective_order=use_effective_order) + return bleu.score + + +def corpus_bleu(sys_stream: Union[str, Iterable[str]], + ref_streams: Union[str, List[Iterable[str]]], + smooth_method='exp', + smooth_value=SMOOTH_VALUE_DEFAULT, + force=False, + lowercase=False, + tokenize=DEFAULT_TOKENIZER, + use_effective_order=False) -> BLEU: + """Produces BLEU scores along with its sufficient statistics from a source against one or more references. + + :param sys_stream: The system stream (a sequence of segments) + :param ref_streams: A list of one or more reference streams (each a sequence of segments) + :param smooth: The smoothing method to use + :param smooth_value: For 'floor' smoothing, the floor to use + :param force: Ignore data that looks already tokenized + :param lowercase: Lowercase the data + :param tokenize: The tokenizer to use + :return: a BLEU object containing everything you'd want + """ + + # Add some robustness to the input arguments + if isinstance(sys_stream, str): + sys_stream = [sys_stream] + if isinstance(ref_streams, str): + ref_streams = [[ref_streams]] + + sys_len = 0 + ref_len = 0 + + correct = [0 for n in range(NGRAM_ORDER)] + total = [0 for n in range(NGRAM_ORDER)] + + # look for already-tokenized sentences + tokenized_count = 0 + + fhs = [sys_stream] + ref_streams + for lines in zip_longest(*fhs): + if None in lines: + raise EOFError( + "Source and reference streams have different lengths!") + + if lowercase: + lines = [x.lower() for x in lines] + + if not (force + or tokenize == 'none') and lines[0].rstrip().endswith(' .'): + tokenized_count += 1 + + if tokenized_count == 100: + logging.warning( + 'That\'s 100 lines that end in a tokenized period (\'.\')') + logging.warning( + 'It looks like you forgot to detokenize your test data, which may hurt your score.' + ) + logging.warning( + 'If you insist your data is detokenized, or don\'t care, you can suppress this message with \'--force\'.' + ) + + output, *refs = [TOKENIZERS[tokenize](x.rstrip()) for x in lines] + + ref_ngrams, closest_diff, closest_len = ref_stats(output, refs) + + sys_len += len(output.split()) + ref_len += closest_len + + sys_ngrams = extract_ngrams(output) + for ngram in sys_ngrams.keys(): + n = len(ngram.split()) + correct[n - 1] += min(sys_ngrams[ngram], ref_ngrams.get(ngram, 0)) + total[n - 1] += sys_ngrams[ngram] + + return compute_bleu(correct, + total, + sys_len, + ref_len, + smooth_method=smooth_method, + smooth_value=smooth_value, + use_effective_order=use_effective_order) + + +def raw_corpus_bleu(sys_stream, ref_streams, + smooth_value=SMOOTH_VALUE_DEFAULT) -> BLEU: + """Convenience function that wraps corpus_bleu(). + This is convenient if you're using sacrebleu as a library, say for scoring on dev. + It uses no tokenization and 'floor' smoothing, with the floor default to 0 (no smoothing). + + :param sys_stream: the system stream (a sequence of segments) + :param ref_streams: a list of one or more reference streams (each a sequence of segments) + """ + return corpus_bleu(sys_stream, + ref_streams, + smooth_method='floor', + smooth_value=smooth_value, + force=True, + tokenize='none', + use_effective_order=True) + + +def delete_whitespace(text: str) -> str: + """ + Removes whitespaces from text. + """ + return re.sub(r'\s+', '', text).strip() + + +def get_sentence_statistics(hypothesis: str, + reference: str, + order: int = CHRF_ORDER, + remove_whitespace: bool = True) -> List[float]: + hypothesis = delete_whitespace( + hypothesis) if remove_whitespace else hypothesis + reference = delete_whitespace( + reference) if remove_whitespace else reference + statistics = [0] * (order * 3) + for i in range(order): + n = i + 1 + hypothesis_ngrams = extract_char_ngrams(hypothesis, n) + reference_ngrams = extract_char_ngrams(reference, n) + common_ngrams = hypothesis_ngrams & reference_ngrams + statistics[3 * i + 0] = sum(hypothesis_ngrams.values()) + statistics[3 * i + 1] = sum(reference_ngrams.values()) + statistics[3 * i + 2] = sum(common_ngrams.values()) + return statistics + + +def get_corpus_statistics(hypotheses: Iterable[str], + references: Iterable[str], + order: int = CHRF_ORDER, + remove_whitespace: bool = True) -> List[float]: + corpus_statistics = [0] * (order * 3) + for hypothesis, reference in zip(hypotheses, references): + statistics = get_sentence_statistics( + hypothesis, + reference, + order=order, + remove_whitespace=remove_whitespace) + for i in range(len(statistics)): + corpus_statistics[i] += statistics[i] + return corpus_statistics + + +def _avg_precision_and_recall(statistics: List[float], + order: int) -> Tuple[float, float]: + avg_precision = 0.0 + avg_recall = 0.0 + effective_order = 0 + for i in range(order): + hypotheses_ngrams = statistics[3 * i + 0] + references_ngrams = statistics[3 * i + 1] + common_ngrams = statistics[3 * i + 2] + if hypotheses_ngrams > 0 and references_ngrams > 0: + avg_precision += common_ngrams / hypotheses_ngrams + avg_recall += common_ngrams / references_ngrams + effective_order += 1 + if effective_order == 0: + return 0.0, 0.0 + avg_precision /= effective_order + avg_recall /= effective_order + return avg_precision, avg_recall + + +def _chrf(avg_precision, avg_recall, beta: int = CHRF_BETA) -> float: + if avg_precision + avg_recall == 0: + return 0.0 + beta_square = beta**2 + score = (1 + beta_square) * (avg_precision * avg_recall) / ( + (beta_square * avg_precision) + avg_recall) + return score + + +def corpus_chrf(hypotheses: Iterable[str], + references: Iterable[str], + order: int = CHRF_ORDER, + beta: float = CHRF_BETA, + remove_whitespace: bool = True) -> float: + """ + Computes Chrf on a corpus. + + :param hypotheses: Stream of hypotheses. + :param references: Stream of references + :param order: Maximum n-gram order. + :param remove_whitespace: Whether to delete all whitespace from hypothesis and reference strings. + :param beta: Defines importance of recall w.r.t precision. If beta=1, same importance. + :return: Chrf score. + """ + corpus_statistics = get_corpus_statistics( + hypotheses, + references, + order=order, + remove_whitespace=remove_whitespace) + avg_precision, avg_recall = _avg_precision_and_recall( + corpus_statistics, order) + return _chrf(avg_precision, avg_recall, beta=beta) + + +def sentence_chrf(hypothesis: str, + reference: str, + order: int = CHRF_ORDER, + beta: float = CHRF_BETA, + remove_whitespace: bool = True) -> float: + """ + Computes ChrF on a single sentence pair. + + :param hypothesis: Hypothesis string. + :param reference: Reference string. + :param order: Maximum n-gram order. + :param remove_whitespace: Whether to delete whitespaces from hypothesis and reference strings. + :param beta: Defines importance of recall w.r.t precision. If beta=1, same importance. + :return: Chrf score. + """ + statistics = get_sentence_statistics(hypothesis, + reference, + order=order, + remove_whitespace=remove_whitespace) + avg_precision, avg_recall = _avg_precision_and_recall(statistics, order) + return _chrf(avg_precision, avg_recall, beta=beta) + + +def main(): + arg_parser = argparse.ArgumentParser( + description= + 'sacreBLEU: Hassle-free computation of shareable BLEU scores.' + 'Quick usage: score your detokenized output against WMT\'14 EN-DE:' + ' cat output.detok.de | ./sacreBLEU -t wmt14 -l en-de') + arg_parser.add_argument('--test-set', + '-t', + type=str, + default=None, + choices=DATASETS.keys(), + help='the test set to use') + arg_parser.add_argument( + '-lc', + action='store_true', + default=False, + help='use case-insensitive BLEU (default: actual case)') + arg_parser.add_argument( + '--smooth', + '-s', + choices=['exp', 'floor', 'add-n', 'none'], + default='exp', + help= + 'smoothing method: exponential decay (default), floor (increment zero counts), add-k (increment num/denom by k for n>1), or none' + ) + arg_parser.add_argument( + '--smooth-value', + '-sv', + type=float, + default=SMOOTH_VALUE_DEFAULT, + help= + 'The value to pass to the smoothing technique, when relevant. Default: %(default)s.' + ) + arg_parser.add_argument('--tokenize', + '-tok', + choices=TOKENIZERS.keys(), + default=None, + help='tokenization method to use') + arg_parser.add_argument( + '--language-pair', + '-l', + dest='langpair', + default=None, + help='source-target language pair (2-char ISO639-1 codes)') + arg_parser.add_argument('--download', + type=str, + default=None, + help='download a test set and quit') + arg_parser.add_argument( + '--echo', + choices=['src', 'ref', 'both'], + type=str, + default=None, + help= + 'output the source (src), reference (ref), or both (both, pasted) to STDOUT and quit' + ) + arg_parser.add_argument('--input', + '-i', + type=str, + default='-', + help='Read input from a file instead of STDIN') + arg_parser.add_argument( + 'refs', + nargs='*', + default=[], + help= + 'optional list of references (for backwards-compatibility with older scripts)' + ) + arg_parser.add_argument('--metrics', + '-m', + choices=['bleu', 'chrf'], + nargs='+', + default=['bleu'], + help='metrics to compute (default: bleu)') + arg_parser.add_argument('--chrf-order', + type=int, + default=CHRF_ORDER, + help='chrf character order (default: %(default)s)') + arg_parser.add_argument('--chrf-beta', + type=int, + default=CHRF_BETA, + help='chrf BETA parameter (default: %(default)s)') + arg_parser.add_argument( + '--chrf-whitespace', + action='store_true', + default=False, + help='include whitespace in chrF calculation (default: %(default)s)') + arg_parser.add_argument( + '--short', + default=False, + action='store_true', + help='produce a shorter (less human readable) signature') + arg_parser.add_argument('--score-only', + '-b', + default=False, + action='store_true', + help='output only the BLEU score') + arg_parser.add_argument( + '--force', + default=False, + action='store_true', + help='insist that your tokenized input is actually detokenized') + arg_parser.add_argument('--quiet', + '-q', + default=False, + action='store_true', + help='suppress informative output') + arg_parser.add_argument( + '--encoding', + '-e', + type=str, + default='utf-8', + help='open text files with specified encoding (default: %(default)s)') + arg_parser.add_argument('--citation', + '--cite', + default=False, + action='store_true', + help='dump the bibtex citation and quit.') + arg_parser.add_argument('--width', + '-w', + type=int, + default=1, + help='floating point width (default: %(default)s)') + arg_parser.add_argument('-V', + '--version', + action='version', + version='%(prog)s {}'.format(VERSION)) + args = arg_parser.parse_args() + + # Explicitly set the encoding + sys.stdin = open(sys.stdin.fileno(), + mode='r', + encoding='utf-8', + buffering=True, + newline="\n") + sys.stdout = open(sys.stdout.fileno(), + mode='w', + encoding='utf-8', + buffering=True) + + if not args.quiet: + logging.basicConfig(level=logging.INFO, + format='sacreBLEU: %(message)s') + + if args.download: + download_test_set(args.download, args.langpair) + sys.exit(0) + + if args.citation: + if not args.test_set: + logging.error('I need a test set (-t).') + sys.exit(1) + elif 'citation' not in DATASETS[args.test_set]: + logging.error('No citation found for %s', args.test_set) + sys.exit(1) + + print(DATASETS[args.test_set]['citation']) + sys.exit(0) + + if args.test_set is not None and args.test_set not in DATASETS: + logging.error('The available test sets are: ') + for testset in sorted(DATASETS.keys(), reverse=True): + logging.error(' %s: %s', testset, + DATASETS[testset].get('description', '')) + sys.exit(1) + + if args.test_set and (args.langpair is None + or args.langpair not in DATASETS[args.test_set]): + if args.langpair is None: + logging.error('I need a language pair (-l).') + elif args.langpair not in DATASETS[args.test_set]: + logging.error('No such language pair "%s"', args.langpair) + logging.error( + 'Available language pairs for test set "%s": %s', args.test_set, + ', '.join( + filter(lambda x: '-' in x, DATASETS[args.test_set].keys()))) + sys.exit(1) + + if args.echo: + if args.langpair is None or args.test_set is None: + logging.warning( + "--echo requires a test set (--t) and a language pair (-l)") + sys.exit(1) + print_test_set(args.test_set, args.langpair, args.echo) + sys.exit(0) + + if args.test_set is None and len(args.refs) == 0: + logging.error( + 'I need either a predefined test set (-t) or a list of references') + logging.error('The available test sets are: ') + for testset in sorted(DATASETS.keys(), reverse=True): + logging.error(' %s: %s', testset, + DATASETS[testset].get('description', '')) + sys.exit(1) + elif args.test_set is not None and len(args.refs) > 0: + logging.error( + 'I need exactly one of (a) a predefined test set (-t) or (b) a list of references' + ) + sys.exit(1) + + if args.test_set is not None and args.tokenize == 'none': + logging.warning( + "You are turning off sacrebleu's internal tokenization ('--tokenize none'), presumably to supply\n" + "your own reference tokenization. Published numbers will not be comparable with other papers.\n" + ) + + # Internal tokenizer settings. Set to 'zh' for Chinese DEFAULT_TOKENIZER ( + if args.tokenize is None: + # set default + if args.langpair is not None and args.langpair.split('-')[1] == 'zh': + args.tokenize = 'zh' + else: + args.tokenize = DEFAULT_TOKENIZER + + if args.langpair is not None and args.langpair.split('-')[ + 1] == 'zh' and 'bleu' in args.metrics and args.tokenize != 'zh': + logging.warning( + 'You should also pass "--tok zh" when scoring Chinese...') + + if args.test_set: + _, *refs = download_test_set(args.test_set, args.langpair) + if len(refs) == 0: + print('No references found for test set {}/{}.'.format( + args.test_set, args.langpair)) + sys.exit(1) + else: + refs = args.refs + + inputfh = io.TextIOWrapper( + sys.stdin.buffer, + encoding=args.encoding) if args.input == '-' else smart_open( + args.input, encoding=args.encoding) + system = inputfh.readlines() + + # Read references + refs = [smart_open(x, encoding=args.encoding).readlines() for x in refs] + + try: + if 'bleu' in args.metrics: + bleu = corpus_bleu(system, + refs, + smooth_method=args.smooth, + smooth_value=args.smooth_value, + force=args.force, + lowercase=args.lc, + tokenize=args.tokenize) + if 'chrf' in args.metrics: + chrf = corpus_chrf(system, + refs[0], + beta=args.chrf_beta, + order=args.chrf_order, + remove_whitespace=not args.chrf_whitespace) + except EOFError: + logging.error( + 'The input and reference stream(s) were of different lengths.\n') + if args.test_set is not None: + logging.error( + 'This could be a problem with your system output or with sacreBLEU\'s reference database.\n' + 'If the latter, you can clean out the references cache by typing:\n' + '\n' + ' rm -r %s/%s\n' + '\n' + 'They will be downloaded automatically again the next time you run sacreBLEU.', + SACREBLEU_DIR, args.test_set) + sys.exit(1) + + width = args.width + for metric in args.metrics: + if metric == 'bleu': + if args.score_only: + print('{0:.{1}f}'.format(bleu.score, width)) + else: + version_str = bleu_signature(args, len(refs)) + print( + bleu.format(width).replace('BLEU', 'BLEU+' + version_str)) + + elif metric == 'chrf': + if args.score_only: + print('{0:.{1}f}'.format(chrf, width)) + else: + version_str = chrf_signature(args, len(refs)) + print('chrF{0:d}+{1} = {2:.{3}f}'.format( + args.chrf_beta, version_str, chrf, width)) + + +if __name__ == '__main__': + main() diff --git a/collections/nemo_nlp/nemo_nlp/externals/tokenization.py b/collections/nemo_nlp/nemo_nlp/externals/tokenization.py new file mode 100644 index 000000000000..653f090f1595 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/externals/tokenization.py @@ -0,0 +1,405 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization classes.""" + +from __future__ import absolute_import, division, print_function, unicode_literals + +from .file_utils import cached_path +import collections +import logging +import os +import unicodedata +from io import open + +logger = logging.getLogger(__name__) + +PRETRAINED_VOCAB_ARCHIVE_MAP = { + 'bert-base-uncased': + "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt", + 'bert-large-uncased': + "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt", + 'bert-base-cased': + "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt", + 'bert-large-cased': + "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt", + 'bert-base-multilingual-uncased': + "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt", + 'bert-base-multilingual-cased': + "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt", + 'bert-base-chinese': + "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt", +} +PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = { + 'bert-base-uncased': 512, + 'bert-large-uncased': 512, + 'bert-base-cased': 512, + 'bert-large-cased': 512, + 'bert-base-multilingual-uncased': 512, + 'bert-base-multilingual-cased': 512, + 'bert-base-chinese': 512, +} +VOCAB_NAME = 'vocab.txt' + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = collections.OrderedDict() + index = 0 + with open(vocab_file, "r") as reader: + for token in reader: + token = token.strip() + vocab[token] = index + index += 1 + return vocab + + +def whitespace_tokenize(text): + """Runs basic whitespace cleaning and splitting on a piece of text.""" + text = text.strip() + if not text: + return [] + tokens = text.split() + return tokens + + +class BertTokenizer(object): + """Runs end-to-end tokenization: punctuation splitting + wordpiece""" + + def __init__(self, + vocab_file, + do_lower_case=True, + max_len=None, + do_basic_tokenize=True, + never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")): + """Constructs a BertTokenizer. + + Args: + vocab_file: Path to a one-wordpiece-per-line vocabulary file + do_lower_case: Whether to lower case the input + Only has an effect when do_wordpiece_only=False + do_basic_tokenize: Whether to do basic tokenization before wordpiece. + max_len: An artificial maximum length to truncate tokenized sequences to; + Effective maximum length is always the minimum of this + value (if specified) and the underlying BERT model's + sequence length. + never_split: List of tokens which will never be split during tokenization. + Only has an effect when do_wordpiece_only=False + """ + if not os.path.isfile(vocab_file): + raise ValueError( + "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained " + "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`" + .format(vocab_file)) + self.vocab = load_vocab(vocab_file) + self.ids_to_tokens = collections.OrderedDict([ + (ids, tok) for tok, ids in self.vocab.items() + ]) + self.do_basic_tokenize = do_basic_tokenize + if do_basic_tokenize: + self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case, + never_split=never_split) + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + self.max_len = max_len if max_len is not None else int(1e12) + + def tokenize(self, text): + if self.do_basic_tokenize: + split_tokens = [] + for token in self.basic_tokenizer.tokenize(text): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + else: + split_tokens = self.wordpiece_tokenizer.tokenize(text) + return split_tokens + + def convert_tokens_to_ids(self, tokens): + """Converts a sequence of tokens into ids using the vocab.""" + ids = [] + for token in tokens: + ids.append(self.vocab[token]) + if len(ids) > self.max_len: + logger.warning( + "Token indices sequence length is longer than the specified maximum " + " sequence length for this BERT model ({} > {}). Running this" + " sequence through BERT will result in indexing errors".format( + len(ids), self.max_len)) + return ids + + def convert_ids_to_tokens(self, ids): + """Converts a sequence of ids in wordpiece tokens using the vocab.""" + tokens = [] + for i in ids: + tokens.append(self.ids_to_tokens[i]) + return tokens + + @classmethod + def from_pretrained(cls, + pretrained_model_name_or_path, + cache_dir=None, + *inputs, + **kwargs): + """ + Instantiate a PreTrainedBertModel from a pre-trained model file. + Download and cache the pre-trained model file if needed. + """ + if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP: + vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[ + pretrained_model_name_or_path] + else: + vocab_file = pretrained_model_name_or_path + if os.path.isdir(vocab_file): + vocab_file = os.path.join(vocab_file, VOCAB_NAME) + # redirect to the cache, if necessary + try: + resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir) + except EnvironmentError: + logger.error( + "Model name '{}' was not found in model name list ({}). " + "We assumed '{}' was a path or url but couldn't find any file " + "associated to this path or url.".format( + pretrained_model_name_or_path, + ', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()), + vocab_file)) + return None + if resolved_vocab_file == vocab_file: + logger.info("loading vocabulary file {}".format(vocab_file)) + else: + logger.info("loading vocabulary file {} from cache at {}".format( + vocab_file, resolved_vocab_file)) + if pretrained_model_name_or_path in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP: + # if we're using a pretrained model, ensure the tokenizer wont index sequences longer + # than the number of positional embeddings + max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[ + pretrained_model_name_or_path] + kwargs['max_len'] = min(kwargs.get('max_len', int(1e12)), max_len) + # Instantiate tokenizer. + tokenizer = cls(resolved_vocab_file, *inputs, **kwargs) + return tokenizer + + +class BasicTokenizer(object): + """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" + + def __init__(self, + do_lower_case=True, + never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")): + """Constructs a BasicTokenizer. + + Args: + do_lower_case: Whether to lower case the input. + """ + self.do_lower_case = do_lower_case + self.never_split = never_split + + def tokenize(self, text): + """Tokenizes a piece of text.""" + text = self._clean_text(text) + # This was added on November 1st, 2018 for the multilingual and Chinese + # models. This is also applied to the English models now, but it doesn't + # matter since the English models were not trained on any Chinese data + # and generally don't have any Chinese data in them (there are Chinese + # characters in the vocabulary because Wikipedia does have some Chinese + # words in the English Wikipedia.). + text = self._tokenize_chinese_chars(text) + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + for token in orig_tokens: + if self.do_lower_case and token not in self.never_split: + token = token.lower() + token = self._run_strip_accents(token) + split_tokens.extend(self._run_split_on_punc(token)) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens + + def _run_strip_accents(self, text): + """Strips accents from a piece of text.""" + text = unicodedata.normalize("NFD", text) + output = [] + for char in text: + cat = unicodedata.category(char) + if cat == "Mn": + continue + output.append(char) + return "".join(output) + + def _run_split_on_punc(self, text): + """Splits punctuation on a piece of text.""" + if text in self.never_split: + return [text] + chars = list(text) + i = 0 + start_new_word = True + output = [] + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + return ["".join(x) for x in output] + + def _tokenize_chinese_chars(self, text): + """Adds whitespace around any CJK character.""" + output = [] + for char in text: + cp = ord(char) + if self._is_chinese_char(cp): + output.append(" ") + output.append(char) + output.append(" ") + else: + output.append(char) + return "".join(output) + + def _is_chinese_char(self, cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ((cp >= 0x4E00 and cp <= 0x9FFF) or # + (cp >= 0x3400 and cp <= 0x4DBF) or # + (cp >= 0x20000 and cp <= 0x2A6DF) or # + (cp >= 0x2A700 and cp <= 0x2B73F) or # + (cp >= 0x2B740 and cp <= 0x2B81F) or # + (cp >= 0x2B820 and cp <= 0x2CEAF) or + (cp >= 0xF900 and cp <= 0xFAFF) or # + (cp >= 0x2F800 and cp <= 0x2FA1F)): # + return True + + return False + + def _clean_text(self, text): + """Performs invalid character removal and whitespace cleanup on text.""" + output = [] + for char in text: + cp = ord(char) + if cp == 0 or cp == 0xfffd or _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +class WordpieceTokenizer(object): + """Runs WordPiece tokenization.""" + + def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): + self.vocab = vocab + self.unk_token = unk_token + self.max_input_chars_per_word = max_input_chars_per_word + + def tokenize(self, text): + """Tokenizes a piece of text into its word pieces. + + This uses a greedy longest-match-first algorithm to perform tokenization + using the given vocabulary. + + For example: + input = "unaffable" + output = ["un", "##aff", "##able"] + + Args: + text: A single token or whitespace separated tokens. This should have + already been passed through `BasicTokenizer`. + + Returns: + A list of wordpiece tokens. + """ + + output_tokens = [] + for token in whitespace_tokenize(text): + chars = list(token) + if len(chars) > self.max_input_chars_per_word: + output_tokens.append(self.unk_token) + continue + + is_bad = False + start = 0 + sub_tokens = [] + while start < len(chars): + end = len(chars) + cur_substr = None + while start < end: + substr = "".join(chars[start:end]) + if start > 0: + substr = "##" + substr + if substr in self.vocab: + cur_substr = substr + break + end -= 1 + if cur_substr is None: + is_bad = True + break + sub_tokens.append(cur_substr) + start = end + + if is_bad: + output_tokens.append(self.unk_token) + else: + output_tokens.extend(sub_tokens) + return output_tokens + + +def _is_whitespace(char): + """Checks whether `chars` is a whitespace character.""" + # \t, \n, and \r are technically contorl characters but we treat them + # as whitespace since they are generally considered as such. + if char == " " or char == "\t" or char == "\n" or char == "\r": + return True + cat = unicodedata.category(char) + if cat == "Zs": + return True + return False + + +def _is_control(char): + """Checks whether `chars` is a control character.""" + # These are technically control characters but we count them as whitespace + # characters. + if char == "\t" or char == "\n" or char == "\r": + return False + cat = unicodedata.category(char) + if cat.startswith("C"): + return True + return False + + +def _is_punctuation(char): + """Checks whether `chars` is a punctuation character.""" + cp = ord(char) + # We treat all non-letter/number ASCII as punctuation. + # Characters such as "^", "$", and "`" are not in the Unicode + # Punctuation class but we treat them as punctuation anyways, for + # consistency. + if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) + or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): + return True + cat = unicodedata.category(char) + if cat.startswith("P"): + return True + return False diff --git a/collections/nemo_nlp/nemo_nlp/huggingface/__init__.py b/collections/nemo_nlp/nemo_nlp/huggingface/__init__.py new file mode 100644 index 000000000000..5074307bd60a --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/huggingface/__init__.py @@ -0,0 +1 @@ +from .bert import BERT diff --git a/collections/nemo_nlp/nemo_nlp/huggingface/bert.py b/collections/nemo_nlp/nemo_nlp/huggingface/bert.py new file mode 100644 index 000000000000..6b62d75ac670 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/huggingface/bert.py @@ -0,0 +1,125 @@ +# Copyright (c) 2019 NVIDIA Corporation + +from pytorch_transformers import BertConfig, BertModel, \ + BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP +from typing import Optional, List +from nemo.backends.pytorch.nm import TrainableNM +from nemo.core.neural_types import AxisType, BatchTag, ChannelTag, \ + NeuralType, TimeTag +from nemo.core.neural_modules import PretrainedModelInfo + + +class BERT(TrainableNM): + """ + BERT wraps around the Huggingface implementation of BERT from their + pytorch-transformers repository for easy use within NeMo. + + Args: + pretrained_model_name (str): If using a pretrained model, this should + be the model's name. Otherwise, should be left as None. + vocab_size (int): Size of the vocabulary file, if not using a + pretrained model. + hidden_size (int): Size of the encoder and pooler layers. + num_hidden_layers (int): Number of hidden layers in the encoder. + num_attention_heads (int): Number of attention heads for each layer. + intermediate_size (int): Size of intermediate layers in the encoder. + hidden_act (str): Activation function for encoder and pooler layers; + "gelu", "relu", and "swish" are supported. + max_position_embeddings (int): The maximum number of tokens in a + sequence. + """ + + @staticmethod + def create_ports(): + input_ports = { + "input_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "token_type_ids": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "attention_mask": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + + output_ports = { + "hidden_states": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }) + } + + return input_ports, output_ports + + def __init__(self, *, + pretrained_model_name=None, + config_filename=None, + vocab_size=None, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + max_position_embeddings=512, + **kwargs): + TrainableNM.__init__(self, **kwargs) + + # Check that only one of pretrained_model_name, config_filename, and + # vocab_size was passed in + total = 0 + + if pretrained_model_name is not None: + total += 1 + if config_filename is not None: + total += 1 + if vocab_size is not None: + total += 1 + + if total != 1: + raise ValueError("Only one of pretrained_model_name, vocab_size, " + + "or config_filename should be passed into the " + + "BERT constructor.") + + if vocab_size is not None: + config = BertConfig( + vocab_size_or_config_json_file=vocab_size, + hidden_size=hidden_size, + num_hidden_layers=num_hidden_layers, + num_attention_heads=num_attention_heads, + intermediate_size=intermediate_size, + hidden_act=hidden_act, + max_position_embeddings=max_position_embeddings) + model = BertModel(config) + elif pretrained_model_name is not None: + model = BertModel.from_pretrained(pretrained_model_name) + elif config_filename is not None: + config = BertConfig.from_json_file(config_filename) + model = BertModel(config) + else: + raise ValueError("Either pretrained_model_name or vocab_size must" + + "be passed into the BERT constructor") + + model.to(self._device) + + self.add_module("bert", model) + self.config = model.config + + @staticmethod + def list_pretrained_models() -> Optional[List[PretrainedModelInfo]]: + pretrained_models = [] + for key, value in BERT_PRETRAINED_MODEL_ARCHIVE_MAP.items(): + model_info = PretrainedModelInfo( + pretrained_model_name=key, + description="weights by HuggingFace", + parameters=BERT_PRETRAINED_CONFIG_ARCHIVE_MAP[key], + location=value) + pretrained_models.append(model_info) + return pretrained_models + + def forward(self, input_ids, token_type_ids, attention_mask): + return self.bert(input_ids, token_type_ids, attention_mask)[0] diff --git a/collections/nemo_nlp/nemo_nlp/nlp_utils.py b/collections/nemo_nlp/nemo_nlp/nlp_utils.py new file mode 100644 index 000000000000..1d8f2b695f20 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/nlp_utils.py @@ -0,0 +1,62 @@ +import numpy as np +from sklearn.metrics import confusion_matrix, classification_report + +from nemo.utils.exp_logging import get_logger + +logger = get_logger('') + + +def read_intent_slot_outputs(queries, + intent_file, + slot_file, + intent_logits, + slot_logits, + slot_masks, + intents=None, + slots=None): + intent_dict = get_vocab(intent_file) + slot_dict = get_vocab(slot_file) + pred_intents = np.argmax(intent_logits, 1) + pred_slots = np.argmax(slot_logits, axis=2) + for i, query in enumerate(queries): + logger.info(f'Query: {query}') + pred = pred_intents[i] + logger.info(f'Predicted intent:\t{pred}\t{intent_dict[pred]}') + if intents is not None: + logger.info( + f'True intent:\t{intents[i]}\t{intent_dict[intents[i]]}') + + pred_slot = pred_slots[i][slot_masks[i]][1:-1] + tokens = query.strip().split() + + if len(pred_slot) != len(tokens): + raise ValueError('Pred_slot and tokens must be of the same length') + + for j, token in enumerate(tokens): + output = f'{token}\t{slot_dict[pred_slot[j]]}' + if slots is not None: + output = f'{output}\t{slot_dict[slots[i][j]]}' + logger.info(output) + + +def get_vocab(file): + lines = open(file, 'r').readlines() + labels = {i: lines[i].strip() for i in range(len(lines))} + return labels + + +def write_vocab(items, outfile): + vocab = {} + idx = 0 + with open(outfile, 'w') as f: + for item in items: + f.write(item + '\n') + vocab[item] = idx + idx += 1 + return vocab + + +def write_vocab_in_order(vocab, outfile): + with open(outfile, 'w') as f: + for key in sorted(vocab.keys()): + f.write(f'{vocab[key]}\n') diff --git a/collections/nemo_nlp/nemo_nlp/text_data_utils.py b/collections/nemo_nlp/nemo_nlp/text_data_utils.py new file mode 100644 index 000000000000..88c6f5e4f95c --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/text_data_utils.py @@ -0,0 +1,507 @@ +import glob +import json +import os +import random +import shutil + +from nemo.utils.exp_logging import get_logger +from nemo_nlp.nlp_utils import get_vocab, write_vocab, write_vocab_in_order + + +logger = get_logger('') +LOGGING_TMP = '{} dataset has already been processed and stored at {}' + + +def if_exist(outfold, modes): + if not os.path.exists(outfold): + return False + for mode in modes: + if not os.path.exists(os.path.join(outfold, mode + '.tsv')): + return False + return True + + +def process_sst_2(data_dir): + if not os.path.exists(data_dir): + link = 'https://gluebenchmark.com/tasks' + raise ValueError(f'Data not found at {data_dir}. ' + 'Please download SST-2 from {link}.') + logger.info('Keep in mind that SST-2 is only available in lower case.') + return data_dir + + +def process_imdb(data_dir, uncased, modes=['train', 'test']): + if not os.path.exists(data_dir): + link = 'www.kaggle.com/iarunava/imdb-movie-reviews-dataset' + raise ValueError(f'Data not found at {data_dir}. ' + 'Please download IMDB from {link}.') + + outfold = f'{data_dir}/nemo-processed' + + if uncased: + outfold = f'{outfold}_uncased' + + if if_exist(outfold, modes): + logger.info(LOGGING_TMP.format('IMDB', outfold)) + return outfold + logger.info(f'Processing IMDB dataset and store at {outfold}') + + os.makedirs(outfold, exist_ok=True) + + outfiles = {} + + for mode in modes: + outfiles[mode] = open(os.path.join(outfold, mode + '.tsv'), 'w') + outfiles[mode].write('sentence\tlabel\n') + for sent in ['neg', 'pos']: + if sent == 'neg': + label = 0 + else: + label = 1 + files = glob.glob(f'{data_dir}/{mode}/{sent}/*.txt') + for file in files: + with open(file, 'r') as f: + review = f.read().strip() + if uncased: + review = review.lower() + review = review.replace("
", "") + outfiles[mode].write(f'{review}\t{label}\n') + + return outfold + + +def process_nlu(filename, + uncased, + modes=['train', 'test'], + dataset_name='nlu-ubuntu'): + """ Dataset has to be of: + - ubuntu + - chat + - web + """ + + if not os.path.exists(filename): + link = 'https://github.com/sebischair/NLU-Evaluation-Corpora' + raise ValueError(f'Data not found at {filename}. ' + 'Please download IMDB from {link}.') + + if dataset_name == 'nlu-ubuntu': + INTENT = {'makeupdate': 1, + 'setupprinter': 2, + 'shutdowncomputer': 3, + 'softwarerecommendation': 4, + 'none': 0} + elif dataset_name == 'nlu-chat': + INTENT = {'departuretime': 0, 'findconnection': 1} + elif dataset_name == 'nlu-web': + INTENT = {'changepassword': 1, + 'deleteaccount': 2, + 'downloadvideo': 3, + 'exportdata': 4, + 'filterspam': 5, + 'findalternative': 6, + 'syncaccounts': 7, + 'none': 0} + else: + raise ValueError(f'{dataset_name}: Invalid dataset name') + + infold = filename[:filename.rfind('/')] + outfold = f'{infold}/{dataset_name}-nemo-processed' + + if uncased: + outfold = f'{outfold}_uncased' + + if if_exist(outfold, modes): + logger.info(LOGGING_TMP.format(dataset_name.upper(), outfold)) + return outfold + logger.info(f'Processing data and store at {outfold}') + + os.makedirs(outfold, exist_ok=True) + + outfiles = {} + + for mode in modes: + outfiles[mode] = open(os.path.join(outfold, mode + '.tsv'), 'w') + outfiles[mode].write('sentence\tlabel\n') + + with open(filename, 'r') as f: + data = json.load(f) + + for obj in data['sentences']: + sentence = obj['text'].strip() + if uncased: + sentence = sentence.lower() + intent = obj['intent'].lower().replace(' ', '') + label = INTENT[intent] + txt = f'{sentence}\t{label}\n' + if obj['training']: + outfiles['train'].write(txt) + else: + outfiles['test'].write(txt) + return outfold + + +def get_car_labels(intent_file): + labels = {} + with open(intent_file, 'r') as f: + for line in f: + intent, label = line.strip().split('\t') + labels[intent] = int(label) + return labels + + +def process_nvidia_car(infold, + uncased, + modes=['train', 'test'], + test_ratio=0.02): + infiles = {'train': f'{infold}/pytextTrainDataPOI_1_0.tsv', + 'test': f'{infold}/test.tsv'} + outfold = f'{infold}/nvidia-car-nemo-processed' + intent_file = f'{outfold}/intent_labels.tsv' + + if uncased: + outfold = f'{outfold}_uncased' + + if if_exist(outfold, modes): + logger.info(LOGGING_TMP.format('NVIDIA-CAR', outfold)) + labels = get_car_labels(intent_file) + return outfold, labels + logger.info(f'Processing this dataset and store at {outfold}') + + os.makedirs(outfold, exist_ok=True) + + outfiles = {} + + for mode in modes: + outfiles[mode] = open(os.path.join(outfold, mode + '.tsv'), 'w') + outfiles[mode].write('sentence\tlabel\n') + intents, sentences = [], [] + start_index = 1 + + if mode == 'train': + all_intents = set() + start_index = 2 + + with open(infiles[mode], 'r') as f: + for line in f: + intent, _, sentence = line.strip().split('\t') + if uncased: + sentence = sentence.lower() + + if mode == 'train': + all_intents.add(intent) + intents.append(intent) + sentences.append(' '.join(sentence.split()[start_index:-1])) + + if mode == 'train': + i = 0 + labels = {} + intent_out = open(intent_file, 'w') + for intent in all_intents: + labels[intent] = i + logger.info(f'{intent}\t{i}') + intent_out.write(f'{intent}\t{i}\n') + i += 1 + + seen, repeat = set(), 0 + for intent, sentence in zip(intents, sentences): + if sentence in seen: + if mode == 'test': + print(sentence) + repeat += 1 + continue + text = f'{sentence}\t{labels[intent]}\n' + outfiles[mode].write(text) + seen.add(sentence) + logger.info(f'{repeat} repeated sentences in {mode}') + + return outfold, labels + + +def process_twitter_airline(filename, uncased, modes=['train', 'test']): + """ Dataset from Kaggle: + https://www.kaggle.com/crowdflower/twitter-airline-sentiment + """ + pass + + +def ids2text(ids, vocab): + return ' '.join([vocab[int(id_)] for id_ in ids]) + + +def process_atis(infold, uncased, modes=['train', 'test'], dev_split=0): + """ MSFT's dataset, processed by Kaggle + https://www.kaggle.com/siddhadev/atis-dataset-from-ms-cntk + """ + outfold = f'{infold}/nemo-processed' + infold = f'{infold}/data/raw_data/ms-cntk-atis' + vocab = get_vocab(f'{infold}/atis.dict.vocab.csv') + + if uncased: + outfold = f'{outfold}-uncased' + + if if_exist(outfold, modes): + logger.info(LOGGING_TMP.format('ATIS', outfold)) + return outfold + logger.info(f'Processing ATIS dataset and store at {outfold}') + + os.makedirs(outfold, exist_ok=True) + + outfiles = {} + + for mode in modes: + outfiles[mode] = open(os.path.join(outfold, mode + '.tsv'), 'w') + outfiles[mode].write('sentence\tlabel\n') + outfiles[mode + '_slots'] = open(f'{outfold}/{mode}_slots.tsv', 'w') + + queries = open(f'{infold}/atis.{mode}.query.csv', 'r').readlines() + intents = open(f'{infold}/atis.{mode}.intent.csv', 'r').readlines() + slots = open(f'{infold}/atis.{mode}.slots.csv', 'r').readlines() + + for i, query in enumerate(queries): + sentence = ids2text(query.strip().split()[1:-1], vocab) + outfiles[mode].write(f'{sentence}\t{intents[i].strip()}\n') + slot = ' '.join(slots[i].strip().split()[1:-1]) + outfiles[mode + '_slots'].write(slot + '\n') + + shutil.copyfile(f'{infold}/atis.dict.intent.csv', + f'{outfold}/dict.intents.csv') + shutil.copyfile(f'{infold}/atis.dict.slots.csv', + f'{outfold}/dict.slots.csv') + + return outfold + + +def reverse_dict(entity2value): + value2entity = {} + for entity in entity2value: + for value in entity2value[entity]: + value2entity[value] = entity + return value2entity + + +def map_entities(entity2value, entities): + for key in entities: + if 'data' in entities[key]: + if key not in entity2value: + entity2value[key] = set([]) + + values = [] + for value in entities[key]['data']: + values.append(value['value']) + values.extend(value['synonyms']) + entity2value[key] = entity2value[key] | set(values) + + return entity2value + + +def get_entities(files): + entity2value = {} + for file in files: + with open(file, 'r') as json_file: + data = json.load(json_file) + entity2value = map_entities(entity2value, data['entities']) + + value2entity = reverse_dict(entity2value) + return entity2value, value2entity + + +def get_data(files, entity2value, value2entity): + all_data, all_slots, all_intents = [], set(['O']), set() + for file in files: + file_data = [] + with open(file, 'r') as json_file: + data = json.load(json_file) + for intent in data['intents']: + all_intents.add(intent) + utterances = data['intents'][intent]['utterances'] + for utterance in utterances: + tokens, slots = [], [] + for frag in utterance['data']: + frag_tokens = frag['text'].strip().split() + tokens.extend(frag_tokens) + if 'slot_name' not in frag: + slot = 'O' + else: + slot = frag['slot_name'] + all_slots.add(slot) + slots.extend([slot] * len(frag_tokens)) + file_data.append((tokens, slots, intent)) + all_data.append(file_data) + return all_data, all_slots, all_intents + + +def get_dataset(files, dev_split=0.1): + entity2value, value2entity = get_entities(files) + data, slots, intents = get_data(files, entity2value, value2entity) + if len(data) == 1: + train, dev = partition(data[0], split=dev_split) + else: + train, dev = data[0], data[1] + return train, dev, slots, intents + + +def partition(data, split=0.1): + n = len(data) + n_dev = int(n * split) + dev_idx = set(random.sample(range(n), n_dev)) + dev, train = [], [] + + for i, item in enumerate(data): + if i in dev_idx: + dev.append(item) + else: + train.append(item) + return train, dev + + +def write_data(data, slot_dict, intent_dict, outfold, mode, uncased): + intent_file = open(f'{outfold}/{mode}.tsv', 'w') + intent_file.write('sentence\tlabel\n') + slot_file = open(f'{outfold}/{mode}_slots.tsv', 'w') + for tokens, slots, intent in data: + text = ' '.join(tokens) + if uncased: + text = text.lower() + intent_file.write(f'{text}\t{intent_dict[intent]}\n') + slots = [str(slot_dict[slot]) for slot in slots] + slot_file.write(' '.join(slots) + '\n') + intent_file.close() + slot_file.close() + + +def create_dataset(train, dev, slots, intents, uncased, outfold): + os.makedirs(outfold, exist_ok=True) + if 'O' in slots: + slots.remove('O') + slots = sorted(list(slots)) + ['O'] + intents = sorted(list(intents)) + slots = write_vocab(slots, f'{outfold}/dict.slots.csv') + intents = write_vocab(intents, f'{outfold}/dict.intents.csv') + write_data(train, slots, intents, outfold, 'train', uncased) + write_data(dev, slots, intents, outfold, 'test', uncased) + + +def process_snips(data_dir, uncased, modes=['train', 'test'], dev_split=0.1): + if not os.path.exists(data_dir): + link = 'www.github.com/snipsco/spoken-language' + '-understanding-research-datasets' + raise ValueError(f'Data not found at {data_dir}. ' + 'Resquest to download the SNIPS dataset from {link}.') + + outfold = f'{data_dir}/nemo-processed' + + if uncased: + outfold = f'{outfold}-uncased' + + exist = True + for dataset in ['light', 'speak', 'all']: + if if_exist(f'{outfold}/{dataset}', modes): + logger.info(LOGGING_TMP.format( + 'SNIPS-' + dataset.upper(), outfold)) + else: + exist = False + if exist: + return outfold + + logger.info(f'Processing SNIPS dataset and store at {outfold}') + + os.makedirs(outfold, exist_ok=True) + + speak_dir = 'smart-speaker-en-close-field' + light_dir = 'smart-lights-en-close-field' + + light_files = [f'{data_dir}/{light_dir}/dataset.json'] + speak_files = [f'{data_dir}/{speak_dir}/training_dataset.json'] + speak_files.append(f'{data_dir}/{speak_dir}/test_dataset.json') + + light_train, light_dev, light_slots, light_intents = get_dataset( + light_files, dev_split) + speak_train, speak_dev, speak_slots, speak_intents = get_dataset( + speak_files) + + create_dataset(light_train, light_dev, light_slots, + light_intents, uncased, f'{outfold}/light') + create_dataset(speak_train, speak_dev, speak_slots, + speak_intents, uncased, f'{outfold}/speak') + create_dataset(light_train + speak_train, light_dev + speak_dev, + light_slots | speak_slots, light_intents | speak_intents, + uncased, f'{outfold}/all') + + return outfold + + +def list2str(nums): + return ' '.join([str(num) for num in nums]) + + +def merge(data_dir, subdirs, dataset_name, modes=['train', 'test']): + outfold = f'{data_dir}/{dataset_name}' + if if_exist(outfold, modes): + logger.info(LOGGING_TMP.format('SNIPS-ATIS', outfold)) + slots = get_vocab(f'{outfold}/dict.slots.csv') + none_slot = 0 + for key in slots: + if slots[key] == 'O': + none_slot = key + break + return outfold, int(none_slot) + + os.makedirs(outfold, exist_ok=True) + + data_files, slot_files = {}, {} + for mode in modes: + data_files[mode] = open(f'{outfold}/{mode}.tsv', 'w') + data_files[mode].write('sentence\tlabel\n') + slot_files[mode] = open(f'{outfold}/{mode}_slots.tsv', 'w') + + intents, slots = {}, {} + intent_shift, slot_shift = 0, 0 + none_intent, none_slot = -1, -1 + + for subdir in subdirs: + curr_intents = get_vocab(f'{data_dir}/{subdir}/dict.intents.csv') + curr_slots = get_vocab(f'{data_dir}/{subdir}/dict.slots.csv') + + for key in curr_intents: + if intent_shift > 0 and curr_intents[key] == 'O': + continue + if curr_intents[key] == 'O' and intent_shift == 0: + none_intent = int(key) + intents[int(key) + intent_shift] = curr_intents[key] + + for key in curr_slots: + if slot_shift > 0 and curr_slots[key] == 'O': + continue + if slot_shift == 0 and curr_slots[key] == 'O': + none_slot = int(key) + slots[int(key) + slot_shift] = curr_slots[key] + + for mode in modes: + with open(f'{data_dir}/{subdir}/{mode}.tsv', 'r') as f: + for line in f.readlines()[1:]: + text, label = line.strip().split('\t') + label = int(label) + if curr_intents[label] == 'O': + label = none_intent + else: + label = label + intent_shift + data_files[mode].write(f'{text}\t{label}\n') + + with open(f'{data_dir}/{subdir}/{mode}_slots.tsv', 'r') as f: + for line in f.readlines(): + labels = [int(label) for label in line.strip().split()] + shifted_labels = [] + for label in labels: + if curr_slots[label] == 'O': + shifted_labels.append(none_slot) + else: + shifted_labels.append(label + slot_shift) + slot_files[mode].write(list2str(shifted_labels) + '\n') + + intent_shift += len(curr_intents) + slot_shift += len(curr_slots) + + write_vocab_in_order(intents, f'{outfold}/dict.intents.csv') + write_vocab_in_order(slots, f'{outfold}/dict.slots.csv') + return outfold, none_slot diff --git a/collections/nemo_nlp/nemo_nlp/transformer/__init__.py b/collections/nemo_nlp/nemo_nlp/transformer/__init__.py new file mode 100644 index 000000000000..1bd002d13f87 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/__init__.py @@ -0,0 +1,7 @@ +# Copyright (c) 2019 NVIDIA Corporation +from .modules import * +from .encoders import * +from .decoders import * +from .softmax_layers import * +from .losses import * +from .generators import * diff --git a/collections/nemo_nlp/nemo_nlp/transformer/decoders.py b/collections/nemo_nlp/nemo_nlp/transformer/decoders.py new file mode 100644 index 000000000000..8de7469448e3 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/decoders.py @@ -0,0 +1,98 @@ +import copy +import torch +import torch.nn as nn +from .modules import MultiHeadAttention, PositionWiseFF +from .utils import form_attention_mask + + +class TransformerDecoderBlock(nn.Module): + """ + Building block of Transformer decoder. + + Args: + hidden_size: size of the embeddings in the model, also known as d_model + inner_size: number of neurons in the intermediate part of feed-forward + net, usually is (4-8 x hidden_size) in the papers + num_attention_heads: number of heads in multi-head attention + attn_score_dropout: probability of dropout applied to attention scores + attn_layer_dropout: probability of dropout applied to the output of the + attention layers, but before layer normalization + ffn_dropout: probability of dropout applied to FFN output + hidden_act: activation function used between two linear layers in FFN + """ + + def __init__(self, hidden_size, inner_size, num_attention_heads=1, + attn_score_dropout=0, attn_layer_dropout=0, ffn_dropout=0, + hidden_act="relu"): + super().__init__() + + self.first_sub_layer = MultiHeadAttention( + hidden_size, num_attention_heads, + attn_score_dropout, attn_layer_dropout) + self.second_sub_layer = MultiHeadAttention( + hidden_size, num_attention_heads, + attn_score_dropout, attn_layer_dropout) + self.third_sub_layer = PositionWiseFF( + hidden_size, inner_size, ffn_dropout, hidden_act) + + def forward(self, decoder_query, decoder_mask, decoder_keys, + encoder_states, encoder_mask): + self_attn_output = self.first_sub_layer( + decoder_query, decoder_keys, decoder_keys, decoder_mask) + enc_dec_attn_output = self.second_sub_layer( + self_attn_output, encoder_states, encoder_states, encoder_mask) + output_states = self.third_sub_layer(enc_dec_attn_output) + return output_states + + +class TransformerDecoder(nn.Module): + + def __init__(self, num_layers, hidden_size, **kwargs): + super().__init__() + + layer = TransformerDecoderBlock(hidden_size, **kwargs) + self.layers = nn.ModuleList( + [copy.deepcopy(layer) for _ in range(num_layers)]) + + def _get_memory_states(self, decoder_states, decoder_mems_list=None, i=0): + if decoder_mems_list is not None: + memory_states = torch.cat( + (decoder_mems_list[i], decoder_states), dim=1) + else: + memory_states = decoder_states + return memory_states + + def forward(self, decoder_states, decoder_mask, encoder_states, + encoder_mask, decoder_mems_list=None, return_mems=False): + """ + Args: + decoder_states: output of the embedding layer (B x L_dec x H) + decoder_mask: decoder inputs mask (B x L_dec) + encoder_states: output of the encoder (B x L_enc x H) + encoder_mask: encoder inputs mask (B x L_enc) + decoder_mems_list: list of the cached decoder hidden states + for fast autoregressive generation which will be used instead + of decoder_states as keys and values if not None + return_mems: bool, whether to return outputs of all decoder layers + or the last layer only + """ + + decoder_attn_mask = form_attention_mask(decoder_mask, diagonal=0) + encoder_attn_mask = form_attention_mask(encoder_mask) + + memory_states = self._get_memory_states( + decoder_states, decoder_mems_list, 0) + cached_mems_list = [memory_states] + + for i, layer in enumerate(self.layers): + decoder_states = layer( + decoder_states, decoder_attn_mask, memory_states, + encoder_states, encoder_attn_mask) + memory_states = self._get_memory_states(decoder_states, + decoder_mems_list, i + 1) + cached_mems_list.append(memory_states) + + if return_mems: + return cached_mems_list + else: + return cached_mems_list[-1] diff --git a/collections/nemo_nlp/nemo_nlp/transformer/encoders.py b/collections/nemo_nlp/nemo_nlp/transformer/encoders.py new file mode 100644 index 000000000000..44dcca895081 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/encoders.py @@ -0,0 +1,128 @@ +import copy +import torch +import torch.nn as nn +from .modules import MultiHeadAttention, PositionWiseFF, TwoStreamSelfAttention +from .utils import form_attention_mask + + +class TransformerEncoderBlock(nn.Module): + """ + Building block of Transformer encoder. + + Args: + hidden_size: size of the embeddings in the model, also known as d_model + inner_size: number of neurons in the intermediate part of feed-forward + net, usually is (4-8 x hidden_size) in the papers + num_attention_heads: number of heads in multi-head attention + attn_score_dropout: probability of dropout applied to attention scores + attn_layer_dropout: probability of dropout applied to the output of the + attention layers, but before layer normalization + ffn_dropout: probability of dropout applied to FFN output + hidden_act: activation function used between two linear layers in FFN + """ + + def __init__(self, hidden_size, inner_size, num_attention_heads=1, + attn_score_dropout=0, attn_layer_dropout=0, ffn_dropout=0, + hidden_act="relu"): + super().__init__() + + self.first_sub_layer = MultiHeadAttention( + hidden_size, num_attention_heads, + attn_score_dropout, attn_layer_dropout) + self.second_sub_layer = PositionWiseFF( + hidden_size, inner_size, ffn_dropout, hidden_act) + + def forward(self, encoder_query, encoder_mask, encoder_keys): + self_attn_output = self.first_sub_layer( + encoder_query, encoder_keys, encoder_keys, encoder_mask) + output_states = self.second_sub_layer(self_attn_output) + return output_states + + +class TransformerEncoder(nn.Module): + def __init__(self, num_layers, hidden_size, mask_future=False, **kwargs): + super().__init__() + + layer = TransformerEncoderBlock(hidden_size, **kwargs) + self.layers = nn.ModuleList( + [copy.deepcopy(layer) for _ in range(num_layers)]) + self.diag = 0 if mask_future else None + + def _get_memory_states(self, encoder_states, encoder_mems_list=None, i=0): + if encoder_mems_list is not None: + memory_states = torch.cat( + (encoder_mems_list[i], encoder_states), dim=1) + else: + memory_states = encoder_states + return memory_states + + def forward(self, encoder_states, encoder_mask, encoder_mems_list=None, + return_mems=False): + """ + Args: + encoder_states: output of the embedding_layer (B x L_enc x H) + encoder_mask: encoder inputs mask (B x L_enc) + encoder_mems_list: list of the cached encoder hidden states + for fast autoregressive generation which will be used instead + of encoder_states as keys and values if not None + return_mems: bool, whether to return outputs of all encoder layers + or the last layer only + """ + + encoder_attn_mask = form_attention_mask(encoder_mask, self.diag) + + memory_states = self._get_memory_states( + encoder_states, encoder_mems_list, 0) + cached_mems_list = [memory_states] + + for i, layer in enumerate(self.layers): + encoder_states = layer( + encoder_states, encoder_attn_mask, memory_states) + memory_states = self._get_memory_states( + encoder_states, encoder_mems_list, i + 1) + cached_mems_list.append(memory_states) + + if return_mems: + return cached_mems_list + else: + return cached_mems_list[-1] + + +class XLNetEncoderBlock(nn.Module): + + def __init__(self, hidden_size, inner_size, num_attention_heads=1, + attn_score_dropout=0, attn_layer_dropout=0, ffn_dropout=0, + hidden_act="relu"): + super().__init__() + + self.first_sub_layer = TwoStreamSelfAttention( + hidden_size, num_attention_heads, + attn_score_dropout, attn_layer_dropout) + self.second_sub_layer = PositionWiseFF( + hidden_size, inner_size, ffn_dropout, hidden_act) + + def forward(self, query_states, content_states, query_attn_mask, + content_attn_mask): + output_query_states, output_content_states = self.first_sub_layer( + query_states, content_states, query_attn_mask, content_attn_mask) + output_content_states = self.second_sub_layer(output_content_states) + return output_query_states, output_content_states + + +class XLNetEncoder(nn.Module): + + def __init__(self, num_layers, hidden_size, **kwargs): + super().__init__() + + layer = XLNetEncoderBlock(hidden_size, **kwargs) + self.layers = nn.ModuleList( + [copy.deepcopy(layer) for _ in range(num_layers)]) + + def forward(self, query_states, content_states, input_mask): + query_attn_mask = form_attention_mask(input_mask, diagonal=-1) + content_attn_mask = form_attention_mask(input_mask, diagonal=0) + for layer in self.layers: + query_states, content_states = layer( + query_states, content_states, + query_attn_mask, content_attn_mask) + return query_states, content_states diff --git a/collections/nemo_nlp/nemo_nlp/transformer/generators.py b/collections/nemo_nlp/nemo_nlp/transformer/generators.py new file mode 100644 index 000000000000..af09c8eed985 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/generators.py @@ -0,0 +1,301 @@ +import torch +import torch.nn as nn +from .utils import mask_padded_tokens, NEG_INF + + +class GreedySequenceGenerator(nn.Module): + def __init__(self, embedding, decoder, log_softmax, pad=0, bos=1, eos=2, + max_sequence_length=512, max_delta_length=20, batch_size=1): + """ + Greedy sequence generator based on the decoder followed by log_softmax. + + Args: + embedding: nn.Module, transforms input_ids into vector embeddings + decoder: nn.Module, takes embeddings and produces hidden_states + log_softmax: nn.Module, takes hidden_states and produces log_probs + which correspond to probability distribution of tokens (ids) + pad: index of padding token in the vocabulary + bos: index of beginning of sequence token in the vocabulary + eos: index of end of sequence token in the vocabulary + max_sequence_length: maximum allowed length for generated sequences + max_delta_length: in case of encoder-decoder generation (e.g. NMT), + forbids generated sequences to be longer than the length of + source sequences plus max_delta_length + batch_size: size of the batch of generated sequences if neither + source nor target starting sequences are provided + """ + + super().__init__() + self.embedding = embedding + self.decoder = decoder + self.log_softmax = log_softmax + self.pad, self.bos, self.eos = pad, bos, eos + self.max_seq_length = max_sequence_length + self.max_delta_len = max_delta_length + self.batch_size = batch_size + self.device = next(self.decoder.parameters()).device + + @torch.no_grad() + def _forward(self, decoder_input_ids=None, encoder_hidden_states=None, + encoder_input_mask=None, decoder_mems_list=None, pos=0): + """ + One step of autoregressive output generation. + + Args: + decoder_input_ids: starting sequence of tokens to generate from; + if None, generation will start from a batch of tokens + encoder_hidden_states: output of the encoder for conditional + sequence generation; if None, generator will use unconditional + mode (e.g., language modeling) + encoder_input_mask: input mask used in the encoder + decoder_mems_list: list of size num_layers with cached activations + of sequence (x[1], ..., x[k-1]) for fast generation of x[k] + pos: starting position in positional encoding + """ + + decoder_hidden_states = self.embedding.forward( + decoder_input_ids, start_pos=pos) + decoder_input_mask = mask_padded_tokens( + decoder_input_ids, self.pad).float() + + if encoder_hidden_states is not None: + decoder_mems_list = self.decoder.forward( + decoder_hidden_states, decoder_input_mask, + encoder_hidden_states, encoder_input_mask, decoder_mems_list, + return_mems=True) + else: + decoder_mems_list = self.decoder.forward( + decoder_hidden_states, decoder_input_mask, + decoder_mems_list, return_mems=True) + log_probs = self.log_softmax.forward(decoder_mems_list[-1]) + return log_probs, decoder_mems_list + + def _prepare_for_search(self, decoder_input_ids=None, + encoder_hidden_states=None): + """ + Helper function which defines starting sequence to begin generating + with and maximum allowed number of tokens to be generated. + """ + + batch_size = self.batch_size + + # for encoder-decoder generation, maximum length of generated sequence + # is min(max_sequence_length, src_len + max_delta_length) + if encoder_hidden_states is not None: + batch_size, src_len, _ = encoder_hidden_states.size() + max_seq_length = min( + self.max_seq_length, src_len + self.max_delta_len) + else: + max_seq_length = self.max_seq_length + + # if no input is provided, start with the batch of tokens + if decoder_input_ids is not None: + tgt = decoder_input_ids + batch_size, tgt_len = decoder_input_ids.size() + else: + tgt = torch.zeros( + batch_size, 1).long().fill_(self.bos).to(self.device) + tgt_len = 1 + max_generation_length = max_seq_length - tgt_len + + return tgt, batch_size, max_generation_length + + def forward(self, decoder_input_ids=None, encoder_hidden_states=None, + encoder_input_mask=None): + + tgt, batch_size, max_generation_length = self._prepare_for_search( + decoder_input_ids, encoder_hidden_states) + + # pad profile tracks sequences ending with token to replace + # everything after with token + pad_profile = torch.zeros(batch_size, 1).long().to(self.device) + + decoder_mems_list = None + for i in range(max_generation_length): + + log_probs, decoder_mems_list = self._forward( + tgt[:, -1:], encoder_hidden_states, encoder_input_mask, + decoder_mems_list, i) + + next_tokens = torch.argmax(log_probs[:, -1], dim=-1, keepdim=True) + next_tokens = self.pad * pad_profile + \ + next_tokens * (1 - pad_profile) + pad_profile = torch.max( + pad_profile, (next_tokens == self.eos).long()) + tgt = torch.cat((tgt, next_tokens), dim=-1) + + # abort generation if all sequences end with + if pad_profile.sum() == batch_size: + break + + return tgt + + +class TopKSequenceGenerator(GreedySequenceGenerator): + def __init__(self, embedding, decoder, log_softmax, beam_size=1, + temperature=1.0, **kwargs): + """ + Top-k sequence generator based on the decoder followed by log_softmax. + + Args: + *all args of GreedySequenceGenerator class + beam_size: size of the beam (parameter k in top-k) + temperature: temperature of top-k sampling, all logits are divided + by temperature before rescaling. High temperature leads to + uniform distribution, low leads to delta-like distribution. + Kwargs: + all remaining parameters of GreedySequenceGenerator class + """ + + super().__init__(embedding, decoder, log_softmax, **kwargs) + self.beam_size = beam_size + self.temp = temperature + + @torch.no_grad() + def _forward(self, decoder_input_ids=None, encoder_hidden_states=None, + encoder_input_mask=None, decoder_mems_list=None, pos=0): + + log_probs, decoder_mems_list = super()._forward( + decoder_input_ids, encoder_hidden_states, encoder_input_mask, + decoder_mems_list, pos) + + batch_size, seq_len, vocab_size = log_probs.size() + scores, indices = torch.topk(log_probs, self.beam_size, dim=-1) + + rescaled_logexp = torch.zeros_like( + log_probs).scatter(-1, indices, scores.div(self.temp).exp()) + probs = rescaled_logexp / rescaled_logexp.norm(1, -1, keepdim=True) + + # We randomly sample next tokens from rescaled probability distribution + # over top-k candidates and return a binary tensor which indicates + # candidates that have been selected. We call this object + # `pseudo_log_probs` as genuine log_probs should have -infs instead of + # 0s and 0s instead of 1s. + ids = torch.multinomial( + probs.view(-1, vocab_size), 1).view(-1, seq_len, 1) + pseudo_log_probs = torch.zeros_like(log_probs).scatter(-1, ids, 1.0) + + return pseudo_log_probs, decoder_mems_list + + +class BeamSearchSequenceGenerator(GreedySequenceGenerator): + def __init__(self, embedding, decoder, log_softmax, beam_size=1, + len_pen=0, **kwargs): + """ + Beam Search sequence generator based on the decoder followed by + log_softmax. + + Args: + *all args of GreedySequenceGenerator class + beam_size: size of the beam + len_pen: length penalty parameter + Kwargs: + all remaining parameters of GreedySequenceGenerator class + """ + + super().__init__(embedding, decoder, log_softmax, **kwargs) + self.beam_size = beam_size + self.len_pen = len_pen + + def forward(self, decoder_input_ids=None, encoder_hidden_states=None, + encoder_input_mask=None): + + tgt, batch_size, max_generation_length = self._prepare_for_search( + decoder_input_ids, encoder_hidden_states) + + # generate initial buffer of beam_size prefixes-hypotheses + log_probs, decoder_mems_list = self._forward( + tgt, encoder_hidden_states, encoder_input_mask, None, 0) + scores, prefixes = torch.topk( + log_probs.permute(0, 2, 1), self.beam_size, dim=1) + scores, prefixes = scores.view(-1, 1), prefixes.view(-1, 1) + + # repeat init target prefixes and cached memory states beam_size times + prefixes = torch.cat( + (tgt.repeat(1, self.beam_size).view(-1, 1), prefixes), dim=1) + for j in range(len(decoder_mems_list)): + decoder_mems_list[j] = \ + decoder_mems_list[j].repeat(self.beam_size, 1, 1) + + # repeat source sequence beam_size times for beam search + if encoder_hidden_states is not None: + _, src_length, hidden_size = encoder_hidden_states.size() + encoder_input_mask = encoder_input_mask.repeat( + 1, self.beam_size).view(-1, src_length) + encoder_hidden_states = encoder_hidden_states.repeat( + 1, self.beam_size, 1).view(-1, src_length, hidden_size) + else: + hidden_size = decoder_mems_list[0].size(2) + + # pad_profile tracks finished hypotheses to generate only tokens + # if or has been generated + pad_profile = torch.zeros_like(scores).long() + + # prefixes_len tracks lengths of generated hypotheses to perform + # length penalty correction + prefixes_len = torch.zeros_like(scores).fill_(prefixes.size(1) + 1) + + for i in range(max_generation_length): + + # mask all finished hypotheses to exclude them from beam + pad_mask = pad_profile.repeat(1, self.beam_size) + + # generate and score candidates for prefixes continuation + log_probs, decoder_mems_list = self._forward( + prefixes[:, -1:], encoder_hidden_states, encoder_input_mask, + decoder_mems_list, i+1) + scores_i, prefixes_i = torch.topk( + log_probs[:, -1, :], self.beam_size, dim=-1) + + # for all prefixes ending with or replace generated + # continuations with + prefixes_i = self.pad * pad_mask + prefixes_i * (1 - pad_mask) + + # force all hypotheses but one generated from already finished + # hypotheses to have extremely low score, so they will not be + # considered during beam re-ranking + pad_mask[:, 1:] = pad_mask[:, 1:] * NEG_INF + scores = scores + scores_i * (1 - pad_mask).to(scores.dtype) + + # choose top-k hypotheses with length penalty applied + scores = scores / prefixes_len.pow(self.len_pen) + scores, indices_i = torch.topk(scores.view( + -1, self.beam_size**2), self.beam_size, dim=1) + scores = scores.view(-1, 1) * prefixes_len.pow(self.len_pen) + + # select prefixes which correspond to the chosen hypotheses + prefixes = prefixes.unsqueeze(1).repeat(1, self.beam_size, 1) + prefixes = torch.cat((prefixes, prefixes_i.unsqueeze(2)), dim=2) + prefixes = prefixes.view(batch_size, self.beam_size**2, -1) + p_len = prefixes.size(2) + prefixes_ids = indices_i.unsqueeze(2).repeat(1, 1, p_len) + prefixes = prefixes.gather(1, prefixes_ids).view(-1, p_len) + + # reshuffle cached decoder memory states to restore the order + # of hypotheses broken after top-k selection + mems_ids = indices_i.unsqueeze(2).unsqueeze(3).repeat( + 1, 1, p_len-1, hidden_size) // self.beam_size + for j in range(len(decoder_mems_list)): + decoder_mems_list[j] = decoder_mems_list[j].view( + -1, self.beam_size, p_len-1, hidden_size).gather( + 1, mems_ids).view(-1, p_len-1, hidden_size) + + # update prefixes_len and pad_profile + not_eos_pad = prefixes.ne(self.eos) & prefixes.ne(self.pad) + prefixes_len = 1 + not_eos_pad.sum( + dim=1, keepdim=True).to(scores.dtype) + pad_profile = (1 - not_eos_pad[:, -1:]).long() + + # if all hypotheses end with or , interrupt search + if pad_profile.sum() == batch_size * self.beam_size: + break + + # select best performing hypotheses in each element of the batch + scores = scores / prefixes_len.pow(self.len_pen) + best_guesses = torch.argmax( + scores.view(-1, self.beam_size), dim=1, keepdim=True).repeat( + 1, prefixes.size(1)).unsqueeze(1) + tgt = prefixes.view( + batch_size, self.beam_size, -1).gather(1, best_guesses) + + return tgt.squeeze(1) diff --git a/collections/nemo_nlp/nemo_nlp/transformer/losses.py b/collections/nemo_nlp/nemo_nlp/transformer/losses.py new file mode 100644 index 000000000000..f676a4de2294 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/losses.py @@ -0,0 +1,63 @@ +import torch +from torch import nn + + +class SmoothedCrossEntropyLoss(nn.Module): + """ + Cross-entropy loss with label smoothing for a batch of sequences. + + Args: + label_smoothing: label smoothing coefficient, usually set between 0.0 + and 0.1 in language modeling and translation pipelines + predict_last_k: int parameter which sets the number of last tokens to + calculate the loss for, for example + 0: (default) calculate loss on the entire sequence (e.g., NMT) + 1: calculate loss on the last token only (e.g., LM evaluation) + Intermediate values allow to control the trade-off between eval + time (proportional to the number of batches) and eval performance + (proportional to the number of context tokens). + """ + + def __init__(self, label_smoothing=0.0, predict_last_k=0): + super().__init__() + self._smoothing = label_smoothing + self._predict_last_k = predict_last_k + + def forward(self, log_probs, output_ids, output_mask): + """ + Args: + log_probs: float tensor of shape batch_size x seq_len x vocab_size + output_ids: int tensor of shape batch_size x seq_len + output_mask: binary tensor of shape batch_size x seq_len + """ + batch_size, seq_len, vocab_size = log_probs.size() + smoothing = vocab_size * self._smoothing / (vocab_size - 1) + target_log_probs = log_probs.gather( + 2, output_ids.unsqueeze(2)).squeeze(2) + smoothing_log_probs = log_probs.mean(dim=-1) + neg_log_likelihood = (1.0 - smoothing) * target_log_probs + \ + smoothing * smoothing_log_probs + neg_log_likelihood = neg_log_likelihood[:, -self._predict_last_k:] + output_mask = output_mask[:, -self._predict_last_k:] + neg_log_likelihood = -torch.sum(neg_log_likelihood * output_mask) + neg_log_likelihood = neg_log_likelihood / (output_mask.sum() + 1e-6) + return neg_log_likelihood + + +class SequenceClassificationLoss(nn.Module): + """ + Sequence classification loss. + """ + + def __init__(self): + super().__init__() + + def forward(self, log_probs, labels): + """ + Args: + log_probs: float tensor of shape batch_size x num_classes + labels: int tensor of shape batch_size + """ + log_probs_target = log_probs.gather(1, labels.unsqueeze(1)) + neg_log_likelihood = -torch.mean(log_probs_target) + return neg_log_likelihood diff --git a/collections/nemo_nlp/nemo_nlp/transformer/modules.py b/collections/nemo_nlp/nemo_nlp/transformer/modules.py new file mode 100644 index 000000000000..0b82b17f97e5 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/modules.py @@ -0,0 +1,304 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Various parts of Transformer architecture implemented as Pytorch nn.Modules. +Some parts of this code were adapted from the HuggingFace library at +https://github.com/huggingface/pytorch-pretrained-BERT +Some parts of this code were adapted from the Annotated Transformer at +http://nlp.seas.harvard.edu/2018/04/03/attention.html +Copyright by the HuggingFace and Annotated Transformer authors. +""" + + +import math +import torch +from torch import nn +from apex.normalization import FusedLayerNorm +from .utils import gelu + + +class FixedPositionalEncoding(nn.Module): + """ + Fixed positional encoding (embedding layer) from sine and cosine functions + of different frequencies according to https://arxiv.org/abs/1706.03762 + + Args: + hidden_size: size of the embeddings in the model, also known as d_model + max_sequence_length: maximum allowed length of the input sequence + """ + + def __init__(self, hidden_size, max_sequence_length=512): + super().__init__() + + pos_enc = torch.zeros(max_sequence_length, hidden_size) + position = torch.arange(0.0, max_sequence_length).unsqueeze(1) + coef = -math.log(10000.0) / hidden_size + div_term = torch.exp(coef * torch.arange(0.0, hidden_size, 2)) + pos_enc[:, 0::2] = torch.sin(position * div_term) + pos_enc[:, 1::2] = torch.cos(position * div_term) + pos_enc.div_(math.sqrt(hidden_size)) + self.register_buffer('pos_enc', pos_enc) + + def forward(self, position_ids): + return torch.embedding(self.pos_enc, position_ids) + + +class TransformerEmbedding(nn.Module): + """ + Embedding from token and position embeddings. + Optionally add token_type embedding (e.g. type of the sentence in BERT). + + Args: + vocab_size: size of the vocabulary + hidden_size: size of the embeddings in the model, also known as d_model + max_sequence_length: maximum allowed length of the input sequence + num_token_types: number of different token types + (e.g. tokens of sentence A and tokens of sentence B in BERT) + embedding_dropout: probability of dropout applied to embeddings + """ + + def __init__(self, vocab_size, hidden_size, max_sequence_length=512, + num_token_types=2, embedding_dropout=0.0, + learn_positional_encodings=False): + super().__init__() + + self.max_sequence_length = max_sequence_length + self.token_embedding = nn.Embedding( + vocab_size, hidden_size, padding_idx=0) + if learn_positional_encodings: + self.position_embedding = nn.Embedding( + max_sequence_length, hidden_size) + else: + self.position_embedding = FixedPositionalEncoding( + hidden_size, max_sequence_length) + self.token_type_embedding = nn.Embedding(num_token_types, hidden_size) + self.layer_norm = FusedLayerNorm(hidden_size, eps=1e-5) + self.dropout = nn.Dropout(embedding_dropout) + + def forward(self, input_ids, token_type_ids=None, start_pos=0): + seq_length = input_ids.size(1) + if seq_length > self.max_sequence_length: + raise ValueError("Input sequence is longer than maximum allowed" + " sequence length for positional encoding") + position_ids = torch.arange( + start=start_pos, end=start_pos+seq_length, + dtype=torch.long, device=input_ids.device) + position_ids = position_ids.unsqueeze(0).expand_as(input_ids) + + token_embeddings = self.token_embedding(input_ids) + position_embeddings = self.position_embedding(position_ids) + embeddings = token_embeddings + position_embeddings + + if token_type_ids is not None: + token_type_embeddings = self.token_type_embedding(token_type_ids) + embeddings = embeddings + token_type_embeddings + + embeddings = self.layer_norm(embeddings) + embeddings = self.dropout(embeddings) + + return embeddings + + +class MultiHeadAttention(nn.Module): + """ + Multi-head scaled dot-product attention layer. + + Args: + hidden_size: size of the embeddings in the model, also known as d_model + num_attention_heads: number of heads in multi-head attention + attn_score_dropout: probability of dropout applied to attention scores + attn_layer_dropout: probability of dropout applied to the output of the + whole layer, but before layer normalization + """ + + def __init__(self, hidden_size, num_attention_heads, + attn_score_dropout=0.0, attn_layer_dropout=0.0): + super().__init__() + if hidden_size % num_attention_heads != 0: + raise ValueError( + "The hidden size (%d) is not a multiple of the number " + "of attention heads (%d)" % (hidden_size, num_attention_heads)) + self.hidden_size = hidden_size + self.num_attention_heads = num_attention_heads + self.attn_head_size = int(hidden_size / num_attention_heads) + self.attn_scale = math.sqrt(math.sqrt(self.attn_head_size)) + + self.query_net = nn.Linear(hidden_size, hidden_size) + self.key_net = nn.Linear(hidden_size, hidden_size) + self.value_net = nn.Linear(hidden_size, hidden_size) + self.out_projection = nn.Linear(hidden_size, hidden_size) + + self.attn_dropout = nn.Dropout(attn_score_dropout) + self.layer_dropout = nn.Dropout(attn_layer_dropout) + self.layer_norm = FusedLayerNorm(hidden_size, eps=1e-5) + + def transpose_for_scores(self, x): + new_x_shape = x.size()[:-1] + \ + (self.num_attention_heads, self.attn_head_size) + x = x.view(*new_x_shape) + return x.permute(0, 2, 1, 3) + + def forward(self, queries, keys, values, attention_mask): + + # attention_mask is needed to hide the tokens which correspond to [PAD] + # in the case of BERT, or to hide the future tokens in the case of + # vanilla language modeling and translation + query = self.query_net(queries) + key = self.key_net(keys) + value = self.value_net(values) + query = self.transpose_for_scores(query) / self.attn_scale + key = self.transpose_for_scores(key) / self.attn_scale + value = self.transpose_for_scores(value) + + # for numerical stability we pre-divide query and key by sqrt(sqrt(d)) + # and perform attention probs computation in float32 + attention_scores = torch.matmul(query, key.transpose(-1, -2)).float() + if attention_mask is not None: + attention_scores = attention_scores + attention_mask.float() + attention_probs = torch.softmax(attention_scores, dim=-1).to(key.dtype) + attention_probs = self.attn_dropout(attention_probs) + + context = torch.matmul(attention_probs, value) + context = context.permute(0, 2, 1, 3).contiguous() + new_context_shape = context.size()[:-2] + (self.hidden_size, ) + context = context.view(*new_context_shape) + + # output projection + output_states = self.out_projection(context) + output_states = self.layer_dropout(output_states) + output_states = self.layer_norm(queries + output_states) + + return output_states + + +class LightweightConv1d(nn.Module): + """ + Lightweight convolution layer from https://arxiv.org/abs/1901.10430 + + Args: + hidden_size: size of the embeddings in the model, also known as d_model + num_heads: number of heads in lightweight convolution + kernel_size: convolution kernel size + conv_weight_dropout: probability of dropout applied to the convolution + kernel (strictly speaking, DropConnect) + conv_layer_dropout: probability of dropout applied to the output of the + whole layer, but before layer normalization + """ + + def __init__(self, hidden_size, num_attention_heads, kernel_size, + conv_weight_dropout=0.0, conv_layer_dropout=0.0): + super().__init__() + self.num_heads = num_attention_heads + self.kernel_size = kernel_size + self.weight = nn.Parameter( + torch.Tensor(num_attention_heads, 1, kernel_size)) + self.in_projection = nn.Linear(hidden_size, hidden_size) + self.out_projection = nn.Linear(hidden_size, hidden_size) + + self.conv_weight_dropout = nn.Dropout(conv_weight_dropout) + self.conv_layer_dropout = nn.Dropout(conv_layer_dropout) + self.layer_norm = FusedLayerNorm(hidden_size, eps=1e-5) + + def forward(self, hidden_states, attention_mask): + batch_size, seq_len, hidden_size = hidden_states.size() + output_states = self.in_projection(hidden_states) + output_states = output_states.permute(0, 2, 1) + + weight = torch.softmax(self.weight, dim=-1) + weight = self.conv_weight_dropout(weight) + + if attention_mask: + pivot = self.kernel_size // 2 + 1 + weight[:, :, pivot:] = 0 + + output_states = output_states.contiguous().view( + -1, self.num_heads, seq_len) + output_states = torch.conv1d(output_states, + weight, + padding=self.kernel_size // 2, + groups=self.num_heads) + output_states = output_states.view(batch_size, hidden_size, seq_len) + output_states = output_states.permute(0, 2, 1) + + # output projection + output_states = self.out_projection(output_states) + output_states = self.conv_layer_dropout(output_states) + output_states = self.layer_norm(hidden_states + output_states) + + return output_states + + +class TwoStreamSelfAttention(nn.Module): + """ + Two-Stream Self-Attention layer from https://arxiv.org/abs/1906.08237 + + Args: + hidden_size: size of the embeddings in the model, also known as d_model + num_attention_heads: number of heads in multi-head attention + attn_score_dropout: probability of dropout applied to attention scores + attn_layer_dropout: probability of dropout applied to the output of the + whole layer, but before layer normalization + """ + + def __init__(self, hidden_size, num_attention_heads, + attn_score_dropout=0.0, attn_layer_dropout=0.0): + super().__init__() + self.query_stream = MultiHeadAttention( + hidden_size, num_attention_heads, + attn_score_dropout, attn_layer_dropout) + self.content_stream = MultiHeadAttention( + hidden_size, num_attention_heads, + attn_score_dropout, attn_layer_dropout) + + def forward(self, query_states, content_states, + query_attention_mask, content_attention_mask): + output_query_states = self.query_stream( + query_states, content_states, content_states, query_attention_mask) + output_content_states = self.content_stream( + query_states, content_states, + content_states, content_attention_mask) + return output_query_states, output_content_states + + +class PositionWiseFF(nn.Module): + """ + Position-wise feed-forward network of Transformer block. + + Args: + hidden_size: size of the embeddings in the model, also known as d_model + inner_size: number of neurons in the intermediate part of feed-forward + net, usually is (4-8 x hidden_size) in the papers + fully_connected_dropout: probability of dropout applied to net output + hidden_act: activation function used between two linear layers + """ + + def __init__(self, hidden_size, inner_size, + fully_connected_dropout=0.0, hidden_act="relu"): + super().__init__() + self.dense_in = nn.Linear(hidden_size, inner_size) + self.dense_out = nn.Linear(inner_size, hidden_size) + self.layer_dropout = nn.Dropout(fully_connected_dropout) + self.layer_norm = FusedLayerNorm(hidden_size, eps=1e-5) + ACT2FN = {"gelu": gelu, "relu": torch.relu} + self.act_fn = ACT2FN[hidden_act] + + def forward(self, hidden_states): + output_states = self.dense_in(hidden_states) + output_states = self.act_fn(output_states) + output_states = self.dense_out(output_states) + output_states = self.layer_dropout(output_states) + output_states = self.layer_norm(hidden_states + output_states) + return output_states diff --git a/collections/nemo_nlp/nemo_nlp/transformer/softmax_layers.py b/collections/nemo_nlp/nemo_nlp/transformer/softmax_layers.py new file mode 100644 index 000000000000..9f6a1ea2e987 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/softmax_layers.py @@ -0,0 +1,39 @@ +import torch +from torch import nn + + +class TransformerLogSoftmax(nn.Module): + """ + Output layer of Transformer architecture which approximates probability + distribution over *vocab_size* output tokens. + """ + + def __init__(self, vocab_size, hidden_size): + super().__init__() + self.dense = nn.Linear(hidden_size, vocab_size) + + def forward(self, hidden_states): + output_states = self.dense(hidden_states).float() + log_probs = torch.log_softmax( + output_states, dim=-1).to(hidden_states.dtype) + return log_probs + + +class ClassificationLogSoftmax(nn.Module): + """ + Classifier on top of the hidden representation of the first token, which + is usually [CLS] token in BERT-like architectures. + """ + + def __init__(self, hidden_size, num_classes): + super().__init__() + self.dense1 = nn.Linear(hidden_size, hidden_size) + self.dense2 = nn.Linear(hidden_size, num_classes) + + def forward(self, hidden_states): + output_states = self.dense1(hidden_states[:, 0]) + output_states = torch.tanh(output_states) + output_states = self.dense2(output_states).float() + log_probs = torch.log_softmax( + output_states, dim=-1).to(hidden_states.dtype) + return log_probs diff --git a/collections/nemo_nlp/nemo_nlp/transformer/utils.py b/collections/nemo_nlp/nemo_nlp/transformer/utils.py new file mode 100644 index 000000000000..86a33d1614a4 --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer/utils.py @@ -0,0 +1,69 @@ +import math +import torch +import torch.nn as nn + +NEG_INF = -10000.0 + + +def gelu(x): + return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) + + +def mask_padded_tokens(tokens, pad_id): + mask = (tokens != pad_id) + return mask + + +def form_attention_mask(input_mask, diagonal=None): + """ + Build attention mask with optional masking of future tokens we forbid + to attend to (e.g. as it is in Transformer decoder). + + Args: + input_mask: binary mask of size B x L with 1s corresponding to valid + tokens and 0s corresponding to padding tokens + diagonal: diagonal where triangular future mask starts + None -- do not mask anything + 0 -- regular translation or language modeling future masking + 1 -- query stream masking as in XLNet architecture + Returns: + attention_mask: mask of size B x 1 x L x L with 0s corresponding to + tokens we plan to attend to and -10000 otherwise + """ + + if input_mask is None: + return None + attn_shape = (1, input_mask.shape[1], input_mask.shape[1]) + attn_mask = input_mask.byte().unsqueeze(1) + if diagonal is not None: + future_mask = torch.tril( + torch.ones(attn_shape).byte().to(input_mask.device), diagonal) + attn_mask = attn_mask & future_mask + attention_mask = (1 - attn_mask.to(input_mask.dtype)) * NEG_INF + return attention_mask.unsqueeze(1) + + +def transformer_weights_init(module, std_init_range=0.02, xavier=True): + """ + Initialize different weights in Transformer model. + + Args: + module: torch.nn.Module to be initialized + std_init_range: standard deviation of normal initializer + xavier: if True, xavier initializer will be used in Linear layers + as was proposed in AIAYN paper, otherwise normal initializer + will be used (like in BERT paper) + """ + + if isinstance(module, nn.Linear): + if xavier: + nn.init.xavier_uniform_(module.weight) + else: + nn.init.normal_(module.weight, mean=0.0, std=std_init_range) + if module.bias is not None: + nn.init.constant_(module.bias, 0.0) + elif isinstance(module, nn.Embedding): + nn.init.normal_(module.weight, mean=0.0, std=std_init_range) + elif isinstance(module, nn.LayerNorm): + nn.init.constant_(module.weight, 1.0) + nn.init.constant_(module.bias, 0.0) diff --git a/collections/nemo_nlp/nemo_nlp/transformer_nm.py b/collections/nemo_nlp/nemo_nlp/transformer_nm.py new file mode 100644 index 000000000000..b482e768c0db --- /dev/null +++ b/collections/nemo_nlp/nemo_nlp/transformer_nm.py @@ -0,0 +1,394 @@ +# Copyright (c) 2019 NVIDIA Corporation +""" +This package contains Transformer for translation Neural Module +""" +import math +from nemo.backends.pytorch.nm import TrainableNM, LossNM +from nemo.core.neural_types import * +from .transformer import TransformerEmbedding, TransformerEncoder, \ + TransformerDecoder, TransformerLogSoftmax, SmoothedCrossEntropyLoss, \ + GreedySequenceGenerator, BeamSearchSequenceGenerator +from .transformer.utils import mask_padded_tokens, transformer_weights_init + + +class TransformerEncoderNM(TrainableNM): + """ + Neural module which consists of embedding layer followed by Transformer + encoder. + + Args: + vocab_size: size of the vocabulary (number of tokens) + hidden_size: hidden size (d_model) of the Transformer + max_sequence_length: maximum allowed length of input sequences, feeding + longer sequences will cause an error + embedding_dropout: dropout ratio applied to embeddings + learn_positional_encodings: bool, whether to learn positional encoding + or use fixed sinusoidal encodings + num_layers: number of layers in Transformer encoder + mask_future: bool, whether to apply triangular future masking to the + sequence of hidden states (which allows to use it for LM) + first_sub_layer: type of the first sublayer, surrently only + self_attention and lightweight_conv are supported + num_attention_heads: number of attention heads + inner_size: number of neurons in the intermediate part of + fully-connected network (second_sub_layer) + fully_connected_dropout: dropout ratio applied to FFN + attn_score_dropout: dropout ratio applied to attention scores + attn_layer_dropout: dropout ratio applied to the output of attn layer + conv_kernel_size: convolution kernel size in lightweight_conv + conv_weight_dropout: dropout ratio applied to the convolution kernel + conv_layer_dropout: dropout ratio applied to the output of conv layer + """ + + @staticmethod + def create_ports(): + input_ports = { + "input_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask_src": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + } + + output_ports = { + "hidden_states": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + + params = self.local_parameters + embedding_params = { + "vocab_size": params["vocab_size"], + "hidden_size": params["d_model"], + "max_sequence_length": params["max_seq_length"], + "embedding_dropout": params.get("embedding_dropout", 0), + "learn_positional_encodings": + params.get("learn_positional_encodings", False) + } + backbone_params = { + "num_layers": params["num_layers"], + "hidden_size": params["d_model"], + "mask_future": params.get("mask_future", False), + "num_attention_heads": params["num_attn_heads"], + "inner_size": params["d_inner"], + "ffn_dropout": params.get("fully_connected_dropout", 0), + "hidden_act": params.get("hidden_act", "relu"), + "attn_score_dropout": params.get("attn_score_dropout", 0), + "attn_layer_dropout": params.get("attn_layer_dropout", 0) + } + + self.embedding_layer = TransformerEmbedding(**embedding_params) + self.encoder = TransformerEncoder(**backbone_params) + + std_init_range = 1 / math.sqrt(params["d_model"]) + self.apply( + lambda module: transformer_weights_init(module, std_init_range)) + self.to(self._device) + + def forward(self, input_ids, input_mask_src): + hidden_states = self.embedding_layer(input_ids) + hidden_states = self.encoder(hidden_states, input_mask_src) + return hidden_states + + +class TransformerDecoderNM(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "input_ids_tgt": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "hidden_states_src": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "input_mask_src": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + "input_mask_tgt": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + } + + output_ports = { + "hidden_states": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + TrainableNM.__init__(self, **kwargs) + + params = self.local_parameters + embedding_params = { + "vocab_size": params["vocab_size"], + "hidden_size": params["d_model"], + "max_sequence_length": params["max_seq_length"], + "embedding_dropout": params.get("embedding_dropout", 0), + "learn_positional_encodings": + params.get("learn_positional_encodings", False) + } + backbone_params = { + "num_layers": params["num_layers"], + "hidden_size": params["d_model"], + "num_attention_heads": params["num_attn_heads"], + "inner_size": params["d_inner"], + "ffn_dropout": params.get("fully_connected_dropout", 0), + "hidden_act": params.get("hidden_act", "relu"), + "attn_score_dropout": params.get("attn_score_dropout", 0), + "attn_layer_dropout": params.get("attn_layer_dropout", 0) + } + + self.embedding_layer = TransformerEmbedding(**embedding_params) + self.decoder = TransformerDecoder(**backbone_params) + + std_init_range = 1 / math.sqrt(params["d_model"]) + self.apply( + lambda module: transformer_weights_init(module, std_init_range)) + self.to(self._device) + + def forward(self, input_ids_tgt, hidden_states_src, input_mask_src, + input_mask_tgt): + hidden_states_tgt = self.embedding_layer(input_ids_tgt) + hidden_states = self.decoder( + hidden_states_tgt, input_mask_tgt, + hidden_states_src, input_mask_src) + return hidden_states + + +class TransformerLogSoftmaxNM(TrainableNM): + @staticmethod + def create_ports(): + input_ports = { + "hidden_states": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + } + + output_ports = { + "log_probs": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + } + return input_ports, output_ports + + def __init__(self, *, vocab_size, d_model, **kwargs): + TrainableNM.__init__(self, **kwargs) + + self.log_softmax = TransformerLogSoftmax( + vocab_size=vocab_size, + hidden_size=d_model) + + self.log_softmax.apply(transformer_weights_init) + self.log_softmax.to(self._device) + + def forward(self, hidden_states): + log_probs = self.log_softmax(hidden_states) + return log_probs + + +class GreedyLanguageGeneratorNM(TrainableNM): + """ + Neural module for greedy text generation with language model + + Args: + decoder: module which maps input_ids into hidden_states + log_softmax: module which maps hidden_states into log_probs + max_sequence_length: maximum allowed length of generated sequences + pad: index of padding token in the vocabulary + bos: index of beginning of sequence token in the vocabulary + eos: index of end of sequence token in the vocabulary + device: torch.device to conduct generation on + batch_size: size of the batch of generated sequences if no starting + tokens are provided + """ + + @staticmethod + def create_ports(): + input_ports = { + "input_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + + output_ports = { + "output_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + return input_ports, output_ports + + def __init__(self, decoder, log_softmax, **kwargs): + TrainableNM.__init__(self, **kwargs) + + generator_params = { + "max_sequence_length": self.local_parameters["max_seq_length"], + "pad": self.local_parameters["pad_token"], + "bos": self.local_parameters["bos_token"], + "eos": self.local_parameters["eos_token"], + "batch_size": self.local_parameters.get("batch_size", 1) + } + self.generator = GreedySequenceGenerator( + decoder, log_softmax, **generator_params) + + @property + def num_weights(self): + return 0 + + def forward(self, input_ids): + output_ids = self.generator(decoder_input_ids=input_ids) + return output_ids + + +class BeamSearchTranslatorNM(TrainableNM): + """ + Neural module for beam search translation generation + + Args: + decoder: module which maps input_ids into hidden_states + log_softmax: module which maps hidden_states into log_probs + max_sequence_length: maximum allowed length of generated sequences + pad: index of padding token in the vocabulary + bos: index of beginning of sequence token in the vocabulary + eos: index of end of sequence token in the vocabulary + device: torch.device to conduct generation on + batch_size: size of the batch of generated sequences if no starting + tokens are provided + beam_size: size of the beam + len_pen: parameter which penalizes shorter sequences + """ + + @staticmethod + def create_ports(): + input_ports = { + "hidden_states_src": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "input_mask_src": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + + output_ports = { + "output_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }) + } + return input_ports, output_ports + + @property + def num_weights(self): + return 0 + + def __init__(self, decoder, log_softmax, **kwargs): + TrainableNM.__init__(self, **kwargs) + + params = self.local_parameters + generator_params = { + "max_sequence_length": params["max_seq_length"], + "max_delta_length": params.get("max_delta_length", 50), + "pad": params["pad_token"], + "bos": params["bos_token"], + "eos": params["eos_token"], + "batch_size": params.get("batch_size", 1), + "beam_size": params.get("beam_size", 4), + "len_pen": params.get("length_penalty", 0) + } + self.generator = BeamSearchSequenceGenerator( + decoder.embedding_layer, decoder.decoder, log_softmax, + **generator_params) + + def forward(self, hidden_states_src, input_mask_src): + output_ids = self.generator( + encoder_hidden_states=hidden_states_src, + encoder_input_mask=input_mask_src) + return output_ids + + +class PaddedSmoothedCrossEntropyLossNM(LossNM): + """ + Neural module which calculates CrossEntropyLoss and + 1) excludes padding tokens from loss calculation + 2) allows to use label smoothing regularization + 3) allows to calculate loss for the desired number of last tokens + + Args: + label_smoothing: label smoothing regularization coefficient + predict_last_k: how many last tokens to use for the loss calculation + """ + + @staticmethod + def create_ports(): + input_ports = { + "log_probs": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag), + 2: AxisType(ChannelTag) + }), + "target_ids": + NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(TimeTag) + }), + } + + output_ports = {"loss": NeuralType(None)} + return input_ports, output_ports + + def __init__(self, **kwargs): + LossNM.__init__(self, **kwargs) + + loss_params = { + "label_smoothing": self.local_parameters.get("label_smoothing", 0), + "predict_last_k": self.local_parameters.get("predict_last_k", 0) + } + self._loss_fn = SmoothedCrossEntropyLoss(**loss_params) + self._pad_id = self.local_parameters['pad_id'] + + def _loss_function(self, log_probs, target_ids): + target_mask = mask_padded_tokens( + target_ids, self._pad_id).to(log_probs.dtype) + loss = self._loss_fn(log_probs, target_ids, target_mask) + return loss diff --git a/collections/nemo_nlp/setup.py b/collections/nemo_nlp/setup.py new file mode 100644 index 000000000000..19c3d08a59a4 --- /dev/null +++ b/collections/nemo_nlp/setup.py @@ -0,0 +1,31 @@ +import setuptools + +with open("README.md", "r") as fh: + long_description = fh.read() + +setuptools.setup( + name="nemo_nlp", + version="0.3", + author="NVIDIA", + author_email="okuchaiev@nvidia.com", + description="Collection of Neural Modules for Natural Language Processing", + long_description=long_description, + long_description_content_type="text/markdown", + url="https://github.com/nvidia/nemo", + packages=setuptools.find_packages(), + classifiers=[ + "Programming Language :: Python :: 3", + "Operating System :: OS Independent", + "License :: OSI Approved :: Apache License 2.0" + ], + install_requires=[ + 'nemo_toolkit', + 'torchtext', + 'sentencepiece', + 'boto3', + 'unidecode', + 'pytorch-transformers', + 'matplotlib', + 'youtokentome' + ] +) diff --git a/collections/nemo_simple_gan/LICENSE b/collections/nemo_simple_gan/LICENSE new file mode 100644 index 000000000000..261eeb9e9f8b --- /dev/null +++ b/collections/nemo_simple_gan/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/collections/nemo_simple_gan/README.md b/collections/nemo_simple_gan/README.md new file mode 100644 index 000000000000..9dad44dd3e04 --- /dev/null +++ b/collections/nemo_simple_gan/README.md @@ -0,0 +1 @@ +write me \ No newline at end of file diff --git a/collections/nemo_simple_gan/nemo_simple_gan/__init__.py b/collections/nemo_simple_gan/nemo_simple_gan/__init__.py new file mode 100644 index 000000000000..8a194e30fbc2 --- /dev/null +++ b/collections/nemo_simple_gan/nemo_simple_gan/__init__.py @@ -0,0 +1,7 @@ +# Copyright (c) 2019 NVIDIA Corporation +from .gan import * + +from nemo.core import Backend + +name = "nemo_simple_gan" +backend = Backend.PyTorch diff --git a/collections/nemo_simple_gan/nemo_simple_gan/gan.py b/collections/nemo_simple_gan/nemo_simple_gan/gan.py new file mode 100644 index 000000000000..41cd1720bfe1 --- /dev/null +++ b/collections/nemo_simple_gan/nemo_simple_gan/gan.py @@ -0,0 +1,322 @@ +# Copyright (c) 2019 NVIDIA Corporation +"""A collection of Neural Modules to be used for training a WGAN-GP on MNIST""" +import torch +from torch.utils.data import Dataset +from torchvision import transforms, datasets + +from nemo.backends.pytorch.nm import TrainableNM, NonTrainableNM, LossNM,\ + DataLayerNM +from nemo.core import NeuralType, BatchTag, ChannelTag, HeightTag, WidthTag,\ + AxisType, DeviceType + + +class SimpleDiscriminator(TrainableNM): + """Simple convolutional discrimnator that takes in a 28x28 greyscale image + and assigns a score to it. + """ + @staticmethod + def create_ports(): + input_ports = { + "image": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag), + 2: AxisType(HeightTag, 28), + 3: AxisType(WidthTag, 28)}) + } + output_ports = { + "decision": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag, 1)}) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + super().__init__(**kwargs) + self.layers = torch.nn.Sequential( + torch.nn.Conv2d(1, 64, 3, padding=1), + torch.nn.ReLU(), + torch.nn.Conv2d(64, 128, 3, stride=2, padding=1), + torch.nn.ReLU(), + torch.nn.Conv2d(128, 128, 3, stride=2, padding=1), + torch.nn.ReLU(), + torch.nn.Conv2d(128, 256, 3, stride=2, padding=1), + torch.nn.ReLU(), + ) + self.fc_layer = torch.nn.Linear(256*4*4, 1) + self.to(self._device) + + def forward(self, image): + decision = self.layers(image) + decision = decision.view(-1, 256*4*4) + decision = self.fc_layer(decision) + return decision + + +class SimpleGenerator(TrainableNM): + """Simple convolutional generator that takes a random variable of size + (64, 4, 4) and produces a 28x28 greyscale image. + """ + @staticmethod + def create_ports(): + input_ports = { + "latents": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag, 64), + 2: AxisType(HeightTag, 4), + 3: AxisType(WidthTag, 4)}) + } + output_ports = { + "image": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag), + 2: AxisType(HeightTag, 28), + 3: AxisType(WidthTag, 28)}) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + super().__init__(**kwargs) + self.layers = torch.nn.Sequential( + torch.nn.ConvTranspose2d(64, 128, 3, stride=2), + torch.nn.ReLU(), + torch.nn.ConvTranspose2d(128, 128, 3, stride=2), + torch.nn.ReLU(), + torch.nn.ConvTranspose2d(128, 128, 3, stride=2), + torch.nn.ReLU(), + torch.nn.Conv2d(128, 1, 12), + torch.nn.Sigmoid(), + ) + self.to(self._device) + + def forward(self, latents): + image = latents + for layer in self.layers: + image = layer(image) + return image + + +class DiscriminatorLoss(LossNM): + """Computes the loss from a disciminator score by simply taking the mean + of all scores in a batch. + + Args: + neg (bool): Whether to negate the final loss + """ + @staticmethod + def create_ports(): + input_ports = { + "decision": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag, 1)}), + } + + output_ports = {"loss": NeuralType(None)} + return input_ports, output_ports + + def __init__(self, neg=False, **kwargs): + super().__init__(**kwargs) + self.neg = neg + + def _loss(self, decision): + if self.neg: + return -torch.mean(decision) + return torch.mean(decision) + + def _loss_function(self, **kwargs): + return self._loss(*(kwargs.values())) + + +class GradientPenalty(LossNM): + """Compute the gradient penalty of the disciminator + + Args: + lambda_ (float): lambda parameter indicating the weight of the loss. + """ + @staticmethod + def create_ports(): + input_ports = { + "interpolated_image": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag), + 2: AxisType(HeightTag, 28), + 3: AxisType(WidthTag, 28)}), + "interpolated_decision": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag, 1)}), + } + + output_ports = {"loss": NeuralType(None)} + return input_ports, output_ports + + def __init__(self, lambda_, **kwargs): + super().__init__(**kwargs) + self.lambda_ = lambda_ + + def _loss(self, interpolated_image, interpolated_decision): + grad_outputs = torch.ones( + interpolated_decision.size(), dtype=interpolated_image.dtype) + if self.placement != DeviceType.CPU: + grad_outputs = grad_outputs.cuda() + gradients = torch.autograd.grad( + outputs=interpolated_decision, inputs=interpolated_image, + grad_outputs=grad_outputs, + create_graph=True, retain_graph=True, only_inputs=True)[0] + gradients = gradients.view(gradients.size(0), -1) + + gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean() + return self.lambda_*gradient_penalty + + def _loss_function(self, **kwargs): + return self._loss(**kwargs) + + +class InterpolateImage(NonTrainableNM): + """Linearly interpolates an image between image1 and image2 + """ + @staticmethod + def create_ports(): + input_ports = { + "image1": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag), + 2: AxisType(HeightTag, 28), + 3: AxisType(WidthTag, 28)}), + "image2": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag), + 2: AxisType(HeightTag, 28), + 3: AxisType(WidthTag, 28)}) + } + + output_ports = { + "interpolated_image": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag), + 2: AxisType(HeightTag, 28), + 3: AxisType(WidthTag, 28)}) + } + return input_ports, output_ports + + def __init__(self, **kwargs): + super().__init__(**kwargs) + + def forward(self, image1, image2): + alpha = torch.rand(image1.shape[0], 1).unsqueeze(-1).unsqueeze(-1) + alpha = alpha.to(self._device) + interpolated_image = alpha * image1 + ((1 - alpha) * image2) + return torch.autograd.Variable(interpolated_image, requires_grad=True) + + +class RandomDataLayer(DataLayerNM): + """Dummy data layer for return random variables to be used in the generator + + Args: + batch_size (int) + """ + + @staticmethod + def create_ports(): + input_ports = {} + output_ports = { + "latent": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag, 64), + 2: AxisType(HeightTag, 4), + 3: AxisType(WidthTag, 4)}) + } + return input_ports, output_ports + + def __init__( + self, *, + batch_size, + **kwargs + ): + DataLayerNM.__init__(self, **kwargs) + self._batch_size = batch_size + + class DummyDataset(torch.utils.data.Dataset): + def __init__(self, batch_size): + super().__init__() + self._batch_size = batch_size + + def __getitem__(self, i): + return torch.randn(64, 4, 4) + + def __len__(self): + return self._batch_size*2 + self._dataset = DummyDataset(batch_size) + + def __len__(self): + return self._dataset.__len__() + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None + + +class MnistGanDataLayer(DataLayerNM): + """Wrapper around torchvision's MNIST dataset. Additionally, it returns a + random variable to be used in the generator. + + Args: + batch_size (int) + root (str): Where to store the dataset + train (bool) + shuffle (bool) + """ + + @staticmethod + def create_ports(input_size=(32, 32)): + input_ports = {} + output_ports = { + "latent": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag, 64), + 2: AxisType(HeightTag, 4), + 3: AxisType(WidthTag, 4)}), + "image": NeuralType({0: AxisType(BatchTag), + 1: AxisType(ChannelTag), + 2: AxisType(HeightTag, input_size[1]), + 3: AxisType(WidthTag, input_size[0])}), + "label": NeuralType({0: AxisType(BatchTag)}) + } + return input_ports, output_ports + + def __init__( + self, *, + batch_size, + root, + train=True, + shuffle=True, + **kwargs + ): + self._input_size = (28, 28) + create_port_args = {"input_size": self._input_size} + DataLayerNM.__init__(self, create_port_args=create_port_args, **kwargs) + + self._batch_size = batch_size + self._train = train + self._shuffle = shuffle + self._root = root + self._transforms = transforms.Compose([transforms.ToTensor()]) + + self._dataset = datasets.MNIST(root=self._root, train=self._train, + download=True, + transform=self._transforms) + + class DatasetWrapper(Dataset): + def __init__(self, dataset): + super().__init__() + self._dataset = dataset + + def __getitem__(self, index): + latents = torch.randn(64, 4, 4) + items = self._dataset.__getitem__(index) + return latents, items[0], items[1] + + def __len__(self): + return self._dataset.__len__() + self._dataset = DatasetWrapper(self._dataset) + + def __len__(self): + return len(self._dataset) + + @property + def dataset(self): + return self._dataset + + @property + def data_iterator(self): + return None diff --git a/collections/nemo_simple_gan/setup.py b/collections/nemo_simple_gan/setup.py new file mode 100644 index 000000000000..e9d3ceefe98f --- /dev/null +++ b/collections/nemo_simple_gan/setup.py @@ -0,0 +1,26 @@ +import setuptools + +with open("README.md", "r") as fh: + long_description = fh.read() + +setuptools.setup( + name="nemo_simple_gan", + version="0.3", + author="NVIDIA", + author_email="jasoli@nvidia.com", + description="Collection of Neural Modules for GANs", + long_description=long_description, + long_description_content_type="text/markdown", + url="https://github.com/nvidia/nemo", + packages=setuptools.find_packages(), + classifiers=[ + "Programming Language :: Python :: 3", + "Operating System :: OS Independent", + "License :: OSI Approved :: Apache License 2.0" + ], + install_requires=[ + 'nemo_toolkit', + 'torchvision', + 'matplotlib' + ] +) diff --git a/docs/_images/interactive_translation.png b/docs/_images/interactive_translation.png new file mode 100644 index 000000000000..edda929388cf Binary files /dev/null and b/docs/_images/interactive_translation.png differ diff --git a/docs/_modules/index.html b/docs/_modules/index.html index 52ce6c47fbd6..6d8993ce6b91 100644 --- a/docs/_modules/index.html +++ b/docs/_modules/index.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • diff --git a/docs/_modules/nemo/backends/pytorch/actions.html b/docs/_modules/nemo/backends/pytorch/actions.html index 45a61a6e1fea..55aab2a78a42 100644 --- a/docs/_modules/nemo/backends/pytorch/actions.html +++ b/docs/_modules/nemo/backends/pytorch/actions.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -156,28 +157,27 @@

    Source code for nemo.backends.pytorch.actions

     # Copyright (c) 2019 NVIDIA Corporation
     import itertools
    -import os
     import logging
    +import os
    +from typing import List, Optional
    +
     import torch
     import torch.distributed as dist
     import torch.nn as nn
     import torch.optim as optim
    +from nemo.backends.pytorch.nm import TrainableNM
     
    -from typing import List, Optional, Dict, Set
     from .module_wrapper import TrainableNeuralModuleWrapper
     from .nm import DataLayerNM
     from .optimizers import Novograd, AdamW, Lamb
    -from ...core import NmTensor, DeviceType
    +from ...core import NmTensor, DeviceType, NeuralModule
     from ...core.callbacks import (
         ActionCallback,
         EvaluatorCallback,
         SimpleLossLoggerCallback,
    -    ModuleSaverCallback,
    -    CheckpointCallback,
     )
     from ...core.neural_factory import Actions, ModelMode, Optimization
     from ...utils.helpers import get_checkpoint_from_dir
    -from nemo.core.callbacks import ValueSetterCallback
     
     try:
         import apex
    @@ -195,36 +195,29 @@ 

    Source code for nemo.backends.pytorch.actions

    Optimization.mxprO3: "O3", } -_float_2_half_req = {Optimization.mxprO1, Optimization.mxprO2, +_float_2_half_req = {Optimization.mxprO1, + Optimization.mxprO2, Optimization.mxprO3} -def _add_uuid_2_name(name, uuid): - return name + "~~~" + uuid - - -def _remove_uuid_from_name(name): - return name[: name.index("~~~")] - - -def _filter_dict(d: Dict, keys: Set) -> Set: - res = {} - for k, v in d.items(): - if k in keys: - res[_remove_uuid_from_name(k)] = v - return res - -
    [docs]class PtActions(Actions): - def __init__(self, params, local_rank=None, tb_writer=None): - super(PtActions, self).__init__(params=params, local_rank=local_rank) + def __init__(self, local_rank=None, tb_writer=None, + optimization_level=Optimization.mxprO0): + super(PtActions, self).__init__( + local_rank=local_rank, + optimization_level=optimization_level) # will be [unique_instance_id -> (NMModule, PTModule)] self.module_reference_table = {} self.step = 0 self.epoch_num = 0 - self.optimizer = None + self.optimizers = [] self.tb_writer = tb_writer + self._modules = set() + + @property + def modules(self): + return self._modules def __get_top_sorted_modules_and_dataloader(self, hook): """ @@ -235,7 +228,7 @@

    Source code for nemo.backends.pytorch.actions

    in DAG Returns: - list of modules with their call arguments and dataset + list of modules with their call arguments and outputs, and dataset """ def create_node(producer, producer_args): @@ -263,41 +256,94 @@

    Source code for nemo.backends.pytorch.actions

    else: hooks = hook + # ensures that no tensors are processed twice + processed_nmtensors = set() + + indices_to_remove = [] + # Check for duplicates in hook + for i, nmtensor in enumerate(hook): + if nmtensor in processed_nmtensors: + indices_to_remove.append(i) + else: + processed_nmtensors.add(nmtensor) + + for i in reversed(indices_to_remove): + hook.pop(i) + _top_sorted_modules = [] - all_nodes = set() + all_nodes = {} # extract all nodes to all_nodes set hooks_lst = list(hooks) while len(hooks_lst) > 0: - # take hook from the end of the list - hook = hooks_lst.pop() - node = create_node(hook.producer, hook.producer_args) - all_nodes.add(node) - if hook.producer_args is not None and hook.producer_args != {}: - for _, nmtensor in hook.producer_args.items(): - hooks_lst.insert(0, nmtensor) - - while len(all_nodes) > 0: - for node in all_nodes.copy(): + # take nmtensor from the end of the list + nmtensor = hooks_lst.pop() + node = create_node(nmtensor.producer, nmtensor.producer_args) + # Store nmtensor as an output of its producer + # first make sure all keys are present per output port + # and nm is inside all_nodes + if node not in all_nodes: + all_nodes[node] = { + k: None for k in nmtensor.producer._output_ports} + # second, populate output port with current nmtensor + # where applicable + all_nodes[node][nmtensor.name] = nmtensor + processed_nmtensors.add(nmtensor) + if (nmtensor.producer_args is not None + and nmtensor.producer_args != {}): + for _, new_nmtensor in nmtensor.producer_args.items(): + if new_nmtensor not in processed_nmtensors: + # put in the start of list + hooks_lst.insert(0, new_nmtensor) + + all_node_with_output = [] + # Iterate over all_nodes to create new nodes that include its output + # now all nodes have (module, input tensors, output tensors) + for node in all_nodes: + all_node_with_output.append(tuple(( + node[0], + node[1], + all_nodes[node] + ))) + + processed_nodes = [] + while len(all_node_with_output) > 0: + for node in all_node_with_output.copy(): # if node's in_degree is zero it can be added to # _top_sorted_modules # this will also reduce in_degree of its children - if is_in_degree_zero(node, _top_sorted_modules): + if is_in_degree_zero(node, processed_nodes): _top_sorted_modules.append(node) - all_nodes.remove(node) + processed_nodes.append((node[0], node[1])) + all_node_with_output.remove(node) + + # Create top_sorted_modules aka callchain + top_sorted_modules = [] + for i, m in enumerate(_top_sorted_modules): + top_sorted_modules.append((m[0], dict(m[1]), m[2])) + # Ensure that there is only one dataset in callchain + if i > 0 and isinstance(m[0], DataLayerNM): + raise ValueError( + "There were more than one DataLayer NeuralModule inside " + "your DAG.") - tdataset = _top_sorted_modules[0][0].dataset - top_sorted_modules = [(m[0], dict(m[1])) for m in _top_sorted_modules] + if not isinstance(top_sorted_modules[0][0], DataLayerNM): + raise ValueError( + "The first module in your DAG was not a DataLayer " + "NeuralModule.") + + tdataset = top_sorted_modules[0][0].dataset # populate self.module_reference_table - for m in _top_sorted_modules: + for m in top_sorted_modules: if m[0].factory is None and self._local_rank is not None: raise ValueError("Neural module {0} was created without " "NeuralModuleFactory, but you are trying to" "run in distributed mode. Please instantiate" - "NeuralModuleFactory first and pass it's " + "NeuralModuleFactory first and pass its " "instance as `factory` parameter to all your" - "Neural Module objects.") + "Neural Module objects." + "".format(m[0].__class__.__name__)) key = m[0].unique_instance_id if key not in self.module_reference_table: if isinstance(m[0], TrainableNeuralModuleWrapper): @@ -307,18 +353,84 @@

    Source code for nemo.backends.pytorch.actions

    return top_sorted_modules, tdataset +
    [docs] def create_optimizer( + self, + optimizer, + things_to_optimize, + optimizer_params=None, + ): + """ + Wrapper function around __setup_optimizer() + + Args: + optimizer : A instantiated PyTorch optimizer or string. For + currently supported strings, see __setup_optimizer(). + things_to_optimize (list): Must be a list of Neural Modules and/or + parameters. If a Neural Module is passed, all trainable + parameters are extracted and passed to the optimizer. + optimizer_params (dict): Optional parameters dictionary. + + Returns: + Optimizer + """ + + optimizer_instance = None + optimizer_class = None + if isinstance(optimizer, str): + optimizer_class = optimizer + elif isinstance(optimizer, torch.optim.Optimizer): + optimizer_instance = optimizer + else: + raise ValueError("`optimizer` must be a string or an instance " + "of torch.optim.Optimizer") + + modules_to_optimize = [] + tensors_to_optimize = [] + if not isinstance(things_to_optimize, list): + things_to_optimize = [things_to_optimize] + for thing in things_to_optimize: + if isinstance(thing, NeuralModule): + modules_to_optimize.append(thing) + elif isinstance(thing, NmTensor): + tensors_to_optimize.append(thing) + else: + raise ValueError("{} passed to create_optimizer() was neither " + "a neural module nor a neural module tensor") + + if tensors_to_optimize: + call_chain, _ = self.__get_top_sorted_modules_and_dataloader( + tensors_to_optimize) + + for module in call_chain: + if module[0] not in modules_to_optimize: + modules_to_optimize.append(module[0]) + + # Extract trainable weights which will be optimized + params_list = [ + p.parameters() for p in modules_to_optimize + if isinstance(p, TrainableNM) or p.is_trainable() + ] + params_to_optimize = itertools.chain(*params_list) + + if optimizer_params is None: + optimizer_params = {} + # Init amp + optimizer = self.__setup_optimizer( + optimizer_instance=optimizer_instance, + optimizer_class=optimizer_class, + optimization_params=optimizer_params, + params_to_optimize=params_to_optimize) + + self.optimizers.append(optimizer) + return optimizer
    + @staticmethod def __setup_optimizer( optimizer_instance, optimizer_class, optimization_params, params_to_optimize, - call_chain, - optim_level=Optimization.nothing, ): - amp_min_loss_scale = 1.0 - if optimization_params is not None: - amp_min_loss_scale = optimization_params.get('min_loss_scale', 1.0) if optimizer_instance is None: # Setup optimizer instance, by default it is SGD lr = optimization_params["lr"] @@ -330,8 +442,10 @@

    Source code for nemo.backends.pytorch.actions

    weight_decay=optimization_params.get("weight_decay", 0.0), ) elif optimizer_class.lower() == "adam": - optimizer = optim.Adam(params=params_to_optimize, lr=lr) - elif optimizer_class.lower() == "fuzed_adam": + optimizer = optim.Adam( + params=params_to_optimize, lr=lr, + betas=optimization_params.get("betas", (0.9, 0.999))) + elif optimizer_class.lower() == "fused_adam": optimizer = apex.optimizers.FusedAdam( params=params_to_optimize, lr=lr) @@ -348,6 +462,7 @@

    Source code for nemo.backends.pytorch.actions

    weight_decay=optimization_params.get("weight_decay", 0.0), luc=optimization_params.get("luc", False), luc_trust=optimization_params.get("luc_eta", 1e-3), + betas=optimization_params.get("betas", (0.95, 0.98)), ) elif optimizer_class.lower() == "lamb": optimizer = Lamb( @@ -374,48 +489,52 @@

    Source code for nemo.backends.pytorch.actions

    "optimizer because `optimizer_instance` " "is provided") optimizer = optimizer_instance + return optimizer - if optim_level in AmpOptimizations: - inds = [] - pt_modules = [] - for i in range(len(call_chain)): - if isinstance(call_chain[i][0], nn.Module): - inds.append([i, False]) - pt_modules.append(call_chain[i][0]) - elif isinstance(call_chain[i][0], - TrainableNeuralModuleWrapper): - inds.append([i, True]) - pt_modules.append(call_chain[i][0]._pt_module) - - pt_modules, optimizer = amp.initialize( - min_loss_scale=amp_min_loss_scale, - max_loss_scale=32768.0, - models=pt_modules, - optimizers=optimizer, - opt_level=AmpOptimizations[optim_level], - ) - - for ind in range(len(pt_modules)): - if inds[ind][1]: - call_chain[inds[ind][0]][0]._pt_module = pt_modules[ind] - else: - call_chain[inds[ind][0]] = ( - pt_modules[ind], - call_chain[inds[ind][0]][1], - ) - else: - return optimizer, call_chain - return optimizer, call_chain + def __initialize_amp( + self, optimizer, optim_level, amp_min_loss_scale=1.0 + ): + if optim_level not in AmpOptimizations: + raise ValueError("__initialize_amp() was called but optim_level " + "was set to float32.") + if len(self.modules) < 1: + raise ValueError("There were no modules to initialize") + pt_modules = [] + for module in self.modules: + if isinstance(module, nn.Module): + pt_modules.append(module) + elif isinstance(module, + TrainableNeuralModuleWrapper): + pt_modules.append(module._pt_module) + + _, optimizer = amp.initialize( + min_loss_scale=amp_min_loss_scale, + models=pt_modules, + optimizers=optimizer, + opt_level=AmpOptimizations[optim_level], + ) + return optimizer def __nm_graph_forward_pass( - self, call_chain, registered_tensors, mode=ModelMode.train + self, + call_chain, + registered_tensors, + mode=ModelMode.train, + disable_allreduce=False ): for ind in range(1, len(call_chain)): call_args = call_chain[ind][1] # module = call_chain[ind][0] m_id = call_chain[ind][0].unique_instance_id pmodule = self.module_reference_table[m_id][1] - module_output_port_names = call_chain[ind][0]._output_ports.keys() + + if isinstance(pmodule, DDP): + if disable_allreduce: + pmodule.disable_allreduce() + pmodule.delay_allreduce = True + else: + pmodule.enable_allreduce() + pmodule.delay_allreduce = False if mode == ModelMode.train: # if module.is_trainable(): @@ -447,20 +566,17 @@

    Source code for nemo.backends.pytorch.actions

    new_tensors = [new_tensors] else: new_tensors = list(new_tensors) - # module_output_port_names = module._output_ports.keys() - # now pack it according module's output port names - new_tensors_packed = dict( - zip( - [ - _add_uuid_2_name(port_name, m_id) - for port_name in module_output_port_names - ], - new_tensors, - ) - ) - for t_name, t_tensor in new_tensors_packed.items(): + for t_tensor, nm_tensor in zip( + new_tensors, call_chain[ind][2].values()): + if nm_tensor is None: + continue + t_name = nm_tensor.unique_name if t_name not in registered_tensors: registered_tensors[t_name] = t_tensor + else: + raise ValueError( + "A NMTensor was produced twice in the same DAG. " + "{}".format(t_name))
    [docs] @staticmethod def pad_tensor(t: torch.Tensor, target_size: torch.Size): @@ -519,11 +635,6 @@

    Source code for nemo.backends.pytorch.actions

    ) dl_nm = call_chain[0][0] - if not isinstance(dl_nm, DataLayerNM): - raise ValueError( - "The evaluation callchain did not start with a DataLayerNM" - ) - # Prepare eval_dataloader # For distributed training it should have disjoint subsets of # all data on every worker @@ -556,6 +667,7 @@

    Source code for nemo.backends.pytorch.actions

    eval_dataloader.sampler.set_epoch(0) else: # Not distributed if dl_nm.dataset is not None: + # Todo: remove local_parameters eval_dataloader = torch.utils.data.DataLoader( dataset=dl_nm.dataset, sampler=None, # not distributed sampler @@ -575,7 +687,6 @@

    Source code for nemo.backends.pytorch.actions

    # there callback.clear_global_var_dict() - data_layer_output_port_names = dl_nm._output_ports.keys() dl_device = dl_nm._device # Evaluation mini-batch for loop @@ -588,29 +699,26 @@

    Source code for nemo.backends.pytorch.actions

    print("Evaluating batch {} out of {}".format(epoch_i, num_batches)) tensors = [] + if isinstance(data, torch.Tensor): + data = (data,) for d in data: if isinstance(d, torch.Tensor): tensors.append(d.to(dl_device)) else: tensors.append(d) - registered_e_tensors = dict( - zip( - [ - _add_uuid_2_name(dl_port_name, - call_chain[0][0]._uuid) - for dl_port_name in data_layer_output_port_names - ], - tensors, - ) - ) + registered_e_tensors = {t.unique_name: d for t, d in + zip(call_chain[0][2].values(), tensors) + if t is not None + } self.__nm_graph_forward_pass( call_chain=call_chain, registered_tensors=registered_e_tensors, mode=ModelMode.eval, ) - values_dict = {} + if not is_distributed or self.local_rank == 0: + values_dict = {} # If distributed. For the outer loop, we need to ensure that # all processes loop through the elements in the same order for t2e in tensors_2_evaluate: @@ -623,7 +731,6 @@

    Source code for nemo.backends.pytorch.actions

    ) continue if is_distributed: - values_dict["IS_FROM_DIST_EVAL"] = True # where we will all_gather results from all workers tensors_list = [] # where we will all_gather tensor sizes @@ -667,7 +774,9 @@

    Source code for nemo.backends.pytorch.actions

    self.depad_tensor(t, size) for t, size in zip(tensors_list, sizes) ] - values_dict[key] = tensors_list + if self.local_rank == 0: + values_dict["IS_FROM_DIST_EVAL"] = True + values_dict[key] = tensors_list else: # NON-DISTRIBUTED TRAINING values_dict["IS_FROM_DIST_EVAL"] = False values_dict[key] = [registered_e_tensors[key]] @@ -690,6 +799,172 @@

    Source code for nemo.backends.pytorch.actions

    for key, val in vals_to_log.items(): callback._swriter.add_scalar(key, val, step) + def _infer(self, tensors_to_return, step, verbose=False): + """ + Does the same as _eval() just with tensors instead of eval callback. + """ + with torch.no_grad(): + # each call chain corresponds to a tensor in tensors_2_evaluate + dl_nm = None + call_chain, _ = self.__get_top_sorted_modules_and_dataloader( + hook=tensors_to_return + ) + dl_nm = call_chain[0][0] + + # Prepare eval_dataloader + # For distributed training it should have disjoint subsets of + # all data on every worker + is_distributed = False + world_size = None + if dl_nm.placement == DeviceType.AllGpu: + assert dist.is_initialized() + is_distributed = True + world_size = torch.distributed.get_world_size() + # print( + # "Doing distributed evaluation. Rank {0} of {1}".format( + # self.local_rank, world_size + # ) + # ) + if dl_nm.dataset is not None: + sampler = torch.utils.data.distributed.DistributedSampler( + dl_nm.dataset + ) + eval_dataloader = torch.utils.data.DataLoader( + dataset=dl_nm.dataset, + sampler=sampler, + num_workers=dl_nm.local_parameters.get( + "num_workers", os.cpu_count() + ), + batch_size=dl_nm.local_parameters["batch_size"], + shuffle=(sampler is None), + ) + else: + eval_dataloader = dl_nm.data_iterator + eval_dataloader.sampler.set_epoch(0) + else: # Not distributed + if dl_nm.dataset is not None: + # Todo: remove local_parameters + eval_dataloader = torch.utils.data.DataLoader( + dataset=dl_nm.dataset, + sampler=None, # not distributed sampler + num_workers=call_chain[0][0].local_parameters.get( + "num_workers", os.cpu_count() + ), + batch_size=call_chain[0][0].local_parameters[ + "batch_size"], + shuffle=call_chain[0][0].local_parameters.get( + "shuffle", + False), + ) + else: + eval_dataloader = dl_nm.data_iterator + # after this eval_dataloader is ready to be used + # reset global_var_dict - results of evaluation will be stored + # there + + if not is_distributed or self.local_rank == 0: + values_dict = {} + for t in tensors_to_return: + values_dict[t.unique_name] = [] + dl_device = dl_nm._device + + # Evaluation mini-batch for loop + num_batches = len(eval_dataloader) + for epoch_i, data in enumerate(eval_dataloader, 0): + if verbose and ( + num_batches < 10 or ( + epoch_i % int(num_batches / 10) == 0) + ): + print("Evaluating batch {} out of {}".format(epoch_i, + num_batches)) + tensors = [] + if isinstance(data, torch.Tensor): + data = (data,) + for d in data: + if isinstance(d, torch.Tensor): + tensors.append(d.to(dl_device)) + else: + tensors.append(d) + + registered_e_tensors = {t.unique_name: d for t, d in + zip(call_chain[0][2].values(), tensors) + if t is not None + } + self.__nm_graph_forward_pass( + call_chain=call_chain, + registered_tensors=registered_e_tensors, + mode=ModelMode.eval, + ) + + # If distributed. For the outer loop, we need to ensure that + # all processes loop through the elements in the same order + for t2e in tensors_to_return: + key = t2e.unique_name + if key not in registered_e_tensors.keys(): + print( + "WARNING: Tensor {} was not found during " + "eval".format( + key) + ) + continue + if is_distributed: + # where we will all_gather results from all workers + tensors_list = [] + # where we will all_gather tensor sizes + tensor_on_worker = registered_e_tensors[key] + if tensor_on_worker.shape != torch.Size([]): + tensor_on_worker_size_as_tensor = torch.tensor( + tensor_on_worker.shape + ).cuda() + sizes = [] + for ind in range(world_size): + sizes.append( + torch.empty_like( + tensor_on_worker_size_as_tensor) + ) + dist.all_gather(sizes, + tensor_on_worker_size_as_tensor) + mx_dim, _ = torch.max(torch.stack(sizes), dim=0) + else: # this is a singleton. For example, loss value + sizes = [torch.Size([])] * world_size + mx_dim = None + for ind in range(world_size): + # we have to use max shape for all_gather + if mx_dim is None: # singletons + tensors_list.append( + torch.tensor(2).cuda().type_as( + tensor_on_worker) + ) + else: # non-singletons + tensors_list.append(torch.empty( + mx_dim.cpu().data.numpy().tolist()).cuda() + .type_as( + tensor_on_worker)) + + if mx_dim is not None: + t_to_send = self.pad_tensor(tensor_on_worker, + mx_dim) + else: + t_to_send = tensor_on_worker + dist.all_gather(tensors_list, t_to_send) + tensors_list = [ + self.depad_tensor(t, size) + for t, size in zip(tensors_list, sizes) + ] + if self.local_rank == 0: + values_dict[key] += tensors_list + else: # NON-DISTRIBUTED TRAINING + values_dict[key] += [registered_e_tensors[key]] + + if not is_distributed or self.local_rank == 0: + inferred_tensors = [] + for t in tensors_to_return: + inferred_tensors.append(values_dict[t.unique_name]) + return inferred_tensors + + # For all other ranks + return None +
    [docs] def save_state_to(self, path: str): """ Saves current state such as step, epoch and optimizer parameters @@ -702,8 +977,7 @@

    Source code for nemo.backends.pytorch.actions

    state = { "step": self.step, "epoch_num": self.epoch_num, - "optimizer_state": self.optimizer.state_dict() if self.optimizer - else None, + "optimizer_state": [opt.state_dict() for opt in self.optimizers], } torch.save(state, path)
    @@ -723,67 +997,202 @@

    Source code for nemo.backends.pytorch.actions

    checkpoint = torch.load(path, map_location="cpu") self.step = checkpoint["step"] self.epoch_num = checkpoint["epoch_num"] - if checkpoint["optimizer_state"] is not None: - self.optimizer.load_state_dict(checkpoint["optimizer_state"]) + if checkpoint["optimizer_state"]: + for opt, opt_chkpt in zip( + self.optimizers, checkpoint["optimizer_state"]): + opt.load_state_dict(opt_chkpt) else: raise FileNotFoundError( "Could not find checkpoint file: {0}".format(path))
    + @staticmethod + def _check_all_tensors(list_of_tensors): + """Method that checks if the passed list contains all NmTensors + """ + if not isinstance(list_of_tensors, list): + return False + for tensor in list_of_tensors: + if not isinstance(tensor, NmTensor): + return False + return True + + @staticmethod + def _check_tuples(list_of_tuples): + """Method that checks if the passed tuple contains an optimizer in the + first element, and a list of NmTensors in the second. + """ + for tup in list_of_tuples: + if not (isinstance(tup[0], torch.optim.Optimizer) + and PtActions._check_all_tensors(tup[1])): + return False + return True + + def _get_all_modules( + self, training_loop, callbacks, logging_callchain=None): + """Gets all neural modules that will be used by train() and eval() via + EvaluatorCallbacks. Saves all modules to self.modules + """ + # If there is a SimpleLossLoggerCallback, create an logger_callchain + # with all callchains from training_loop and + # SimpleLossLoggerCallback.tensors + if logging_callchain: + for module in logging_callchain: + self.modules.add(module[0]) + + # Else grab all callchains from training_loop + else: + for step in training_loop: + for module in step[2]: + self.modules.add(module[0]) + + # Lastly, grab all eval modules + if callbacks is not None: + for callback in callbacks: + if isinstance(callback, EvaluatorCallback): + callchain, _ = \ + self.__get_top_sorted_modules_and_dataloader( + hook=callback.eval_tensors) + for module in callchain: + self.modules.add(module[0]) +
    [docs] def train( self, - tensors_to_optimize: List[NmTensor], - tensors_to_evaluate: Optional[List[NmTensor]] = None, + tensors_to_optimize, + optimizer=None, + optimization_params=None, callbacks: Optional[List[ActionCallback]] = None, lr_policy=None, batches_per_step=None, stop_on_nan_loss=False ): - if len(tensors_to_optimize) != 1: - raise NotImplementedError( - "Currently we can only optimize single loss") + if not optimization_params: + optimization_params = {} + num_epochs = optimization_params.get("num_epochs", 1) + max_steps = optimization_params.get("max_steps", None) + grad_norm_clip = optimization_params.get('grad_norm_clip', None) + if batches_per_step is None: batches_per_step = 1 # this is necessary because we average gradients over batch - bps_scale = torch.FloatTensor([1.0/batches_per_step]) - - # Parse graph into a topologically sorted sequence of neural - # modules' calls - opt_call_chain, t_dataset = \ - self.__get_top_sorted_modules_and_dataloader( - hook=tensors_to_optimize - ) - opteval_call_chain = None - if tensors_to_evaluate is not None: - opteval_call_chain, _ = \ + bps_scale = torch.FloatTensor([1.0 / batches_per_step]) + + if tensors_to_optimize is None: + # This is Evaluation Mode + self._init_callbacks(callbacks) + # Do action start callbacks + self._perform_on_action_end(callbacks=callbacks) + return + # Check if tensors_to_optimize is just a list of NmTensors + elif tensors_to_optimize is not None and ( + isinstance(tensors_to_optimize[0], + NmTensor) and PtActions._check_all_tensors( + tensors_to_optimize)): + # Parse graph into a topologically sorted sequence of neural + # modules' calls + opt_call_chain, t_dataset = \ self.__get_top_sorted_modules_and_dataloader( - hook=tensors_to_optimize + tensors_to_evaluate + hook=tensors_to_optimize ) - # Extract trainable weights which will be optimized - params_list = [p[0].parameters() for p in opt_call_chain if - p[0].is_trainable()] - params_to_optimize = itertools.chain(*params_list) + # Extract trainable weights which will be optimized + params_list = [ + p[0].parameters() for p in opt_call_chain + if isinstance(p[0], TrainableNM) or p[0].is_trainable() + ] + params_to_optimize = itertools.chain(*params_list) + + # Setup optimizer instance. By default it is SGD + optimizer_instance = None + optimizer_class = None + if isinstance(optimizer, str): + optimizer_class = optimizer + elif isinstance(optimizer, torch.optim.optimizer): + optimizer_instance = optimizer + else: + raise ValueError("optimizer was not understood") + optimizer = self.__setup_optimizer( + optimizer_instance=optimizer_instance, + optimizer_class=optimizer_class, + optimization_params=optimization_params, + params_to_optimize=params_to_optimize, + ) - # Setup optimizer instance. By default it is SGD - optimizer_instance = self._parameters.get("optimizer_instance", None) - optimizer_class = self._parameters.get("optimizer_kind", "sgd") - optimization_params = self._parameters.get( - "optimization_params", {"lr": 0.0003} - ) - grad_norm_clip = optimization_params.get('grad_norm_clip', None) - num_epochs = optimization_params.get("num_epochs", 1) - max_steps = optimization_params.get("max_steps", None) - self.optimizer, opt_call_chain = self.__setup_optimizer( - optimizer_instance=optimizer_instance, - optimizer_class=optimizer_class, - optimization_params=optimization_params, - params_to_optimize=params_to_optimize, - call_chain=opt_call_chain, - optim_level=self._optim_level, - ) + training_loop = [ + (optimizer, tensors_to_optimize, opt_call_chain) + ] + + self.optimizers.append(optimizer) + assert len(self.optimizers) == 1, \ + ("There was more than one optimizer, was create_optimizer() " + "called before train()?") + + elif PtActions._check_tuples(tensors_to_optimize): + if batches_per_step != 1: + raise ValueError("Gradient accumlation with multiple " + "optimizers is not supported") + datasets = [] + training_loop = [] + for step in tensors_to_optimize: + step_call_chain, dataset = \ + self.__get_top_sorted_modules_and_dataloader( + hook=step[1] + ) + datasets.append(dataset) + training_loop.append( + (step[0], step[1], step_call_chain)) + + t_dataset = datasets[0] + for dataset in datasets: + if type(dataset) is not type(t_dataset): + raise ValueError( + "There were two training datasets, we only support 1.") + else: + raise ValueError("tensors_to_optimize was not understood") - dataNM = opt_call_chain[0][0] + logging_callchain = None + # callbacks setup + if callbacks is not None: + for callback in callbacks: + if not isinstance(callback, ActionCallback): + raise ValueError("A callback was received that was not a " + "child of ActionCallback") + elif isinstance(callback, SimpleLossLoggerCallback): + if logging_callchain: + raise ValueError("We only support one logger callback " + "but more than one were found") + logger_step_freq = callback._step_freq + logging_tensors = callback.tensors + all_tensors = logging_tensors + for step in training_loop: + all_tensors = all_tensors + step[1] + logging_callchain, _ = \ + self.__get_top_sorted_modules_and_dataloader( + hook=all_tensors) + + self._get_all_modules(training_loop, callbacks, logging_callchain) + + # Intialize Amp if needed + if self._optim_level in AmpOptimizations: + # Store mapping of self.optimizers to optimizer in callchain + training_loop_opts = [] + for opt in training_loop: + training_loop_opts.append(self.optimizers.index(opt[0])) + self.optimizers = self.__initialize_amp( + optimizer=self.optimizers, + optim_level=self._optim_level, + amp_min_loss_scale=optimization_params.get( + 'amp_min_loss_scale', 1.0)) + # Use stored mapping to map amp_init opts to training loop + for i, step in enumerate(training_loop): + training_loop[i] = ( + self.optimizers[training_loop_opts[i]], step[1], step[2]) + + dataNM = training_loop[0][2][0][0] if dataNM.placement == DeviceType.AllGpu: + if len(training_loop) > 1: + raise NotImplementedError( + "Distributed training does nor work with multiple " + "optimizers") print("Doing distributed training") if t_dataset is not None: train_sampler = \ @@ -829,32 +1238,8 @@

    Source code for nemo.backends.pytorch.actions

    train_dataloader = dataNM.data_iterator train_sampler = None - data_layer_output_port_names = opt_call_chain[0][0]._output_ports\ - .keys() - eval_tensors_debug_freq = 1000000000 - - # callbacks setup - if callbacks is not None: - for callback in callbacks: - if isinstance(callback, EvaluatorCallback): - callback.__setattr__("_compute_callback", self._eval) - elif isinstance(callback, SimpleLossLoggerCallback): - eval_tensors_debug_freq = min(callback._step_frequency, - eval_tensors_debug_freq) - elif isinstance(callback, CheckpointCallback): - callback.__setattr__("call_chain", opt_call_chain) - callback.__setattr__("action", self) - elif isinstance(callback, (ModuleSaverCallback, - ValueSetterCallback)): - pass - else: - raise TypeError("Callback of unknown type") - # Register action start with callbacks - self._fill_callbacks( - callbacks=callbacks, - tensors_to_optimize=tensors_to_optimize, - tensors_to_evaluate=tensors_to_evaluate, - ) + self._init_callbacks(callbacks) + # Do action start callbacks self._perform_on_action_start(callbacks=callbacks) # MAIN TRAINING LOOP @@ -867,28 +1252,20 @@

    Source code for nemo.backends.pytorch.actions

    break # Register epochs start with callbacks - self._fill_callbacks( - callbacks=callbacks, - tensors_to_optimize=tensors_to_optimize, - tensors_to_evaluate=tensors_to_evaluate, - ) self._perform_on_epoch_start(callbacks=callbacks) # iteration over batches in epoch batch_counter = 0 - for epoch_i, data in enumerate(train_dataloader, 0): + for _, data in enumerate(train_dataloader, 0): if max_steps is not None and self.step >= max_steps: break if batch_counter == 0: # Started step, zero gradients - self.optimizer.zero_grad() + curr_optimizer = training_loop[ + self.step % len(training_loop)][0] + curr_optimizer.zero_grad() # Register iteration start with callbacks - self._fill_callbacks( - callbacks=callbacks, - tensors_to_optimize=tensors_to_optimize, - tensors_to_evaluate=tensors_to_evaluate, - ) self._perform_on_iteration_start(callbacks=callbacks) # set learning rate policy @@ -896,16 +1273,20 @@

    Source code for nemo.backends.pytorch.actions

    adjusted_lr = lr_policy( optimization_params["lr"], self.step, self.epoch_num ) - for param_group in self.optimizer.param_groups: + for param_group in curr_optimizer.param_groups: param_group["lr"] = adjusted_lr if self.tb_writer is not None: - value = self.optimizer.param_groups[0]['lr'] + value = curr_optimizer.param_groups[0]['lr'] self.tb_writer.add_scalar('param/lr', value, self.step) # registered_tensors will contain created tensors # named by output port and uuid of module which created them # Get and properly name tensors returned by data layer - dl_device = opt_call_chain[0][0]._device + curr_call_chain = training_loop[ + self.step % len(training_loop)][2] + dl_device = curr_call_chain[0][0]._device + if logging_callchain and self.step % logger_step_freq == 0: + curr_call_chain = logging_callchain tensors = [] if isinstance(data, torch.Tensor): data = (data,) @@ -923,62 +1304,48 @@

    Source code for nemo.backends.pytorch.actions

    else: tensors.append(d) - registered_tensors = dict( - zip( - [ - _add_uuid_2_name(dl_port_name, - opt_call_chain[0][0]._uuid) - for dl_port_name in data_layer_output_port_names - ], - tensors, - ) + registered_tensors = { + t.unique_name: d for t, d in + zip(curr_call_chain[0][2].values(), tensors) + if t is not None + } + disable_allreduce = batch_counter < (batches_per_step - 1) + self.__nm_graph_forward_pass( + call_chain=curr_call_chain, + registered_tensors=registered_tensors, + disable_allreduce=disable_allreduce ) - # Run opteval_call_chain as needed, otherwise run - # opt_call_chain - if ( - self.step % eval_tensors_debug_freq == 0 - and opteval_call_chain is not None - ): - self.__nm_graph_forward_pass( - call_chain=opteval_call_chain, - registered_tensors=registered_tensors, - ) - else: - self.__nm_graph_forward_pass( - call_chain=opt_call_chain, - registered_tensors=registered_tensors - ) - - tto_len = len(tensors_to_optimize) + curr_tensors_to_optimize = training_loop[ + self.step % len(training_loop)][1] + tto_len = len(curr_tensors_to_optimize) for ind in range(tto_len): - registered_name = tensors_to_optimize[ind].unique_name + tensor_name = curr_tensors_to_optimize[ind].unique_name if self._optim_level in AmpOptimizations and ind == \ tto_len - 1: with amp.scale_loss( - registered_tensors[registered_name], - self.optimizer + registered_tensors[tensor_name], + curr_optimizer ) as scaled_loss: if torch.isnan(scaled_loss).any(): if stop_on_nan_loss: raise ValueError('Loss is NaN exiting') - else: - print('WARNING: Loss is NaN') - self.optimizer.zero_grad() + print('WARNING: Loss is NaN') + curr_optimizer.zero_grad() scaled_loss.backward( bps_scale.to(scaled_loss.get_device())) else: - if torch.isnan(registered_tensors[registered_name])\ - .any(): + if torch.isnan(registered_tensors[tensor_name]).any(): if stop_on_nan_loss: raise ValueError('Loss is NaN exiting') - else: - print('WARNING: Loss is NaN') - self.optimizer.zero_grad() + print('WARNING: Loss is NaN') + curr_optimizer.zero_grad() + break - registered_tensors[registered_name].backward( - bps_scale.to(registered_tensors[ - registered_name].get_device())) + registered_tensors[tensor_name].backward( + bps_scale.to( + registered_tensors[tensor_name].get_device()), + retain_graph=(ind != tto_len - 1)) batch_counter += 1 @@ -986,14 +1353,12 @@

    Source code for nemo.backends.pytorch.actions

    # Ended step. Do optimizer update if grad_norm_clip is not None: torch.nn.utils.clip_grad_norm_( - amp.master_params(self.optimizer), grad_norm_clip) - self.optimizer.step() + amp.master_params(curr_optimizer), grad_norm_clip) + curr_optimizer.step() batch_counter = 0 # Register iteration end with callbacks - self._fill_callbacks( + self._update_callbacks( callbacks=callbacks, - tensors_to_optimize=tensors_to_optimize, - tensors_to_evaluate=tensors_to_evaluate, registered_tensors=registered_tensors, ) self._perform_on_iteration_end(callbacks=callbacks) @@ -1001,25 +1366,16 @@

    Source code for nemo.backends.pytorch.actions

    # End of epoch for loop # Register epochs end with callbacks - self._fill_callbacks( - callbacks=callbacks, - tensors_to_optimize=tensors_to_optimize, - tensors_to_evaluate=tensors_to_evaluate, - ) self._perform_on_epoch_end(callbacks=callbacks) - self._fill_callbacks( - callbacks=callbacks, - tensors_to_optimize=tensors_to_optimize, - tensors_to_evaluate=tensors_to_evaluate, - ) self._perform_on_action_end(callbacks=callbacks)
    -
    [docs] def infer(self, callback, checkpoint_dir=None, ckpt_pattern=''): +
    [docs] def infer(self, tensors, checkpoint_dir=None, ckpt_pattern='', + logger=None): if checkpoint_dir: # Find all modules that need to be restored call_chain, _ = self.__get_top_sorted_modules_and_dataloader( - hook=callback.eval_tensors + hook=tensors ) modules_to_restore = [] modules_to_restore_name = [] @@ -1033,6 +1389,8 @@

    Source code for nemo.backends.pytorch.actions

    ) for mod, checkpoint in zip(modules_to_restore, module_checkpoints): + if logger: + logger.info(f"Restoring {mod} from {checkpoint}") mod.restore_from(checkpoint, self._local_rank) # Init Amp @@ -1047,20 +1405,13 @@

    Source code for nemo.backends.pytorch.actions

    amp.initialize( min_loss_scale=1.0, - max_loss_scale=8192.0, models=pt_modules, optimizers=None, opt_level=AmpOptimizations[self._optim_level], ) # Run infer - self._eval(callback.eval_tensors, callback, step=0, verbose=True) - - evaluated_tensors = [] - for tensor in callback.eval_tensors: - evaluated_tensors.append( - callback._global_var_dict[tensor.unique_name]) - return evaluated_tensors
    + return self._infer(tensors_to_return=tensors, step=0, verbose=True)
    diff --git a/docs/_modules/nemo/backends/pytorch/common/data.html b/docs/_modules/nemo/backends/pytorch/common/data.html index 7e68edea0b77..3ba9daf80419 100644 --- a/docs/_modules/nemo/backends/pytorch/common/data.html +++ b/docs/_modules/nemo/backends/pytorch/common/data.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -242,6 +243,8 @@

    Source code for nemo.backends.pytorch.common.data

    for i, s in enumerate(batch_list): texts[i].narrow(0, 0, s.size(0)).copy_(s) + assert len(texts.shape) == 2 + return texts
    diff --git a/docs/_modules/nemo/backends/pytorch/common/losses.html b/docs/_modules/nemo/backends/pytorch/common/losses.html index 55bf8adc7a3e..77ecf3930f7c 100644 --- a/docs/_modules/nemo/backends/pytorch/common/losses.html +++ b/docs/_modules/nemo/backends/pytorch/common/losses.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -172,6 +173,9 @@

    Source code for nemo.backends.pytorch.common.losses

    Defaults to 0. smoothing_coef (float): Label smoothing coefficient in range [0, 1]. Defaults to 0.0. + sample_wise (bool): Flag indicates if loss sum divisor should be batch + size. + Defaults to False. aux_ctc (bool): Whether to add auxiliary CTC loss. Defaults to False. ctc_initial_coef (float): Initial coefficient to multiply ctc component @@ -201,7 +205,7 @@

    Source code for nemo.backends.pytorch.common.losses

    } return input_ports, output_ports - def __init__(self, pad_id=0, smoothing_coef=0.0, + def __init__(self, pad_id=0, smoothing_coef=0.0, sample_wise=False, aux_ctc=False, ctc_initial_coef=0.1, ctc_blank_id=None, **kwargs): assert (not aux_ctc) or (ctc_blank_id is not None), \ @@ -211,6 +215,7 @@

    Source code for nemo.backends.pytorch.common.losses

    self.pad_id = pad_id self.smoothing_coef = smoothing_coef + self.sample_wise = sample_wise self.aux_ctc = aux_ctc self.ctc_coef = ctc_initial_coef @@ -240,7 +245,10 @@

    Source code for nemo.backends.pytorch.common.losses

    + self.smoothing_coef * log_probs.mean(-1) pad_mask = pad_mask.float() loss = -torch.sum(loss * pad_mask) - loss = loss / (pad_mask.sum() + EPS) + if self.sample_wise: + loss /= target_log_probs.size(0) + else: + loss /= pad_mask.sum() + EPS return loss def _ctc_loss(self, log_probs, targets, pad_mask): @@ -248,6 +256,39 @@

    Source code for nemo.backends.pytorch.common.losses

    loss = self.ctc(log_probs.transpose(0, 1), targets, lengths, lengths) loss = torch.mean(loss) return loss
    + + +
    [docs]class CrossEntropyLoss(LossNM): + """ + CrossEntropyLoss + + """ + @staticmethod + def create_ports(): + input_ports = { + "logits": NeuralType({ + 0: AxisType(BatchTag), + 1: AxisType(ChannelTag) + }), + "labels": NeuralType({ + 0: AxisType(BatchTag), + }) + } + + output_ports = { + "loss": NeuralType(None), + } + return input_ports, output_ports + + def __init__(self, **kwargs): + LossNM.__init__(self, **kwargs) + self._criterion = nn.CrossEntropyLoss() + + def _loss_function(self, + logits, + labels): + loss = self._criterion(logits, labels) + return loss
    diff --git a/docs/_modules/nemo/backends/pytorch/common/rnn.html b/docs/_modules/nemo/backends/pytorch/common/rnn.html index 5e6661e2b765..1760a075d7f7 100644 --- a/docs/_modules/nemo/backends/pytorch/common/rnn.html +++ b/docs/_modules/nemo/backends/pytorch/common/rnn.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -248,6 +249,7 @@

    Source code for nemo.backends.pytorch.common.rnn

    voc_size = pad_to(voc_size, 8) # 8-divisors trick self.embedding = nn.Embedding(voc_size, hidden_size) + # noinspection PyTypeChecker self.in_dropout = nn.Dropout(in_dropout) rnn_class = getattr(nn, rnn_type.upper()) self.rnn = rnn_class(hidden_size, hidden_size, n_layers, @@ -285,6 +287,7 @@

    Source code for nemo.backends.pytorch.common.rnn

    # Inputs decoder_inputs = self.embedding(decoder_inputs) + # noinspection PyCallingNonCallable decoder_inputs = self.in_dropout(decoder_inputs) # RNN diff --git a/docs/_modules/nemo/backends/pytorch/common/search.html b/docs/_modules/nemo/backends/pytorch/common/search.html index 63732136b1c9..f49f72b095c5 100644 --- a/docs/_modules/nemo/backends/pytorch/common/search.html +++ b/docs/_modules/nemo/backends/pytorch/common/search.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -156,7 +157,7 @@

    Source code for nemo.backends.pytorch.common.search

     import torch
     
    -from nemo.backends.pytorch.nm import TrainableNM
    +from nemo.backends.pytorch.nm import NonTrainableNM
     from nemo.core.neural_types import NeuralType, AxisType, BatchTag, TimeTag, \
         ChannelTag
     
    @@ -165,7 +166,7 @@ 

    Source code for nemo.backends.pytorch.common.search

    # TODO: Validate, compare to `BeamSearch` -
    [docs]class GreedySearch(TrainableNM): +
    [docs]class GreedySearch(NonTrainableNM): """Greedy translation search. For encoder-decoder based models. @@ -272,10 +273,11 @@

    Source code for nemo.backends.pytorch.common.search

    self.beam_size = beam_size - @torch.no_grad() def forward(self, encoder_outputs=None): k = self.beam_size + fdtype = self.decoder.embedding.weight.dtype if self.batch_size is None: + encoder_outputs = encoder_outputs.to(fdtype) bs = encoder_outputs.size(0) # [BK]TC # encoder_output = encoder_output.repeat_interleave(k, 0) @@ -290,13 +292,13 @@

    Source code for nemo.backends.pytorch.common.search

    bs * k, 1, dtype=torch.long, device=self._device ).fill_(self.bos_id) # [BK]1 - scores = torch.zeros_like(predictions, dtype=torch.float) # [BK]1 + scores = torch.zeros_like(predictions, dtype=fdtype) # [BK]1 pad_profile = torch.zeros_like(predictions) # [BK]1 if encoder_outputs is not None: t = encoder_outputs.shape[1] # [BK]1T attention_weights = torch.empty( - bs * k, 1, t, device=self._device + bs * k, 1, t, dtype=fdtype, device=self._device ).fill_(1. / t) else: attention_weights = None diff --git a/docs/_modules/nemo/backends/pytorch/common/zero_data.html b/docs/_modules/nemo/backends/pytorch/common/zero_data.html index 45f68cf92b05..80f74012ac9e 100644 --- a/docs/_modules/nemo/backends/pytorch/common/zero_data.html +++ b/docs/_modules/nemo/backends/pytorch/common/zero_data.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • diff --git a/docs/_modules/nemo/backends/pytorch/module_wrapper.html b/docs/_modules/nemo/backends/pytorch/module_wrapper.html index 75f1f4d7df41..22f3079803f4 100644 --- a/docs/_modules/nemo/backends/pytorch/module_wrapper.html +++ b/docs/_modules/nemo/backends/pytorch/module_wrapper.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -180,7 +181,7 @@

    Source code for nemo.backends.pytorch.module_wrapper

    self._input_ports = input_ports_dict self._output_ports = output_ports_dict self._device = t.device( - "cuda" if self.placement == DeviceType.GPU or DeviceType.AllGpu + "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu] else "cpu" ) self._pt_module = pt_nn_module diff --git a/docs/_modules/nemo/backends/pytorch/nm.html b/docs/_modules/nemo/backends/pytorch/nm.html index 3b97ceebb3f4..d6f601d11255 100644 --- a/docs/_modules/nemo/backends/pytorch/nm.html +++ b/docs/_modules/nemo/backends/pytorch/nm.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -191,9 +192,7 @@

    Source code for nemo.backends.pytorch.nm

             NeuralModule.__init__(self, **kwargs)  # For NeuralModule API
             nn.Module.__init__(self)  # For PyTorch API
             self._device = t.device(
    -            "cuda"
    -            if self.placement == DeviceType.GPU or self.placement ==
    -            DeviceType.AllGpu
    +            "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu]
                 else "cpu"
             )
     
    @@ -224,9 +223,9 @@ 

    Source code for nemo.backends.pytorch.nm

     
    [docs] def tie_weights_with(self, module, weight_names, name2name_and_transform=None): if module is None: - raise ValueError("Module with which to tie weights can't be None") + raise ValueError("Module to tie weights can't be None") if weight_names is None or len(weight_names) == 0: - raise ValueError("Please provide weigth names to tie") + raise ValueError("Please provide weight names to tie") if name2name_and_transform is None: for name in weight_names: @@ -261,7 +260,7 @@

    Source code for nemo.backends.pytorch.nm

     
    [docs] def restore_from(self, path, local_rank=0): # self._pt_module.load_state_dict(t.load(path)) if self.placement == DeviceType.AllGpu: - load_device = "cuda:{}".format(local_rank) + load_device = f"cuda:{local_rank}" else: load_device = self._device self.load_state_dict(t.load(path, map_location=load_device))
    @@ -295,9 +294,7 @@

    Source code for nemo.backends.pytorch.nm

         def __init__(self, **kwargs):
             NeuralModule.__init__(self, **kwargs)  # For NeuralModule API
             self._device = t.device(
    -            "cuda"
    -            if self.placement == DeviceType.GPU or self.placement ==
    -            DeviceType.AllGpu
    +            "cuda" if self.placement in [DeviceType.GPU, DeviceType.AllGpu]
                 else "cpu"
             )
     
    @@ -310,7 +307,7 @@ 

    Source code for nemo.backends.pytorch.nm

                 return NeuralModule.__call__(self, **kwargs)
     
     
    [docs] def forward(self, *input): - r"""Defines the computation performed at every call. + """Defines the computation performed at every call. Should be overridden by all subclasses. """ @@ -510,8 +507,7 @@

    Source code for nemo.backends.pytorch.nm

             pass
     
         def __call__(self, force_pt=False, *input, **kwargs):
    -        pt_call = force_pt
    -        if pt_call:
    +        if force_pt:
                 return self._loss_function(**kwargs)
             else:
                 return NeuralModule.__call__(self, **kwargs)
    diff --git a/docs/_modules/nemo/core/neural_factory.html b/docs/_modules/nemo/core/neural_factory.html index 3b6e41e00fd1..b1119e5edacf 100644 --- a/docs/_modules/nemo/core/neural_factory.html +++ b/docs/_modules/nemo/core/neural_factory.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -155,13 +156,15 @@

    Source code for nemo.core.neural_factory

     # Copyright (c) 2019 NVIDIA Corporation
    -import logging
    -import numpy as np
    +import random
     from abc import ABC, abstractmethod
     from typing import List, Optional
     
    -from .callbacks import ActionCallback
    +import numpy as np
    +
    +from .callbacks import ActionCallback, EvaluatorCallback
     from .neural_types import *
    +from ..utils import ExpManager
     
     
     
    [docs]class Backend(Enum): @@ -179,14 +182,13 @@

    Source code for nemo.core.neural_factory

     
     
     
    [docs]class Optimization(Enum): - """Various levels of Optimization. + """Various levels of Apex/amp Optimization. WARNING: This might have effect on model accuracy.""" - nothing = 0 - mxprO0 = 1 - mxprO1 = 2 - mxprO2 = 3 - mxprO3 = 4
    + mxprO0 = 0 + mxprO1 = 1 + mxprO2 = 2 + mxprO3 = 3
    [docs]class DeviceType(Enum): @@ -200,13 +202,10 @@

    Source code for nemo.core.neural_factory

     
    [docs]class Actions(ABC): """Basic actions allowed on graphs of Neural Modules""" - def __init__(self, params, local_rank): - self._parameters = params + def __init__(self, local_rank, + optimization_level=Optimization.mxprO0): self._local_rank = local_rank - if "optimization_level" in params: - self._optim_level = params["optimization_level"] - else: - self._optim_level = Optimization.nothing + self._optim_level = optimization_level self.step = None self.epoch_num = None @@ -223,7 +222,6 @@

    Source code for nemo.core.neural_factory

         def train(
                 self,
                 tensors_to_optimize: List[NmTensor],
    -            tensors_to_evaluate: Optional[List[NmTensor]],
                 callbacks: Optional[List[ActionCallback]],
                 lr_policy=None,
                 batches_per_step=None,
    @@ -234,7 +232,6 @@ 

    Source code for nemo.core.neural_factory

             Args:
                 tensors_to_optimize: which tensors to optimize. Typically this is
                     single loss tesnor.
    -            tensors_to_evaluate: which tensors to compute during evaluation.
                 callbacks: list of callback objects
                 lr_policy: function which should take (initial_lr, step, epoch) and
                     return learning rate
    @@ -286,10 +283,32 @@ 

    Source code for nemo.core.neural_factory

     
             Returns:
     
    +        """
    +        pass
    + +
    [docs] @abstractmethod + def create_optimizer( + self, + optimizer, + things_to_optimize, + optimizer_params): + """ + Creates an optimizer object to be use in the train() method. + + Args: + optimizer: Specifies which optimizer to use. + things_to_optimize: A list of neural modules or tensors to be + optimized. + optimizer_params: Specifies the parameters of the optimizer + + Returns: + Optimizer """ pass
    def _perform_on_iteration_start(self, callbacks): + # TODO: Most of these checks can be relaxed since we enforce callbacks + # to be a list of ActionCallback objects if callbacks is not None and isinstance(callbacks, List) and len( callbacks) > 0: for callback in callbacks: @@ -331,26 +350,34 @@

    Source code for nemo.core.neural_factory

                     callback._local_rank = self.local_rank
                     callback.on_epoch_end()
     
    -    def _fill_callbacks(
    +    def _init_callbacks(self, callbacks):
    +        if callbacks is not None and isinstance(callbacks, List) and len(
    +                callbacks) > 0:
    +            for callback in callbacks:
    +                callback.action = self
    +
    +    def _update_callbacks(
                 self,
                 callbacks=None,
    -            tensors_to_optimize=None,
    -            tensors_to_evaluate=None,
                 registered_tensors=None,
         ):
             # if self.local_rank is None or self.local_rank == 0:
             if callbacks is not None and isinstance(callbacks, List) and len(
                     callbacks) > 0:
                 for callback in callbacks:
    -                callback._step = self.step
    -                callback._epoch_num = self.epoch_num
    -                callback._tensors_to_optimize = tensors_to_optimize
    -                callback._tensors_to_evaluate = tensors_to_evaluate
    -                callback._registered_tensors = registered_tensors
    -                callback._local_rank = self.local_rank
    + callback._registered_tensors = registered_tensors
    + + +def _str_to_opt_level(opt_str: str) -> Optimization: + number = int(opt_str[1:]) + if number not in Optimization._value2member_map_: + raise ValueError(f"Unknown optimization value {opt_str}") + return Optimization(number)
    [docs]class NeuralModuleFactory(object): + _DEFAULT = None + """ Neural Module Factory instance is used to create neural modules and trainers @@ -371,24 +398,46 @@

    Source code for nemo.core.neural_factory

                 randomness. This should be used for debugging purposes as it might
                 have negative impact on performance. Can't be used when
                 `cudnn_benchmark=True`.
    +        master_process (bool): (default True) Flag for master process
    +            indication
    +        set_default (bool): (default True) True if should set this instance as
    +            default factory for modules instantiating.
         """
     
         def __init__(
                 self,
                 backend=Backend.PyTorch,
                 local_rank=None,
    -            optimization_level=Optimization.nothing,
    -            placement=DeviceType.GPU,
    +            optimization_level=Optimization.mxprO0,
    +            placement=None,
                 cudnn_benchmark=False,
                 random_seed=None,
    -            master_process=True,
    +            set_default=True,
    +            log_dir=None,
    +            checkpoint_dir=None,
    +            tensorboard_dir=None,
    +            create_tb_writer=False,
    +            files_to_copy=None,
    +            add_time_to_log_dir=False
         ):
             self._local_rank = local_rank
    +
    +        if isinstance(optimization_level, str):
    +            optimization_level = _str_to_opt_level(optimization_level)
             self._optim_level = optimization_level
    -        self._placement = placement
    +
    +        if placement is None:
    +            if local_rank is not None:
    +                device = DeviceType.AllGpu
    +            else:
    +                device = DeviceType.GPU
    +
    +            self._placement = device
    +        else:
    +            self._placement = placement
    +
             self._backend = backend
             self._world_size = 1
    -        self._master_process = master_process
             if backend == Backend.PyTorch:
                 # TODO: Move all framework specific code from this file
                 import torch
    @@ -401,6 +450,7 @@ 

    Source code for nemo.core.neural_factory

                     torch.backends.cudnn.benchmark = False
                     torch.manual_seed(random_seed)
                     np.random.seed(random_seed)
    +                random.seed(random_seed)
     
                 if self._local_rank is not None:
                     torch.cuda.set_device(self._local_rank)
    @@ -412,6 +462,37 @@ 

    Source code for nemo.core.neural_factory

                 raise NotImplementedError(
                     "Only Pytorch backend is currently supported.")
     
    +        if set_default:
    +            NeuralModuleFactory.set_default_factory(self)
    +
    +        # Create ExpManager
    +        # if log_dir is None, only create logger
    +        self._exp_manager = ExpManager(
    +            work_dir=log_dir,
    +            ckpt_dir=checkpoint_dir,
    +            use_tb=create_tb_writer,
    +            tb_dir=tensorboard_dir,
    +            local_rank=local_rank,
    +            files_to_copy=files_to_copy,
    +            add_time=add_time_to_log_dir,
    +            exist_ok=True)
    +        self._tb_writer = self._exp_manager.tb_writer
    +
    +        # Create trainer
    +        self._trainer = self._get_trainer(tb_writer=self._tb_writer)
    +
    +
    [docs] @classmethod + def get_default_factory(cls): + return cls._DEFAULT
    + +
    [docs] @classmethod + def set_default_factory(cls, factory): + cls._DEFAULT = factory
    + +
    [docs] @classmethod + def reset_default_factory(cls): + cls._DEFAULT = None
    + @staticmethod def __name_import(name): components = name.split(".") @@ -541,17 +622,16 @@

    Source code for nemo.core.neural_factory

             """
             if params is not None and "optimization_level" in params:
                 if params["optimization_level"] != self._optim_level:
    -                if self._master_process:
    -                    logging.warning(
    -                        "Module's {0} requested optimization level {1} is"
    -                        "different from the one specified by factory - {2}."
    -                        "Using: {3} for this module".format(
    -                            name,
    -                            params["optimization_level"],
    -                            self._optim_level,
    -                            params["optimization_level"],
    -                        )
    +                self.logger.warning(
    +                    "Module's {0} requested optimization level {1} is"
    +                    "different from the one specified by factory - {2}."
    +                    "Using: {3} for this module".format(
    +                        name,
    +                        params["optimization_level"],
    +                        self._optim_level,
    +                        params["optimization_level"],
                         )
    +                )
             else:
                 if params is None:
                     params = {}
    @@ -565,23 +645,94 @@ 

    Source code for nemo.core.neural_factory

             else:
                 return None
    -
    [docs] def get_trainer(self, params, tb_writer=None): +
    [docs] def create_optimizer(self, + optimizer, + things_to_optimize, + optimizer_params): + return self._trainer.create_optimizer( + optimizer=optimizer, + things_to_optimize=things_to_optimize, + optimizer_params=optimizer_params)
    + +
    [docs] def train(self, + tensors_to_optimize, + optimizer=None, + optimization_params=None, + callbacks: Optional[List[ActionCallback]] = None, + lr_policy=None, + batches_per_step=None, + stop_on_nan_loss=False, + reset=False): + if reset: + self.reset_trainer() + return self._trainer.train( + tensors_to_optimize=tensors_to_optimize, + optimizer=optimizer, + optimization_params=optimization_params, + callbacks=callbacks, + lr_policy=lr_policy, + batches_per_step=batches_per_step, + stop_on_nan_loss=stop_on_nan_loss)
    + +
    [docs] def eval(self, + callbacks: List[EvaluatorCallback]): + if callbacks is None or len(callbacks) == 0: + raise ValueError(f"You need to provide at lease one evaluation" + f"callback to eval") + for callback in callbacks: + if not isinstance(callback, EvaluatorCallback): + raise TypeError(f"All callbacks passed to the eval action must" + f"be inherited from EvaluatorCallback") + self.train( + tensors_to_optimize=None, + optimizer='sgd', + callbacks=callbacks + )
    + +
    [docs] def infer(self, tensors: List[NmTensor], checkpoint_dir=None, + ckpt_pattern=''): + return self._trainer.infer( + tensors=tensors, checkpoint_dir=checkpoint_dir, + ckpt_pattern=ckpt_pattern, logger=self.logger)
    + + def _get_trainer(self, tb_writer=None): if self._backend == Backend.PyTorch: - params["optimization_level"] = self._optim_level constructor = NeuralModuleFactory.__name_import( "nemo.backends.pytorch.PtActions" ) - instance = constructor(params=params, - local_rank=self._local_rank, - tb_writer=tb_writer) + instance = constructor(local_rank=self._local_rank, + tb_writer=tb_writer, + optimization_level=self._optim_level) return instance else: - raise ValueError("Only PyTorch backend is currently supported.")
    + raise ValueError("Only PyTorch backend is currently supported.") + +
    [docs] def get_trainer(self, tb_writer=None): + self.logger.warning( + f"This function is deprecated and will be removed" + f"in future versions of NeMo." + f"Please use .train(...), .eval(...), .infer(...) and " + f".create_optimizer(...) directly methods from " + f"NeuralModuleFactory instance.") + if self._trainer: + self.logger.warning( + "The trainer instance was created during initialization of " + "Neural factory, using the already created instance.") + return self._trainer + return self._get_trainer(tb_writer)
    + +
    [docs] def reset_trainer(self): + del self._trainer + self._trainer = self._get_trainer(tb_writer=self._tb_writer)
    @property def world_size(self): return self._world_size + @property + def tb_writer(self): + return self._tb_writer + @property def placement(self): return self._placement @@ -591,8 +742,16 @@

    Source code for nemo.core.neural_factory

             return self._optim_level
     
         @property
    -    def master_process(self):
    -        return self._master_process
    + def logger(self): + return self._exp_manager.logger + + @property + def checkpoint_dir(self): + return self._exp_manager.ckpt_dir + + @property + def work_dir(self): + return self._exp_manager.work_dir
    diff --git a/docs/_modules/nemo/core/neural_modules.html b/docs/_modules/nemo/core/neural_modules.html index b27e6e9d9a3a..9f8126ec8fa4 100644 --- a/docs/_modules/nemo/core/neural_modules.html +++ b/docs/_modules/nemo/core/neural_modules.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -159,9 +160,11 @@

    Source code for nemo.core.neural_modules

     import uuid
     import logging
     from abc import ABC, abstractmethod
    +from collections import namedtuple
     from enum import Enum
     from typing import Optional, Dict, Set, Tuple, List
     from inspect import getargvalues, stack
    +from nemo.core import NeuralModuleFactory
     
     from .neural_factory import Optimization, DeviceType
     from .neural_types import CanNotInferResultNeuralType,\
    @@ -178,10 +181,17 @@ 

    Source code for nemo.core.neural_modules

         TRANSPOSE = 1
    +PretrainedModelInfo = namedtuple("PretrainedModleInfo", + ("pretrained_model_name", "description", + "parameters", "location")) + +
    [docs]class NeuralModule(ABC): """Abstract class that every Neural Module must inherit from. Args: + pretrained_model_name (str): name of pretrained model to use in order + to initialize this neural module create_port_args (dict): arguments that are passed to create_ports() factory (NeuralModuleFactory): :class:`NeuralModuleFactory` which created or which should mange this instance. Required for @@ -193,12 +203,13 @@

    Source code for nemo.core.neural_modules

     
         def __init__(
                 self, *,
    +            pretrained_model_name=None,
                 create_port_args=None,
                 factory=None,
                 placement=None,
                 **kwargs
         ):
    -
    +        self._pretrained_model_name = pretrained_model_name
             self._local_parameters = self.update_local_params()
     
             if create_port_args is None:
    @@ -206,13 +217,17 @@ 

    Source code for nemo.core.neural_modules

             self._input_ports, self._output_ports = self.create_ports(
                 **create_port_args)
     
    +        default_factory = NeuralModuleFactory.get_default_factory()
    +        if (factory is None) and (default_factory is not None):
    +            factory = default_factory
    +
             # Set module properties from factory else use defaults
             self._placement = factory.placement if factory is not None\
                 else DeviceType.GPU
             self._opt_level = factory.optim_level if factory is not None\
    -            else Optimization.nothing
    -        self._master_process = factory.master_process if factory is not None\
    -            else True
    +            else Optimization.mxprO0
    +        self._logger = factory.logger if factory is not None\
    +            else logging
     
             # Update module properties using overrides if overrides exist
             if placement is not None:
    @@ -221,12 +236,16 @@ 

    Source code for nemo.core.neural_modules

             self._factory = factory
             self._uuid = str(uuid.uuid4())
     
    -        if self._master_process and kwargs:
    -            logging.warning(
    +        if kwargs:
    +            self._logger.warning(
                     "When constructing {}. The base "
                     "NeuralModule class received the following unused "
                     "arguments:".format(self.__class__.__name__))
    -            logging.warning("{}".format(kwargs.keys()))
    +            self._logger.warning("{}".format(kwargs.keys()))
    +
    +
    [docs] @staticmethod + def pretrained_storage(): + return ''
    def __call__(self, **kwargs): """This method allows objects to be called with their port names @@ -327,6 +346,9 @@

    Source code for nemo.core.neural_modules

                     )
                 return tuple(result)
     
    +    def __str__(self):
    +        return self.__class__.__name__
    +
     
    [docs] @abstractmethod def get_weights(self) -> Optional[Dict[(str, bool)]]: """Returns NeuralModule's weights copy. @@ -366,6 +388,22 @@

    Source code for nemo.core.neural_modules

             """
             pass
    +
    [docs] @staticmethod + def list_pretrained_models() -> Optional[List[PretrainedModelInfo]]: + """List all available pre-trained models (e.g. weights) for this NM. + + Returns: + A list of PretrainedModelInfo tuples. + The pretrained_model_name field of the tuple can be used to + retrieve pre-trained model's weights (pass it as + pretrained_model_name argument to the module's constructor) + """ + return None
    + +
    [docs] def get_config_dict_and_checkpoint(self, pretrained_model_name): + """WARNING: This part is work in progress""" + return None
    +
    [docs] @abstractmethod def tie_weights_with( self, diff --git a/docs/_modules/nemo/core/neural_types.html b/docs/_modules/nemo/core/neural_types.html index d4ba2af78c18..96923e55d98c 100644 --- a/docs/_modules/nemo/core/neural_types.html +++ b/docs/_modules/nemo/core/neural_types.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -163,6 +164,7 @@

    Source code for nemo.core.neural_types

     of incompatible types.
     """
     from enum import Enum
    +import uuid
     
     
     
    [docs]class BaseTag(object): @@ -446,6 +448,7 @@

    Source code for nemo.core.neural_types

             self._producer = producer
             self._producer_args = producer_args
             self._name = name
    +        self._uuid = str(uuid.uuid4())
     
         @property
         def producer(self):
    @@ -484,7 +487,7 @@ 

    Source code for nemo.core.neural_types

             """
             if self._producer is None:
                 raise ValueError("This NmTensor does not have a unique name")
    -        return self._name + "~~~" + self.producer._uuid
    + return f"{self._name}~~~{self.producer}~~~{self._uuid}"
    [docs]class NeuralTypeError(Exception): diff --git a/docs/_modules/nemo_asr/data_layer.html b/docs/_modules/nemo_asr/data_layer.html index fafcd72310d2..457e9d2650c0 100644 --- a/docs/_modules/nemo_asr/data_layer.html +++ b/docs/_modules/nemo_asr/data_layer.html @@ -88,8 +88,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -162,7 +163,7 @@

    Source code for nemo_asr.data_layer

     import torch
     from apex import amp
     
    -from nemo.backends.pytorch.nm import DataLayerNM, NonTrainableNM
    +from nemo.backends.pytorch.nm import DataLayerNM, TrainableNM, NonTrainableNM
     from nemo.core import Optimization, DeviceType
     from nemo.core.neural_types import *
     from .parts.dataset import AudioDataset, seq_collate_fn
    @@ -268,13 +269,12 @@ 

    Source code for nemo_asr.data_layer

                 labels=labels,
                 featurizer=self._featurizer, max_duration=max_duration,
                 min_duration=min_duration, normalize=normalize_transcripts,
    -            trim=trim_silence, verbose=self._master_process,
    +            trim=trim_silence, logger=self._logger,
                 eos_id=eos_id, load_audio=load_audio
             )
     
             if self._placement == DeviceType.AllGpu:
    -            if self._master_process:
    -                print('Parallelizing DATALAYER')
    +            self._logger.info('Parallelizing DATALAYER')
                 sampler = torch.utils.data.distributed.DistributedSampler(
                     self._dataset)
             else:
    @@ -302,7 +302,7 @@ 

    Source code for nemo_asr.data_layer

             return self._dataloader
    -
    [docs]class AudioPreprocessing(NonTrainableNM): +
    [docs]class AudioPreprocessing(TrainableNM): """ Neural Module that does batch processing of audio files and converts them to spectrogram representations @@ -388,7 +388,7 @@

    Source code for nemo_asr.data_layer

                 raise NotImplementedError("AudioPreprocessing currently only "
                                           "accepts 'fbank' or 'logfbank' as "
                                           "feat_type")
    -        NonTrainableNM.__init__(self, **kwargs)
    +        TrainableNM.__init__(self, **kwargs)
     
             self.featurizer = FilterbankFeatures(
                 sample_rate=sample_rate,
    @@ -404,14 +404,14 @@ 

    Source code for nemo_asr.data_layer

                 dither=dither,
                 pad_to=pad_to,
                 frame_splicing=frame_splicing,
    -            stft_conv=stft_conv
    +            stft_conv=stft_conv,
    +            logger=self._logger
             )
             # _pre_procesing_config = self.local_parameters
             # self.featurizer = FeatureFactory.from_config(_pre_procesing_config)
             self.featurizer.to(self._device)
     
    -        stft_conv = kwargs.get("stft_conv", False)
    -        self.disable_casts = (self._opt_level != Optimization.nothing and
    +        self.disable_casts = (self._opt_level == Optimization.mxprO1 and
                                   not stft_conv)
     
         def forward(self, input_signal, length):
    diff --git a/docs/_modules/nemo_asr/jasper.html b/docs/_modules/nemo_asr/jasper.html
    index ffcb99c326d6..bf5fe4e9b5af 100644
    --- a/docs/_modules/nemo_asr/jasper.html
    +++ b/docs/_modules/nemo_asr/jasper.html
    @@ -88,8 +88,9 @@
     
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • diff --git a/docs/_sources/api-docs/modules.rst.txt b/docs/_sources/api-docs/modules.rst.txt index 61547e382b27..3c1e96408bbc 100644 --- a/docs/_sources/api-docs/modules.rst.txt +++ b/docs/_sources/api-docs/modules.rst.txt @@ -1,7 +1,7 @@ -NEMO API +NeMo API ======== .. toctree:: :maxdepth: 8 - nemo \ No newline at end of file + nemo diff --git a/docs/_sources/asr/datasets.rst.txt b/docs/_sources/asr/datasets.rst.txt index c3eb72f9dadf..5651cb1e16e7 100644 --- a/docs/_sources/asr/datasets.rst.txt +++ b/docs/_sources/asr/datasets.rst.txt @@ -36,6 +36,90 @@ Switchboard and CallHome ------------------------ coming soon ... +Fisher English Training Speech +------------------------------ + +Run these scripts to convert the Fisher English Training Speech data into a format expected by the `nemo_asr` collection. + +In brief, the following scripts convert the .sph files to .wav, slice those files into smaller audio samples, match the smaller slices with their corresponding transcripts, and split the resulting audio segments into train, validation, and test sets (with one manifest each). + +.. note:: + You will need at least 106GB of space to run the .wav conversion, and an additional 105GB for the slicing and matching. + You will need to have sph2pipe installed in order to run the .wav conversion. + + +**Instructions** + +These scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure that looks something like this: + +.. code-block:: bash + + FisherEnglishTrainingSpeech/ + ├── LDC2004S13-Part1 + │   ├── fe_03_p1_transcripts + │   ├── fisher_eng_tr_sp_d1 + │   ├── fisher_eng_tr_sp_d2 + │   ├── fisher_eng_tr_sp_d3 + │   └── ... + └── LDC2005S13-Part2 + ├── fe_03_p2_transcripts + ├── fe_03_p2_sph1 + ├── fe_03_p2_sph2 + ├── fe_03_p2_sph3 + └── ... + +The transcripts that will be used are located in `fe_03_p<1,2>_transcripts/data/trans`, and the audio files (.sph) are located in the remaining directories in an `audio` subdirectory. + +First, convert the audio files from .sph to .wav by running: + +.. code-block:: bash + + cd /scripts + python fisher_audio_to_wav.py \ + --data_root= --dest_root= + +This will place the unsliced .wav files in `/LDC200[4,5]S13-Part[1,2]/audio-wav/`. +It will take several minutes to run. + +Next, process the transcripts and slice the audio data: + +.. code-block:: bash + + python process_fisher_data.py \ + --audio_root= --transcript_root= \ + --dest_root= \ + --remove_noises + +This script will split the full dataset into train, validation, and test sets, and place the audio slices in the corresponding folders in the destination directory. +One manifest will be written out per set, which includes each slice's transcript, duration, and path. + +This will likely take around 20 minutes to run. +Once finished, you may delete the 10 minute long .wav files if you wish. + +2000 HUB5 English Evaluation Speech +----------------------------------- + +Run the following script to convert the HUB5 data into a format expected by the `nemo_asr` collection. + +Similarly to the Fisher dataset processing scripts, this script converts the .sph files to .wav, slices the audio files and transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). +The resulting segments are all written out to an audio directory, and the corresponding transcripts are written to a manifest JSON. + +.. note:: + You will need 5GB of free space to run this script. + You will also need to have sph2pipe installed. + +This script assumes you already have the 2000 HUB5 dataset from the Linguistic Data Consortium. + +Run the following to process the 2000 HUB5 English Evaluation Speech samples: + +.. code-block:: bash + + python process_hub5_data.py \ + --data_root= \ + --dest_root= + +You may optionally include `--min_slice_duration=` if you would like to change the minimum audio segment duration. + Building Your Own Dataset ------------------------- coming soon ... diff --git a/docs/_sources/asr/tutorial.rst.txt b/docs/_sources/asr/tutorial.rst.txt index e1022a38f61c..982feca96db2 100644 --- a/docs/_sources/asr/tutorial.rst.txt +++ b/docs/_sources/asr/tutorial.rst.txt @@ -10,7 +10,7 @@ See :ref:`installation` section. Introduction ------------- -This Automatic Speech Recognition (ASR) tutorial is focused on Jasper :cite:`li2019jasper` model. Jasper is CTC-based :cite:`graves2006` end-to-end model. The model is called "end-to-end" because it transcripts speech samples without any additional alignmet information. CTC allows finding an alignment between audio and text. +This Automatic Speech Recognition (ASR) tutorial is focused on Jasper :cite:`li2019jasper` model. Jasper is CTC-based :cite:`graves2006` end-to-end model. The model is called "end-to-end" because it transcripts speech samples without any additional alignment information. CTC allows finding an alignment between audio and text. CTC-ASR training pipeline consists of the following blocks: 1. audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC) @@ -25,7 +25,7 @@ CTC-ASR training pipeline consists of the following blocks: Get data -------- -We will be using an open-source Librispeech :cite:`panayotov2015librispeech` dataset. These scripts will download and convert Librispeech into format expected by `nemo_asr`: +We will be using an open-source LibriSpeech :cite:`panayotov2015librispeech` dataset. These scripts will download and convert LibriSpeech into format expected by `nemo_asr`: .. code-block:: bash @@ -42,7 +42,7 @@ We will be using an open-source Librispeech :cite:`panayotov2015librispeech` d You should have at least 26GB of disk space available if you've used ``--data_set=dev_clean,train_clean_100``; and at least 110GB if you used ``--data_set=ALL``. Also, it will take some time to download and process, so go grab a coffee. -After donwload and conversion are completed, your `data` folder should contain 2 manifests: +After download and conversion, your `data` folder should contain 2 json files: * dev_clean.json * train_clean_100.json @@ -85,16 +85,19 @@ The script below does both training (on `train_clean_100.json`) and evaluation ( # NeMo's ASR collection import nemo_asr - # We will use tensorboardX to keep track of train loss, eval wer etc. - from tensorboardX import SummaryWriter - - tb_writer = SummaryWriter('jasper12x1SEP') + # Create a Neural Factory + # It creates log files and tensorboard writers for us among other functions + nf = nemo.core.NeuralModuleFactory( + log_dir='jasper12x1SEP', + create_tb_writer=True) + tb_writer = nf.tb_writer + logger = nf.logger # Path to our training manifest - train_manifest = "/train_clean_100.json" + train_dataset = "/train_clean_100.json" # Path to our validation manifest - val_manifest = "/dev_clean.json" + eval_datasets = "/dev_clean.json" # Jasper Model definition from ruamel.yaml import YAML @@ -107,21 +110,25 @@ The script below does both training (on `train_clean_100.json`) and evaluation ( labels = jasper_model_definition['labels'] # Instantiate neural modules - data_layer = nemo_asr.AudioToTextDataLayer(manifest_filepath=train_manifest, + data_layer = nemo_asr.AudioToTextDataLayer( + manifest_filepath=train_dataset, labels=labels, batch_size=32) - data_layer_val = nemo_asr.AudioToTextDataLayer(manifest_filepath=val_manifest, + data_layer_val = nemo_asr.AudioToTextDataLayer( + manifest_filepath=eval_datasets, labels=labels, batch_size=32, shuffle=False) data_preprocessor = nemo_asr.AudioPreprocessing() spec_augment = nemo_asr.SpectrogramAugmentation(rect_masks=5) - jasper_encoder = nemo_asr.JasperEncoder(feat_in=64, + jasper_encoder = nemo_asr.JasperEncoder( + feat_in=64, **jasper_model_definition['JasperEncoder']) - jasper_decoder = nemo_asr.JasperDecoderForCTC(feat_in=1024, num_classes=len(labels)) + jasper_decoder = nemo_asr.JasperDecoderForCTC( + feat_in=1024, num_classes=len(labels)) ctc_loss = nemo_asr.CTCLossNM(num_classes=len(labels)) greedy_decoder = nemo_asr.GreedyCTCDecoder() - ## Training DAG (Model) + # Training DAG (Model) audio_signal, audio_signal_len, transcript, transcript_len = data_layer() processed_signal, processed_signal_len = data_preprocessor( input_signal=audio_signal, length=audio_signal_len) @@ -130,7 +137,8 @@ The script below does both training (on `train_clean_100.json`) and evaluation ( audio_signal=aug_signal, length=processed_signal_len) log_probs = jasper_decoder(encoder_output=encoded) predictions = greedy_decoder(log_probs=log_probs) - loss = ctc_loss(log_probs=log_probs, targets=transcript, + loss = ctc_loss( + log_probs=log_probs, targets=transcript, input_length=encoded_len, target_length=transcript_len) # Validation DAG (Model) @@ -144,29 +152,36 @@ The script below does both training (on `train_clean_100.json`) and evaluation ( audio_signal=processed_signal_v, length=processed_signal_len_v) log_probs_v = jasper_decoder(encoder_output=encoded_v) predictions_v = greedy_decoder(log_probs=log_probs_v) - loss_v = ctc_loss(log_probs=log_probs_v, targets=transcript_v, + loss_v = ctc_loss( + log_probs=log_probs_v, targets=transcript_v, input_length=encoded_len_v, target_length=transcript_len_v) # These helper functions are needed to print and compute various metrics # such as word error rate and log them into tensorboard - # they are domain-specific and are provided by NEMO's collections + # they are domain-specific and are provided by NeMo's collections from nemo_asr.helpers import monitor_asr_train_progress, \ process_evaluation_batch, process_evaluation_epoch + from functools import partial # Callback to track loss and print predictions during training train_callback = nemo.core.SimpleLossLoggerCallback( - tensorboard_writer=tb_writer, - # How to print loss to screen - tensor_list2string=lambda x: str(x[0].item()), - # How to print predictions and compute WER for train batches - tensor_list2string_evl=lambda x: monitor_asr_train_progress(x, labels=labels)) - - saver_callback = nemo.core.ModuleSaverCallback( - save_modules_list=[jasper_encoder, jasper_decoder], + tb_writer=tb_writer, + # Define the tensors that you want SimpleLossLoggerCallback to + # operate on + # Here we want to print our loss, and our word error rate which + # is a function of our predictions, transcript, and transcript_len + tensors=[loss, predictions, transcript, transcript_len], + # To print logs to screen, define a print_func + print_func=partial( + monitor_asr_train_progress, + labels=labels, + logger=logger + )) + + saver_callback = nemo.core.CheckpointCallback( folder="./", - # If set to x > 0 it will save modules every x steps - # If set to = -1 it will only save once, after training is done - step_frequency=-1) + # Set how often we want to save checkpoints + step_freq=100) # PRO TIP: while you can only have 1 train DAG, you can have as many # val DAGs and callbacks as you want. This is useful if you want to monitor @@ -175,27 +190,32 @@ The script below does both training (on `train_clean_100.json`) and evaluation ( eval_callback = nemo.core.EvaluatorCallback( eval_tensors=[loss_v, predictions_v, transcript_v, transcript_len_v], # how to process evaluation batch - e.g. compute WER - user_iter_callback=lambda x, y: process_evaluation_batch(x, y, labels=labels), + user_iter_callback=partial( + process_evaluation_batch, + labels=labels + ), # how to aggregate statistics (e.g. WER) for the evaluation epoch - user_epochs_done_callback=lambda x: process_evaluation_epoch(x, tag="DEV-CLEAN"), + user_epochs_done_callback=partial( + process_evaluation_epoch, tag="DEV-CLEAN", logger=logger + ), eval_step=500, - tensorboard_writer=tb_writer) - - # Neural Module Factory manages training - # You will need to specify which backend to use - # Currently we only support PyTorch - nf = nemo.core.NeuralModuleFactory() - - # Optimizer - optimizer = nf.get_trainer(params={"optimizer_kind": "novograd", - "optimization_params": {"num_epochs": 50, "lr": 0.02, "weight_decay": 1e-4}}) - - # Run training + tb_writer=tb_writer) + + # Run training using your Neural Factory # Once this "action" is called data starts flowing along train and eval DAGs # and computations start to happen - optimizer.train(tensors_to_optimize=[loss], + nf.train( + # Specify the loss to optimize for + tensors_to_optimize=[loss], + # Specify which callbacks you want to run callbacks=[train_callback, eval_callback, saver_callback], - tensors_to_evaluate=[predictions, transcript, transcript_len]) + # Specify what optimizer to use + optimizer="novograd", + # Specify optimizer parameters such as num_epochs and lr + optimization_params={ + "num_epochs": 50, "lr": 0.02, "weight_decay": 1e-4 + } + ) .. note:: This script trains should finish 50 epochs in about 7 hours on GTX 1080. @@ -239,7 +259,7 @@ Enabling multi-GPU training with NeMo is easy: .. code-block:: bash - python -m torch.distributed.launch --nproc_per_node=8 /examples/asr/jasper.py --num_gpus=8 ... + python -m torch.distributed.launch --nproc_per_node=8 /examples/asr/jasper.py ... Large Training Example @@ -251,7 +271,7 @@ Assuming, you are working with Volta-based DGX, you can run training like this: .. code-block:: bash - python -m torch.distributed.launch --nproc_per_node=8 /examples/asr/jasper.py --batch_size=64 --num_gpus=8 --num_epochs=100 --lr=0.015 --warmup_steps=8000 --weight_decay=0.001 --train_manifest=/manifests/librivox-train-all.json --val_manifest1=/manifests/librivox-dev-clean.json --val_manifest2=/manifests/librivox-dev-other.json --model_config=/nemo/examples/asr/configs/jasper15x5SEP.yaml --exp_name=MyLARGE-ASR-EXPERIMENT + python -m torch.distributed.launch --nproc_per_node=8 /examples/asr/jasper.py --batch_size=64 --num_epochs=100 --lr=0.015 --warmup_steps=8000 --weight_decay=0.001 --train_dataset=/manifests/librivox-train-all.json --eval_datasets /manifests/librivox-dev-clean.json /manifests/librivox-dev-other.json --model_config=/nemo/examples/asr/configs/jasper15x5SEP.yaml --exp_name=MyLARGE-ASR-EXPERIMENT The command above should trigger 8-GPU training with mixed precision. In the command above various manifests (.json) files are various datasets. Substitute them with the ones containing your data. @@ -280,60 +300,9 @@ Inference First download pre-trained model (jasper_encoder, jasper_decoder and configuration files) `from here `_ into ``. We will use this pre-trained model to measure WER on LibriSpeech dev-clean dataset. -.. code-block:: python - - import nemo - import nemo_asr - - # Path to the inference data - inference_manifest = "/dev_clean.json" - - # Import Jasper model definition - # Note that we are using a much larger 15x5 model now instead of 12x1 - from ruamel.yaml import YAML - yaml = YAML(typ="safe") - with open("/examples/asr/configs/jasper15x5SEP.yaml") as f: - jasper_model_definition = yaml.load(f) - labels = jasper_model_definition['labels'] - - # Instantiate neural modules - data_layer = nemo_asr.AudioToTextDataLayer(manifest_filepath=inference_manifest, - labels=labels, batch_size=64, shuffle=False,) - data_preprocessor = nemo_asr.AudioPreprocessing() - jasper_encoder = nemo_asr.JasperEncoder(feat_in=64, - **jasper_model_definition['JasperEncoder']) - jasper_decoder = nemo_asr.JasperDecoderForCTC(feat_in=1024, num_classes=len(labels)) - greedy_decoder = nemo_asr.GreedyCTCDecoder() - - # Define inference model - audio_signal, audio_signal_len, transcript, transcript_len = data_layer() - processed_signal, processed_signal_len = data_preprocessor( - input_signal=audio_signal, length=audio_signal_len) - encoded, encoded_len = jasper_encoder( - audio_signal=processed_signal, length=processed_signal_len) - log_probs = jasper_decoder(encoder_output=encoded) - predictions = greedy_decoder(log_probs=log_probs) - - eval_tensors=[predictions, transcript, transcript_len] - from nemo_asr.helpers import post_process_predictions, \ - post_process_transcripts, word_error_rate - infer_callback = nemo.core.InferenceCallback( - eval_tensors=eval_tensors) - - nf = nemo.core.NeuralModuleFactory() - - optimizer = nf.get_trainer(params={}) - evaluated_tensors = optimizer.infer(callback=infer_callback, - checkpoint_dir="/15x5SEP/") - - hypotheses = post_process_predictions(evaluated_tensors[0], labels=labels) - references = post_process_transcripts(evaluated_tensors[1], labels=labels, - transcript_len_list=evaluated_tensors[2]) - wer = word_error_rate(hypotheses=hypotheses, references=references) - print("Greedy WER {:.2f}".format(wer*100)) - - +.. code-block:: bash + python /examples/asr/jasper_infer.py --model_config=/examples/asr/configs/jasper15x5SEP.yaml --eval_datasets "/dev_clean.json" --load_dir= Inference with Language Model @@ -341,71 +310,22 @@ Inference with Language Model Using KenLM ~~~~~~~~~~~ -We will be using `BAIDU's CTC decoder with LM implementation. `_. +We will be using `Baidu's CTC decoder with LM implementation. `_. Perform the following steps: * Go to `cd /scripts` - * Install BAIDU's CTC decoders `sudo apt-get install swig` and `./install_decoders.sh` + * Install Baidu's CTC decoders `sudo apt-get install swig` and `./install_decoders.sh` * Build 6-gram KenLM model on LibriSpeech `./build_6-gram_OpenSLR_lm.sh` - * Add the following lines to the inference script right after - ``predictions = greedy_decoder(log_probs=log_probs)`` : - - .. code-block:: python - - predictions = greedy_decoder(log_probs=log_probs) - - import os - - # Instantiate BeamSearch NM - # Feel free to experiment with alpha, and beta parameters - beam_search_with_lm = nemo_asr.BeamSearchDecoderWithLM( - vocab=labels, - beam_width=128, - alpha=2.2, - beta=0.5, - lm_path="/6-gram.binary", - num_cpus=max(os.cpu_count(), 1)) - beam_predictions = beam_search_with_lm(log_probs=log_probs, - log_probs_length=encoded_len) - eval_tensors.append(beam_predictions) - - # Rest of code is slightly modified from the above script - from nemo_asr.helpers import post_process_predictions, \ - post_process_transcripts, word_error_rate - infer_callback = nemo.core.InferenceCallback( - # We add beam_predictions to eval_tensors - eval_tensors=[predictions, transcripts, beam_predictions], - ) - - nf = nemo.core.NeuralModuleFactory(backend=nemo.core.Backend.PyTorch) + * Run jasper_infer.py with the --lm_path flag - optimizer = nf.get_trainer(params={}) - evaluated_tensors = optimizer.infer(callback=infer_callback, - checkpoint_dir="/15x5SEP/") + .. code-block:: bash - hypotheses = post_process_predictions(evaluated_tensors[0], labels=labels) - references = post_process_transcripts(evaluated_tensors[1], labels=labels, - transcript_len_list=evaluated_tensors[2]) - wer = word_error_rate(hypotheses=hypotheses, references=references) - print("Greedy WER {:.2f}".format(wer*100)) - - # Post processing the new beam search predictions - beam_hypotheses = [] - for i in evaluated_tensors[-1]: - # Over samples - for j in i: - beam_hypotheses.append(j[0][1]) - - beam_wer = word_error_rate( - hypotheses=beam_hypotheses, references=references) - print("Beam WER {:.2f}".format(beam_wer*100)) - - * Run your updated inference script! + python /examples/asr/jasper_infer.py --model_config=/examples/asr/configs/jasper15x5SEP.yaml --eval_datasets "/dev_clean.json" --load_dir= --lm_path= References -------------- +---------- .. bibliography:: Jasperbib.bib :style: plain diff --git a/docs/_sources/collections/core.rst.txt b/docs/_sources/collections/core.rst.txt index 965f4c15653f..10f8c56c7418 100644 --- a/docs/_sources/collections/core.rst.txt +++ b/docs/_sources/collections/core.rst.txt @@ -1,7 +1,7 @@ -NEMO Common Collection +NeMo Common Collection ====================== -NEMO core package comes with "common" collection for pytorch built-in: +NeMo core package comes with "common" collection for pytorch built-in: .. automodule:: nemo.backends.pytorch.common.data :members: diff --git a/docs/_sources/collections/modules.rst.txt b/docs/_sources/collections/modules.rst.txt index d8dce788a3c0..b7640e39707e 100644 --- a/docs/_sources/collections/modules.rst.txt +++ b/docs/_sources/collections/modules.rst.txt @@ -1,10 +1,10 @@ .. _collection-docs: -NEMO Collections API +NeMo Collections API ==================== .. toctree:: :maxdepth: 8 core - ncollections \ No newline at end of file + ncollections diff --git a/docs/_sources/collections/ncollections.rst.txt b/docs/_sources/collections/ncollections.rst.txt index a82e558f1c3b..841264d37bbd 100644 --- a/docs/_sources/collections/ncollections.rst.txt +++ b/docs/_sources/collections/ncollections.rst.txt @@ -1,4 +1,4 @@ -NEMO support the following collections +NeMo support the following collections NEMO_ASR collection =================== @@ -18,3 +18,4 @@ Automatic Speech Recognition modules :undoc-members: :show-inheritance: :exclude-members: forward, create_ports + diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt index e83caacf3507..5ca0c35f2bef 100644 --- a/docs/_sources/index.rst.txt +++ b/docs/_sources/index.rst.txt @@ -1,4 +1,4 @@ -Welcome to NEMO! +Welcome to NeMo! ================================ .. toctree:: @@ -10,13 +10,14 @@ Welcome to NEMO! tutorials/intro training asr/intro + nlp/intro collections/modules api-docs/modules faq .. image:: nemo-icon-256x256.png :align: center - :alt: NEMO + :alt: NeMo NEural MOdules (NeMo) is a high level toolkit for training AI applications using Neural Modules. NeMo comes with neural module collections for automatic speech recognition (ASR) and natural language processing (NLP). @@ -31,9 +32,24 @@ NeMo is built for fast training on GPUs. It provides: Automatic Speech Recognition ----------------------------- -`A Short VIDEO walk-through about using NEMO to experiment with ASR systems. `_ +Video walk-through +################## -**You can use ``nemo_nlp`` collection to construct the following models** +`A short VIDEO walk-through about using NeMo for ASR. `_ + + +.. raw:: html + +
    + +
    + +
    + +You can use ``nemo_asr`` collection to construct the following models +##################################################################### * Jasper * QuartzNet diff --git a/docs/_sources/installation.rst.txt b/docs/_sources/installation.rst.txt index 7fc96530f759..2b27c5496246 100644 --- a/docs/_sources/installation.rst.txt +++ b/docs/_sources/installation.rst.txt @@ -13,7 +13,7 @@ Installation 6) (Recommended for distributed training) `NCCL `_ >= 2.4 -**Installing NEMO and Collections** +**Installing NeMo and Collections** 1) Clone the repository: @@ -23,7 +23,7 @@ Installation 2) Go to ``nemo`` folder and do: ``python setup.py install`` -3) Run unittests to validate instalation: +3) Run unittests to validate installation: .. code-block:: bash @@ -41,5 +41,3 @@ For development do: ``python setup.py develop`` instead of ``python setup.py ins 5) Go to ``examples/start_here`` to get started with few simple examples - - diff --git a/docs/_sources/nlp/asr-improvement.rst.txt b/docs/_sources/nlp/asr-improvement.rst.txt new file mode 100644 index 000000000000..d9f7920bbfc9 --- /dev/null +++ b/docs/_sources/nlp/asr-improvement.rst.txt @@ -0,0 +1,148 @@ +Tutorial +=========================== + +In this tutorial we will train an ASR postprocessing model to correct mistakes in +output of end-to-end language model. This model method works similar to translation model +in contrast to traditional ASR language model rescoring. The model architecture is +attention based encoder-decoder where both encoder and decoder are initialized with +pretrained BERT language model. To train this model we collected dataset with typical +ASR errors by using pretrained Jasper ASR model :cite:`li2019jasper`. + +Data +----------- +**Data collection.** We collected dataset for this tutorial with Jasper ASR model +:cite:`li2019jasper` trained on Librispeech dataset :cite:`panayotov2015librispeech`. +Librispeech training dataset consists of three parts -- train-clean-100, train-clean-360 and +train-clean-500 which give 281k training examples in total. +To augment this data we used two techniques: + +* We split all training data into 10 folds and trained 10 Jasper models in cross-validation manner: a model was trained on 9 folds and used to make ASR predictions for the remaining fold. +* We took pretrained Jasper model and enabled dropout during inference on training data. This procedure was repeated multiple times with different random seeds. + +**Data postprocessing.** The collecred dataset was postprocessed by removing duplicates +and examples with word error rate higher than 0.5. +The resulting training dataset consists of 1.7M pairs of "bad" English-"good" English examples. + +**Dev and test datasets preparation**. Librispeech contains 2 dev datasets +(dev-clean and dev-other) and 2 test datasets (test-clean and test-other). +For our task we kept the same splits. We fed these datasets to a pretrained +Jasper model with the greedy decoding to get the ASR predictions that are used +for evaluation in our tutorial. + +Importing parameters from pretrained BERT +----------------------------------------- +Both encoder and decoder are initialized with pretrained BERT parameters. Since BERT language +model has the same architecture as transformer encoder, there is no need to do anything +additional. To prepare decoder parameters from pretrained BERT we wrote a script +``get_decoder_params_from_bert.py`` that downloads BERT parameters from +pytorch-transformers repository :cite:`huggingface2019transformers` and maps them into a transformer decoder. +Encoder-decoder attention is initialized with self-attention parameters. +The script is located under ``nemo/scripts`` directory and accepts 2 arguments: +``--model_name`` (ex. ``bert-base-cased``, ``bert-base-uncased``, etc) and ``--save_to`` +(a directory where the parameters will be saved): + + .. code-block:: bash + + $ python get_decoder_params_from_bert.py --model_name bert-base-uncased + + +Neural modules overview +-------------------------- +First we define tokenizer to convert tokens into indices. We will use ``bert-base-uncased`` +vocabukary, since our dataset only contains lower-case text: + + .. code-block:: python + + tokenizer = NemoBertTokenizer(pretrained_model="bert-base-uncased") + + +The encoder block is a neural module corresponding to BERT language model from +``nemo.nemo_nlp.huggingface`` collection: + + .. code-block:: python + + zeros_transform = neural_factory.get_module( + name="ZerosLikeNM", + params={}, + collection="nemo_nlp" + ) + encoder = neural_factory.get_module( + name="higgingface.BERT", + params={ + "pretrained_model_name": args.pretrained_model_name, + "local_rank": args.local_rank + }, + collection="nemo_nlp" + ) + + .. tip:: + Making embedding size (as well as all other tensor dimensions) divisible + by 8 will help to get the best GPU utilization and speed-up with mixed precision + training. + +We also pad the matrix of embedding parameters with zeros to have all the dimensions sizes +divisible by 8, which will speed up the computations on GPU with AMP: + + .. code-block:: python + + vocab_size = 8 * math.ceil(tokenizer.vocab_size / 8) + tokens_to_add = vocab_size - tokenizer.vocab_size + device = encoder.bert.embeddings.word_embeddings.weight.get_device() + zeros = torch.zeros((tokens_to_add, args.d_model)).to(device=device) + + encoder.bert.embeddings.word_embeddings.weight.data = torch.cat( + (encoder.bert.embeddings.word_embeddings.weight.data, zeros)) + + +Next we construct transformer decoder neural module. Since we will be initializing decoder +with pretrained BERT parameters, we set hidden activation to ``"hidden_act": "gelu"`` and learn +positional encodings ``"learn_positional_encodings": True``: + + .. code-block:: python + + decoder = neural_factory.get_module( + name="TransformerDecoderNM", + params={ + "d_model": args.d_model, + "d_inner": args.d_inner, + "num_layers": args.num_layers, + "num_attn_heads": args.num_heads, + "fully_connected_dropout": args.fully_connected_dropout, + "vocab_size": vocab_size, + "max_seq_length": max_sequence_length, + "embedding_dropout": args.embedding_dropout, + "learn_positional_encodings": True, + "hidden_act": "gelu", + **dec_first_sublayer_params + }, + collection="nemo_nlp" + ) + +To load the pretrained parameters into decoder, we use ``restore_from`` attribute function +of the decoder neural module: + + .. code-block:: python + + decoder.restore_from(args.restore_from, local_rank=args.local_rank) + + +Model training +-------------- + +To train the model run ``bert_asr_improvement.py`` located in ``nemo\examples\nlp`` directory. +We train with novograd optimizer :cite:`ginsburg2019stochastic`, learning rate ``lr=0.001``, +polynomial learning rate decay policy, ``1000`` warmup steps, per-gpu batch size of ``4096*8`` tokens, +and ``0.25`` dropout probability. We trained on 8 GPUS. To launch the training in +multi-gpu mode run the following command: + + .. code-block:: bash + + $ python -m torch.distributed.launch --nproc_per_node=8 bert_asr_improvement.py --dataset_dir ../../tests/data/pred_real/ --restore_from ../../scripts/bert-base-uncased_decoder.pt + + + +References +------------------ + +.. bibliography:: asr_impr.bib + :style: plain \ No newline at end of file diff --git a/docs/_sources/nlp/intro.rst.txt b/docs/_sources/nlp/intro.rst.txt new file mode 100644 index 000000000000..1fda38c5c787 --- /dev/null +++ b/docs/_sources/nlp/intro.rst.txt @@ -0,0 +1,52 @@ +.. _nlp-docs: + +Natural Language Processing +=========================== + +Neural Machine Translation (NMT) +-------------------------------- +.. toctree:: + :maxdepth: 8 + + neural-machine-translation + + +Language Modeling (LM) +-------------------------------- +.. toctree:: + :maxdepth: 8 + + language-modeling + + +BERT +---- +.. toctree:: + :maxdepth: 8 + + pretraining + + +Named Entity Recognition +------------------------ + +.. toctree:: + :maxdepth: 8 + + ner + + +Intent and Slot filling +----------------------- +.. toctree:: + :maxdepth: 8 + + joint_intent_slot_filling + + +Improving speech recognition with BERTx2 post-processing model +-------------------------------------------------------------- +.. toctree:: + :maxdepth: 8 + + asr-improvement diff --git a/docs/_sources/nlp/joint_intent_slot_filling.rst.txt b/docs/_sources/nlp/joint_intent_slot_filling.rst.txt new file mode 100644 index 000000000000..fb6600a080c2 --- /dev/null +++ b/docs/_sources/nlp/joint_intent_slot_filling.rst.txt @@ -0,0 +1,192 @@ +Tutorial +======== + +In this tutorial, we are going to implement a joint intent and slot filling system with pretrained BERT model based on `BERT for Joint Intent Classification and Slot Filling `_ :cite:`chen2019bert`. All code used in this tutorial is based on ``examples/nlp/joint_intent_slot_with_bert.py``. + +There are four pretrained BERT models that we can select from using the argument `--pretrained_bert_model`. We're currently using the script for loading pretrained models from `pytorch_transformers`. See the list of available pretrained models `here `__. + + +Preliminaries +------------- + +**Model details** +This model jointly train the sentence-level classifier for intents and token-level classifier for slots by minimizing the combined loss of the two classifiers: + + intent_loss * intent_loss_weight + slot_loss * (1 - intent_loss_weight) + +When `intent_loss_weight = 0.5`, this loss jointly maximizes: + + p(y | x)P(s1, s2, ..., sn | x) + +with x being the sequence of n tokens (x1, x2, ..., xn), y being the predicted intent for x, and s1, s2, ..., sn being the predicted slots corresponding to x1, x2, ..., xn. + +**Datasets.** + +This model can work with any dataset that follows the format: + * input file: a `tsv` file with the first line as a header [sentence][tab][label] + + * slot file: slot labels for all tokens in the sentence, separated by space. The length of the slot labels should be the same as the length of all tokens in sentence in input file. + +Currently, the datasets that we provide pre-processing script for include ATIS which can be downloaded from `Kaggle `_ and the SNIPS spoken language understanding research dataset which can be requested from `here `__. You can find the pre-processing script in ``collections/nemo_nlp/nemo_nlp/text_data_utils.py``. + + +Code structure +-------------- + +First, we instantiate Neural Module Factory which defines 1) backend (PyTorch or TensorFlow), 2) mixed precision optimization level, 3) local rank of the GPU, and 4) an experiment manager that creates a timestamped folder to store checkpoints, relevant outputs, log files, and TensorBoard graphs. + + .. code-block:: python + + nf = nemo.core.NeuralModuleFactory( + backend=nemo.core.Backend.PyTorch, + local_rank=args.local_rank, + optimization_level=args.amp_opt_level, + log_dir=work_dir, + create_tb_writer=True, + files_to_copy=[__file__]) + +We define tokenizer which transforms text into BERT tokens, using a built-in tokenizer by `pytorch_transformers`. This will tokenize text following the mapping of the original BERT model. + + .. code-block:: python + + from pytorch_transformers import BertTokenizer + tokenizer = BertTokenizer.from_pretrained(args.pretrained_bert_model) + +Next, we define all Neural Modules participating in our joint intent slot filling classification pipeline. + + * Data layer: converting from the formatted dataset to data loader that feeds data into our model. + + .. code-block:: python + + data_layer = nemo_nlp.BertJointIntentSlotDataLayer( + path_to_data=data_file, + path_to_slot=slot_file, + pad_label=pad_label, + tokenizer=tokenizer, + mode=mode, + max_seq_length=max_seq_length, + num_samples=num_samples, + batch_size=batch_size, + shuffle=shuffle, + num_workers=0, + local_rank=local_rank + ) + + ids, type_ids, input_mask, slot_mask, intents, slots = data_layer() + + + * Load the pretrained model and get the hidden states for the corresponding inputs. + + .. code-block:: python + + hidden_states = pretrained_bert_model(input_ids=ids, + token_type_ids=type_ids, + attention_mask=input_mask) + + + * Create the classifier heads for our task. + + .. code-block:: python + + classifier = nemo_nlp.JointIntentSlotClassifier( + hidden_size=hidden_size, + num_intents=num_intents, + num_slots=num_slots, + dropout=args.fc_dropout) + + intent_logits, slot_logits = classifier(hidden_states=hidden_states) + + + * Create loss function + + .. code-block:: python + + loss_fn = nemo_nlp.JointIntentSlotLoss(num_slots=num_slots) + + loss = loss_fn(intent_logits=intent_logits, + slot_logits=slot_logits, + input_mask=input_mask, + intents=intents, + slots=slots) + + + * Create relevant callbacks for saving checkpoints, printing training progresses and evaluating results + + .. code-block:: python + + if mode == 'train': + callback_fn = nemo.core.SimpleLossLoggerCallback( + tensors=[loss, intent_logits, slot_logits], + print_func=lambda x: str(np.round(x[0].item(), 3)), + tb_writer=exp.tb_writer, + get_tb_values=lambda x: [["loss", x[0]]], + step_freq=100) + elif mode == 'eval': + callback_fn = nemo.core.EvaluatorCallback( + eval_tensors=[intent_logits, slot_logits, intents, slots], + user_iter_callback=lambda x, y: eval_iter_callback( + x, y, data_layer), + user_epochs_done_callback=lambda x: eval_epochs_done_callback( + x, f'{exp.work_dir}/graphs'), + tb_writer=exp.tb_writer, + eval_step=steps_per_epoch) + + + ckpt_callback = nemo.core.CheckpointCallback( + folder=exp.ckpt_dir, + epoch_freq=args.save_epoch_freq, + step_freq=args.save_step_freq) + + + * Finally, we define the optimization parameters and run the whole pipeline. + + .. code-block:: python + + lr_policy_fn = get_lr_policy(args.lr_policy, + total_steps=args.num_epochs * steps_per_epoch, + warmup_ratio=args.lr_warmup_proportion) + nf.train(tensors_to_optimize=[train_loss], + callbacks=[callback_train, callback_eval, ckpt_callback], + lr_policy=lr_policy_fn, + optimizer=args.optimizer_kind, + optimization_params={"num_epochs": num_epochs, + "lr": args.lr, + "weight_decay": args.weight_decay}) + +Model training +-------------- + +To train a joint intent slot filling model, run ``joint_intent_slot_with_bert.py`` located at ``nemo/examples/nlp``: + + .. code-block:: python + + python -m torch.distributed.launch --nproc_per_node=2 joint_intent_slot_with_bert.py \ + --data_dir + --work_dir \ + --max_seq_length \ + --optimizer_kind + ... + +To do inference, run: + + .. code-block:: python + + python -m joint_intent_slot_infer.py \ + --data_dir \ + --work_dir + + +To do inference on a single query, run: + + .. code-block:: python + + python -m joint_intent_slot_infer.py \ + --work_dir + --query + + +References +---------- + +.. bibliography:: joint_intent_slot.bib + :style: plain diff --git a/docs/_sources/nlp/language-modeling.rst.txt b/docs/_sources/nlp/language-modeling.rst.txt new file mode 100644 index 000000000000..6a2193514954 --- /dev/null +++ b/docs/_sources/nlp/language-modeling.rst.txt @@ -0,0 +1,10 @@ +Tutorial +====================== + +In this tutorial we are going to implement Language Modeling (LM) system based on `Transformer decoder architecture `_ :cite:`baevski2018adaptive`. All code used in this tutorial is based on ``examples/nlp/lm_tutorial.py``. + +References +------------- + +.. bibliography:: lm.bib + :style: plain diff --git a/docs/_sources/nlp/ner.rst.txt b/docs/_sources/nlp/ner.rst.txt new file mode 100644 index 000000000000..0de298f70b5f --- /dev/null +++ b/docs/_sources/nlp/ner.rst.txt @@ -0,0 +1,254 @@ +Tutorial +======== + +Make sure you have ``nemo`` and ``nemo_nlp`` installed before starting this +tutorial. See the :ref:`installation` section for more details. + +Introduction +------------ + +This tutorial explains how to implement named entity recognition (NER) in NeMo. We'll show how to do this with a pre-trained BERT model, or with one that you trained yourself! For more details, check out our BERT pretraining tutorial. + +Download Dataset +---------------- + +`CoNLL-2003`_ is a standard evaluation dataset for NER, but any NER dataset will work. The only requirement is that the files are formatted like this: + +.. _CoNLL-2003: https://www.clips.uantwerpen.be/conll2003/ner/ + +.. code-block:: + + Jennifer B-PER + is O + from O + New B-LOC + York I-LOC + City I-LOC + . O + + She O + likes O + ... + +Here, the words and labels are separated with spaces, but in your dataset they should be separated with tabs. Each line should follow the format: [WORD] [TAB] [LABEL] (without spaces in between). There can be columns in between for part-of-speech tags, as shown on the `CoNLL-2003 website`_. There should also be empty lines separating each sequence, as shown above. + +.. _CoNLL-2003 website: https://www.clips.uantwerpen.be/conll2003/ner/ + +Training +-------- + +.. tip:: + + We recommend you try this out in a Jupyter notebook. It'll make debugging much easier! + +Here, we'll fine-tune a BERT model on our downstream NER task. We'll start off with our imports and constants. + +.. code-block:: python + + import math + import os + + import nemo + from nemo.utils.lr_policies import WarmupAnnealing + + import nemo_nlp + from nemo_nlp import NemoBertTokenizer, SentencePieceTokenizer + from nemo_nlp.callbacks.ner import \ + eval_iter_callback, eval_epochs_done_callback + + BATCHES_PER_STEP = 1 + BATCH_SIZE = 32 + CLASSIFICATION_DROPOUT = 0.1 + DATA_DIR = "conll2003" + MAX_SEQ_LENGTH = 128 + NUM_EPOCHS = 3 + LEARNING_RATE = 0.00005 + LR_WARMUP_PROPORTION = 0.1 + OPTIMIZER = "adam" + +Next, we need to create our neural factory. How you should define it depends on whether you'd like to multi-GPU or mixed-precision training. This tutorial assumes that you're training on one GPU, without mixed precision. + +.. code-block:: python + + # Instantiate neural factory with supported backend + neural_factory = nemo.core.NeuralModuleFactory( + backend=nemo.core.Backend.PyTorch, + + # If you're training with multiple GPUs, you should handle this value with + # something like argparse. See examples/nlp/ner.py for an example. + local_rank=None, + + # If you're training with mixed precision, this should be set to mxprO1 or mxprO2. + # See https://nvidia.github.io/apex/amp.html#opt-levels for more details. + optimization_level=nemo.core.Optimization.mxprO0, + + # If you're training with multiple GPUs, this should be set to + # nemo.core.DeviceType.AllGpu + placement=nemo.core.DeviceType.GPU) + +Next, we'll need to define our tokenizer and our BERT model. There are a couple of different ways you can do this. Keep in mind that NER benefits from casing ("New York City" is easier to identify than "new york city"), so we recommend you use cased models. + +.. code-block:: python + + # If you're using a standard BERT model, you should do it like this. To see the full + # list of BERT model names, check out nemo_nlp.huggingface.BERT.list_pretrained_models() + tokenizer = NemoBertTokenizer(pretrained_model="bert-base-cased") + bert_model = nemo_nlp.huggingface.BERT( + pretrained_model_name="bert-base-cased", + factory=neural_factory) + + # If you're using a BERT model that you pre-trained yourself, you should do it like this. + # You should replace BERT-STEP-150000.pt with the path to your checkpoint file. + tokenizer = SentencePieceTokenizer(model_path="tokenizer.model") + tokenizer.add_special_tokens(["[MASK]", "[CLS]", "[SEP]"]) + + bert_model = nemo_nlp.huggingface.BERT( + config_filename=os.path.join("bert_pretraining_checkpoints", "config.json"), + factory=neural_factory) + bert_model.restore_from( + os.path.join("bert_pretraining_checkpoints", "BERT-STEP-150000.pt")) + +Now, we will define the training pipeline: + +.. code-block:: python + + train_data_layer = nemo_nlp.BertNERDataLayer( + tokenizer=tokenizer, + path_to_data=os.path.join(DATA_DIR, "train.txt"), + max_seq_length=MAX_SEQ_LENGTH, + batch_size=BATCH_SIZE, + factory=neural_factory) + + tag_ids = train_data_layer.dataset.tag_ids + + ner_loss = nemo_nlp.TokenClassificationLoss( + d_model=bert_model.bert.config.hidden_size, + num_labels=len(tag_ids), + dropout=CLASSIFICATION_DROPOUT, + factory=neural_factory) + + input_ids, input_type_ids, input_mask, labels, _ = train_data_layer() + + hidden_states = bert_model( + input_ids=input_ids, + token_type_ids=input_type_ids, + attention_mask=input_mask) + + train_loss, train_logits = ner_loss( + hidden_states=hidden_states, + labels=labels, + input_mask=input_mask) + +And now, our evaluation pipeline: + +.. code-block:: python + + eval_data_layer = nemo_nlp.BertNERDataLayer( + tokenizer=tokenizer, + path_to_data=os.path.join(DATA_DIR, "dev.txt"), + max_seq_length=MAX_SEQ_LENGTH, + batch_size=BATCH_SIZE, + factory=neural_factory) + + input_ids, input_type_ids, eval_input_mask, \ + eval_labels, eval_seq_ids = eval_data_layer() + + hidden_states = bert_model( + input_ids=input_ids, + token_type_ids=input_type_ids, + attention_mask=eval_input_mask) + + eval_loss, eval_logits = ner_loss( + hidden_states=hidden_states, + labels=eval_labels, + input_mask=eval_input_mask) + +Now, we will set up our callbacks. Here, we will use `SimpleLossLoggerCallback` to print loss values during training, and `EvaluatorCallback` to evaluate our F1 score on the dev dataset. In this example, `EvaluatorCallback` will also output predictions to `output.txt`, which can be helpful with debugging what our model gets wrong. + +.. tip:: + + Tensorboard_ is a great debugging tool. It's not a requirement for this tutorial, but if you'd like to use it, you should install tensorboardX_ and run the following command during fine-tuning: + + .. code-block:: bash + + tensorboard --logdir bert_ner_tb + +.. _Tensorboard: https://www.tensorflow.org/tensorboard +.. _tensorboardX: https://github.com/lanpa/tensorboardX + +.. code-block:: python + + try: + import tensorboardX + tb_writer = tensorboardX.SummaryWriter("bert_ner_tb") + except ModuleNotFoundError: + tb_writer = None + print("Tensorboard is not available") + + callback_train = nemo.core.SimpleLossLoggerCallback( + tensors=[train_loss], + print_func=lambda x: print("Loss: {:.3f}".format(x[0].item())), + get_tb_values=lambda x: [["loss", x[0]]], + tb_writer=tb_writer) + + train_data_size = len(train_data_layer) + + # If you're training on multiple GPUs, this should be + # train_data_size / (batch_size * batches_per_step * num_gpus) + steps_per_epoch = int(train_data_size / (BATCHES_PER_STEP * BATCH_SIZE)) + + callback_eval = nemo.core.EvaluatorCallback( + eval_tensors=[eval_logits, eval_seq_ids], + user_iter_callback=lambda x, y: eval_iter_callback( + x, y, eval_data_layer, tag_ids), + user_epochs_done_callback=lambda x: eval_epochs_done_callback( + x, tag_ids, "output.txt"), + tb_writer=tb_writer, + eval_step=steps_per_epoch) + +Finally, we will define our learning rate policy and our optimizer, and start training. + +.. code-block:: python + + lr_policy = WarmupAnnealing(NUM_EPOCHS * steps_per_epoch, + warmup_ratio=LR_WARMUP_PROPORTION) + optimizer = neural_factory.get_trainer() + optimizer.train( + tensors_to_optimize=[train_loss], + callbacks=[callback_train, callback_eval], + lr_policy=lr_policy, + batches_per_step=BATCHES_PER_STEP, + optimizer=OPTIMIZER, + optimization_params={ + "num_epochs": NUM_EPOCHS, + "lr": LEARNING_RATE + }) + +Using Other BERT Models +----------------------- + +In addition to using pre-trained BERT models from Google and BERT models that you've trained yourself, in NeMo it's possible to use other third-party BERT models as well, as long as the weights were exported with PyTorch. For example, if you want to fine-tune an NER task with SciBERT_... + +.. _SciBERT: https://github.com/allenai/scibert + +.. code-block:: bash + + wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_cased.tar + tar -xf scibert_scivocab_cased.tar + cd scibert_scivocab_cased + tar -xzf weights.tar.gz + mv bert_config.json config.json + cd .. + +And then, when you load your BERT model, you should specify the name of the directory for the model name. + +.. code-block:: python + + tokenizer = NemoBertTokenizer(pretrained_model="scibert_scivocab_cased") + bert_model = nemo_nlp.huggingface.BERT( + pretrained_model_name="scibert_scivocab_cased", + factory=neural_factory) + +If you want to use a TensorFlow-based model, such as BioBERT, you should be able to use it in NeMo by first using this `model conversion script`_ provided by Hugging Face. + +.. _model conversion script: https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/convert_tf_checkpoint_to_pytorch.py diff --git a/docs/_sources/nlp/neural-machine-translation.rst.txt b/docs/_sources/nlp/neural-machine-translation.rst.txt new file mode 100644 index 000000000000..3baa78ac65d1 --- /dev/null +++ b/docs/_sources/nlp/neural-machine-translation.rst.txt @@ -0,0 +1,157 @@ +Tutorial +======== + +In this tutorial we are going to implement Neural Machine Translation (NMT) system based on `Transformer encoder-decoder architecture `_ :cite:`vaswani2017attention`. All code used in this tutorial is based on ``examples/nlp/nmt_tutorial.py``. + +Preliminaries +------------- + +**Dataset.** We use WMT16 English-German dataset which consists of approximately 4.5 million sentence pairs before preprocessing. To clean the dataset we remove all sentence pairs such that: + + * The length of either source or target is greater than 128 or smaller than 3 tokens. + * Absolute difference between source and target is greater than 25 tokens. + * One sentence is more than 2.5 times longer than the other. + * Target sentence is the exact copy of the source sentence :cite:`ott2018analyzing`. + +We use newstest2013 for development and newstest2014 for testing. All datasets, as well as the tokenizer model can be downloaded from `here `_. In the following steps, we assume that all data is located at ****. + +**Resources.** Training script ``examples/nlp/nmt_tutorial.py`` used in this tutorial allows to train Transformer-big architecture to **29.2** BLEU / **28.5** SacreBLEU on newstest2014 in approximately 15 hours on NVIDIA's DGX-1 with 16GB Volta GPUs. This setup can also be replicated with fewer resources by using gradient accumulation :cite:`ott2018scaling`. + +.. tip:: + Launching training script without any arguments will run training on much smaller dataset (newstest2013) of 3000 sentence pairs and validate on the subset of this dataset consisting of 100 sentence pairs. This is useful for debugging purposes: if everything is set up correctly, validation BLEU will reach >99 and training / validation losses will go to <1.5 pretty fast. + +Code overview +------------- + +First of all, we instantiate Neural Module Factory which defines 1) backend, 2) mixed precision optimization level, and 3) local rank of the GPU. + + .. code-block:: python + + neural_factory = nemo.core.NeuralModuleFactory( + backend=nemo.core.Backend.PyTorch, + local_rank=args.local_rank, + optimization_level=nemo.core.Optimization.mxprO2) + +We define tokenizer which allows to transform input text into tokens. In this tutorial, we use joint `Byte Pair Encodings (BPE) `_ :cite:`sennrich2015neural` trained on WMT16 En-De corpus with `YouTokenToMe library `_. In contrast to the models presented in the literature (which usually have vocabularies of size 30000+), we work with 4x smaller vocabulary of 8192 BPEs. It achieves the same level of performance but allows to increase the batch size by 20% which in turn leads to faster convergence. + + + .. code-block:: python + + tokenizer = nemo_nlp.YouTokenToMeTokenizer(model_path="/bpe8k_yttm.model") + + .. tip:: + To leverage the best GPU utilization and mixed precision speedup, make sure that the vocabulary size (as well as all sizes in the model) is divisible by 8. + +Next, we define all Neural Modules participating in our NMT pipeline: + + * Two data layers (one for training and one for evaluation) which pack input sentences into batches of similar length to minimize the use of padding symbol. Note, that the maximum allowed number of tokens in a batch is given in **source and target** tokens. + * Transformer Encoder and Decoder. + * LogSoftmax for mapping output of the decoder into probability distribution over vocabulary. + * Beam Search module for generating translations. + * Loss function (cross entropy with label smoothing regularization). + + .. code-block:: python + + train_data_layer = nemo_nlp.TranslationDataLayer(**train_datalayer_params) + eval_data_layer = nemo_nlp.TranslationDataLayer(**eval_datalayer_params) + encoder = nemo_nlp.TransformerEncoderNM(**encoder_params) + decoder = nemo_nlp.TransformerDecoderNM(**decoder_params) + log_softmax = nemo_nlp.TransformerLogSoftmaxNM(**log_softmax_params) + beam_search = nemo_nlp.BeamSearchTranslatorNM(**beam_search_params) + loss = nemo_nlp.PaddedSmoothedCrossEntropyLossNM(**loss_params) + +Following `Press and Wolf, 2016 `_ :cite:`press2016using`, we also tie the parameters of embedding and softmax layers: + + .. code-block:: python + + log_softmax.log_softmax.dense.weight = encoder.embedding_layer.token_embedding.weight + decoder.embedding_layer.token_embedding.weight = encoder.embedding_layer.token_embedding.weight + +Then, we build the computation graph out of instantiated modules: + + .. code-block:: python + + ########################### Training pipeline ########################### + src, src_mask, tgt, tgt_mask, labels, sent_ids = train_data_layer() + src_hiddens = encoder(input_ids=src, input_mask_src=src_mask) + tgt_hiddens = decoder(input_ids_tgt=tgt, + hidden_states_src=src_hiddens, + input_mask_src=src_mask, + input_mask_tgt=tgt_mask) + log_softmax = log_softmax(hidden_states=tgt_hiddens) + train_loss = loss(log_probs=log_softmax, target_ids=labels) + + ########################## Evaluation pipeline ########################## + src_, src_mask_, tgt_, tgt_mask_, labels_, sent_ids_ = eval_data_layer() + src_hiddens_ = encoder(input_ids=src_, input_mask_src=src_mask_) + tgt_hiddens_ = decoder(input_ids_tgt=tgt_, + hidden_states_src=src_hiddens_, + input_mask_src=src_mask_, + input_mask_tgt=tgt_mask_) + log_softmax_ = log_softmax(hidden_states=tgt_hiddens_) + eval_loss = loss(log_probs=log_softmax_, target_ids=labels_) + beam_trans = beam_search(hidden_states_src=src_hiddens_, + input_mask_src=src_mask_) + +Next, we define necessary callbacks for: 1) tracking loss during training, 2) tracking BLEU score on evaluation dataset, 3) saving model checkpoints once in a while. + + .. code-block:: python + + from nemo_nlp.callbacks.translation import eval_iter_callback, eval_epochs_done_callback + + callback_train = nemo.core.SimpleLossLoggerCallback(...) + callback_eval = nemo.core.EvaluatorCallback(...) + callback_ckpt = nemo.core.CheckpointCallback(...) + + .. note:: + + The BLEU score is calculated between detokenized translation (generated with beam search) and genuine evaluation dataset. For the sake of completeness, we report both `SacreBLEU `_ :cite:`post2018call` and `tokenized BLEU score `_ commonly used in the literature. + +Finally, we define the optimization parameters and run the whole pipeline. + + .. code-block:: python + + optimizer = neural_factory.get_trainer(**optimization_params) + optimizer.train(tensors_to_optimize=[train_loss], + callbacks=[callback_train, callback_eval, callback_ckpt]) + + +Model training +-------------- + +To train the Transformer-big model, run ``nmt_tutorial.py`` located at ``nemo/examples/nlp``: + + .. code-block:: python + + python -m torch.distributed.launch --nproc_per_node=8 nmt_tutorial.py \ + --data_root --tokenizer_model bpe8k_yttm.model \ + --eval_datasets valid/newstest2013 --optimizer novograd --lr 0.04 \ + --weight_decay 0.0001 --max_num_steps 40000 --warmup_steps 4000 \ + --d_model 1024 --d_inner 4096 --num_layers 6 --num_attn_heads 16 \ + --batch_size 12288 --grad_accumulation_steps 5 + + + .. note:: + + This command runs training on 8 GPUs with at least 16 GB of memory. If your GPUs have less memory, decrease the **batch_size** parameter. To train with bigger batches which do not fit into the memory, increase the **grad_accumulation_steps** parameter. + +Translation with pretrained model +--------------------------------- + +1. Put your saved checkpoint (or download good checkpoint which obtains 28.5 SacreBLEU on newstest2014 from `here `_) into ****. +2. Run ``nmt_tutorial.py`` in an interactive mode:: + + python nmt_tutorial.py --tokenizer_model bpe8k_yttm.model \ + --eval_datasets test --optimizer novograd --d_model 1024 \ + --d_inner 4096 --num_layers 6 --num_attn_heads 16 \ + --path_to_checkpoints --interactive + + + .. image:: interactive_translation.png + :align: center + +References +---------- + +.. bibliography:: nmt.bib + :style: plain diff --git a/docs/_sources/nlp/pretraining.rst.txt b/docs/_sources/nlp/pretraining.rst.txt new file mode 100644 index 000000000000..5c92675bc838 --- /dev/null +++ b/docs/_sources/nlp/pretraining.rst.txt @@ -0,0 +1,298 @@ +Pretraining BERT +================ + +Make sure you have ``nemo`` and ``nemo_nlp`` installed before starting this +tutorial. See the :ref:`installation` section for more details. + +Introduction +------------ + +This tutorial is focused on pretraining BERT from scratch. Creating domain-specific BERT models can be advantageous for a wide range of applications. Most notably, in a biomedical setting, similar to BioBERT :cite:`lee2019biobert` and SciBERT :cite:`beltagy2019scibert`. + +Download Corpus +--------------- + +For demonstration purposes, we will be using the very small WikiText-2 corpus. This script will download and unzip the corpus for you: + +.. code-block:: bash + + ./tests/data/get_wt2.sh + +After the download has completed, there should be a `wikitext-2` folder in your current directory, which should include `train.txt`, `valid.txt`, and `test.txt`. + +Build Vocabulary +---------------- + +.. note:: + This step is optional! If you don't want to use a custom vocabulary, using the `vocab.txt` file from any `pretrained BERT model`_ will do. Also, depending on the size of your corpus, this may take awhile. + +.. _pretrained BERT model: https://github.com/google-research/bert#pre-trained-models + +Another script can be used to generate your vocabulary file. In this example with WikiText-2, you can build it like this: + +.. code-block:: bash + + # In this example, our dataset consists of one file, so we can run it like this: + python tests/data/create_vocab.py --train_path wikitext-2/train.txt + + # If your corpus consists of many different files, you should run it like this instead: + python tests/data/create_vocab.py --dataset_dir path_to_dataset/ + +The script will output two important files: `tokenizer.vocab` and `tokenizer.model`. We'll explain how to use both in the next section. + +Training +-------- + +.. tip:: + + We recommend you try this out in a Jupyter notebook. It'll make debugging much easier! + +Here, we will pre-train a BERT model from scratch on the WikiText-2 corpus. We'll start off with our imports and constants. + +.. code-block:: python + + import math + import os + + import nemo + from nemo.utils.lr_policies import CosineAnnealing + + import nemo_nlp + from nemo_nlp import NemoBertTokenizer, SentencePieceTokenizer + from nemo_nlp.callbacks.bert_pretraining import eval_iter_callback, \ + eval_epochs_done_callback + + BATCHES_PER_STEP = 1 + BATCH_SIZE = 64 + BATCH_SIZE_EVAL = 16 + CHECKPOINT_DIR = "bert_pretraining_checkpoints" + D_MODEL = 768 + D_INNER = 3072 + HIDDEN_ACT = "gelu" + LEARNING_RATE = 1e-2 + LR_WARMUP_PROPORTION = 0.05 + MASK_PROBABILITY = 0.15 + MAX_SEQ_LENGTH = 128 + NUM_EPOCHS = 10 + NUM_HEADS = 12 + NUM_LAYERS = 12 + OPTIMIZER = "novograd" + WEIGHT_DECAY = 0 + +Next, we need to create our neural factory. How you should define it depends on whether you'd like to multi-GPU or mixed-precision training. This tutorial assumes that you're training on one GPU, without mixed precision. + +.. code-block:: python + + # Instantiate neural factory with supported backend + neural_factory = nemo.core.NeuralModuleFactory( + backend=nemo.core.Backend.PyTorch, + + # If you're training with multiple GPUs, you should handle this value with + # something like argparse. See examples/nlp/bert_pretraining.py for an example. + local_rank=None, + + # If you're training with mixed precision, this should be set to mxprO1 or mxprO2. + # See https://nvidia.github.io/apex/amp.html#opt-levels for more details. + optimization_level=nemo.core.Optimization.mxprO0, + + # If you're training with multiple GPUs, this should be set to + # nemo.core.DeviceType.AllGpu + placement=nemo.core.DeviceType.GPU) + +Now, we need to define our tokenizer. If you'd like to use a custom vocabulary file, we strongly recommend you use our `SentencePieceTokenizer`. Otherwise, if you'll be using a vocabulary file from another pre-trained BERT model, you should use `NemoBertTokenizer`. + +.. code-block:: python + + # If you're using a custom vocabulary, create your tokenizer like this + tokenizer = SentencePieceTokenizer(model_path="tokenizer.model") + tokenizer.add_special_tokens(["[MASK]", "[CLS]", "[SEP]"]) + + # Otherwise, create your tokenizer like this + tokenizer = NemoBertTokenizer(vocab_file="vocab.txt") + +We also need to define the BERT model that we will be pre-training. Here, you can configure your model size as needed. + +.. code-block:: python + + bert_model = nemo_nlp.huggingface.BERT( + vocab_size=tokenizer.vocab_size, + num_hidden_layers=NUM_LAYERS, + hidden_size=D_MODEL, + num_attention_heads=NUM_HEADS, + intermediate_size=D_INNER, + max_position_embeddings=MAX_SEQ_LENGTH, + hidden_act=HIDDEN_ACT, + factory=neural_factory) + + # If you want to start pre-training from existing BERT checkpoints, you should create + # the model like this instead. For the full list of BERT model names, check out + # nemo_nlp.huggingface.BERT.list_pretrained_models() + bert_model = nemo_nlp.huggingface.BERT( + pretrained_model_name="bert-base-cased", + factory=neural_factory) + +Next, we will define our loss functions. We will demonstrate how to pre-train with both MLM and NSP losses, but you may observe higher downstream accuracy by only pre-training with MLM loss. + +.. code-block:: python + + mlm_log_softmax = nemo_nlp.TransformerLogSoftmaxNM( + vocab_size=tokenizer.vocab_size, + d_model=D_MODEL, + factory=neural_factory) + mlm_loss = nemo_nlp.MaskedLanguageModelingLossNM(factory=neural_factory) + + mlm_log_softmax.log_softmax.dense.weight = \ + bert_model.bert.embeddings.word_embeddings.weight + + nsp_log_softmax = nemo_nlp.SentenceClassificationLogSoftmaxNM( + d_model=D_MODEL, + num_classes=2, + factory=neural_factory) + nsp_loss = nemo_nlp.NextSentencePredictionLossNM(factory=neural_factory) + + bert_loss = nemo_nlp.LossAggregatorNM( + num_inputs=2, + factory=neural_factory) + +Another crucial pre-training component is our data layer. If you're training on larger corpora, you can pass a directory name into the `dataset` argument, but we can do our example like this: + +.. code-block:: python + + train_data_layer = nemo_nlp.BertPretrainingDataLayer( + tokenizer=tokenizer, + dataset=os.path.join("wikitext-2", "train.txt"), + name="train", + max_seq_length=MAX_SEQ_LENGTH, + mask_probability=MASK_PROBABILITY, + batch_size=BATCH_SIZE, + factory=neural_factory) + + test_data_layer = nemo_nlp.BertPretrainingDataLayer( + tokenizer=tokenizer, + dataset=os.path.join("wikitext-2", "test.txt"), + name="test", + max_seq_length=MAX_SEQ_LENGTH, + mask_probability=MASK_PROBABILITY, + batch_size=BATCH_SIZE_EVAL, + factory=neural_factory) + +Next, we will define our training pipeline. + +.. code-block:: python + + input_ids, input_type_ids, input_mask, \ + output_ids, output_mask, nsp_labels = train_data_layer() + + hidden_states = bert_model(input_ids=input_ids, + token_type_ids=input_type_ids, + attention_mask=input_mask) + + train_mlm_log_probs = mlm_log_softmax(hidden_states=hidden_states) + train_mlm_loss = mlm_loss(log_probs=train_mlm_log_probs, + output_ids=output_ids, + output_mask=output_mask) + + train_nsp_log_probs = nsp_log_softmax(hidden_states=hidden_states) + train_nsp_loss = nsp_loss(log_probs=train_nsp_log_probs, labels=nsp_labels) + train_loss = bert_loss(loss_1=train_mlm_loss, loss_2=train_nsp_loss) + +And testing pipeline. + +.. code-block:: python + + input_ids_, input_type_ids_, input_mask_, \ + output_ids_, output_mask_, nsp_labels_ = test_data_layer() + + hidden_states_ = bert_model(input_ids=input_ids_, + token_type_ids=input_type_ids_, + attention_mask=input_mask_) + + test_mlm_log_probs = mlm_log_softmax(hidden_states=hidden_states_) + test_mlm_loss = mlm_loss(log_probs=test_mlm_log_probs, + output_ids=output_ids_, + output_mask=output_mask_) + + test_nsp_log_probs = nsp_log_softmax(hidden_states=hidden_states_) + test_nsp_loss = nsp_loss(log_probs=test_nsp_log_probs, labels=nsp_labels_) + +Now, we will define our callbacks. NeMo provides a variety of callbacks for you to use; in this tutorial, we will make use of `SimpleLossLoggerCallback`, which prints loss values during training, `CheckpointCallback`, which saves model checkpoints at set intervals, and `EvaluatorCallback`, which evaluates test loss at set intervals. + +.. tip:: + + Tensorboard_ is a great debugging tool. It's not a requirement for this tutorial, but if you'd like to use it, you should install tensorboardX_ and run the following command during pre-training: + + .. code-block:: bash + + tensorboard --logdir bert_pretraining_tb + +.. _Tensorboard: https://www.tensorflow.org/tensorboard +.. _tensorboardX: https://github.com/lanpa/tensorboardX + +.. code-block:: python + + try: + import tensorboardX + tb_writer = tensorboardX.SummaryWriter("bert_pretraining_tb") + except ModuleNotFoundError: + tb_writer = None + print("Tensorboard is not available") + + callback_loss = nemo.core.SimpleLossLoggerCallback( + tensors=[train_loss], + print_func=lambda x: print("Loss: {:.3f}".format(x[0].item())), + get_tb_values=lambda x: [["loss", x[0]]], + tb_writer=tb_writer) + + callback_ckpt = nemo.core.CheckpointCallback( + folder=CHECKPOINT_DIR, + step_freq=25000) + + train_data_size = len(train_data_layer) + + # If you're training on multiple GPUs, this should be + # train_data_size / (batch_size * batches_per_step * num_gpus) + steps_per_epoch = int(train_data_size / (BATCHES_PER_STEP * BATCH_SIZE)) + + callback_test = nemo.core.EvaluatorCallback( + eval_tensors=[test_mlm_loss, test_nsp_loss], + user_iter_callback=eval_iter_callback, + user_epochs_done_callback=eval_epochs_done_callback, + eval_step=steps_per_epoch, + tb_writer=tb_writer) + +We also recommend you export your model's parameters to a config file. This makes it easier to load your BERT model into NeMo later, as explained in our NER tutorial. + +.. code-block:: python + + if not os.path.exists(CHECKPOINT_DIR): + os.makedirs(CHECKPOINT_DIR) + + config_path = os.path.join(CHECKPOINT_DIR, "config.json") + if not os.path.exists(config_path): + bert_model.config.to_json_file(config_path) + +Finally, you should define your optimizer, and start training! + +.. code-block:: python + + lr_policy = CosineAnnealing(NUM_EPOCHS * steps_per_epoch, + warmup_ratio=LR_WARMUP_PROPORTION) + neural_factory.train(tensors_to_optimize=[train_loss], + lr_policy=lr_policy, + callbacks=[callback_loss, callback_ckpt, callback_test], + batches_per_step=BATCHES_PER_STEP, + optimizer=OPTIMIZER, + optimization_params={ + "batch_size": BATCH_SIZE, + "num_epochs": NUM_EPOCHS, + "lr": LEARNING_RATE, + "weight_decay": WEIGHT_DECAY, + "betas": (0.95, 0.98), + "grad_norm_clip": None + }) + +References +---------- + +.. bibliography:: Bertbib.bib + :style: plain diff --git a/docs/_sources/training.rst.txt b/docs/_sources/training.rst.txt index ffe43817f493..6cc07b6030de 100644 --- a/docs/_sources/training.rst.txt +++ b/docs/_sources/training.rst.txt @@ -1,13 +1,13 @@ Fast Training ============= -Training large model (especially from scratch) requires significant compute. NeMo provides support for mixed precision and distributed training to speed-up trainng. For this NEMO uses `NVIDIA's APEX library `_, which enables to get maximum performance out of NVIDIA's GPUs. Furthermore, multi-GPU systems (such as DGX Station, DGX-1 and DGX-2) have *NVLINK* to speed-up multi-GPU communication. +Training large model (especially from scratch) requires significant compute. NeMo provides support for mixed precision and distributed training to speed-up training. NeMo uses `NVIDIA's APEX library `_ to get maximum performance out of NVIDIA's GPUs. Furthermore, multi-GPU systems (such as DGX Station, DGX-1 and DGX-2) have *NVLINK* to speed-up multi-GPU communication. Mixed Precision ~~~~~~~~~~~~~~~ NVIDIA Volta and Turing GPUs have *Tensor Cores* which can do fast matrix multiplications with values in float16 format. -To enable mixed-precision in NeMo all you need to do is to set `optimization_level` parameter of `nemo.core.NeuralModuleFactory` to `nemo.core.Optimization.mxprO1`. For example: +To enable mixed-precision in NeMo all you need to do is to set `optimization_level` parameter of `nemo.core.NeuralModuleFactory` to `nemo.core.Optimization.mxprO1`. For example: .. code-block:: python @@ -15,7 +15,7 @@ To enable mixed-precision in NeMo all you need to do is to set `optimization_lev optimization_level=nemo.core.Optimization.mxprO1) .. important:: - Mixed precision requires Tensor Cores, so it works only on NVIDIA Volta and Turing GPUs. + Mixed precision requires Tensor Cores, so it works only on NVIDIA Volta and Turing GPUs. Multi-GPU training ~~~~~~~~~~~~~~~~~~ @@ -23,7 +23,7 @@ Multi-GPU training For multi-GPU training: (1) Set `placement` to `nemo.core.DeviceType.AllGpu` in NeuralModuleFactory -(2) Add to your script 'local_rank' argument and do not set it yourself: `parser.add_argument("--local_rank", default=None, type=int)` +(2) Add 'local_rank' argument to your script and do not set it yourself: `parser.add_argument("--local_rank", default=None, type=int)` .. code-block:: python diff --git a/docs/_sources/tutorials/complex_training.rst.txt b/docs/_sources/tutorials/complex_training.rst.txt new file mode 100644 index 000000000000..25e31be265fa --- /dev/null +++ b/docs/_sources/tutorials/complex_training.rst.txt @@ -0,0 +1,175 @@ +Complex Training Pipelines (GAN Example) +======================================== + +So far, training examples have utilized one optimizer to optimize one loss +across all Trainable Neural Modules. NeMo further extends to uses cases that +require multiple losses and multiple optimizers. + +.. note:: + These pipelines do not currently support multi-gpu training. + +.. note:: + All of our pipelines only support one datalayer. + +Multiple Losses +--------------- +Taking our Hello World example from earlier. Let's say that we now want to +optimize for both a square error loss and a l1 loss. We can pass both the +square error loss tensor and the l1 loss tensor to +:meth:`NeuralFactory.train()`. +An example is shown below. + +.. code-block:: python + + ### Same as previous example ### + import nemo + + # instantiate Neural Factory with supported backend + nf = nemo.core.NeuralModuleFactory() + + # instantiate necessary neural modules + dl = nemo.tutorials.RealFunctionDataLayer( + n=10000, batch_size=128) + fx = nemo.tutorials.TaylorNet(dim=4) + mse_loss = nemo.tutorials.MSELoss() + + # describe activation's flow + x, y = dl() + p = fx(x=x) + mse_loss_tensor = mse_loss(predictions=p, target=y) + + ### New code starts here ### + # We define our new LossNM and as well as our new loss tensor + l1_loss = nemo.tutorials.L1Loss() + l1_loss_tensor = l1_loss(predictions=p, target=y) + + # SimpleLossLoggerCallback will print loss values to console. + # Update printing function to add both losses + callback = nemo.core.SimpleLossLoggerCallback( + tensors=[l1_loss_tensor, mse_loss_tensor], + print_func=lambda x: print( + f'Train Loss: {str(x[0].item() + x[1].item())}') + ) + + # Invoke "train" action with both loss tensors + nf.train([mse_loss_tensor, l1_loss_tensor], callbacks=[callback], + optimization_params={"num_epochs": 3, "lr": 0.0003}, + optimizer="sgd") + +We can further extend this to optimize one loss at a time. Let's say that +instead of computing derivatives and gradients with respect to +mse_loss + l1_loss, we want to first compute gradients with respect to +mse_loss, do a weight update, and then compute gradients with respect to +l1_loss, and do another weight update. Here we have to define our own training +loop. + +.. code-block:: python + + ### Same as previous example ### + import nemo + + # instantiate Neural Factory with supported backend + nf = nemo.core.NeuralModuleFactory() + + # instantiate necessary neural modules + dl = nemo.tutorials.RealFunctionDataLayer( + n=10000, batch_size=128) + fx = nemo.tutorials.TaylorNet(dim=4) + mse_loss = nemo.tutorials.MSELoss() + l1_loss = nemo.tutorials.L1Loss() + + # describe activation's flow + x, y = dl() + p = fx(x=x) + mse_loss_tensor = mse_loss(predictions=p, target=y) + l1_loss_tensor = l1_loss(predictions=p, target=y) + + # SimpleLossLoggerCallback will print loss values to console. + callback = nemo.core.SimpleLossLoggerCallback( + tensors=[l1_loss_tensor, mse_loss_tensor], + print_func=lambda x: print( + f'L1 Loss: {str(x[0].item())}' + f'MSE Loss: {str(x[1].item())}') + ) + + ### New code starts here ### + # We need to create optimizers manually to enable complex training pipelines + optimizer = nf.create_optimizer( + optimizer="sgd", + # Note we have to specify the neural modules or nmtensors that we want + # to optimize for + things_to_optimize=[l1_loss_tensor, mse_loss_tensor], + optimizer_params={"lr": 0.0003}) + + # Now we define our training_loop, which is a list of tuples + # Each tuple should have two elements + # The first element is the optimizer to use + # The second element is the loss we want to optimize + training_loop = [ + # Optimizer MSE first and do a weight update + (optimizer, [mse_loss_tensor]), + # Optimizer L1 second and do a weight update + (optimizer, [l1_loss_tensor]), + ] + + # Invoke "train" action + # Note, we no longer need to pass optimizer since we have a training_loop + nf.train(training_loop, callbacks=[callback], + optimization_params={"num_epochs": 3}) + +Multiple Optimizers and Multiple Losses +--------------------------------------- +NeMo additionally supports use cases where a user would want to create more +than one optimizer. One example of such a use case would be a GAN where +we want to create an optimizer for the generator and an optimizer for the +discriminator. We also want to optimize for different losses in both cases. +Here are the highlights from examples/images/gan.py that enable such behaviour. + +.. code-block:: python + + ... + + # Creation of Neural Modules + generator = nemo_simple_gan.SimpleGenerator( + batch_size=batch_size) + discriminator = nemo_simple_gan.SimpleDiscriminator() + + ... + + # Creation of Loss NM Tensors + # Loss 1: Interpolated image loss + interpolated_loss = disc_loss(decision=interpolated_decision) + # Loss 2: Real image loss + real_loss = neg_disc_loss(decision=real_decision) + # Loss 3: WGAN Gradient Penalty + grad_penalty = disc_grad_penalty( + interpolated_image=interpolated_image, + interpolated_decision=interpolated_decision) + + ... + + # Create optimizers + # Note that we only want one optimizer to optimize either the generator + # or the discriminator + optimizer_G = neural_factory.create_optimizer( + things_to_optimize=[generator], + ...) + optimizer_D = neural_factory.create_optimizer( + things_to_optimize=[discriminator], + ...) + + # Define training_loop + # Note in our training loop, we want to optimize the discriminator + # 3x more compared to our generator + losses_G = [generator_loss] + losses_D = [interpolated_loss, real_loss, grad_penalty] + training_loop = [ + (optimizer_D, losses_D), + (optimizer_D, losses_D), + (optimizer_D, losses_D), + (optimizer_G, losses_G), + ] + + neural_factory.train( + tensors_to_optimize=training_loop, + ...) diff --git a/docs/_sources/tutorials/examples.rst.txt b/docs/_sources/tutorials/examples.rst.txt index 3085cd319e23..ee24500170e6 100644 --- a/docs/_sources/tutorials/examples.rst.txt +++ b/docs/_sources/tutorials/examples.rst.txt @@ -18,29 +18,29 @@ This example shows how to build a model which learn Taylor's coefficients for y= import nemo - # instantiate Neural Factory + # instantiate Neural Factory with supported backend nf = nemo.core.NeuralModuleFactory() - # instantiate neural modules - data = nf.get_module(name="RealFunctionDataLayer", collection="toys", - params={"n": 10000, "batch_size": 128}) - f = nf.get_module(name="TaylorNet", collection="toys", params={"dim": 4}) - L = nf.get_module(name="MSELoss", collection="toys", params={}) - - # build model out of neural modules using activations - x, y = data() - p = f(x=x) - loss = L(predictions=p, target=y) - - # add SimpleLossLoggerCallback to print loss values to console - # Callback function converts a list of tensors into string - callback = nemo.core.SimpleLossLoggerCallback(tensor_list2string=lambda x: str(x[0].item())) - - # instantiate SGD as optimizer - optimizer = nf.get_trainer(params={"optimization_params": {"num_epochs": 3, "lr": 0.0003}}) - - # start training - optimizer.train([loss], callbacks=[callback]) + # instantiate necessary neural modules + dl = nemo.tutorials.RealFunctionDataLayer( + n=10000, batch_size=128) + fx = nemo.tutorials.TaylorNet(dim=4) + loss = nemo.tutorials.MSELoss() + + # describe activation's flow + x, y = dl() + p = fx(x=x) + lss = loss(predictions=p, target=y) + + # SimpleLossLoggerCallback will print loss values to console. + callback = nemo.core.SimpleLossLoggerCallback( + tensors=[lss], + print_func=lambda x: print(f'Train Loss: {str(x[0].item())}')) + + # Invoke "train" action + nf.train([lss], callbacks=[callback], + optimization_params={"num_epochs": 3, "lr": 0.0003}, + optimizer="sgd") Simple Chatbot --------------- @@ -48,18 +48,18 @@ Simple Chatbot This is an adaptation of `PyTorch's Chatbot tutorial `_ into NeuralModule's framework. It demonstrates how to do training and evaluation. Model can be describes by graph shown below. Model has: - - * two data layers (one for training and another one for inference), - * encoder and decoder (shared by training and inference), - * two loss modules (one for training and another one for inference). + + * two data layers (one for training and another one for inference), + * encoder and decoder (shared by training and inference), + * two loss modules (one for training and another one for inference). .. image:: chatbot.png During training model will print: - * **SOURCE**: model input - * **PREDICTED RESPONSE**: model output - * **TARGET**: target output + * **SOURCE**: model input + * **PREDICTED RESPONSE**: model output + * **TARGET**: target output .. code-block:: python @@ -72,47 +72,44 @@ During training model will print: # Get Data data_file = "movie_data.txt" if not os.path.isfile(data_file): - with gzip.open("../../tests/data/movie_lines.txt.gz", 'rb') as f_in: - with open(data_file, 'wb') as f_out: - shutil.copyfileobj(f_in, f_out) + with gzip.open("../../tests/data/movie_lines.txt.gz", 'rb') as f_in: + with open(data_file, 'wb') as f_out: + shutil.copyfileobj(f_in, f_out) # Configuration config = { - "corpus_name": "cornell", - "datafile": data_file, - "attn_model": 'dot', - "hidden_size": 512, - "encoder_n_layers": 2, - "decoder_n_layers": 2, - "dropout": 0.1, - "voc_size": 6104 + 3, - "batch_size": 128, - "num_epochs": 15, - "optimizer_kind": "adam", - "learning_rate": 0.0003, - "tb_log_dir": "ChatBot", + "corpus_name": "cornell", + "datafile": data_file, + "attn_model": 'dot', + "hidden_size": 512, + "encoder_n_layers": 2, + "decoder_n_layers": 2, + "dropout": 0.1, + "voc_size": 6104 + 3, + "batch_size": 128, + "num_epochs": 15, + "optimizer_kind": "adam", + "learning_rate": 0.0003, + "tb_log_dir": "ChatBot", } - #instantiate neural factory + # instantiate neural factory nf = nemo.core.NeuralModuleFactory() - #instantiate neural modules - dl = nf.get_module(name="DialogDataLayer", collection="tutorials", params=config) - - encoder = nf.get_module(name="EncoderRNN", collection="tutorials", params=config) - - decoder = nf.get_module(name="LuongAttnDecoderRNN", collection="tutorials", params=config) - - L = nf.get_module(name="MaskedXEntropyLoss", collection="tutorials", params={}) - - decoderInfer = nf.get_module(name="GreedyLuongAttnDecoderRNN", collection="tutorials", params=config) + # instantiate neural modules + dl = nemo.tutorials.DialogDataLayer(**config) + encoder = nemo.tutorials.EncoderRNN(**config) + decoder = nemo.tutorials.LuongAttnDecoderRNN(**config) + L = nemo.tutorials.MaskedXEntropyLoss() + decoderInfer = nemo.tutorials.GreedyLuongAttnDecoderRNN(**config) # PARAMETER SHARING: between training and auto-regressive inference decoders decoderInfer.tie_weights_with(decoder, list(decoder.get_weights().keys())) # express activations flow src, src_lengths, tgt, mask, max_tgt_length = dl() - encoder_outputs, encoder_hidden = encoder(input_seq=src, input_lengths=src_lengths) + encoder_outputs, encoder_hidden = encoder(input_seq=src, + input_lengths=src_lengths) outputs, hidden = decoder(targets=tgt, encoder_outputs=encoder_outputs, max_target_len=max_tgt_length) loss = L(predictions=outputs, target=tgt, mask=mask) @@ -120,31 +117,34 @@ During training model will print: # run inference decoder to generate predictions outputs_inf, _ = decoderInfer(encoder_outputs=encoder_outputs) + # define callback function which prints intermediate results to console def outputs2words(tensors, vocab): - source_ids = tensors[0][:, 0].cpu().numpy().tolist() - response_ids = tensors[1][:, 0].cpu().numpy().tolist() - tgt_ids = tensors[2][:, 0].cpu().numpy().tolist() - source = list(map(lambda x: vocab[x], source_ids)) - response = list(map(lambda x: vocab[x], response_ids)) - target = list(map(lambda x: vocab[x], tgt_ids)) - source = ' '.join([s for s in source if s!='EOS' and s!='PAD']) - response = ' '.join([s for s in response if s!='EOS' and s!='PAD']) - target = ' '.join([s for s in target if s!='EOS' and s!='PAD']) - return " SOURCE: {0} <---> PREDICTED RESPONSE: {1} <---> TARGET: {2}".format( - source, response, target) + source_ids = tensors[1][:, 0].cpu().numpy().tolist() + response_ids = tensors[2][:, 0].cpu().numpy().tolist() + tgt_ids = tensors[3][:, 0].cpu().numpy().tolist() + source = list(map(lambda x: vocab[x], source_ids)) + response = list(map(lambda x: vocab[x], response_ids)) + target = list(map(lambda x: vocab[x], tgt_ids)) + source = ' '.join([s for s in source if s != 'EOS' and s != 'PAD']) + response = ' '.join([s for s in response if s != 'EOS' and s != 'PAD']) + target = ' '.join([s for s in target if s != 'EOS' and s != 'PAD']) + print(f"Train Loss:{str(tensors[0].item())}") + print(f"SOURCE: {source} <---> PREDICTED RESPONSE: {response} " + f"<---> TARGET: {target}") + callback = nemo.core.SimpleLossLoggerCallback( - tensor_list2string=lambda x: str(x[0].item()), - tensor_list2string_evl=lambda x: outputs2words(x, dl.voc.index2word)) - - # instantiate an optimizer for training - optimizer = nf.get_trainer(params={"optimizer_kind": "adam", - "optimization_params": {"num_epochs": config["num_epochs"], "lr": 0.001}}) + tensors=[loss, src, outputs_inf, tgt], + print_func=lambda x: outputs2words(x, dl.voc.index2word) + ) # start training - optimizer.train(tensors_to_optimize=[loss], tensors_to_evaluate=[src, outputs_inf, tgt], - callbacks=[callback]) + nf.train( + tensors_to_optimize=[loss], + callbacks=[callback], + optimizer="adam", + optimization_params={"num_epochs": config["num_epochs"], "lr": 0.001}) .. note:: Look for more examples under `nemo/examples` diff --git a/docs/_sources/tutorials/intro.rst.txt b/docs/_sources/tutorials/intro.rst.txt index 966523582755..e1758354238f 100644 --- a/docs/_sources/tutorials/intro.rst.txt +++ b/docs/_sources/tutorials/intro.rst.txt @@ -10,11 +10,4 @@ Getting started custommodules weightsharing callbacks - - - - - - - - + complex_training diff --git a/docs/_sources/tutorials/weightsharing.rst.txt b/docs/_sources/tutorials/weightsharing.rst.txt index 48d572bfbc1b..480e987405f4 100644 --- a/docs/_sources/tutorials/weightsharing.rst.txt +++ b/docs/_sources/tutorials/weightsharing.rst.txt @@ -12,10 +12,10 @@ For example: .. code-block:: python ... - train_dataloader=nf.get_module(name="TrainDataLayer", params=train_config) - eval_dataloader = nf.get_module(name="EvalDataLayer", params=eval_config) + train_dataloader = nemo.TrainDataLayer(**train_config) + eval_dataloader = nemo.EvalDataLayer(**eval_config) - L = nf.get_module(name="MaskedXEntropyLoss", params={}) + L = nemo.MaskedXEntropyLoss() # training model diff --git a/docs/api-docs/modules.html b/docs/api-docs/modules.html index 662bef4609b5..29bd343a3519 100644 --- a/docs/api-docs/modules.html +++ b/docs/api-docs/modules.html @@ -8,7 +8,7 @@ - NEMO API — nemo 0.1 documentation + NeMo API — nemo 0.1 documentation @@ -90,8 +90,9 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • Models
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -202,6 +205,72 @@

    WSJSwitchboard and CallHome

    coming soon …

    +
    +

    Fisher English Training Speech

    +

    Run these scripts to convert the Fisher English Training Speech data into a format expected by the nemo_asr collection.

    +

    In brief, the following scripts convert the .sph files to .wav, slice those files into smaller audio samples, match the smaller slices with their corresponding transcripts, and split the resulting audio segments into train, validation, and test sets (with one manifest each).

    +
    +

    Note

    +

    You will need at least 106GB of space to run the .wav conversion, and an additional 105GB for the slicing and matching. +You will need to have sph2pipe installed in order to run the .wav conversion.

    +
    +

    Instructions

    +

    These scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure that looks something like this:

    +
    FisherEnglishTrainingSpeech/
    +├── LDC2004S13-Part1
    +│   ├── fe_03_p1_transcripts
    +│   ├── fisher_eng_tr_sp_d1
    +│   ├── fisher_eng_tr_sp_d2
    +│   ├── fisher_eng_tr_sp_d3
    +│   └── ...
    +└── LDC2005S13-Part2
    +    ├── fe_03_p2_transcripts
    +    ├── fe_03_p2_sph1
    +    ├── fe_03_p2_sph2
    +    ├── fe_03_p2_sph3
    +    └── ...
    +
    +
    +

    The transcripts that will be used are located in fe_03_p<1,2>_transcripts/data/trans, and the audio files (.sph) are located in the remaining directories in an audio subdirectory.

    +

    First, convert the audio files from .sph to .wav by running:

    +
    cd <nemo_root>/scripts
    +python fisher_audio_to_wav.py \
    +  --data_root=<fisher_root> --dest_root=<conversion_target_dir>
    +
    +
    +

    This will place the unsliced .wav files in <conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/. +It will take several minutes to run.

    +

    Next, process the transcripts and slice the audio data:

    +
    python process_fisher_data.py \
    +  --audio_root=<conversion_target_dir> --transcript_root=<fisher_root> \
    +  --dest_root=<processing_target_dir> \
    +  --remove_noises
    +
    +
    +

    This script will split the full dataset into train, validation, and test sets, and place the audio slices in the corresponding folders in the destination directory. +One manifest will be written out per set, which includes each slice’s transcript, duration, and path.

    +

    This will likely take around 20 minutes to run. +Once finished, you may delete the 10 minute long .wav files if you wish.

    +
    +
    +

    2000 HUB5 English Evaluation Speech

    +

    Run the following script to convert the HUB5 data into a format expected by the nemo_asr collection.

    +

    Similarly to the Fisher dataset processing scripts, this script converts the .sph files to .wav, slices the audio files and transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). +The resulting segments are all written out to an audio directory, and the corresponding transcripts are written to a manifest JSON.

    +
    +

    Note

    +

    You will need 5GB of free space to run this script. +You will also need to have sph2pipe installed.

    +
    +

    This script assumes you already have the 2000 HUB5 dataset from the Linguistic Data Consortium.

    +

    Run the following to process the 2000 HUB5 English Evaluation Speech samples:

    +
    python process_hub5_data.py \
    +  --data_root=<path_to_HUB5_data> \
    +  --dest_root=<target_dir>
    +
    +
    +

    You may optionally include –min_slice_duration=<num_seconds> if you would like to change the minimum audio segment duration.

    +

    Building Your Own Dataset

    coming soon …

    diff --git a/docs/asr/garnet.html b/docs/asr/garnet.html index 93c82a4c8b89..69f8f93413df 100644 --- a/docs/asr/garnet.html +++ b/docs/asr/garnet.html @@ -35,7 +35,7 @@ - + @@ -101,8 +101,9 @@ -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -185,7 +186,7 @@

    GarNet - + diff --git a/docs/asr/intro.html b/docs/asr/intro.html index b8898e8d552b..cc4eed364be9 100644 --- a/docs/asr/intro.html +++ b/docs/asr/intro.html @@ -95,8 +95,9 @@
  • Models
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -189,6 +190,8 @@
  • Mozilla Common Voice
  • WSJ
  • Switchboard and CallHome
  • +
  • Fisher English Training Speech
  • +
  • 2000 HUB5 English Evaluation Speech
  • Building Your Own Dataset
  • diff --git a/docs/asr/jasper.html b/docs/asr/jasper.html index e9282b7d7d2f..939adb440c01 100644 --- a/docs/asr/jasper.html +++ b/docs/asr/jasper.html @@ -101,8 +101,9 @@ -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -174,7 +175,7 @@

    Jasper

    -

    Jasper (“Just Another SPeech Recognizer”) [1] is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. +

    Jasper (“Just Another SPeech Recognizer”) [1] is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. Jasper family of models are denoted as Jasper_[BxR] where B is the number of blocks, and R - the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D convolution, batch normalization, ReLU, and dropout:

    japer model diff --git a/docs/asr/models.html b/docs/asr/models.html index 0f999242a148..1a27df4db9be 100644 --- a/docs/asr/models.html +++ b/docs/asr/models.html @@ -101,8 +101,9 @@ -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • diff --git a/docs/asr/quartznet.html b/docs/asr/quartznet.html index 745749decd22..20ae0274d94a 100644 --- a/docs/asr/quartznet.html +++ b/docs/asr/quartznet.html @@ -101,8 +101,9 @@ -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • diff --git a/docs/asr/tutorial.html b/docs/asr/tutorial.html index ce2cbd6b532e..5d202ffbb93c 100644 --- a/docs/asr/tutorial.html +++ b/docs/asr/tutorial.html @@ -112,8 +112,9 @@
  • Models
  • -
  • NEMO Collections API
  • -
  • NEMO API
  • +
  • Natural Language Processing
  • +
  • NeMo Collections API
  • +
  • NeMo API
  • Frequently Asked Questions (FAQ)
  • @@ -191,7 +192,7 @@

    Tutorial

    Introduction

    -

    This Automatic Speech Recognition (ASR) tutorial is focused on Jasper [1] model. Jasper is CTC-based [1] end-to-end model. The model is called “end-to-end” because it transcripts speech samples without any additional alignmet information. CTC allows finding an alignment between audio and text. +

    This Automatic Speech Recognition (ASR) tutorial is focused on Jasper [1] model. Jasper is CTC-based [1] end-to-end model. The model is called “end-to-end” because it transcripts speech samples without any additional alignment information. CTC allows finding an alignment between audio and text. CTC-ASR training pipeline consists of the following blocks:

    1. audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)

    2. @@ -205,7 +206,7 @@

      Introduction

      Get data

      -

      We will be using an open-source Librispeech [3] dataset. These scripts will download and convert Librispeech into format expected by nemo_asr:

      +

      We will be using an open-source LibriSpeech [3] dataset. These scripts will download and convert LibriSpeech into format expected by nemo_asr:

      -

      After donwload and conversion are completed, your data folder should contain 2 manifests:

      +

      After download and conversion, your data folder should contain 2 json files:

    @@ -413,14 +434,14 @@

    Multi-GPU training
    python -m torch.distributed.launch --nproc_per_node=8 <nemo_git_repo_root>/examples/asr/jasper.py --num_gpus=8 ...
    +
    python -m torch.distributed.launch --nproc_per_node=8 <nemo_git_repo_root>/examples/asr/jasper.py ...
     

    Large Training Example

    Please refer to the <nemo_git_repo_root>/examples/asr/jasper.py for comprehensive example. It builds one train DAG and up to three validation DAGs to evaluate on different datasets.

    Assuming, you are working with Volta-based DGX, you can run training like this:

    -
    python -m torch.distributed.launch --nproc_per_node=8 <nemo_git_repo_root>/examples/asr/jasper.py --batch_size=64 --num_gpus=8 --num_epochs=100 --lr=0.015 --warmup_steps=8000 --weight_decay=0.001 --train_manifest=/manifests/librivox-train-all.json --val_manifest1=/manifests/librivox-dev-clean.json --val_manifest2=/manifests/librivox-dev-other.json --model_config=<nemo_git_repo_root>/nemo/examples/asr/configs/jasper15x5SEP.yaml --exp_name=MyLARGE-ASR-EXPERIMENT
    +
    python -m torch.distributed.launch --nproc_per_node=8 <nemo_git_repo_root>/examples/asr/jasper.py --batch_size=64 --num_epochs=100 --lr=0.015 --warmup_steps=8000 --weight_decay=0.001 --train_dataset=/manifests/librivox-train-all.json --eval_datasets /manifests/librivox-dev-clean.json /manifests/librivox-dev-other.json --model_config=<nemo_git_repo_root>/nemo/examples/asr/configs/jasper15x5SEP.yaml --exp_name=MyLARGE-ASR-EXPERIMENT
     

    The command above should trigger 8-GPU training with mixed precision. In the command above various manifests (.json) files are various datasets. Substitute them with the ones containing your data.

    @@ -451,55 +472,7 @@

    Fine-tuning

    Inference

    First download pre-trained model (jasper_encoder, jasper_decoder and configuration files) from here into <path_to_checkpoints>. We will use this pre-trained model to measure WER on LibriSpeech dev-clean dataset.

    -
    import nemo
    -import nemo_asr
    -
    -# Path to the inference data
    -inference_manifest = "<path_to_data>/dev_clean.json"
    -
    -# Import Jasper model definition
    -# Note that we are using a much larger 15x5 model now instead of 12x1
    -from ruamel.yaml import YAML
    -yaml = YAML(typ="safe")
    -with open("<nemo_git_repo_root>/examples/asr/configs/jasper15x5SEP.yaml") as f:
    -    jasper_model_definition = yaml.load(f)
    -labels = jasper_model_definition['labels']
    -
    -# Instantiate neural modules
    -data_layer = nemo_asr.AudioToTextDataLayer(manifest_filepath=inference_manifest,
    -    labels=labels, batch_size=64, shuffle=False,)
    -data_preprocessor = nemo_asr.AudioPreprocessing()
    -jasper_encoder = nemo_asr.JasperEncoder(feat_in=64,
    -    **jasper_model_definition['JasperEncoder'])
    -jasper_decoder = nemo_asr.JasperDecoderForCTC(feat_in=1024, num_classes=len(labels))
    -greedy_decoder = nemo_asr.GreedyCTCDecoder()
    -
    -# Define inference model
    -audio_signal, audio_signal_len, transcript, transcript_len = data_layer()
    -processed_signal, processed_signal_len = data_preprocessor(
    -    input_signal=audio_signal, length=audio_signal_len)
    -encoded, encoded_len = jasper_encoder(
    -    audio_signal=processed_signal, length=processed_signal_len)
    -log_probs = jasper_decoder(encoder_output=encoded)
    -predictions = greedy_decoder(log_probs=log_probs)
    -
    -eval_tensors=[predictions, transcript, transcript_len]
    -from nemo_asr.helpers import post_process_predictions, \
    -                             post_process_transcripts, word_error_rate
    -infer_callback = nemo.core.InferenceCallback(
    -    eval_tensors=eval_tensors)
    -
    -nf = nemo.core.NeuralModuleFactory()
    -
    -optimizer = nf.get_trainer(params={})
    -evaluated_tensors = optimizer.infer(callback=infer_callback,
    -    checkpoint_dir="<path_to_checkpoints>/15x5SEP/")
    -
    -hypotheses = post_process_predictions(evaluated_tensors[0], labels=labels)
    -references = post_process_transcripts(evaluated_tensors[1], labels=labels,
    -                                  transcript_len_list=evaluated_tensors[2])
    -wer = word_error_rate(hypotheses=hypotheses, references=references)
    -print("Greedy WER {:.2f}".format(wer*100))
    +
    python <nemo_git_repo_root>/examples/asr/jasper_infer.py --model_config=<nemo_git_repo_root>/examples/asr/configs/jasper15x5SEP.yaml --eval_datasets "<path_to_data>/dev_clean.json" --load_dir=<path_to_checkpoints>
     
    @@ -507,68 +480,18 @@

    Inference

    Using KenLM

    -

    We will be using BAIDU’s CTC decoder with LM implementation..

    +

    We will be using Baidu’s CTC decoder with LM implementation..

    Perform the following steps:

    • Go to cd <nemo_git_repo_root>/scripts

    • -
    • Install BAIDU’s CTC decoders sudo apt-get install swig and ./install_decoders.sh

    • +
    • Install Baidu’s CTC decoders sudo apt-get install swig and ./install_decoders.sh

    • Build 6-gram KenLM model on LibriSpeech ./build_6-gram_OpenSLR_lm.sh

    • -
    • Add the following lines to the inference script right after -predictions = greedy_decoder(log_probs=log_probs) :

    • +
    • Run jasper_infer.py with the –lm_path flag

    -
    predictions = greedy_decoder(log_probs=log_probs)
    -
    -import os
    -
    -# Instantiate BeamSearch NM
    -# Feel free to experiment with alpha, and beta parameters
    -beam_search_with_lm = nemo_asr.BeamSearchDecoderWithLM(
    -    vocab=labels,
    -    beam_width=128,
    -    alpha=2.2,
    -    beta=0.5,
    -    lm_path="<path_to_lm>/6-gram.binary",
    -    num_cpus=max(os.cpu_count(), 1))
    -beam_predictions = beam_search_with_lm(log_probs=log_probs,
    -                                       log_probs_length=encoded_len)
    -eval_tensors.append(beam_predictions)
    -
    -# Rest of code is slightly modified from the above script
    -from nemo_asr.helpers import post_process_predictions, \
    -                             post_process_transcripts, word_error_rate
    -infer_callback = nemo.core.InferenceCallback(
    -    # We add beam_predictions to eval_tensors
    -    eval_tensors=[predictions, transcripts, beam_predictions],
    -)
    -
    -nf = nemo.core.NeuralModuleFactory(backend=nemo.core.Backend.PyTorch)
    -
    -optimizer = nf.get_trainer(params={})
    -evaluated_tensors = optimizer.infer(callback=infer_callback,
    -   checkpoint_dir="<path_to_checkpoints>/15x5SEP/")
    -
    -hypotheses = post_process_predictions(evaluated_tensors[0], labels=labels)
    -references = post_process_transcripts(evaluated_tensors[1], labels=labels,
    -                          transcript_len_list=evaluated_tensors[2])
    -wer = word_error_rate(hypotheses=hypotheses, references=references)
    -print("Greedy WER {:.2f}".format(wer*100))
    -
    -# Post processing the new beam search predictions
    -beam_hypotheses = []
    -for i in evaluated_tensors[-1]:
    -    # Over samples
    -    for j in i:
    -        beam_hypotheses.append(j[0][1])
    -
    -beam_wer = word_error_rate(
    -    hypotheses=beam_hypotheses, references=references)
    -print("Beam WER {:.2f}".format(beam_wer*100))
    +
    python <nemo_git_repo_root>/examples/asr/jasper_infer.py --model_config=<nemo_git_repo_root>/examples/asr/configs/jasper15x5SEP.yaml --eval_datasets "<path_to_data>/dev_clean.json" --load_dir=<path_to_checkpoints> --lm_path=<path_to_6gram>
     
    -
      -
    • Run your updated inference script!

    • -
    diff --git a/docs/collections/core.html b/docs/collections/core.html index 02256db2361b..290b961a5daa 100644 --- a/docs/collections/core.html +++ b/docs/collections/core.html @@ -8,7 +8,7 @@ - NEMO Common Collection — nemo 0.1 documentation + NeMo Common Collection — nemo 0.1 documentation @@ -36,7 +36,7 @@ - + @@ -90,12 +90,13 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API @@ -142,9 +143,9 @@
  • Docs »
  • -
  • NEMO Collections API »
  • +
  • NeMo Collections API »
  • -
  • NEMO Common Collection
  • +
  • NeMo Common Collection
  • @@ -164,8 +165,8 @@
    -

    NEMO Common Collection

    -

    NEMO core package comes with “common” collection for pytorch built-in:

    +

    NeMo Common Collection

    +

    NeMo core package comes with “common” collection for pytorch built-in:

    class nemo.backends.pytorch.common.data.TextDataLayer(path, labels, eos_id, pad_id, batch_size, drop_last=False, num_workers=0, **kwargs)[source]
    @@ -219,8 +220,38 @@

    NEMO Common Collection

    +
    +class nemo.backends.pytorch.common.losses.CrossEntropyLoss(**kwargs)[source]
    +

    Bases: nemo.backends.pytorch.nm.LossNM

    +
    +
    Input Ports:
      +
    • logits:

      +
        +
      • 0-><class ‘nemo.core.neural_types.BatchTag’>:None:None

      • +
      • 1-><class ‘nemo.core.neural_types.ChannelTag’>:None:None

      • +
      +
    • +
    • labels:

      +
        +
      • 0-><class ‘nemo.core.neural_types.BatchTag’>:None:None

      • +
      +
    • +
    +
    +
    Output Ports:
      +
    • loss:

      +
        +
      • non-tensor object

      • +
      +
    • +
    +
    +
    +
    + +
    -class nemo.backends.pytorch.common.losses.SequenceLoss(pad_id=0, smoothing_coef=0.0, aux_ctc=False, ctc_initial_coef=0.1, ctc_blank_id=None, **kwargs)[source]
    +class nemo.backends.pytorch.common.losses.SequenceLoss(pad_id=0, smoothing_coef=0.0, sample_wise=False, aux_ctc=False, ctc_initial_coef=0.1, ctc_blank_id=None, **kwargs)[source]

    Bases: nemo.backends.pytorch.nm.LossNM

    Loss for seq2seq tasks

    @@ -230,6 +261,9 @@

    NEMO Common Collection

  • smoothing_coef (float) – Label smoothing coefficient in range [0, 1]. Defaults to 0.0.

  • +
  • sample_wise (bool) – Flag indicates if loss sum divisor should be batch +size. +Defaults to False.

  • aux_ctc (bool) – Whether to add auxiliary CTC loss. Defaults to False.

  • ctc_initial_coef (float) – Initial coefficient to multiply ctc component @@ -414,7 +448,7 @@

    NEMO Common Collection
    class nemo.backends.pytorch.common.search.GreedySearch(decoder, pad_id, bos_id, eos_id, max_len, batch_size=None, **kwargs)[source]
    -

    Bases: nemo.backends.pytorch.nm.TrainableNM

    +

    Bases: nemo.backends.pytorch.nm.NonTrainableNM

    Greedy translation search.

    For encoder-decoder based models.

    @@ -516,7 +550,7 @@

    NEMO Common Collection - +

  • diff --git a/docs/collections/modules.html b/docs/collections/modules.html index f74f20a11b61..0e381225d9c2 100644 --- a/docs/collections/modules.html +++ b/docs/collections/modules.html @@ -8,7 +8,7 @@ - NEMO Collections API — nemo 0.1 documentation + NeMo Collections API — nemo 0.1 documentation @@ -35,8 +35,8 @@ - - + + @@ -90,12 +90,13 @@
  • Getting started
  • Fast Training
  • Speech Recognition
  • -
  • NEMO Collections API @@ -142,7 +143,7 @@
  • Docs »
  • -
  • NEMO Collections API
  • +
  • NeMo Collections API
  • @@ -162,10 +163,10 @@
    -

    NEMO Collections API

    +

    NeMo Collections API

    - + @@ -294,12 +307,16 @@

    E

    - + @@ -334,6 +351,10 @@

    F

    G

    @@ -407,12 +430,16 @@

    L