Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of NLU YAML parser #5989

Merged
merged 4 commits into from
Jul 6, 2020
Merged

Conversation

degiz
Copy link
Contributor

@degiz degiz commented Jun 10, 2020

Part of #5983

Proposed changes:

  • Created new YAML parser
  • Extracted common parts for YAML and MD to the separate classes
  • Added unit tests

Things to address in the next PRs

  • Update the docs
  • Split loaders for Core and NLU files

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@degiz degiz changed the title Initial commit NL Jun 10, 2020
@degiz degiz changed the title NL YAML parser for the NLU training data Jun 10, 2020
@degiz degiz force-pushed the 5983_new_nlu_parser branch 2 times, most recently from 1611414 to ac8aae0 Compare June 17, 2020 12:27
@degiz degiz changed the title YAML parser for the NLU training data Initial implenemtation of NLU YAML parser Jun 17, 2020
@degiz degiz requested review from federicotdn and ricwo June 17, 2020 12:40
@degiz degiz force-pushed the 5983_new_nlu_parser branch 2 times, most recently from e808fb9 to 6ea2e8c Compare June 17, 2020 13:59
@degiz degiz marked this pull request as ready for review June 22, 2020 07:18
@federicotdn
Copy link
Contributor

I started testing this by using an old folder I had initialized with rasa init --no-prompt (on master). Then, I changed nlu.md to nlu.yml, and adapted the contents to YAML. I checked the validity of the YAML just in case. I also deleted nlu.md.

However, after running rasa train there seems to be a problem when stories.md is loaded. An exception is raised:

Traceback (most recent call last):
  File "/home/fede/Workspace/rasa/env/bin/rasa", line 11, in <module>
    load_entry_point('rasa', 'console_scripts', 'rasa')()
  File "/home/fede/Workspace/rasa/rasa/__main__.py", line 108, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/home/fede/Workspace/rasa/rasa/cli/train.py", line 77, in train
    nlu_additional_arguments=extract_nlu_additional_arguments(args),
  File "/home/fede/Workspace/rasa/rasa/train.py", line 52, in train
    nlu_additional_arguments=nlu_additional_arguments,
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/home/fede/Workspace/rasa/rasa/train.py", line 88, in train_async
    config, domain, training_files
  File "/home/fede/Workspace/rasa/rasa/importers/importer.py", line 81, in load_from_config
    config, config_path, domain_path, training_data_paths
  File "/home/fede/Workspace/rasa/rasa/importers/importer.py", line 139, in load_from_dict
    RasaFileImporter(config_path, domain_path, training_data_paths)
  File "/home/fede/Workspace/rasa/rasa/importers/rasa.py", line 36, in __init__
    training_data_paths
  File "/home/fede/Workspace/rasa/rasa/data.py", line 96, in get_core_nlu_files
    path
  File "/home/fede/Workspace/rasa/rasa/data.py", line 117, in _find_core_nlu_files_in_directory
    if is_nlu_file(full_path):
  File "/home/fede/Workspace/rasa/rasa/data.py", line 145, in is_nlu_file
    return loading.guess_format(file_path) != loading.UNK
  File "/home/fede/Workspace/rasa/rasa/nlu/training_data/loading.py", line 184, in guess_format
    content = io_utils.read_yaml_file(filename)
  File "/home/fede/Workspace/rasa/rasa/utils/io.py", line 209, in read_yaml_file
    return read_yaml(read_file(filename, DEFAULT_ENCODING))
  File "/home/fede/Workspace/rasa/rasa/utils/io.py", line 124, in read_yaml
    return yaml_parser.load(content) or {}
  File "/home/fede/Workspace/rasa/env/lib/python3.7/site-packages/ruamel/yaml/main.py", line 343, in load
    return constructor.get_single_data()
  File "/home/fede/Workspace/rasa/env/lib/python3.7/site-packages/ruamel/yaml/constructor.py", line 111, in get_single_data
    node = self.composer.get_single_node()
  File "_ruamel_yaml.pyx", line 703, in _ruamel_yaml.CParser.get_single_node
  File "_ruamel_yaml.pyx", line 904, in _ruamel_yaml.CParser._parse_next_event
ruamel.yaml.scanner.ScannerError: while scanning an alias
  in "<unicode string>", line 2, column 1
did not find expected alphabetic or numeric character
  in "<unicode string>", line 2, column 2

It appears to be a problem with _is_nlg_story_format, which is not detecting stories.md as a Markdown stories file for some reason. But that's strange since this PR does not modify that function (or the regexp it uses).

@federicotdn
Copy link
Contributor

A simple way to reproduce: on master, run rasa init --no-prompt, then switch to this branch and do rasa train (no YAML needed).

@degiz
Copy link
Contributor Author

degiz commented Jun 23, 2020

@federicotdn thanks for spotting that!
I think locally I only tried the rasa train --nlu. Problem fixed, but I also need to make sure that situation is covered with a test 😄
Also, there's a bunch of test failing in the test_train.py, I'll check that.

Copy link
Contributor

@federicotdn federicotdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested a couple of different NLU YAML files and it's working great! I have a couple of thoughts that I've written down, in no particular order:

  • Some warnings are appearing three times when I train. For example, if I omit the examples section of a lookup table. Not sure if this is specific to YAML.
  • If a user has a typo on their section name (e.g. regexp instead of regex) I think we could help them with a warning in that case.
  • For the warnings that start with Unexpected block found in..., maybe we could change the >> for a got: {value}, but expected a multiline string.
  • If I specify a list of strings under examples for an intent, an exception is raised. I think we should cover this case because users might try to write examples with lists (without metadata).
  • In interactive learning, I am forced to pick a file ending with .md to write the new data. I am not sure if we want to update interactive learning with this PR, maybe it's out of scope.
  • The rasa init command should create a nlu.yml file instead of a nlu.md file. Again, maybe not in the scope of this PR.
  • Some methods are missing docstrings.

rasa/data.py Outdated Show resolved Hide resolved
rasa/nlu/featurizers/sparse_featurizer/regex_featurizer.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/formats/rasa_yaml.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/formats/rasa_yaml.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/formats/rasa_yaml.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/formats/readerwriter.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/loading.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/lookup_tables_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/synonyms_parser.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ricwo ricwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested some basic training and it works well! when editing multiline examples, I think it's quite intuitive to provide lists instead of a multiline string. I think @federicotdn already suggested this in the review, but I think it would be awesome if we could support both.

When making YAML syntax errors, it would be very helpful if we could output some hints about the error. This could even be just the YAML exception. Right now, all the CLI shows is Path 'data' doesn't contain valid NLU data in it. Please verify the data format. The NLU model training will be skipped now.. What do you think?

rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
tests/nlu/training_data/formats/test_rasa_yaml.py Outdated Show resolved Hide resolved
tests/nlu/training_data/test_entities_parser.py Outdated Show resolved Hide resolved
tests/nlu/training_data/test_training_data.py Outdated Show resolved Hide resolved
@degiz
Copy link
Contributor Author

degiz commented Jun 27, 2020

@ricwo @federicotdn Thanks a lot for your patience and the detailed code review! I know there were obvious issues with missing docstrings and other stuff 😞 I was in a bit of hurry to get this thing working asap, sorry!

@degiz degiz force-pushed the 5983_new_nlu_parser branch 4 times, most recently from 400763f to 4b9ea9a Compare June 28, 2020 07:21
@degiz
Copy link
Contributor Author

degiz commented Jun 30, 2020

When making YAML syntax errors

@ricwo I have a plan to verify both NLU and Core YAML files against a schema. It will print a proper message like "Unexpected key..". I want to do that in a separate PR not to make this one even bigger.

provide lists instead of a multiline string

The latest decision we made was to be strict and support only multiple strings. Thanks for spotting that, I'll adjust the code and the tests.

@degiz degiz force-pushed the 5983_new_nlu_parser branch 5 times, most recently from b51759c to 4f8272a Compare July 1, 2020 15:19
@wochinge wochinge changed the title Initial implenemtation of NLU YAML parser Initial implementation of NLU YAML parser Jul 2, 2020
@degiz degiz force-pushed the 5983_new_nlu_parser branch 5 times, most recently from a2f8a17 to 7e5e5b1 Compare July 2, 2020 12:53
@degiz degiz requested review from federicotdn and ricwo July 2, 2020 13:33
Copy link
Contributor

@ricwo ricwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job 👍

I'm seeing some confusing warning messages when training with the example nlu.yml:

UserWarning: Unexpected key 'metadata' found in 'data/nlu.yml'.Supported keys are: 'intent', 'synonym', 'regex', 'lookup'. This key will be skipped.

and

UserWarning: Unexpected key 'examples' found in 'data/nlu.yml'.Supported keys are: 'intent', 'synonym', 'regex', 'lookup'. This key will be skipped.

It might just be a matter of adding them as elifs in rasa_yaml._parse_items()

rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/entities_parser.py Outdated Show resolved Hide resolved
@@ -151,32 +136,27 @@ def _load_files(self, line: Text) -> None:

def _parse_item(self, line: Text) -> None:
"""Parses an md list item line based on the current section type."""
import rasa.nlu.training_data.lookup_tables_parser as lookup_tables_parser
import rasa.nlu.training_data.synonyms_parser as synonyms_parser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be module-level imports since this is called very frequently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately there is something wrong with the cyclic import in the whole NLU package and I haven't managed yet to resolve it.

import rasa.nlu.training_data.entities_parser as entities_parser
import rasa.nlu.training_data.synonyms_parser as synonyms_parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, can they be module-level imports or does it lead to circular imports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer.

rasa/nlu/training_data/formats/rasa_yaml.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/formats/markdown.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/lookup_tables_parser.py Outdated Show resolved Hide resolved
rasa/nlu/training_data/lookup_tables_parser.py Outdated Show resolved Hide resolved
Comment on lines 512 to 513
def test_guess_format_from_non_existing_file_path():
assert guess_format("not existing path") == UNK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test could be added to the parametrized one: test_get_supported_file_format()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, if the path doesn't exist, the get_file_format() raises.

rasa/nlu/training_data/formats/rasa_yaml.py Outdated Show resolved Hide resolved
Copy link
Contributor

@federicotdn federicotdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some testing again! Some comments:

  • Getting the same as Rick for the examples and metadata keys in particular.
  • I would also add that warnings are appearing twice. If I add a foo key, I get twice:
/home/fede/Workspace/rasa/rasa/utils/common.py:364: UserWarning: Unexpected key 'foo' found in 'data/nlu.yml'.Supported keys are: 'intent', 'synonym', 'regex', 'lookup'. This key will be skipped.
  More info at https://rasa.com/docs/rasa/nlu/training-data-format/
  • Using a list of strings instead of a multiline string crashes the parser with an error that users won't understand. We could either show a warning and ignore the contents, or raise an exception with a more informative message.
  • If I specify a file in examples for a lookup, we are treating that as a one-line lookup. Would there be an easy way of warning the user that the string will not be interpreted as a file path?

I think there will be opportunities in the future to see where users are getting stuck, and how we can help them by adding more warnings here and there. Other than that, the parser looks ready to be merged!

rasa/nlu/training_data/formats/rasa_yaml.py Show resolved Hide resolved
rasa/nlu/training_data/formats/rasa_yaml.py Outdated Show resolved Hide resolved
example.get(KEY_INTENT_TEXT, "")
for example in examples
if example
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't specify text for one of the entries, this defaults to "". Maybe we could show a warning saying that text needs to be specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point!

Alexander Khizov added 2 commits July 6, 2020 11:45
- Created new YAML parser
- Extracted common parts for YAML and MD to the separate classes
- Added unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants