Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for writing back NLU intent/example metadata to YAML #7731

Closed
3 tasks done
chdorner opened this issue Jan 15, 2021 · 3 comments
Closed
3 tasks done

Support for writing back NLU intent/example metadata to YAML #7731

chdorner opened this issue Jan 15, 2021 · 3 comments
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@chdorner
Copy link
Contributor

chdorner commented Jan 15, 2021

Description of Problem:
Rasa 2.0 introduced support for metadata on NLU intents and examples (reference), but so far only the RasaYAMLReader supports parsing this, the RasaYAMLWriter is not able to write it back to YAML files.

This came out of https://github.com/RasaHQ/rasa-x/issues/4180.

Overview of the Solution:
Support for intent and example metadata needs to be added to RasaYAMLWriter.process_training_examples_by_key (src).

Considering this YAML structure:

version: "2.0"
nlu:
- intent: greet
  metadata:
    sentiment: neutral
  examples:
    - text: |
        hi
      metadata:
        capitalization: lazy
    - text: |
        Hi
      metadata:
        capitalization: correct

The parser returns:

# ...
[{'text': 'hi',
  'intent': 'greet',
  'metadata': {'intent': {'sentiment': 'neutral'},
               'example': {'capitalization': 'lazy'}}},
 {'text': 'Hi',
  'intent': 'greet',
  'metadata': {'intent': {'sentiment': 'neutral'},
               'example': {'capitalization': 'correct'}}}]
# ...

Rendering the example metadata (the dict with "capitalization") should probably be fairly straight forward without too many questions to figure out upfront.
The intent metadata (the dict with "sentiment") however is duplicated on each example which does raise a few questions.

  • Should the writer just take the first/last (?) example and take its intent metadata?
  • Or should the writer take the intent metadata of all examples, deep/shallow (?) merge before writing it?

Examples (if relevant):

>>> from rasa.shared.nlu.training_data.message import Message
>>> from rasa.shared.nlu.training_data.training_data import TrainingData
>>> a = Message.build(text="hello", intent="greet", example_metadata={"paraphrases": ["hey", "hi"]})
>>> a.data
{'text': 'hello', 'intent': 'greet', 'metadata': {'example': {'paraphrases': ['hey', 'hi']}}}
>>> a.as_dict()
{'text': 'hello', 'intent': 'greet', 'metadata': {'example': {'paraphrases': ['hey', 'hi']}}}
>>> td = TrainingData([a])
>>> td.nlu_as_yaml()
'version: "2.0"\nnlu:\n- intent: greet\n  examples: |\n    - hello\n'

(hat tip to @dakshvar22 for the code)

Blockers (if relevant):

Definition of Done:

  • Discuss and agree on how to handle intent metadata
  • Tests are added
  • Feature mentioned in the changelog
@chdorner chdorner added type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jan 15, 2021
@chdorner chdorner self-assigned this Jan 15, 2021
@chdorner
Copy link
Contributor Author

chdorner commented Jan 18, 2021

Open Questions

Rendering intent metadata

We don't have a Python-object representation for the intent itself when parsing an NLU file, thus we put the metadata on the intent level in each example of that intent:

from rasa.shared.nlu.training_data.formats.rasa_yaml import RasaYAMLReader

yaml_string = f"""version: "2.0"
nlu:
- intent: greet
  metadata:
    sentiment: neutral
  examples: |
    - hi
    - hello
"""

training_data = RasaYAMLReader().reads(yaml_string)

training_data.training_examples[0].as_dict()
# {'text': 'hi',
#  'intent': 'greet',
#  'metadata': {'intent': {'sentiment': 'neutral'}}}

training_data.training_examples[0].as_dict()
# {'text': 'hello',
#  'intent': 'greet',
#  'metadata': {'intent': {'sentiment': 'neutral'}}}

This opens up a question on how the RasaYAMLWriter should collect all the intent metadata from each example and render it in YAML.

We can:
a) trust that the training example representation in Python objects follows the rules of the NLU file format and that for each example of a given intent the intent metadata is exactly the same, thus allowing us to just grab the intent metadata from one (first? last?) example.

b) be a bit more defensive and try to collect all intent metadata from the examples of a given intent and try to merge them together (shallow / deep merge?).

Update: The RasaYAMLWriter can assume that all intent metadata from the examples belonging to the same intent are identical, thus it's fine just to take the first one.

Data type of metadata

The docs currently say that:

the metadata key can contain arbitrary key-value data.

There is however one test case in the code which has a list of strings as the value of the "metadata" key.

Which one of the two is the truth? Only allowing key-value objects (i.e. Python dicts) would simplify the implementation in the RasaYAMLWriter significantly.

Update: The metadata can be any data type that is supported by YAML including maps, lists, strings, numbers, etc.

Preserving the YAML structure for examples without metadata

The YAML structure looks different depending if we have example metadata or not. Given that we have metadata on individual examples (or if at least one of the examples has metadata) the YAML structure looks like this:

With metadata on examples it would be (example 1):

version: "2.0"
nlu:
- intent: greet
  examples:
    - text: |
        hi
      metadata:
        sentiment: neutral
    - text: |
        hello
# ...

If we don't have any metadata on the examples, then we can use a less verbose YAML structure (example 2):

version: "2.0"
nlu:
- intent: greet
  examples: |
    - hi
    - hello

So far the RasaYAMLWriter only supports the less verbose YAML structure. Do we need to preserve this functionality, or can the writer from now on always write the verbose version?
In other words, given example 2 as the input, is it okay if the RasaYAMLWriter will always write this as:

version: "2.0"
nlu:
- intent: greet
  examples:
    - text: |
        hi
    - text: |
        hello

Update: The YAML output should be identical to the input.

@m-vdb
Copy link
Collaborator

m-vdb commented Jan 19, 2021

As I followed the initial implementation by @degiz , let me share a few thoughts:

  • Rendering intent metadata: since Rasa Open Source is responsible for loading + manipulating intent metadata + dumping, I think that a) is more sensible. I'd be a bit more defensive and include a warning in case the intent metadata is different on one or more examples (which maybe would be a bug in our code?). Implementing b) sounds a bit overkill (what's the use case here?)
  • Data type of metadata: I'd follow the public API and the doc. I think that the test was written at an early stage of implementation. While allowing any kind of metadata sounds appealing, I think it would reduce our ability to manipulate it / combine it, etc... hence reducing value for users.
  • Preserving the YAML structure for examples without metadata: I think we need to focus on the user experience here, and simplify the "look" of the training data as long as it's manageable on our end. Not all examples for all users will have metadata. So I'd respect what's in the documentation and go for both implementations.

(also cc'ing you @tmbo in case you miss reasoning about training data format 😅 )

@chdorner
Copy link
Contributor Author

Summary from a call w/ @degiz today:

  • Rendering intent metadata: The RasaYAMLWriter can assume that all intent metadata from the examples belonging to the same intent are identical, thus it's fine just to take the first one.
  • Metadata can be any data type that is supported by YAML including maps, lists, strings, numbers, etc.
  • Preserving the YAML structure for examples without metadata: The YAML output should be identical to the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

No branches or pull requests

2 participants