Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tokens for story structure validation #7436

Merged
merged 30 commits into from
Dec 14, 2020
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
d6ab8cf
Add tests
Nov 27, 2020
a72b5fd
Draft first implementation
Dec 2, 2020
d1280a3
Fix random sorting before hash
Dec 3, 2020
71d7638
Update doc strings
Dec 3, 2020
c0e5d98
Merge branch 'e2e' into johannes-res114
Dec 3, 2020
6ffc60f
Add doc strings
Dec 3, 2020
8b223cb
Merge branch 'johannes-res114' of github.com:RasaHQ/rasa into johanne…
Dec 3, 2020
292865b
Merge branch 'e2e' into johannes-res114
Dec 4, 2020
3e32e4f
Merge branch 'e2e' into johannes-res114
Dec 9, 2020
56f8d9c
Fix minor issues
Dec 9, 2020
bccfc2d
Add config file loading
Dec 9, 2020
9e6d850
Fix some docstrings
Dec 9, 2020
87b429f
Make test stories part of the test
Dec 9, 2020
d8cace6
Update tests
Dec 9, 2020
b10b82e
Merge remote-tracking branch 'github/e2e' into johannes-res114
Dec 9, 2020
34c3b60
Merge branch 'e2e' into johannes-res114
Dec 9, 2020
0f98d40
Merge branch 'e2e' into johannes-res114
Dec 14, 2020
166ec5d
Fix minor issues
Dec 14, 2020
3b18c16
Fix config param argument
Dec 14, 2020
f41fe53
Add TrainingType to tests to avoid config change
Dec 14, 2020
3d3db9e
Delete hash again
Dec 14, 2020
4bdb830
Update docs
Dec 14, 2020
cfd899f
Merge branch 'e2e' into johannes-res114
Dec 14, 2020
1a3dc1f
Merge branch 'e2e' into johannes-res114
Dec 14, 2020
4a180cf
Merge branch 'e2e' into johannes-res114
Dec 14, 2020
58ae7f2
Merge branch 'e2e' into johannes-res114
Dec 14, 2020
09ab28b
Merge branch 'e2e' into johannes-res114
Dec 14, 2020
6af7577
Update rasa/core/training/story_conflict.py
Dec 14, 2020
b79c476
Update rasa/shared/nlu/training_data/features.py
Dec 14, 2020
cd16176
Fix minor issues
Dec 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions changelog/7436.improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Make `rasa data validate stories` work for end-to-end.

The `rasa data validate stories` function now considers the tokenized user text instead of the plain text that is part of a state.
This is closer to what Rasa Core actually uses to distinguish states and thus captures more story structure problems.
9 changes: 9 additions & 0 deletions data/test_config/config_defaults.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,12 @@ pipeline: []
# ambiguity_threshold: 0.1

data:
policies:
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
# - name: MemoizationPolicy
# - name: TEDPolicy
# max_history: 5
# epochs: 100
# - name: RulePolicy
13 changes: 13 additions & 0 deletions data/test_stories/stories_e2e_conflicting_1.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
version: "2.0"
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved

stories:
- story: path 1
steps:
- user: |
amazing!
- action: utter_happy
- story: path 2 (should always conflict path 1)
steps:
- user: |
amazing!
- action: utter_cheer_up
13 changes: 13 additions & 0 deletions data/test_stories/stories_e2e_conflicting_2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
version: "2.0"

stories:
- story: path 1
steps:
- user: |
truly amazing
- action: utter_greet
- story: path 2 (should conflict path 1 when WhitespaceTokenizer is used)
steps:
- user: |
truly amazing
- action: utter_goodbye
4 changes: 4 additions & 0 deletions docs/docs/command-line-interface.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,10 @@ rasa data validate stories
:::note
Running `rasa data validate` does **not** test if your [rules](./rules.mdx) are consistent with your stories.
However, during training, the `RulePolicy` checks for conflicts between rules and stories. Any such conflict will abort training.

Furthermore, if you use end-to-end stories, then this might not capture all conflicts. Specifically, if two user inputs
result in different tokens yet exactly the same featurization, then conflicting actions after these inputs
may exist but will not be reported by the tool.
:::

To interrupt validation even for minor issues such as unused intents or responses, use the `--fail-on-warnings` flag.
Expand Down
4 changes: 4 additions & 0 deletions docs/docs/setting-up-ci-cd.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,10 @@ always good to run this check before training a model. By including the
:::note
Running `rasa data validate` does **not** test if your [rules](./rules.mdx) are consistent with your stories.
However, during training, the `RulePolicy` checks for conflicts between rules and stories. Any such conflict will abort training.

Furthermore, if you use end-to-end stories, then this might not capture all conflicts. Specifically, if two user inputs
result in different tokens yet exactly the same featurization, then conflicting actions after these inputs
may exist but will not be reported by the tool.
:::

To read more about the validator and all of the available options, see [the documentation for
Expand Down
6 changes: 4 additions & 2 deletions rasa/core/featurizers/single_state_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,11 +292,13 @@ def encode_entities(
# we cannot build a classifier if there are less than 2 class
return {}

parsed_text = interpreter.featurize_message(Message({TEXT: entity_data[TEXT]}))
parsed_text = (
interpreter.featurize_message(Message({TEXT: entity_data[TEXT]})) or {}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do afterwards

if parsed_text is None:
    return {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning {} instead of {ENTITY_TAGS: Features(...)} with zero-features saves some bandwidth and computation time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made that change as vova suggested.

entities = entity_data.get(ENTITIES, [])

_tags = []
for token in parsed_text.get(TOKENS_NAMES[TEXT]):
for token in parsed_text.get(TOKENS_NAMES[TEXT], []):
_tag = determine_token_labels(
token, entities, attribute_key=ENTITY_ATTRIBUTE_TYPE
)
Expand Down
132 changes: 120 additions & 12 deletions rasa/core/training/story_conflict.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from collections import defaultdict
import logging
from typing import Dict, Generator, List, NamedTuple, Optional, Text, Tuple
from typing import Dict, Generator, List, NamedTuple, Optional, Text, Tuple, Any

from rasa.core.featurizers.tracker_featurizers import MaxHistoryTrackerFeaturizer
from rasa.shared.core.constants import ACTION_LISTEN_NAME, PREVIOUS_ACTION, USER
Expand All @@ -9,6 +9,14 @@
from rasa.shared.core.generator import TrackerWithCachedStates
from rasa.shared.nlu.constants import INTENT
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved

from rasa.nlu.model import Trainer
from rasa.nlu.components import Component
from rasa.nlu.tokenizers.tokenizer import Tokenizer
from rasa.nlu.config import RasaNLUModelConfig
from rasa.shared.nlu.constants import TEXT
from rasa.shared.nlu.training_data.message import Message
from rasa.shared.utils.io import raise_warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please import the module instead of the function directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn! I always overlook those! Sorry :/


logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -124,7 +132,32 @@ class TrackerEventStateTuple(NamedTuple):

@property
def sliced_states_hash(self) -> int:
return hash(str(list(self.sliced_states)))
"""Returns the hash of the sliced states."""
return hash(_as_sorted_text(self.sliced_states))


def _as_sorted_text(obj: Any) -> Text:
"""Returns the string of `obj` after sorting lists and dicts.

Args:
obj: Something made up of lists and dicts and stringifiable objects.

Returns:
A string representation of the object that doesn't change
randomly due to unsorted dicts or sets.
"""
if isinstance(obj, str):
return obj
elif isinstance(obj, dict):
return str(
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved
[
(_as_sorted_text(key), _as_sorted_text(value))
for key, value in sorted(obj.items())
]
)
elif isinstance(obj, (list, set)):
return str(sorted([_as_sorted_text(element) for element in obj]))
return str(obj)


def _get_length_of_longest_story(
Expand All @@ -146,56 +179,100 @@ def find_story_conflicts(
trackers: List[TrackerWithCachedStates],
domain: Domain,
max_history: Optional[int] = None,
nlu_config: Optional[RasaNLUModelConfig] = None,
) -> List[StoryConflict]:
"""Generates `StoryConflict` objects, describing conflicts in the given trackers.

Args:
trackers: Trackers in which to search for conflicts.
domain: The domain.
max_history: The maximum history length to be taken into account.
nlu_config: NLU config.

Returns:
StoryConflict objects.
"""
if not max_history:
max_history = _get_length_of_longest_story(trackers, domain)
if max_history:
logger.info(
f"Considering the preceding {max_history} turns for conflict analysis."
)
else:
logger.info("Considering all preceding turns for conflict analysis.")

logger.info(f"Considering the preceding {max_history} turns for conflict analysis.")
tokenizing_function = _get_tokenizing_function_from_nlu_config(nlu_config)

# We do this in two steps, to reduce memory consumption:

# Create a 'state -> list of actions' dict, where the state is
# represented by its hash
conflicting_state_action_mapping = _find_conflicting_states(
trackers, domain, max_history
trackers, domain, max_history, tokenizing_function
)

# Iterate once more over all states and note the (unhashed) state,
# for which a conflict occurs
conflicts = _build_conflicts_from_states(
trackers, domain, max_history, conflicting_state_action_mapping
trackers,
domain,
max_history,
conflicting_state_action_mapping,
tokenizing_function,
)

return conflicts


def _get_tokenizing_function_from_nlu_config(
nlu_config: Optional[RasaNLUModelConfig] = None,
) -> Optional[callable]:
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved
"""Extracts the `tokenize` function of the first Tokenizer in the pipeline.
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved

Args:
nlu_config: NLU Config.
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved
"""
if not nlu_config:
return None

pipeline: List[Component] = Trainer(
nlu_config, skip_validation=True
).pipeline # ToDo: ComponentBuilder?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if / how I should provide some ComponentBuilder here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wochinge Thanks for the review! Will implement changes tomorrow morning. What about this? Should I somehow add the ComponentBuilder here? It doesn't seem to do anything and I don't know where to get it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's optional and the default will be created automatically. Seems fine for me.

tokenizer: Optional[Tokenizer] = None
for component in pipeline:
if isinstance(component, Tokenizer) and tokenizer:
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved
raise_warning(
"The pipeline contains more than one tokenizer. "
"Only the first tokenizer will be used for story validation.",
category=UserWarning,
)
elif isinstance(component, Tokenizer):
tokenizer = component

return tokenizer.tokenize if tokenizer else None


def _find_conflicting_states(
trackers: List[TrackerWithCachedStates], domain: Domain, max_history: int
trackers: List[TrackerWithCachedStates],
domain: Domain,
max_history: Optional[int],
tokenizing_function: Optional[callable],
) -> Dict[int, Optional[List[Text]]]:
"""Identifies all states from which different actions follow.

Args:
trackers: Trackers that contain the states.
domain: The domain object.
max_history: Number of turns to take into account for the state descriptions.
tokenizing_function: A `Tokenizer.tokenize` function.

Returns:
A dictionary mapping state-hashes to a list of actions that follow from each state.
"""
# Create a 'state -> list of actions' dict, where the state is
# represented by its hash
state_action_mapping = defaultdict(list)
for element in _sliced_states_iterator(trackers, domain, max_history):
for element in _sliced_states_iterator(
trackers, domain, max_history, tokenizing_function
):
hashed_state = element.sliced_states_hash
if element.event.as_story_string() not in state_action_mapping[hashed_state]:
state_action_mapping[hashed_state] += [element.event.as_story_string()]
Expand All @@ -211,8 +288,9 @@ def _find_conflicting_states(
def _build_conflicts_from_states(
trackers: List[TrackerWithCachedStates],
domain: Domain,
max_history: int,
max_history: Optional[int],
conflicting_state_action_mapping: Dict[int, Optional[List[Text]]],
tokenizing_function: Optional[callable],
) -> List["StoryConflict"]:
"""Builds a list of `StoryConflict` objects for each given conflict.

Expand All @@ -222,6 +300,7 @@ def _build_conflicts_from_states(
max_history: Number of turns to take into account for the state descriptions.
conflicting_state_action_mapping: A dictionary mapping state-hashes to a list of actions
that follow from each state.
tokenizing_function: A `Tokenizer.tokenize` function.

Returns:
A list of `StoryConflict` objects that describe inconsistencies in the story
Expand All @@ -230,7 +309,9 @@ def _build_conflicts_from_states(
# Iterate once more over all states and note the (unhashed) state,
# for which a conflict occurs
conflicts = {}
for element in _sliced_states_iterator(trackers, domain, max_history):
for element in _sliced_states_iterator(
trackers, domain, max_history, tokenizing_function
):
hashed_state = element.sliced_states_hash

if hashed_state in conflicting_state_action_mapping:
Expand All @@ -252,7 +333,10 @@ def _build_conflicts_from_states(


def _sliced_states_iterator(
trackers: List[TrackerWithCachedStates], domain: Domain, max_history: int
trackers: List[TrackerWithCachedStates],
domain: Domain,
max_history: Optional[int],
tokenizing_function: Optional[callable],
) -> Generator[TrackerEventStateTuple, None, None]:
"""Creates an iterator over sliced states.

Expand All @@ -263,6 +347,7 @@ def _sliced_states_iterator(
trackers: List of trackers.
domain: Domain (used for tracker.past_states).
max_history: Assumed `max_history` value for slicing.
tokenizing_function: A `Tokenizer.tokenize` function.

Yields:
A (tracker, event, sliced_states) triplet.
Expand All @@ -276,10 +361,33 @@ def _sliced_states_iterator(
sliced_states = MaxHistoryTrackerFeaturizer.slice_state_history(
states[: idx + 1], max_history
)
if tokenizing_function:
_apply_tokenizer_to_states(tokenizing_function, sliced_states)
# ToDo: deal with oov (different tokens can lead to identical features if some of those tokens are out of vocabulary for all featurizers)
yield TrackerEventStateTuple(tracker, event, sliced_states)
idx += 1


def _apply_tokenizer_to_states(
tokenizing_function: callable, states: List[State]
) -> None:
"""Split each user text into tokens and concatenate them again.

Args:
tokenizing_function: Should take a message and an attribute and return the tokens,
just like `Tokenizer.tokenize`.
states: The states to be tokenized.
"""
for state in states:
if USER in state:
state[USER][TEXT] = "".join(
JEM-Mosig marked this conversation as resolved.
Show resolved Hide resolved
token.text
for token in tokenizing_function(
Message({TEXT: state[USER][TEXT]}), TEXT
)
)


def _get_previous_event(
state: Optional[State],
) -> Tuple[Optional[Text], Optional[Text]]:
Expand Down
18 changes: 14 additions & 4 deletions rasa/shared/nlu/training_data/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@


class Features:
"""Stores the features produces by any featurizer."""
"""Stores the features produced by any featurizer."""

def __init__(
self,
Expand All @@ -16,6 +16,14 @@ def __init__(
attribute: Text,
origin: Union[Text, List[Text]],
) -> None:
"""Initializes the Features object.

Args:
features: The features.
feature_type: Type of the feature, e.g. FEATURE_TYPE_SENTENCE.
attribute: Message attribute, e.g. INTENT or TEXT.
origin: Name of the component that created the features.
"""
self.features = features
self.type = feature_type
self.origin = origin
Expand Down Expand Up @@ -83,10 +91,12 @@ def __key__(
) -> Tuple[
Text, Text, Union[np.ndarray, scipy.sparse.spmatrix], Union[Text, List[Text]]
]:
return (self.type, self.attribute, self.features, self.origin)
"""Returns a 4-tuple of defining properties.

def __hash__(self) -> int:
wochinge marked this conversation as resolved.
Show resolved Hide resolved
return hash(self.__key__())
Returns:
Tuple of type, attribute, features, and origin properties.
"""
return (self.type, self.attribute, self.features, self.origin)

def __eq__(self, other: Any) -> bool:
if not isinstance(other, Features):
Expand Down
Loading