New core featurization #6296

evgeniiaraz · 2020-07-29T13:29:18Z

Proposed changes:

Modified core states -- from a dictionary to a dictionary of dictionaries; (e.g., {intent_greet: 1.0, prev_utter_greet :1.0} --> USER: {intent: greet}, PREVIOUS_ACTION: {action_name: utter_greet})
Moved rasa.core.featurizers into a separate package
Core State Featurization:
-- SingleStateFeaturizer uses NLU Interpreter (if provided) to featurize ACTION_NAME, ACTION_TEXT, INTENT, TEXT
-- Featurization of each attribute stored in Features object;
-- SLOTS, ENTITIES, INTENT and ACTION_NAME are sparse;
-- LabelTokenizerSingleStateFeaturizer is deprecated;
Moved Features rasa.utils.features (from rasa.nlu.featurizers.featurizer )
RasaModelData -- changes to store, add and pad data in [dict[dict]]
RasaModel -- changes to accommodate to dict[dict] RasaModelData
Created TransformerRasaModel -- a RasaModel with methods shared by DIET and TED; it's an abstract class containing helper methods for transformer sequence models.
TED -- changed to process sparse features and accept both text and name features
TED -- instead of recompute label embeddings, gather them from all_labels_embed using index

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

rasa/core/domain.py

rasa/core/events/__init__.py

rasa/core/featurizers.py

rasa/core/policies/ensemble.py

rasa/core/policies/ted_policy.py

rasa/core/training/generator.py

Ghostvv · 2020-07-31T11:31:34Z

I think UserUttered.as_story_string() should be updated

changelog/6296.removal.md

wochinge · 2020-09-04T14:07:44Z

Training breaks for me with the rasa init project:

Traceback (most recent call last):
  File "/Users/tobias/.pyenv/versions/rasa2.0/bin/rasa", line 11, in <module>
    load_entry_point('rasa', 'console_scripts', 'rasa')()
  File "/Users/tobias/Workspace/stack/rasa/__main__.py", line 109, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/Users/tobias/Workspace/stack/rasa/cli/train.py", line 77, in train
    nlu_additional_arguments=extract_nlu_additional_arguments(args),
  File "/Users/tobias/Workspace/stack/rasa/train.py", line 53, in train
    nlu_additional_arguments=nlu_additional_arguments,
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/Users/tobias/Workspace/stack/rasa/train.py", line 109, in train_async
    nlu_additional_arguments=nlu_additional_arguments,
  File "/Users/tobias/Workspace/stack/rasa/train.py", line 202, in _train_async_internal
    old_model_zip_path=old_model,
  File "/Users/tobias/Workspace/stack/rasa/train.py", line 258, in _do_training
    or _interpreter_from_previous_model(old_model_zip_path),
  File "/Users/tobias/Workspace/stack/rasa/train.py", line 404, in _train_core_with_validated_data
    interpreter=interpreter,
  File "/Users/tobias/Workspace/stack/rasa/core/train.py", line 66, in train
    agent.train(training_data, **additional_arguments)
  File "/Users/tobias/Workspace/stack/rasa/core/agent.py", line 718, in train
    training_trackers, self.domain, interpreter=self.interpreter, **kwargs
  File "/Users/tobias/Workspace/stack/rasa/core/policies/ensemble.py", line 203, in train
    trackers_to_train, domain, interpreter=interpreter, **kwargs
  File "/Users/tobias/Workspace/stack/rasa/core/policies/ted_policy.py", line 323, in train
    training_trackers, domain, interpreter, **kwargs
  File "/Users/tobias/Workspace/stack/rasa/core/policies/policy.py", line 151, in featurize_for_training
    training_trackers, domain, interpreter
  File "/Users/tobias/Workspace/stack/rasa/core/featurizers/tracker_featurizers.py", line 125, in featurize_trackers
    tracker_state_features = self._featurize_states(trackers_as_states, interpreter)
  File "/Users/tobias/Workspace/stack/rasa/core/featurizers/tracker_featurizers.py", line 58, in _featurize_states
    for tracker_states in trackers_as_states
  File "/Users/tobias/Workspace/stack/rasa/core/featurizers/tracker_featurizers.py", line 58, in <listcomp>
    for tracker_states in trackers_as_states
  File "/Users/tobias/Workspace/stack/rasa/core/featurizers/tracker_featurizers.py", line 56, in <listcomp>
    for state in tracker_states
  File "/Users/tobias/Workspace/stack/rasa/core/featurizers/single_state_featurizer.py", line 196, in encode_state
    self._extract_state_features(sub_state, interpreter, sparse=True)
  File "/Users/tobias/Workspace/stack/rasa/core/featurizers/single_state_featurizer.py", line 164, in _extract_state_features
    parsed_message = interpreter.featurize_message(message)
  File "/Users/tobias/Workspace/stack/rasa/core/interpreter.py", line 295, in featurize_message
    result = self.interpreter.featurize_message(message)
  File "/Users/tobias/Workspace/stack/rasa/nlu/model.py", line 416, in featurize_message
    component.process(message, **self.context)
  File "/Users/tobias/Workspace/stack/rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py", line 549, in process
    attribute, [message_tokens]
  File "/Users/tobias/Workspace/stack/rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurizer.py", line 429, in _create_features
    seq_vec = self.vectorizers[attribute].transform(tokens)
  File "/Users/tobias/.pyenv/versions/3.7.8/envs/rasa2.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1247, in transform
    self._check_vocabulary()
  File "/Users/tobias/.pyenv/versions/3.7.8/envs/rasa2.0/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 467, in _check_vocabulary
    raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided

evgeniiaraz · 2020-09-04T14:16:12Z

@wochinge this is because we haven't reverted the reading of action_names yet, right?

Ghostvv · 2020-09-04T14:37:20Z

seems like the right error, CVF was not trained for action_name

wochinge · 2020-09-04T14:38:36Z

I updated my PR with the latest state of this PR. feel free to merge once the tests pass. Sorry, completely forgot about the PR.

wochinge · 2020-09-04T14:39:38Z

Btw, model size and training time seem comparable 👍

Ghostvv · 2020-09-04T14:51:57Z

we didn't roll out the big TED yet, so the difference model-wise is addition of sparse-to-dense layers

… into new_core_featurization

Ghostvv · 2020-09-05T09:50:32Z

@wochinge could it be that test_train.py fails because of that (rasa train doesn't work, due to importer)?

Revert "remove end-to-end reading yaml"

Ghostvv · 2020-09-07T09:17:56Z

the failing windows test seems to be unrelated to this PR, so I'll merge it

evgeniiaraz force-pushed the new_core_featurization branch from b0c8aab to ae8de05 Compare July 30, 2020 19:24

evgeniiaraz force-pushed the tokenizers_process_action_text branch from 0557b2e to a25fca3 Compare July 30, 2020 21:26

evgeniiaraz force-pushed the new_core_featurization branch from ae8de05 to b18a75a Compare July 31, 2020 10:42