Pre-trained embeddings not used as feature for CRFEntityExtractor #8930

liaeh · 2021-06-22T08:36:07Z

Rasa version: 2.7.1

Rasa SDK version (if used & relevant): 2.7.0

Rasa X version (if used & relevant):

Python version: 3.8.8

Operating system (windows, osx, ...): Windows-10-10.0.19041-SP0

Issue:
In the docs for CRFEntityExtractor component, it says:

If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails.

However, I get identical results when using different language models, or even no language model at all. I'm using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.

For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.

Error (including full traceback):

Command or request that led to error:

rasa train nlu -c config1 (or 2 or 3)
rasa test nlu -c config1  (or 2 or 3)

Content of configuration file (config.yml) (if relevant):

Config 1

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 2

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "distilbert"
    model_weights: "distilbert-base-uncased"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Content of domain file (domain.yml) (if relevant):

Content of train data
I am just using a few utterances from the SNIPS dataset. Here's a small example of my train data.

version: '2.0'

nlu:
- intent: General
  examples: |
    - find a [restaurant](restaurant_type) for [marylou and i](party_size_description) [within walking distance](spatial_relation) of [my mum s hotel](poi)
    - book a table at a [bar](restaurant_type) in [cambodia](country) that serves [cheese fries](served_dish)
    - i m in [bowling green](poi) please book a [restaurant](restaurant_type) for [1](party_size_number) [close by](spatial_relation)
    - book a [restaurant](restaurant_type) at a [steakhouse](restaurant_type) [around](spatial_relation) [in town](poi) that serves [empanada](served_dish) for [me and my son](party_size_description)
    - book me a table for [me and my nephew](party_size_description) [near](spatial_relation) [my location](poi) at an [indoor](facility) [pub](restaurant_type)
    - book a table for [me and belinda](party_size_description) serving [minestra](served_dish) in a [bar](restaurant_type)
    - i need seating for [ten](party_size_number) people at a [bar](restaurant_type) that serves [czech](cuisine) cuisine
    - book a spot for [connie earline and rose](party_size_description) at an [oyster bar](restaurant_type) that serves [chicken fried bacon](served_dish) in [beauregard](city) [delaware](state)
    - reserve a table for [two](party_size_number) at a [restaurant](restaurant_type) which serves [creole](cuisine) [around](spatial_relation) here in [myanmar](country)
    - take me a [top-rated](sort) [restaurant](restaurant_type) for [nine](party_size_number) [close](spatial_relation) to [westfield](city) [delaware](state)
    - book a [joint restaurant](restaurant_type) for [four](party_size_number) with an [outdoor](facility) [area within the same area](spatial_relation) as [borough de denali](poi)
    - make reservations for [7](party_size_number) people at a [top-rated](sort) [brazilian](cuisine) [pub](restaurant_type) [around](spatial_relation) [rockaway park-beach 116th](poi)
    - need to book a table [downtown](poi) [within walking distance](spatial_relation) of me at [j g melon](restaurant_name)

Definition of done

Determine if this is only a documentation issue by looking through 1.4, 1.5, and 2.x + asking the research team
If so, then we should update the docs ~~and add warnings~~
~~Otherwise, create another issue for addressing this bug~~
Reviewed by @koernerfelicia

The text was updated successfully, but these errors were encountered:

sara-tagger · 2021-06-22T12:00:15Z

Thanks for raising this issue, @lty4 will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

liaeh · 2021-06-23T13:47:12Z

This issue is described in two posts on the Rasa forum as well (second one by me):

https://forum.rasa.com/t/crf-with-dense-features/44691

https://forum.rasa.com/t/no-difference-in-performance-when-using-or-changing-language-model-featurizers/44882

koaning · 2021-06-25T09:25:38Z

I'm checking this out right now. My gut feeling is that the LanguageModelTokeniser is meant to handle the BytePair tokeniser that's inside of huggingface. You're using it as a tokeniser for Rasa, so I imagine that's where something goes awry.

Note, the LanguageModelTokeniser is also deprecated.

koaning · 2021-06-25T09:39:40Z

I might also ask, is there a reason why you weren't using DIET?

koaning · 2021-06-28T07:17:19Z

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor

pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor

Even the confidence values are the same (confirmed via rasa shell nlu).

liaeh · 2021-06-28T07:35:54Z

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.
pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor
pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor
Even the confidence values are the same (confirmed via rasa shell nlu).

Great, glad you were able to reproduce it.

The reason I'm not using DIET is because I want to have a benchmark on how the NLU pipeline performs with/without finetuning a transformer model.

koaning · 2021-06-28T07:38:10Z

In the meantime then; you can turn off the transformer layers inside of DIET. That way you can still get your measurement.

samsucik · 2021-08-06T12:34:26Z

This looks like an investigation issue where the definition of done would involve producing a simple example (possibly just the one that @koaning used, once he shares it), identifying the root cause, and creating a followup issue to implement and test the fix.

dakshvar22 · 2021-08-06T12:57:30Z

I am not completely sure but it looks like a documentation issue. As far as I can remember, CRFEntityExtractor never used dense features (word embeddings) and always relied on syntactic features... I even cross-checked on 1.10.x branch and at a quick look it doesn't seem to be using the word embeddings of tokens for training a model.
As I said, this is just a speculation... We should check it thoroughly.

tttthomasssss · 2021-08-18T07:45:24Z

Tab on reproducing this issue across rasa versions, starting from rasa v1.4.0. I have used the full SNIPS data for reproduction. Note that the configs will differ for older versions as a lot of the functionality wasn't available in earlier versions (rasa 1.4.0 is from June 2019!). All done with python 3.7.6.

identical performance in rasa v1.4.0 (using a slightly different config)
identical performance in rasa v1.5.3 (using a slightly different config)
identical performance in rasa v1.6.1 (using a slightly different config)
struggling to install rasa 1.7.x but I assume it won't deviate from the rest

Configs used for rasa `1.4.0, 1.5.3, 1.6.1`:

config 1:

language: "en"

pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "CRFEntityExtractor"

config 2:

language: "en"

pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "CRFEntityExtractor"

Configs used for rasa `1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3`:

config 1:

language: en

pipeline:
  - name: HFTransformersNLP
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

config 2:

language: en

pipeline:
  - name: HFTransformersNLP
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

tttthomasssss · 2021-08-26T09:26:32Z

Given the reproduction above, this looks like a docs issue. CRFEntityExtractor has indeed never used any dense word embeddings.

tttthomasssss · 2021-08-27T12:19:02Z

I had a closer look into the code for CRFEntityExtractor and whats slightly odd is that the feature preprocessing does extract dense features from whatever embedding model you specify and adds them to the CRFToken, but then ignores them again when building the X_train matrix thats used for training the eventual sklearn_crfsuite CRF. @dakshvar22 do you know anything more about the use of dense features in the CRFEntityExtractor?

Given this is unused code, we should probably remove it? (@TyDunn might need an updated definition of done if we want to remove the unused code).

tttthomasssss · 2021-08-31T14:31:41Z

I had another poke at the issue and it is possible to make CRFEntityExtractor use dense embeddings. The 3 configs below all give different results. It looks like its more of a documentation issue now as its not documented how to configure CRFEntityExtractor to use dense features (so the code in question from the above comment is indeed very much in use). I am also not sure whether this is an intended or an accidental feature (@TyDunn or @dakshvar22 might know more?).

Config 1:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

Config 2:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

dakshvar22 · 2021-09-06T12:41:43Z

@tttthomasssss I am sure it's accidental that the documentation lacks information on how to use dense features. We should add it if it's not already there.

…zer with the CRFEntityExtractor

tttthomasssss · 2021-09-17T19:50:22Z

Merged with #9572.

liaeh added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Jun 22, 2021

koaning self-assigned this Jun 25, 2021

koaning removed their assignment Jul 12, 2021

TyDunn added the area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components label Aug 5, 2021

TyDunn added the effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. label Aug 6, 2021

TyDunn assigned tttthomasssss and koernerfelicia Aug 6, 2021

tttthomasssss added a commit that referenced this issue Sep 8, 2021

addresses #8930 by adding documentation on how to use a dense featuri…

ba86407

…zer with the CRFEntityExtractor

tttthomasssss added a commit that referenced this issue Sep 8, 2021

adds changelog for #8930

6c7e3a1

tttthomasssss mentioned this issue Sep 8, 2021

Dense features and CRFEntityExtractor - docs update #9572

Merged

4 tasks

tttthomasssss closed this as completed Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-trained embeddings not used as feature for CRFEntityExtractor #8930

Pre-trained embeddings not used as feature for CRFEntityExtractor #8930

liaeh commented Jun 22, 2021 •

edited by tttthomasssss

Loading

sara-tagger commented Jun 22, 2021

liaeh commented Jun 23, 2021

koaning commented Jun 25, 2021 •

edited

Loading

koaning commented Jun 25, 2021

koaning commented Jun 28, 2021

liaeh commented Jun 28, 2021

koaning commented Jun 28, 2021

samsucik commented Aug 6, 2021

dakshvar22 commented Aug 6, 2021

tttthomasssss commented Aug 18, 2021 •

edited by koaning

Loading

tttthomasssss commented Aug 26, 2021

tttthomasssss commented Aug 27, 2021

tttthomasssss commented Aug 31, 2021

dakshvar22 commented Sep 6, 2021 •

edited

Loading

tttthomasssss commented Sep 17, 2021

Pre-trained embeddings not used as feature for CRFEntityExtractor #8930

Pre-trained embeddings not used as feature for CRFEntityExtractor #8930

Comments

liaeh commented Jun 22, 2021 • edited by tttthomasssss Loading

Config 1

Config 2

Config 3

sara-tagger commented Jun 22, 2021

Please also check out the docs and the forum in case your issue was raised there too 🤗

liaeh commented Jun 23, 2021

koaning commented Jun 25, 2021 • edited Loading

koaning commented Jun 25, 2021

koaning commented Jun 28, 2021

liaeh commented Jun 28, 2021

koaning commented Jun 28, 2021

samsucik commented Aug 6, 2021

dakshvar22 commented Aug 6, 2021

tttthomasssss commented Aug 18, 2021 • edited by koaning Loading

Configs used for rasa 1.4.0, 1.5.3, 1.6.1:

Configs used for rasa 1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3:

tttthomasssss commented Aug 26, 2021

tttthomasssss commented Aug 27, 2021

tttthomasssss commented Aug 31, 2021

dakshvar22 commented Sep 6, 2021 • edited Loading

tttthomasssss commented Sep 17, 2021

liaeh commented Jun 22, 2021 •

edited by tttthomasssss

Loading

koaning commented Jun 25, 2021 •

edited

Loading

tttthomasssss commented Aug 18, 2021 •

edited by koaning

Loading

Configs used for rasa `1.4.0, 1.5.3, 1.6.1`:

Configs used for rasa `1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3`:

dakshvar22 commented Sep 6, 2021 •

edited

Loading