Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-trained embeddings not used as feature for CRFEntityExtractor #8930

Closed
4 tasks done
liaeh opened this issue Jun 22, 2021 · 15 comments
Closed
4 tasks done

Pre-trained embeddings not used as feature for CRFEntityExtractor #8930

liaeh opened this issue Jun 22, 2021 · 15 comments
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@liaeh
Copy link

liaeh commented Jun 22, 2021

Rasa version: 2.7.1

Rasa SDK version (if used & relevant): 2.7.0

Rasa X version (if used & relevant):

Python version: 3.8.8

Operating system (windows, osx, ...): Windows-10-10.0.19041-SP0

Issue:
In the docs for CRFEntityExtractor component, it says:

If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails.

However, I get identical results when using different language models, or even no language model at all. I'm using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.

For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.

Error (including full traceback):

Command or request that led to error:

rasa train nlu -c config1 (or 2 or 3)
rasa test nlu -c config1  (or 2 or 3)

Content of configuration file (config.yml) (if relevant):

Config 1

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 2

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "distilbert"
    model_weights: "distilbert-base-uncased"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Content of domain file (domain.yml) (if relevant):

Content of train data
I am just using a few utterances from the SNIPS dataset. Here's a small example of my train data.

version: '2.0'

nlu:
- intent: General
  examples: |
    - find a [restaurant](restaurant_type) for [marylou and i](party_size_description) [within walking distance](spatial_relation) of [my mum s hotel](poi)
    - book a table at a [bar](restaurant_type) in [cambodia](country) that serves [cheese fries](served_dish)
    - i m in [bowling green](poi) please book a [restaurant](restaurant_type) for [1](party_size_number) [close by](spatial_relation)
    - book a [restaurant](restaurant_type) at a [steakhouse](restaurant_type) [around](spatial_relation) [in town](poi) that serves [empanada](served_dish) for [me and my son](party_size_description)
    - book me a table for [me and my nephew](party_size_description) [near](spatial_relation) [my location](poi) at an [indoor](facility) [pub](restaurant_type)
    - book a table for [me and belinda](party_size_description) serving [minestra](served_dish) in a [bar](restaurant_type)
    - i need seating for [ten](party_size_number) people at a [bar](restaurant_type) that serves [czech](cuisine) cuisine
    - book a spot for [connie earline and rose](party_size_description) at an [oyster bar](restaurant_type) that serves [chicken fried bacon](served_dish) in [beauregard](city) [delaware](state)
    - reserve a table for [two](party_size_number) at a [restaurant](restaurant_type) which serves [creole](cuisine) [around](spatial_relation) here in [myanmar](country)
    - take me a [top-rated](sort) [restaurant](restaurant_type) for [nine](party_size_number) [close](spatial_relation) to [westfield](city) [delaware](state)
    - book a [joint restaurant](restaurant_type) for [four](party_size_number) with an [outdoor](facility) [area within the same area](spatial_relation) as [borough de denali](poi)
    - make reservations for [7](party_size_number) people at a [top-rated](sort) [brazilian](cuisine) [pub](restaurant_type) [around](spatial_relation) [rockaway park-beach 116th](poi)
    - need to book a table [downtown](poi) [within walking distance](spatial_relation) of me at [j g melon](restaurant_name)

Definition of done

  • Determine if this is only a documentation issue by looking through 1.4, 1.5, and 2.x + asking the research team
  • If so, then we should update the docs and add warnings
  • Otherwise, create another issue for addressing this bug
  • Reviewed by @koernerfelicia
@liaeh liaeh added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Jun 22, 2021
@sara-tagger
Copy link
Collaborator

Thanks for raising this issue, @lty4 will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

@liaeh
Copy link
Author

liaeh commented Jun 23, 2021

@koaning
Copy link
Contributor

koaning commented Jun 25, 2021

I'm checking this out right now. My gut feeling is that the LanguageModelTokeniser is meant to handle the BytePair tokeniser that's inside of huggingface. You're using it as a tokeniser for Rasa, so I imagine that's where something goes awry.

Note, the LanguageModelTokeniser is also deprecated.

@koaning koaning self-assigned this Jun 25, 2021
@koaning
Copy link
Contributor

koaning commented Jun 25, 2021

I might also ask, is there a reason why you weren't using DIET?

@koaning
Copy link
Contributor

koaning commented Jun 28, 2021

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor
pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor

Even the confidence values are the same (confirmed via rasa shell nlu).

@liaeh
Copy link
Author

liaeh commented Jun 28, 2021

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor
pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor

Even the confidence values are the same (confirmed via rasa shell nlu).

Great, glad you were able to reproduce it.

The reason I'm not using DIET is because I want to have a benchmark on how the NLU pipeline performs with/without finetuning a transformer model.

@koaning
Copy link
Contributor

koaning commented Jun 28, 2021

In the meantime then; you can turn off the transformer layers inside of DIET. That way you can still get your measurement.

@koaning koaning removed their assignment Jul 12, 2021
@TyDunn TyDunn added the area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components label Aug 5, 2021
@samsucik
Copy link
Contributor

samsucik commented Aug 6, 2021

This looks like an investigation issue where the definition of done would involve producing a simple example (possibly just the one that @koaning used, once he shares it), identifying the root cause, and creating a followup issue to implement and test the fix.

@dakshvar22
Copy link
Contributor

I am not completely sure but it looks like a documentation issue. As far as I can remember, CRFEntityExtractor never used dense features (word embeddings) and always relied on syntactic features... I even cross-checked on 1.10.x branch and at a quick look it doesn't seem to be using the word embeddings of tokens for training a model.
As I said, this is just a speculation... We should check it thoroughly.

@TyDunn TyDunn added the effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. label Aug 6, 2021
@tttthomasssss
Copy link
Contributor

tttthomasssss commented Aug 18, 2021

Tab on reproducing this issue across rasa versions, starting from rasa v1.4.0. I have used the full SNIPS data for reproduction. Note that the configs will differ for older versions as a lot of the functionality wasn't available in earlier versions (rasa 1.4.0 is from June 2019!). All done with python 3.7.6.

  • identical performance in rasa v1.4.0 (using a slightly different config)
  • identical performance in rasa v1.5.3 (using a slightly different config)
  • identical performance in rasa v1.6.1 (using a slightly different config)
  • struggling to install rasa 1.7.x but I assume it won't deviate from the rest

  • identical performance in rasa v1.8.3 (using a minimally different config)
  • identical performance in rasa v1.9.7 (using a minimally different config)
  • identical performance in rasa v1.10.26 (using a minimally different config)
  • identical performance in rasa v2.0.8 (using a minimally different config)
  • identical performance in rasa v2.1.3 (using a minimally different config)
  • identical performance in rasa v2.2.10 (using a minimally different config)
  • identical performance in rasa v2.3.5 (using a minimally different config)
  • identical performance in rasa v2.4.3 (using a minimally different config)
  • identical performance in rasa v2.5.2 (using a minimally different config)
  • identical performance in rasa v2.6.3 (using a minimally different config)
  • identical performance in rasa v2.7.2 (using a minimally different config)
  • identical performance in rasa v2.8.3 (using a minimally different config)

Configs used for rasa 1.4.0, 1.5.3, 1.6.1:

  • config 1:

    language: "en"
    
    pipeline:
    - name: "SpacyNLP"
    - name: "SpacyTokenizer"
    - name: "SpacyFeaturizer"
    - name: "CRFEntityExtractor"
  • config 2:

    language: "en"
    
    pipeline:
    - name: "SpacyNLP"
    - name: "SpacyTokenizer"
    - name: "CRFEntityExtractor"

Configs used for rasa 1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3:

  • config 1:

    language: en
    
    pipeline:
      - name: HFTransformersNLP
      - name: LanguageModelTokenizer
      - name: LexicalSyntacticFeaturizer
        "features": [
          # features for the word preceding the word being evaluated
          [ "suffix2", "prefix2" ],
          # features for the word being evaluated
          [ "BOS", "EOS" ],
          # features for the word following the word being evaluated
          [ "suffix2", "prefix2" ]]
      - name: CRFEntityExtractor
  • config 2:

    language: en
    
    pipeline:
      - name: HFTransformersNLP
      - name: LanguageModelTokenizer
      - name: LanguageModelFeaturizer
        model_name: "roberta"
        model_weights: "roberta-base"
      - name: LexicalSyntacticFeaturizer
        "features": [
          # features for the word preceding the word being evaluated
          [ "suffix2", "prefix2" ],
          # features for the word being evaluated
          [ "BOS", "EOS" ],
          # features for the word following the word being evaluated
          [ "suffix2", "prefix2" ]]
      - name: CRFEntityExtractor
    

@tttthomasssss
Copy link
Contributor

Given the reproduction above, this looks like a docs issue. CRFEntityExtractor has indeed never used any dense word embeddings.

@tttthomasssss
Copy link
Contributor

I had a closer look into the code for CRFEntityExtractor and whats slightly odd is that the feature preprocessing does extract dense features from whatever embedding model you specify and adds them to the CRFToken, but then ignores them again when building the X_train matrix thats used for training the eventual sklearn_crfsuite CRF. @dakshvar22 do you know anything more about the use of dense features in the CRFEntityExtractor?

Given this is unused code, we should probably remove it? (@TyDunn might need an updated definition of done if we want to remove the unused code).

@tttthomasssss
Copy link
Contributor

I had another poke at the issue and it is possible to make CRFEntityExtractor use dense embeddings. The 3 configs below all give different results. It looks like its more of a documentation issue now as its not documented how to configure CRFEntityExtractor to use dense features (so the code in question from the above comment is indeed very much in use). I am also not sure whether this is an intended or an accidental feature (@TyDunn or @dakshvar22 might know more?).

Config 1:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

Config 2:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

@dakshvar22
Copy link
Contributor

dakshvar22 commented Sep 6, 2021

@tttthomasssss I am sure it's accidental that the documentation lacks information on how to use dense features. We should add it if it's not already there.

tttthomasssss added a commit that referenced this issue Sep 8, 2021
tttthomasssss added a commit that referenced this issue Sep 8, 2021
@tttthomasssss
Copy link
Contributor

Merged with #9572.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

8 participants