Skip to content

Commit

Permalink
addresses #8930 by adding documentation on how to use a dense featuri…
Browse files Browse the repository at this point in the history
…zer with the CRFEntityExtractor
  • Loading branch information
tttthomasssss committed Sep 8, 2021
1 parent 2b943f9 commit ba86407
Showing 1 changed file with 29 additions and 21 deletions.
50 changes: 29 additions & 21 deletions docs/docs/components.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1713,7 +1713,8 @@ The `SpacyEntityExtractor` extractor does not provide a `confidence` level and w
If you want to pass custom features, such as pre-trained word embeddings, to `CRFEntityExtractor`, you can
add any dense featurizer to the pipeline before the `CRFEntityExtractor`.
add any dense featurizer to the pipeline before the `CRFEntityExtractor` and subsequently configure
`CRFEntityExtractor` to make use of the dense features by adding `"text_dense_feature"` to its feature configuration.
`CRFEntityExtractor` automatically finds the additional dense features and checks if the dense features are an
iterable of `len(tokens)`, where each entry is a vector.
A warning will be shown in case the check fails.
Expand All @@ -1730,26 +1731,27 @@ The `SpacyEntityExtractor` extractor does not provide a `confidence` level and w
The following features are available:
```
============== ==========================================================================================
Feature Name Description
============== ==========================================================================================
low Checks if the token is lower case.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining characters are
lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
pos Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
pos2 Take the first two characters of the Part-of-Speech tag of the token
(``SpacyTokenizer`` required).
pattern Take the patterns defined by ``RegexFeaturizer``.
bias Add an additional "bias" feature to the list of features.
============== ==========================================================================================
=================== ==========================================================================================
Feature Name Description
=================== ==========================================================================================
low Checks if the token is lower case.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining characters are
lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
pos Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
pos2 Take the first two characters of the Part-of-Speech tag of the token
(``SpacyTokenizer`` required).
pattern Take the patterns defined by ``RegexFeaturizer``.
bias Add an additional "bias" feature to the list of features.
text_dense_features Adds additional features from a dense featurizer.
=================== ==========================================================================================
```
As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for
Expand Down Expand Up @@ -1782,6 +1784,7 @@ The `SpacyEntityExtractor` extractor does not provide a `confidence` level and w
"pattern",
],
["low", "title", "upper"],
["text_dense_features"]
]
# The maximum number of iterations for optimization algorithms.
"max_iterations": 50
Expand All @@ -1808,6 +1811,11 @@ The `SpacyEntityExtractor` extractor does not provide a `confidence` level and w
:::
:::note
If `text_dense_features` features are used, you need to have a dense featurizer (e.g. `LanguageModelFeaturizer`) in
your pipeline.
:::
### DucklingEntityExtractor
Expand Down

0 comments on commit ba86407

Please sign in to comment.