11[[analysis-htmlstrip-charfilter]]
2- === HTML Strip Char Filter
2+ === HTML strip character filter
3+ ++++
4+ <titleabbrev>HTML strip</titleabbrev>
5+ ++++
36
4- The `html_strip` character filter strips HTML elements from the text and
5- replaces HTML entities with their decoded value (e.g. replacing `&` with
6- `&`).
7+ Strips HTML elements from a text and replaces HTML entities with their decoded
8+ value (e.g, replaces `&` with `&`).
79
8- [float]
9- === Example output
10+ The `html_strip` filter uses Lucene's
11+ {lucene-analysis-docs}/charfilter/HTMLStripCharFilter.html[HTMLStripCharFilter].
12+
13+ [[analysis-htmlstrip-charfilter-analyze-ex]]
14+ ==== Example
15+
16+ The following <<indices-analyze,analyze API>> request uses the
17+ `html_strip` filter to change the text `<p>I'm so <b>happy</b>!</p>` to
18+ `\nI'm so happy!\n`.
1019
1120[source,console]
12- ---------------------------
13- POST _analyze
21+ ----
22+ GET / _analyze
1423{
15- "tokenizer": "keyword", <1>
16- "char_filter": [ "html_strip" ],
24+ "tokenizer": "keyword",
25+ "char_filter": [
26+ "html_strip"
27+ ],
1728 "text": "<p>I'm so <b>happy</b>!</p>"
1829}
19- ---------------------------
30+ ----
2031
21- <1> The <<analysis-keyword-tokenizer,`keyword` tokenizer>> returns a single term.
32+ The filter produces the following text:
2233
23- /////////////////////
34+ [source,text]
35+ ----
36+ [ \nI'm so happy!\n ]
37+ ----
2438
39+ ////
2540[source,console-result]
26- ----------------------------
41+ ----
2742{
2843 "tokens": [
2944 {
@@ -35,93 +50,81 @@ POST _analyze
3550 }
3651 ]
3752}
38- ----------------------------
39-
40- /////////////////////
41-
53+ ----
54+ ////
4255
43- The above example returns the term:
56+ [[analysis-htmlstrip-charfilter-analyzer-ex]]
57+ ==== Add to an analyzer
4458
45- [source,text]
46- ---------------------------
47- [ \nI'm so happy!\n ]
48- ---------------------------
49-
50- The same example with the `standard` tokenizer would return the following terms:
59+ The following <<indices-create-index,create index API>> request uses the
60+ `html_strip` filter to configure a new
61+ <<analysis-custom-analyzer,custom analyzer>>.
5162
52- [source,text]
53- ---------------------------
54- [ I'm, so, happy ]
55- ---------------------------
56-
57- [float]
58- === Configuration
63+ [source,console]
64+ ----
65+ PUT /my_index
66+ {
67+ "settings": {
68+ "analysis": {
69+ "analyzer": {
70+ "my_analyzer": {
71+ "tokenizer": "keyword",
72+ "char_filter": [
73+ "html_strip"
74+ ]
75+ }
76+ }
77+ }
78+ }
79+ }
80+ ----
5981
60- The `html_strip` character filter accepts the following parameter:
82+ [[analysis-htmlstrip-charfilter-configure-parms]]
83+ ==== Configurable parameters
6184
62- [horizontal]
6385`escaped_tags`::
86+ (Optional, array of strings)
87+ Array of HTML elements without enclosing angle brackets (`< >`). The filter
88+ skips these HTML elements when stripping HTML from the text. For example, a
89+ value of `[ "p" ]` skips the `<p>` HTML element.
6490
65- An array of HTML tags which should not be stripped from the original text.
91+ [[analysis-htmlstrip-charfilter-customize]]
92+ ==== Customize
6693
67- [float]
68- === Example configuration
94+ To customize the `html_strip` filter, duplicate it to create the basis
95+ for a new custom token filter. You can modify the filter using its configurable
96+ parameters.
6997
70- In this example, we configure the `html_strip` character filter to leave `<b>`
71- tags in place:
98+ The following <<indices-create-index,create index API>> request
99+ configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
100+ `html_strip` filter, `my_custom_html_strip_char_filter`.
101+
102+ The `my_custom_html_strip_char_filter` filter skips the removal of the `<b>`
103+ HTML element.
72104
73105[source,console]
74- ----------------------------
106+ ----
75107PUT my_index
76108{
77109 "settings": {
78110 "analysis": {
79111 "analyzer": {
80112 "my_analyzer": {
81113 "tokenizer": "keyword",
82- "char_filter": ["my_char_filter"]
114+ "char_filter": [
115+ "my_custom_html_strip_char_filter"
116+ ]
83117 }
84118 },
85119 "char_filter": {
86- "my_char_filter ": {
120+ "my_custom_html_strip_char_filter ": {
87121 "type": "html_strip",
88- "escaped_tags": ["b"]
122+ "escaped_tags": [
123+ "b"
124+ ]
89125 }
90126 }
91127 }
92128 }
93129}
94-
95- POST my_index/_analyze
96- {
97- "analyzer": "my_analyzer",
98- "text": "<p>I'm so <b>happy</b>!</p>"
99- }
100- ----------------------------
101-
102- /////////////////////
103-
104- [source,console-result]
105- ----------------------------
106- {
107- "tokens": [
108- {
109- "token": "\nI'm so <b>happy</b>!\n",
110- "start_offset": 0,
111- "end_offset": 32,
112- "type": "word",
113- "position": 0
114- }
115- ]
116- }
117- ----------------------------
118-
119- /////////////////////
120-
121-
122- The above example produces the following term:
123-
124- [source,text]
125- ---------------------------
126- [ \nI'm so <b>happy</b>!\n ]
127- ---------------------------
130+ ----
0 commit comments