11[[analysis-cjk-bigram-tokenfilter]]
2- === CJK Bigram Token Filter
2+ === CJK bigram token filter
3+ ++++
4+ <titleabbrev>CJK bigram</titleabbrev>
5+ ++++
36
4- The `cjk_bigram` token filter forms bigrams out of the CJK
5- terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
6- or the `icu_tokenizer` (see {plugins}/analysis-icu-tokenizer.html[`analysis-icu` plugin]).
7+ Forms https://en.wikipedia.org/wiki/Bigram[bigrams] out of CJK (Chinese,
8+ Japanese, and Korean) tokens.
79
8- By default, when a CJK character has no adjacent characters to form a bigram,
9- it is output in unigram form. If you always want to output both unigrams and
10- bigrams, set the `output_unigrams` flag to `true`. This can be used for a
11- combined unigram+bigram approach.
10+ This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
11+ analyzer>>. It uses Lucene's
12+ https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html[CJKBigramFilter].
1213
13- Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
14- `hangul`, but bigrams can be disabled for particular scripts with the
15- `ignored_scripts` parameter. All non-CJK input is passed through unmodified.
14+
15+ [[analysis-cjk-bigram-tokenfilter-analyze-ex]]
16+ ==== Example
17+
18+ The following <<indices-analyze,analyze API>> request demonstrates how the
19+ CJK bigram token filter works.
20+
21+ [source,console]
22+ --------------------------------------------------
23+ GET /_analyze
24+ {
25+ "tokenizer" : "standard",
26+ "filter" : ["cjk_bigram"],
27+ "text" : "東京都は、日本の首都であり"
28+ }
29+ --------------------------------------------------
30+
31+ The filter produces the following tokens:
32+
33+ [source,text]
34+ --------------------------------------------------
35+ [ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
36+ --------------------------------------------------
37+
38+ /////////////////////
39+ [source,console-result]
40+ --------------------------------------------------
41+ {
42+ "tokens" : [
43+ {
44+ "token" : "東京",
45+ "start_offset" : 0,
46+ "end_offset" : 2,
47+ "type" : "<DOUBLE>",
48+ "position" : 0
49+ },
50+ {
51+ "token" : "京都",
52+ "start_offset" : 1,
53+ "end_offset" : 3,
54+ "type" : "<DOUBLE>",
55+ "position" : 1
56+ },
57+ {
58+ "token" : "都は",
59+ "start_offset" : 2,
60+ "end_offset" : 4,
61+ "type" : "<DOUBLE>",
62+ "position" : 2
63+ },
64+ {
65+ "token" : "日本",
66+ "start_offset" : 5,
67+ "end_offset" : 7,
68+ "type" : "<DOUBLE>",
69+ "position" : 3
70+ },
71+ {
72+ "token" : "本の",
73+ "start_offset" : 6,
74+ "end_offset" : 8,
75+ "type" : "<DOUBLE>",
76+ "position" : 4
77+ },
78+ {
79+ "token" : "の首",
80+ "start_offset" : 7,
81+ "end_offset" : 9,
82+ "type" : "<DOUBLE>",
83+ "position" : 5
84+ },
85+ {
86+ "token" : "首都",
87+ "start_offset" : 8,
88+ "end_offset" : 10,
89+ "type" : "<DOUBLE>",
90+ "position" : 6
91+ },
92+ {
93+ "token" : "都で",
94+ "start_offset" : 9,
95+ "end_offset" : 11,
96+ "type" : "<DOUBLE>",
97+ "position" : 7
98+ },
99+ {
100+ "token" : "であ",
101+ "start_offset" : 10,
102+ "end_offset" : 12,
103+ "type" : "<DOUBLE>",
104+ "position" : 8
105+ },
106+ {
107+ "token" : "あり",
108+ "start_offset" : 11,
109+ "end_offset" : 13,
110+ "type" : "<DOUBLE>",
111+ "position" : 9
112+ }
113+ ]
114+ }
115+ --------------------------------------------------
116+ /////////////////////
117+
118+ [[analysis-cjk-bigram-tokenfilter-analyzer-ex]]
119+ ==== Add to an analyzer
120+
121+ The following <<indices-create-index,create index API>> request uses the
122+ CJK bigram token filter to configure a new
123+ <<analysis-custom-analyzer,custom analyzer>>.
124+
125+ [source,console]
126+ --------------------------------------------------
127+ PUT /cjk_bigram_example
128+ {
129+ "settings" : {
130+ "analysis" : {
131+ "analyzer" : {
132+ "standard_cjk_bigram" : {
133+ "tokenizer" : "standard",
134+ "filter" : ["cjk_bigram"]
135+ }
136+ }
137+ }
138+ }
139+ }
140+ --------------------------------------------------
141+
142+
143+ [[analysis-cjk-bigram-tokenfilter-configure-parms]]
144+ ==== Configurable parameters
145+
146+ `ignored_scripts`::
147+ +
148+ --
149+ (Optional, array of character scripts)
150+ Array of character scripts for which to disable bigrams.
151+ Possible values:
152+
153+ * `han`
154+ * `hangul`
155+ * `hiragana`
156+ * `katakana`
157+
158+ All non-CJK input is passed through unmodified.
159+ --
160+
161+ `output_unigrams`
162+ (Optional, boolean)
163+ If `true`, emit tokens in both bigram and
164+ https://en.wikipedia.org/wiki/N-gram[unigram] form. If `false`, a CJK character
165+ is output in unigram form when it has no adjacent characters. Defaults to
166+ `false`.
167+
168+ [[analysis-cjk-bigram-tokenfilter-customize]]
169+ ==== Customize
170+
171+ To customize the CJK bigram token filter, duplicate it to create the basis
172+ for a new custom token filter. You can modify the filter using its configurable
173+ parameters.
16174
17175[source,console]
18176--------------------------------------------------
@@ -30,9 +188,9 @@ PUT /cjk_bigram_example
30188 "han_bigrams_filter" : {
31189 "type" : "cjk_bigram",
32190 "ignored_scripts": [
191+ "hangul",
33192 "hiragana",
34- "katakana",
35- "hangul"
193+ "katakana"
36194 ],
37195 "output_unigrams" : true
38196 }
0 commit comments