Table of contents
SELECT
statement in SQL is the most common query that retrieves data from Elasticsearch index. In this doc, only simple SELECT
statement with single index and query involved is covered. A SELECT
statement includes SELECT
, FROM
, WHERE
, GROUP BY
, HAVING
, ORDER BY
and LIMIT
clause. Among these clauses, SELECT
and FROM
are the foundation to specify which fields to be fetched and which index they should be fetched from. All others are optional and used according to your needs. Please read on for their description, syntax and use cases in details.
The syntax of SELECT
statement is as follows:
SELECT [ALL | DISTINCT] (* | expression) [[AS] alias] [, ...] FROM index_name [WHERE predicates] [GROUP BY expression [, ...] [HAVING predicates]] [ORDER BY expression [ASC | DESC] [NULLS {FIRST | LAST}] [, ...]] [LIMIT [offset, ] size]
Although multiple query statements to execute in batch is not supported, ending with semicolon ;
is still allowed. For example, you can run SELECT * FROM accounts;
without issue. This is useful to support queries generated by other tool, such as Microsoft Excel or BI tool.
Apart from predefined keyword of SQL language, the most basic element is literal and identifier. Literal is numeric, string, date or boolean constant. Identifier represents Elasticsearch index or field name. With arithmetic operators and SQL functions applied, the basic literals and identifiers can be built into complex expression.
Rule expressionAtom
:
The expression in turn can be combined into predicate with logical operator. Typically, predicate is used in WHERE
and HAVING
clause to filter out data by conditions specified.
Rule expression
:
Rule predicate
:
The actual order of execution is very different from its appearance:
FROM index WHERE predicates GROUP BY expressions HAVING predicates SELECT expressions ORDER BY expressions LIMIT size
SELECT
clause specifies which fields in Elasticsearch index should be retrieved.
Rule selectElements
:
Rule selectElement
:
You can use *
to fetch all fields in the index which is very convenient when you just want to have a quick look at your data.
SQL query:
POST /_opendistro/_sql { "query" : "SELECT * FROM accounts" }
Explain:
{ "from" : 0, "size" : 200 }
Result set:
account_number | firstname | gender | city | balance | employer | state | address | lastname | age | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Amber | M | Brogan | 39225 | Pyrami | IL | [email protected] | 880 Holmes Lane | Duke | 32 |
6 | Hattie | M | Dante | 5686 | Netagy | TN | [email protected] | 671 Bristol Street | Bond | 36 |
13 | Nanette | F | Nogal | 32838 | Quility | VA | [email protected] | 789 Madison Street | Bates | 28 |
18 | Dale | M | Orick | 4180 | null | MD | [email protected] | 467 Hutchinson Court | Adams | 33 |
More often you would give specific field name(s) in SELECT
clause to avoid large and unnecessary data retrieved.
SQL query:
POST /_opendistro/_sql { "query" : "SELECT firstname, lastname FROM accounts" }
Explain:
{ "from" : 0, "size" : 200, "_source" : { "includes" : [ "firstname", "lastname" ], "excludes" : [ ] } }
Result set:
firstname | lastname |
---|---|
Amber | Duke |
Dale | Adams |
Hattie | Bond |
Nanette | Bates |
Alias is often used to make your query more readable by giving your field a shorter name.
SQL query:
POST /_opendistro/_sql { "query" : "SELECT account_number AS num FROM accounts" }
Explain:
{ "from" : 0, "size" : 200, "_source" : { "includes" : [ "account_number" ], "excludes" : [ ] } }
Result set:
num |
---|
1 |
6 |
13 |
18 |
By default, SELECT ALL
takes effect to return all rows. DISTINCT
is useful when you want to de-duplicate and get unique field value. You can provide one or more field names ('DISTINCT *' is not supported yet).
SQL query:
POST /_opendistro/_sql { "query" : "SELECT DISTINCT age FROM accounts" }
Explain:
{ "from" : 0, "size" : 0, "_source" : { "includes" : [ "age" ], "excludes" : [ ] }, "stored_fields" : "age", "aggregations" : { "age" : { "terms" : { "field" : "age", "size" : 200, "min_doc_count" : 1, "shard_min_doc_count" : 0, "show_term_doc_count_error" : false, "order" : [ { "_count" : "desc" }, { "_key" : "asc" } ] } } } }
Result set:
age |
---|
28 |
32 |
33 |
36 |
In fact your can use any expression in a DISTINCT
clause as follows:
od> SELECT DISTINCT SUBSTRING(lastname, 1, 1) FROM accounts; fetched rows / total rows = 3/3 +-----------------------------+ | SUBSTRING(lastname, 1, 1) | |-----------------------------| | A | | B | | D | +-----------------------------+
FROM
clause specifies Elasticsearch index where the data should be retrieved from. You've seen how to specify a single index in FROM clause in last section. Here we provide examples for more use cases.
Subquery in FROM
clause is also supported. Please check out the documentation for more details.
Rule tableName
:
Similarly you can give index in FROM
clause an alias and use it across clauses in query.
SQL query:
POST /_opendistro/_sql { "query" : "SELECT acc.account_number FROM accounts acc" }
Alternatively you can query from multiple indices of similar names by index pattern. This is very convenient for indices created by Logstash index template with date as suffix.
SQL query:
POST /_opendistro/_sql { "query" : "SELECT account_number FROM account*" }
You can also specify type name explicitly though this has been deprecated in later Elasticsearch version.
SQL query:
POST /_opendistro/_sql { "query" : "SELECT account_number FROM accounts/account" }
WHERE
clause specifies only Elasticsearch documents that meet the criteria should be affected. It consists of predicates that uses =
, <>
, >
, >=
, <
, <=
, IN
, BETWEEN
, LIKE
, IS NULL
or IS NOT NULL
. These predicates can be combined by logical operator NOT
, AND
or OR
to build more complex expression.
For LIKE
and other full text search topics, please refer to Full Text Search documentation.
Besides SQL query, WHERE clause can also be used in SQL statement such as DELETE
. Please refer to Data Manipulation Language documentation for details.
Basic comparison operators, such as =
, <>
, >
, >=
, <
, <=
, can work for number, string or date. IN
and BETWEEN
is convenient for comparison with multiple values or a range.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT account_number FROM accounts WHERE account_number = 1 """ }
Explain:
{ "from" : 0, "size" : 200, "query" : { "bool" : { "filter" : [ { "bool" : { "must" : [ { "term" : { "account_number" : { "value" : 1, "boost" : 1.0 } } } ], "adjust_pure_negative" : true, "boost" : 1.0 } } ], "adjust_pure_negative" : true, "boost" : 1.0 } }, "_source" : { "includes" : [ "account_number" ], "excludes" : [ ] } }
Result set:
account_number |
---|
1 |
As NoSQL database, Elasticsearch allows for flexible schema that documents in an index may have different fields. In this case, you can use IS NULL
or IS NOT NULL
to retrieve missing fields or existing fields only.
Note that for now we don't differentiate missing field and field set to NULL
explicitly.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT account_number, employer FROM accounts WHERE employer IS NULL """ }
Explain:
{ "from" : 0, "size" : 200, "query" : { "bool" : { "filter" : [ { "bool" : { "must" : [ { "bool" : { "must_not" : [ { "exists" : { "field" : "employer.keyword", "boost" : 1.0 } } ], "adjust_pure_negative" : true, "boost" : 1.0 } } ], "adjust_pure_negative" : true, "boost" : 1.0 } } ], "adjust_pure_negative" : true, "boost" : 1.0 } }, "_source" : { "includes" : [ "account_number", "employer" ], "excludes" : [ ] } }
Result set:
account_number | employer |
---|---|
18 | null |
GROUP BY
groups documents with same field value into buckets. It is often used along with aggregation functions to aggregate inside each bucket. Please refer to SQL Functions documentation for more details.
Note that WHERE
clause is applied before GROUP BY
clause.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT age FROM accounts GROUP BY age """ }
Explain:
{ "from" : 0, "size" : 0, "_source" : { "includes" : [ "age" ], "excludes" : [ ] }, "stored_fields" : "age", "aggregations" : { "age" : { "terms" : { "field" : "age", "size" : 200, "min_doc_count" : 1, "shard_min_doc_count" : 0, "show_term_doc_count_error" : false, "order" : [ { "_count" : "desc" }, { "_key" : "asc" } ] } } } }
Result set:
age |
---|
28 |
32 |
33 |
36 |
Field alias is accessible in GROUP BY
clause.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT account_number AS num FROM accounts GROUP BY num """ }
Explain:
{ "from" : 0, "size" : 0, "_source" : { "includes" : [ "account_number" ], "excludes" : [ ] }, "stored_fields" : "account_number", "aggregations" : { "num" : { "terms" : { "field" : "account_number", "size" : 200, "min_doc_count" : 1, "shard_min_doc_count" : 0, "show_term_doc_count_error" : false, "order" : [ { "_count" : "desc" }, { "_key" : "asc" } ] } } } }
Result set:
num |
---|
1 |
6 |
13 |
18 |
Alternatively field ordinal in SELECT
clause can be used too. However this is not recommended because your GROUP BY
clause depends on fields in SELECT
clause and require to change accordingly.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT age FROM accounts GROUP BY 1 """ }
Explain:
{ "from" : 0, "size" : 0, "_source" : { "includes" : [ "age" ], "excludes" : [ ] }, "stored_fields" : "age", "aggregations" : { "age" : { "terms" : { "field" : "age", "size" : 200, "min_doc_count" : 1, "shard_min_doc_count" : 0, "show_term_doc_count_error" : false, "order" : [ { "_count" : "desc" }, { "_key" : "asc" } ] } } } }
Result set:
age |
---|
28 |
32 |
33 |
36 |
Scalar function can be used in GROUP BY
clause and it's required to be present in SELECT
clause too.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT ABS(age) AS a FROM accounts GROUP BY ABS(age) """ }
Explain:
{ "from" : 0, "size" : 0, "_source" : { "includes" : [ "script" ], "excludes" : [ ] }, "stored_fields" : "abs(age)", "script_fields" : { "abs(age)" : { "script" : { "source" : "def abs_1 = Math.abs(doc['age'].value);return abs_1;", "lang" : "painless" }, "ignore_failure" : false } }, "aggregations" : { "abs(age)" : { "terms" : { "script" : { "source" : "def abs_1 = Math.abs(doc['age'].value);return abs_1;", "lang" : "painless" }, "size" : 200, "min_doc_count" : 1, "shard_min_doc_count" : 0, "show_term_doc_count_error" : false, "order" : [ { "_count" : "desc" }, { "_key" : "asc" } ] } } } }
Result set:
a |
---|
28.0 |
32.0 |
33.0 |
36.0 |
HAVING
clause filters result from GROUP BY
clause by predicate(s). Because of this, aggregation function, even different from those on SELECT
clause, can be used in predicate.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT age, MAX(balance) FROM accounts GROUP BY age HAVING MIN(balance) > 10000 """ }
Explain:
{ "from" : 0, "size" : 0, "_source" : { "includes" : [ "age", "MAX" ], "excludes" : [ ] }, "stored_fields" : "age", "aggregations" : { "age" : { "terms" : { "field" : "age", "size" : 200, "min_doc_count" : 1, "shard_min_doc_count" : 0, "show_term_doc_count_error" : false, "order" : [ { "_count" : "desc" }, { "_key" : "asc" } ] }, "aggregations" : { "MAX_0" : { "max" : { "field" : "balance" } }, "min_0" : { "min" : { "field" : "balance" } }, "bucket_filter" : { "bucket_selector" : { "buckets_path" : { "min_0" : "min_0", "MAX_0" : "MAX_0" }, "script" : { "source" : "params.min_0 > 10000", "lang" : "painless" }, "gap_policy" : "skip" } } } } } }
Result set:
age | MAX(balance) |
---|---|
28 | 32838 |
32 | 39225 |
ORDER BY
clause specifies which fields used to sort the result and in which direction.
Besides regular field names, ordinal, alias or scalar function can also be used similarly as in GROUP BY
. ASC
(by default) or DESC
can be appended to indicate sorting in ascending or descending order.
SQL query:
POST /_opendistro/_sql { "query" : "SELECT account_number FROM accounts ORDER BY account_number DESC" }
Explain:
{ "from" : 0, "size" : 200, "_source" : { "includes" : [ "account_number" ], "excludes" : [ ] }, "sort" : [ { "account_number" : { "order" : "desc" } } ] }
Result set:
account_number |
---|
18 |
13 |
6 |
1 |
Additionally you can specify if documents with missing field be put first or last. The default behavior of Elasticsearch is to return nulls or missing last. You can make them present before non-nulls by using IS NOT NULL
.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT employer FROM accounts ORDER BY employer IS NOT NULL """ }
Explain:
{ "from" : 0, "size" : 200, "_source" : { "includes" : [ "employer" ], "excludes" : [ ] }, "sort" : [ { "employer.keyword" : { "order" : "asc", "missing" : "_first" } } ] }
Result set:
employer |
---|
null |
Netagy |
Pyrami |
Quility |
Note that the example above is essentially sorting on a predicate expression. In this case, nulls are put first because it's evaluated to false (0), though all the rest are evaluated to true and still in random order. If you want to specify order for both nulls and non-nulls, NULLS FIRST
or NULLS LAST
in SQL standard can help. Basically, it allows you to specify an independent order for nulls along with ASC
or DESC
keyword:
od> SELECT employer FROM accounts ORDER BY employer ASC NULLS LAST; fetched rows / total rows = 4/4 +------------+ | employer | |------------| | Netagy | | Pyrami | | Quility | | null | +------------+
The sorting rule can be summarized as follows:
- Without
NULLS
clauseASC
: sort non-nulls in ascending order and put nulls firstDESC
: sort non-nulls in descending order and put nulls last
- With
NULLS
clause: just use the nulls order given
Here is another example for sort in descending order without NULLS
clause:
od> SELECT employer FROM accounts ORDER BY employer DESC; fetched rows / total rows = 4/4 +------------+ | employer | |------------| | Quility | | Pyrami | | Netagy | | null | +------------+
Aggregate functions are allowed to be used in ORDER BY
clause. You can reference it by same function call or its alias or ordinal in select list:
od> SELECT gender, MAX(age) FROM accounts GROUP BY gender ORDER BY MAX(age) DESC; fetched rows / total rows = 2/2 +----------+------------+ | gender | MAX(age) | |----------+------------| | M | 36 | | F | 28 | +----------+------------+
Even if it's not present in SELECT
clause, it can be also used as follows:
od> SELECT gender, MIN(age) FROM accounts GROUP BY gender ORDER BY MAX(age) DESC; fetched rows / total rows = 2/2 +----------+------------+ | gender | MIN(age) | |----------+------------| | M | 32 | | F | 28 | +----------+------------+
Mostly specifying maximum number of documents returned is necessary to prevent fetching large amount of data into memory. LIMIT clause is helpful in this case. Basically the limit is set to the query planning, so different LIMIT and OFFSET might end up unpredictable subset in the results. Thus it is suggested to use order by in query with limit keyword to enforce a fixed ordering in the result set.
Given a positive number, LIMIT
uses it as page size to fetch result of that size at most.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT account_number FROM accounts ORDER BY account_number LIMIT 1 """ }
Explain:
{ "from" : 0, "size" : 1, "_source" : { "includes" : [ "account_number" ], "excludes" : [ ] }, "sort" : [ { "account_number" : { "order" : "asc" } } ] }
Result set:
account_number |
---|
1 |
Offset position can be given as first argument to indicate where to start fetching. This can be used as simple pagination solution though it's inefficient on large index. Generally ORDER BY
is required in this case to ensure the same order between pages.
SQL query:
POST /_opendistro/_sql { "query" : """ SELECT account_number FROM accounts ORDER BY account_number LIMIT 1, 1 """ }
Explain:
{ "from" : 1, "size" : 1, "_source" : { "includes" : [ "account_number" ], "excludes" : [ ] }, "sort" : [ { "account_number" : { "order" : "asc" } } ] }
Result set:
account_number |
---|
6 |
Offset position can be given following the OFFSET keyword as well, here is an example:
>od SELECT age FROM accounts ORDER BY age LIMIT 2 OFFSET 1 fetched rows / total rows = 2/2 +-------+ | age | |-------| | 32 | | 33 | +-------+
Generally, sort plan is pushed down into the Elasticsearch DSL in plan optimization, but note that if a query has complex sorting, like sort expression, which would not be pushed down during optimization (see Optimizations for details), but computed in local memory. However, the engine fetches the index of a default size that is set in plugin setting (See Settings <../admin/settings.rst> opendistro.query.size_limit for details). Therefore, the result might not be absolutely correct if the index size is larger than the default size of index scan. For example, the engine has a index scan size of 200 and the index size is 500. Then a query with limit 300 can only fetch 200 rows of the index, compute and return the sorted result with 200 rows, while the rest 300 rows of the index are ignored and would not be fetched into the engine. To get an absolutely correct result, it is suggested to set the query size limit to a larger value before run the query.