diff --git a/rfcs/text/0001-wildcard-data-type.md b/rfcs/text/0001-wildcard-data-type.md
index ff81ac08a0..448a929f4e 100644
--- a/rfcs/text/0001-wildcard-data-type.md
+++ b/rfcs/text/0001-wildcard-data-type.md
@@ -1,8 +1,8 @@
# 0001: Wildcard Field Adoption into ECS
-- Stage: **1 (proposal)**
-- Date: **TBD**
+- Stage: **2 (draft)**
+- Date: **2020-10-02**
Wildcard is a data type for Elasticsearch string fields being introduced in Elasticsearch 7.9. Wildcard optimizes performance for queries using wildcards (`*`) and regex, allowing users to perform `grep`-like searches without the limitations of the existing
text[0] and keyword[1] types.
@@ -10,28 +10,46 @@ text[0] and keyword[1] types.
## Fields
-For a field to use wildcard, it will require changing the the field's defined schema `type` from `keyword` to `wildcard`. The following fieldsets are expected to adopt `wildcard` in at least one of their fields:
-
-* `agent.*`
-* `destination.*`
-* `error.*`
-* `file.*`
-* `host.*`
-* `http.*`
-* `os.*`
-* `process.*`
-* `registry.*`
-* `source.*`
-* `url.*`
-* `user.*`
-* `user_agent.*`
+### Identified Wildcard Fields
+
+For a field to use wildcard, it will require changing the the field's defined schema `type` from `keyword` to `wildcard`. The following fields are candidates for `wildcard`:
+
+| Field Set | Field(s) |
+| --------- | -------- |
+| [`agent`](0001/agent.yml) | `agent.build.original` |
+| [`as`](0001/as.yml) | `as.organization.name` |
+| [`client`](0001/client.yml) | `client.domain`
`client.registered_domain` |
+| [`destination`](0001/destination.yml) | `destination.domain`
`destination.registered_domain` |
+| [`dns`](0001/dns.yml) | `dns.question.name`
`dns.answers.data` |
+| [`error`](0001/error.yml) | `error.stack_trace`
`error.type` |
+| [`event`](0001/event.yml) | `event.original` |
+| [`file`](0001/file.yml) | `file.directory`
`file.path`
`file.target_path` |
+| [`geo`](0001/geo.yml) | `geo.name` |
+| [`host`](0001/host.yml) | `host.hostname`
|
+| [`http`](0001/http.yml) | `http.request.referrer`
`http.request.body.content`
`http.response.body.content` |
+| [`log`](0001/log.yml) | `log.file.path`
`log.logger` |
+| [`os`](0001/os.yml) | `os.name`
`os.full` |
+| [`pe`](0001/pe.yml) | `pe.original_file_name` |
+| [`process`](0001/process.yml) | `process.command_line`
`process.executable`
`process.name`
`process.title`
`process.working_directory`
|
+| [`registry`](0001/registry.yml) | `registry.key`
`registry.path`
`registry.data.strings` |
+| [`server`](0001/server.yml) | `server.domain`
`server.registered_domain` |
+| [`source`](0001/source.yml) | `source.domain`
`source.registered_domain` |
+| [`tls`](0001/tls.yml) | `tls.client.issuer`
`tls.client.subject`
`tls.server.issuer`
`tls.server.subject` |
+| [`url`](0001/url.yml) | `url.full`
`url.original`
`url.path`
`url.domain`
`url.registered_domain` |
+| [`user`](0001/user.yml) | `user.name`
`user.full_name`
`user.email`
`user.domain` |
+| [`user_agent`](0001/user_agent.yml) | `user_agent.original` |
+| [`x509`](0001/x509.yml) | `x509.issuer.distinguished_name`
`x509.subject.distinguished_name` |
+
+The full set of schema files which will be transitioning to `wildcard` are located in directory [rfcs/text/0001/](0001/).
+
+### Example definition
Here's an example of applying this change to the `process.command_line` field:
-**Definition as of ECS 1.5.0**
+**Definition as of ECS 1.6.0**
Schema definition:
@@ -137,7 +155,7 @@ The following table is a comparison of `wildcard` vs. `keyword` [2]:
| Searched by "all fields" queries | Y | Y |
| Disk costs for mostly unique values | high (see *5) | lower (see *5) |
| Dist costs for mostly identical values | low (see *5) | medium (see *5) |
-| Max character size for a field value | 256 for default JSON string mapping (1024 for ECS), 32766 Luence max | unlimited |
+| Max character size for a field value | 256 for default JSON string mapping (1024 for ECS), 32766 Lucene max | unlimited |
| Supports normalizers in mappings | Y | N |
| Indexing speeds | Fast | Slower (see *6) |
@@ -232,9 +250,11 @@ Additional cases for wildcard searching against command line executions:
## Source data
+### Categories
+
* Windows events
* Sysmon events
* Powershell events
@@ -244,6 +264,138 @@ Stage 1: Provide a high-level description of example sources of data. This does
* Endpoint agents
* Application stack traces
+### Real world examples
+
+Each example in this section contains a partial index mapping, a partial event, and one wildcard search query. Each query example uses a leading wildcard on expected high-cardinality fields where `wildcard` is performs far better than `keyword`.
+
+**Windows registry event from sysmon:**
+
+```
+### Mapping (partial)
+...
+ "registry" : {
+ "properties" : {
+ "key" : {
+ "type" : "wildcard"
+ }
+ }
+ }
+...
+
+### Event (partial)
+...
+ "registry": {
+ "path": "HKU\\S-1-5-21-1957236100-58272097-297103362-500\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Advanced\\HideFileExt",
+ "hive": "HKU",
+ "key": "S-1-5-21-1957236100-58272097-297103362-500\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Advanced\\HideFileExt",
+ "value": "HideFileExt",
+ "data": {
+ "strings": [
+ "1"
+ ],
+ "type": "SZ_DWORD"
+ }
+...
+
+### Query
+
+GET winlogbeat-*/_search
+{
+ "query": {
+ "wildcard": {
+ "registry.key": {
+ "value": "*CurrentVersion*"
+ }
+ }
+ }
+}
+
+```
+
+**Windows Powershell logging event:**
+
+```
+### Mapping (partial)
+...
+ "process" : {
+ "properties" : {
+ "command_line" : {
+ "type" : "wildcard",
+ "fields" : {
+ "text" : {
+ "type" : "text",
+ "norms" : false
+ }
+ }
+ }
+ }
+ }
+...
+
+### Event (partial)
+
+ "process": {
+ "pid": 3540,
+ ...
+ "command_line": "C:\\Windows\\System32\\svchost.exe -k netsvcs -p -s NetSetupSvc"
+ }
+
+### Query
+
+GET winlogbeat-*/_search
+{
+ "_source": false,
+ "query": {
+ "wildcard": {
+ "process.command_line": {
+ "value": "*-k netsvcs -p*"
+ }
+ }
+ }
+}
+```
+
+**Wildcard query against original URL from a squid web proxy event:**
+
+```
+### Mapping (partial)
+
+...
+ "url" : {
+ "original" : {
+ "type" : "wildcard",
+ "fields" : {
+ "text" : {
+ "type" : "text",
+ "norms" : false
+ }
+ }
+ }
+...
+
+### Event (partial)
+
+...
+ "url": {
+ "original": "http://example.com/cart.do?action=view&itemId=HolyGouda",
+ "domain": "example.com"
+ }
+...
+
+### Query
+
+GET filebeat-*/_search
+{
+ "_source": false,
+ "query": {
+ "wildcard": {
+ "url.original": {
+ "value": "*action=view*Gouda"
+ }
+ }
+ }
+}
+```
## Scope of impact
@@ -270,7 +422,7 @@ ECS is and will remain an open source licensed project. However, there will be f
## Concerns
### Wildcard and case-insensitivity
@@ -287,6 +439,8 @@ Performance and storage characteristics between wildcard and keyword will be dif
ECS applies the `ignore_above` setting to keyword fields to prevent strings longer than 1024 characters from being indexed or stored. While `ignore_above` can be raised, Lucene implements a term byte-length limit of 32766 which cannot be adjusted. Wildcard supports an unlimited max character size for a field value. The `wildcard` field type will still have the `ignore_above` option available, and a reasonable limit may be need applied to mitigate unexpected side-effects.
+For the initial adoption into ECS, `wildcard` fields will not have an `ignore_above` option defined.
+
### Licensing
Until now ECS has relied only on OSS licensed features, but ECS will also support Elastic licensed features. The ECS project will remain OSS licensed with the schema implementing Elastic licensed features as part of the specification. When ECS adopts a feature available only under a license, it will be noted in the documentation. ECS plans to provide tooling options which continue to support OSS consumers of ECS and the Elastic Stack.
@@ -295,6 +449,23 @@ Until now ECS has relied only on OSS licensed features, but ECS will also suppor
A data shipper which uses the `wildcard` field type may need to verify that the configured output Elasticsearch destination can support it (>= 7.9.0). For example, if a future version of Beats adopts `wildcard` in index mappings, Beats would may need to gracefully handle a scenario where the targeted Elasticsearch instance doesn't support the data type.
+### Text fields migrating to wildcard
+
+ECS currently has two `text` fields that would likely benefit from migrating to `wildcard`.
+Doing so on the canonical field (as opposed to adding a multi-field) would be a breaking change.
+However adding a `.wildcard` multi-field may cause confusion, as they would be the only
+places where `wildcard` appears as a multi-field.
+
+The fields are:
+
+- `message`
+- `error.message`
+
+Paradoxically, in some cases they also benefit from the `text` data type.
+A prime example is Windows Event Logs' main messages, which is stored in the `message` field.
+
+The situation is captured here for addressing at a later stage.
+
## People
The following are the people that consulted on the contents of this RFC.
@@ -326,3 +497,4 @@ The following are the people that consulted on the contents of this RFC.
* Stage 0: https://github.com/elastic/ecs/pull/890
* Stage 1: https://github.com/elastic/ecs/pull/904
+* Stage 2: https://github.com/elastic/ecs/pull/970
diff --git a/rfcs/text/0001/agent.yml b/rfcs/text/0001/agent.yml
new file mode 100644
index 0000000000..d09e77111d
--- /dev/null
+++ b/rfcs/text/0001/agent.yml
@@ -0,0 +1,5 @@
+---
+- name: agent
+ fields:
+ - name: build.original
+ type: wildcard
diff --git a/rfcs/text/0001/as.yml b/rfcs/text/0001/as.yml
new file mode 100644
index 0000000000..96cf45621c
--- /dev/null
+++ b/rfcs/text/0001/as.yml
@@ -0,0 +1,5 @@
+---
+- name: as
+ fields:
+ - name: organization.name
+ type: wildcard
diff --git a/rfcs/text/0001/client.yml b/rfcs/text/0001/client.yml
new file mode 100644
index 0000000000..14ed3a9a37
--- /dev/null
+++ b/rfcs/text/0001/client.yml
@@ -0,0 +1,7 @@
+---
+ - name: client
+ fields:
+ - name: domain
+ type: wildcard
+ - name: registered_domain
+ type: wildcard
diff --git a/rfcs/text/0001/destination.yml b/rfcs/text/0001/destination.yml
new file mode 100644
index 0000000000..d64a84c6be
--- /dev/null
+++ b/rfcs/text/0001/destination.yml
@@ -0,0 +1,7 @@
+---
+ - name: destination
+ fields:
+ - name: domain
+ type: wildcard
+ - name: registered_domain
+ type: wildcard
diff --git a/rfcs/text/0001/dns.yml b/rfcs/text/0001/dns.yml
new file mode 100644
index 0000000000..54f9ccd69a
--- /dev/null
+++ b/rfcs/text/0001/dns.yml
@@ -0,0 +1,7 @@
+---
+- name: dns
+ fields:
+ - name: question.name
+ type: wildcard
+ - name: answers.data
+ type: wildcard
diff --git a/rfcs/text/0001/error.yml b/rfcs/text/0001/error.yml
new file mode 100644
index 0000000000..f2004d3fe0
--- /dev/null
+++ b/rfcs/text/0001/error.yml
@@ -0,0 +1,9 @@
+---
+- name: error
+ fields:
+ - name: stack_trace
+ index: true
+ type: wildcard
+
+ - name: type
+ type: wildcard
diff --git a/rfcs/text/0001/event.yml b/rfcs/text/0001/event.yml
new file mode 100644
index 0000000000..07daa3ac87
--- /dev/null
+++ b/rfcs/text/0001/event.yml
@@ -0,0 +1,5 @@
+---
+- name: event
+ fields:
+ - name: original
+ type: wildcard
diff --git a/rfcs/text/0001/file.yml b/rfcs/text/0001/file.yml
new file mode 100644
index 0000000000..f4938d38be
--- /dev/null
+++ b/rfcs/text/0001/file.yml
@@ -0,0 +1,9 @@
+---
+- name: file
+ fields:
+ - name: directory
+ type: wildcard
+ - name: path
+ type: wildcard
+ - name: target_path
+ type: wildcard
diff --git a/rfcs/text/0001/geo.yml b/rfcs/text/0001/geo.yml
new file mode 100644
index 0000000000..d3445a5a2b
--- /dev/null
+++ b/rfcs/text/0001/geo.yml
@@ -0,0 +1,5 @@
+---
+ - name: geo
+ fields:
+ - name: name
+ type: wildcard
diff --git a/rfcs/text/0001/host.yml b/rfcs/text/0001/host.yml
new file mode 100644
index 0000000000..91f3d1bbc2
--- /dev/null
+++ b/rfcs/text/0001/host.yml
@@ -0,0 +1,4 @@
+- name: host
+ fields:
+ - name: hostname
+ type: wildcard
diff --git a/rfcs/text/0001/http.yml b/rfcs/text/0001/http.yml
new file mode 100644
index 0000000000..1722cdc5e7
--- /dev/null
+++ b/rfcs/text/0001/http.yml
@@ -0,0 +1,9 @@
+---
+- name: http
+ fields:
+ - name: request.body.content
+ type: wildcard
+ - name: request.referrer
+ type: wildcard
+ - name: response.body.content
+ type: wildcard
diff --git a/rfcs/text/0001/log.yml b/rfcs/text/0001/log.yml
new file mode 100644
index 0000000000..8a2f2dd397
--- /dev/null
+++ b/rfcs/text/0001/log.yml
@@ -0,0 +1,7 @@
+---
+- name: log
+ fields:
+ - name: file.path
+ type: wildcard
+ - name: logger
+ type: wildcard
diff --git a/rfcs/text/0001/organization.yml b/rfcs/text/0001/organization.yml
new file mode 100644
index 0000000000..594581413b
--- /dev/null
+++ b/rfcs/text/0001/organization.yml
@@ -0,0 +1,5 @@
+---
+- name: organization
+ fields:
+ - name: name
+ type: wildcard
diff --git a/rfcs/text/0001/os.yml b/rfcs/text/0001/os.yml
new file mode 100644
index 0000000000..ec9d71a79c
--- /dev/null
+++ b/rfcs/text/0001/os.yml
@@ -0,0 +1,7 @@
+---
+- name: os
+ fields:
+ - name: name
+ type: wildcard
+ - name: full
+ type: wildcard
diff --git a/rfcs/text/0001/pe.yml b/rfcs/text/0001/pe.yml
new file mode 100644
index 0000000000..6e729b39f4
--- /dev/null
+++ b/rfcs/text/0001/pe.yml
@@ -0,0 +1,5 @@
+---
+ - name: pe
+ fields:
+ - name: original_file_name
+ type: wildcard
diff --git a/rfcs/text/0001/process.yml b/rfcs/text/0001/process.yml
new file mode 100644
index 0000000000..da492e4564
--- /dev/null
+++ b/rfcs/text/0001/process.yml
@@ -0,0 +1,13 @@
+---
+- name: process
+ fields:
+ - name: command_line
+ type: wildcard
+ - name: executable
+ type: wildcard
+ - name: name
+ type: wildcard
+ - name: title
+ type: wildcard
+ - name: working_directory
+ type: wildcard
diff --git a/rfcs/text/0001/registry.yml b/rfcs/text/0001/registry.yml
new file mode 100644
index 0000000000..66f6f6b22c
--- /dev/null
+++ b/rfcs/text/0001/registry.yml
@@ -0,0 +1,9 @@
+---
+- name: registry
+ fields:
+ - name: key
+ type: wildcard
+ - name: path
+ type: wildcard
+ - name: data.strings
+ type: wildcard
diff --git a/rfcs/text/0001/server.yml b/rfcs/text/0001/server.yml
new file mode 100644
index 0000000000..70c285f374
--- /dev/null
+++ b/rfcs/text/0001/server.yml
@@ -0,0 +1,7 @@
+---
+ - name: server
+ fields:
+ - name: domain
+ type: wildcard
+ - name: registered_domain
+ type: wildcard
diff --git a/rfcs/text/0001/source.yml b/rfcs/text/0001/source.yml
new file mode 100644
index 0000000000..d810a6cb79
--- /dev/null
+++ b/rfcs/text/0001/source.yml
@@ -0,0 +1,7 @@
+---
+- name: source
+ fields:
+ - name: domain
+ type: wildcard
+ - name: registered_domain
+ type: wildcard
diff --git a/rfcs/text/0001/tls.yml b/rfcs/text/0001/tls.yml
new file mode 100644
index 0000000000..4f5378a313
--- /dev/null
+++ b/rfcs/text/0001/tls.yml
@@ -0,0 +1,11 @@
+---
+- name: tls
+ fields:
+ - name: client.issuer
+ type: wildcard
+ - name: client.subject
+ type: wildcard
+ - name: server.issuer
+ type: wildcard
+ - name: server.subject
+ type: wildcard
diff --git a/rfcs/text/0001/url.yml b/rfcs/text/0001/url.yml
new file mode 100644
index 0000000000..0d5f66c36a
--- /dev/null
+++ b/rfcs/text/0001/url.yml
@@ -0,0 +1,13 @@
+---
+- name: url
+ fields:
+ - name: original
+ type: wildcard
+ - name: full
+ type: wildcard
+ - name: path
+ type: wildcard
+ - name: domain
+ type: wildcard
+ - name: registered_domain
+ type: wildcard
diff --git a/rfcs/text/0001/user.yml b/rfcs/text/0001/user.yml
new file mode 100644
index 0000000000..89e182fbee
--- /dev/null
+++ b/rfcs/text/0001/user.yml
@@ -0,0 +1,9 @@
+---
+- name: user
+ fields:
+ - name: name
+ type: wildcard
+ - name: full_name
+ type: wildcard
+ - name: email
+ type: wildcard
diff --git a/rfcs/text/0001/user_agent.yml b/rfcs/text/0001/user_agent.yml
new file mode 100644
index 0000000000..c413a9d702
--- /dev/null
+++ b/rfcs/text/0001/user_agent.yml
@@ -0,0 +1,5 @@
+---
+- name: user_agent
+ fields:
+ - name: original
+ type: wildcard
diff --git a/rfcs/text/0001/x509.yml b/rfcs/text/0001/x509.yml
new file mode 100644
index 0000000000..d1c7d8af6b
--- /dev/null
+++ b/rfcs/text/0001/x509.yml
@@ -0,0 +1,7 @@
+---
+- name: x509
+ fields:
+ - name: issuer.distinguished_name
+ type: wildcard
+ - name: subject.distinguished_name
+ type: wildcard