elastic · ruflin · Aug 26, 2019 · Sep 5, 2019 · Sep 5, 2019 · Sep 5, 2019
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,4 @@
+.DS_Store
+
+build
+.idea
diff --git a/integration/README.md b/integration/README.md
@@ -0,0 +1,131 @@
+**STATUS: Proposal**
+
+# Overview
+
+This describes how integrations are built and then packaged for the [integrations registry](https://github.com/elastic/integrations-registry). The format described in this document is focused allowing build and test integrations which then are packaged with tools. 
+
+The format of a package for the registry can be found here: https://github.com/elastic/integrations-registry#package-structure But for the development of a package, this is not enough as we also need testing of datasets which needs additional meta data. The proposed structure allows to tests metricsets / filsets in a similar way as we do today. With `mage package` all assets are packaged together to conform to the package structure.
+
+# Definitions
+
+**Integration package**: An integration package is a packaged version of an integration. This is what is served by the integrations registry. An example on what such a package looks like can be found here. It’s important to state that the shipped package does not look identical to the format here which is optimised for development and testing.
+
+**Integration**: Integration definition with manifest and several datasets with the assets for the Elastic Stack.
+
+**Dataset**: Group of assets which are grouped together for testing and development purposes.
+
+
+# Integration files structure
+
+As today with modules, the base structure for the implementation is split up into integrations which contains multiple datasets. All the assets are inside each dataset and the dataset file structure is very similar to the final structure of a package. The structure looks as following:
+
+```
+{integration-name}/dataset/{dataset-name}/{package-structure}
+```
+
+On the top level of each integration, it contains the `manifest.yml`, `LICENSE.txt` file and the `changelog.yml`.
+
+## Assets
+
+Below are all the existing assets described. The assets which are already defined in [ASSET.md](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md) from the package definition are not repeated here.
+
+### manifest.yml
+
+The manifest contains all the information about the integration and follows the logic of the [package manifest](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md#general-assets). The manifest might be enriched with further information from its dataset during packaging. Also verifications on compatiblity version etc. will be done.
+
+It contains a few additional fields which are not part of the package:
+
+**datasets**
+
+This is a list of dataset this integration depends on. As packages today do not allow to depend on other packages, it is important to have a dependency feature during building integration to not have to duplicate all the assets. Some examples here are fields for ECS or fields specific to Filebeat. An example is below:
+
+```
+datasets:
+    - name: "ecs:ecs"
+    - name: "filebeat:filebeat"
+```
+
+No versions are mentioned above of the datasets. It's up to the implementer to make sure to increase the version number of the integration if a dependency changes. Alternative we could use versions which are then validate and if not correct anymore, an error is thrown. This would probably be more dev friendly but more complex to implement.
+
+**Package config**
+
+Not all integrations which are in this repo need packaging. For example the Filebeat or ECS integration directory are only placeholders for the assets but will not come any integration. To prevent these from packaging. `package: false` can be set.
+
+### changelog.yml
+
+Every integration should keep a changelog so if a user upgrades, we can show the user what changed. If a dependency of an integration changes, its up to the integration to add these items to the changelog list if needed.
+
+The changelog is in a structure format, so it can be read out and visualised in the package manager.
+
+```
+# The changelog.yml contains all the changes made to the integration and it's datasets.
+# If a dataset is adjusted, it should also be added to this changelog.
+# The changelog is in a structure format so the order does not matter and it can be used
+# for visualisation in the UI.
+
+  # Description of the change
+- description: Added dataset foo
+
+  # The versions here should follow semver
+  version: 1.2.3
+
+  # The options here are:
+  # - breaking change
+  # - bugfix
+  # - Added
+  # - Deprecated
+  # - Known Issue
+  type: bugfix
+
+  # Link to a Github issue or a PR
+  link: https://github.com/elastic/integrations
+```
+
+### testing.yml
+
+The testing.yml can contain information about how the integration should be tested. So far the focus is on testing datasets so this file might not be necessary. It could be used to include in datasets to share common testing info.
+
+### docs/README.md
+
+README document which contains all the documentation about the integration. It is possible that each dataset has its own additional documentation. It is expected that this will be just appended to the main README on packaging.
+
+### img/
+
+Directory for all the icons, screenshots and potentially videos. 
+
+Question: Should we name this media?
+
+### dataset/{dataset}/testing.yml
+
+This yaml file should contain all the configuration on how a dataset can be tested. It might contain which services have to be booted up for testing and how the tests should be run.
+
+### dataset/{dataset}/testdata
+
+All the data used for testing. For example example logs and the generated output of it.
+
+## Reusable content
+
+Any dataset can be reused by just referencing it in the manifest. But some of these reused assets don't need packaging on it's own. These go into integrations directory list `filebeat` or `ecs` as datasets where `package: false` is set. This allows to reuse all these assets without also getting a package for it. It would be possible to store these assets outside the "integration" directory for better separation. But implementation of the collection script has shown, that the script stays much simpler like this.
+
+
+## Versioning
+
+The version of a package is taken from the manifest file. If the CoreDNS package contains `version: 1.2.3` it will build the package `coredns-1.2.3`. For now, no exact version of a dataset is specified. If a dataset is updated, next time packaging is called for an integration, it will pull in the newest assets. So if there is a breaking change in a dataset, it's up the integration package dev to decide if this is needed. To reduce errors we could introduce exact specification of a dataset version. This would mean in case a dataset version is updated, all datasets which reference it must be updated too. As everything is in one repository, this shouldn't be too much hassle but would make it more explicit.
+
+### Backports
+
+I expect most integrations to only be moving forward and have rarely breaking changes. Because of this no backports to branches etc. are needed. In case this is needed, there are several options:
+
+* Have a branch for this integration. Packaging will just work as is.
+* Have this integration in a separate repo.
+
+
+## Conversion of modules to integrations
+
+As the data of a module and an integration stay mostly the same, transformation from a module to an integration can mostly be automated. I started to play around with some tooling to convert existing modules to integrations but I would prefer to delay the discussion around this until we agreed on the format for building integrations.
+
+# Questions
+
+* Why don't we store the main assets of an integration in for example `coredns/dataset/coredns` instead of just the top level? 
+  * One of the main reasons is that it heavily simplifies the code of collecting assets, as there is just one and no checks have to be made if there are also assets on the top level. It also prevents potential directory name conflicts.
+
diff --git a/integration/apache/CHANGELOG.md b/integration/apache/CHANGELOG.md
@@ -0,0 +1,4 @@
+
+
+* Entries for each version -> fix can go into multiple versions?
+* How will the module read this changelog entries? Should we use yaml?
diff --git a/integration/apache/LICENSE.txt b/integration/apache/LICENSE.txt
diff --git a/integration/apache/dataset/apache.access/dataset.yml b/integration/apache/dataset/apache.access/dataset.yml
@@ -0,0 +1,67 @@
+name: apache.access
+version: master
+testing:
+  files:
+    - testfile.log
+
+# TODO: What should we call this?
+
+
+datasource:
+  apache.access:
+    vars:
+      - name: paths
+        default:
+          - /var/log/apache2/access.log*
+          - /var/log/apache2/other_vhosts_access.log*
+        os.darwin:
+          - /usr/local/var/log/apache2/access_log*
+        os.windows:
+          - "C:/tools/Apache/httpd-2.*/Apache24/logs/access.log*"
+          - "C:/Program Files/Apache Software Foundation/Apache2.*/logs/access.log*"
+
+        # TODO: Configure input for the module. This can be overwritten by central management.
+        input:
+
+ingest_pipeline: ingest/default.json
+input: config/access.yml
+
+# This becomes a requirement of the module
+elasticsearch:
+  requires.processors:
+    - name: user_agent
+      plugin: ingest-user-agent
+    - name: geoip
+      plugin: ingest-geoip
+
+
+# From PH
+configs:
+  include:
+    - package: common/logs.yml:configs
+  # Note I need to play with the syntax here, but I think that would work.
+  # it will be up to the building to create the right input but we could use some helpers
+  # methods to fix that.
+  transport:
+    type: nested
+    default: file
+    validations:
+      enum: ["file", "tcp"]
+      required: true
+  transport.type:
+    type: string
+    default: "file"
+    enum: ["file", "tcp"]
+  transport.file:
+    paths:
+      type: Array<PATH>
+      default: ["%base_path/access.log"]
+      validations:
+        presence: true
+  transport.tcp:
+    port:
+      type: range # We can have Port type which has the default validations.
+      default: 8000
+      validations:
+        min: 0
+        max: 65568
diff --git a/integration/apache/dataset/apache.access/elasticsearch/ingest-pipeline/default.json b/integration/apache/dataset/apache.access/elasticsearch/ingest-pipeline/default.json
@@ -0,0 +1,73 @@
+{
+    "description": "Pipeline for parsing Apache HTTP Server access logs. Requires the geoip and user_agent plugins.",
+    "on_failure": [
+        {
+            "set": {
+                "field": "error.message",
+                "value": "{{ _ingest.on_failure_message }}"
+            }
+        }
+    ],
+    "processors": [
+        {
+            "grok": {
+                "field": "message",
+                "ignore_missing": true,
+                "patterns": [
+                    "%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"(?:%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}|-)?\" %{NUMBER:http.response.status_code:long} (?:%{NUMBER:http.response.body.bytes:long}|-)( \"%{DATA:http.request.referrer}\")?( \"%{DATA:user_agent.original}\")?",
+                    "%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"-\" %{NUMBER:http.response.status_code:long} -",
+                    "\\[%{HTTPDATE:apache.access.time}\\] %{IPORHOST:source.address} %{DATA:apache.access.ssl.protocol} %{DATA:apache.access.ssl.cipher} \"%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}\" %{NUMBER:http.response.body.bytes:long}"
+                ]
+            }
+        },
+        {
+            "remove": {
+                "field": "message"
+            }
+        },
+        {
+            "grok": {
+                "field": "source.address",
+                "ignore_missing": true,
+                "patterns": [
+                    "^(%{IP:source.ip}|%{HOSTNAME:source.domain})$"
+                ]
+            }
+        },
+        {
+            "rename": {
+                "field": "@timestamp",
+                "target_field": "event.created"
+            }
+        },
+        {
+            "date": {
+                "field": "apache.access.time",
+                "formats": [
+                    "dd/MMM/yyyy:H:m:s Z"
+                ],
+                "ignore_failure": true,
+                "target_field": "@timestamp"
+            }
+        },
+        {
+            "remove": {
+                "field": "apache.access.time",
+                "ignore_failure": true
+            }
+        },
+        {
+            "user_agent": {
+                "field": "user_agent.original",
+                "ignore_failure": true
+            }
+        },
+        {
+            "geoip": {
+                "field": "source.ip",
+                "ignore_missing": true,
+                "target_field": "source.geo"
+            }
+        }
+    ]
+}
diff --git a/integration/apache/dataset/apache.access/elasticsearch/template/default.json b/integration/apache/dataset/apache.access/elasticsearch/template/default.json
@@ -0,0 +1,23 @@
+{
+    "index_patterns": [
+        "te*",
+        "bar*"
+    ],
+    "mappings": {
+        "_source": {
+            "enabled": false
+        },
+        "properties": {
+            "created_at": {
+                "format": "EEE MMM dd HH:mm:ss Z yyyy",
+                "type": "date"
+            },
+            "host_name": {
+                "type": "keyword"
+            }
+        }
+    },
+    "settings": {
+        "number_of_shards": 1
+    }
+}
diff --git a/integration/apache/dataset/apache.access/fields/fields.yml b/integration/apache/dataset/apache.access/fields/fields.yml
@@ -0,0 +1,14 @@
+- name: access
+  type: group
+  description: >
+    Contains fields for the Apache HTTP Server access logs.
+  fields:
+    - name: ssl.protocol
+      type: keyword
+      description: >
+        SSL protocol version.
+
+    - name: ssl.cipher
+      type: keyword
+      description: >
+        SSL cipher name.
diff --git a/integration/apache/dataset/apache.access/filebeat/input.yml b/integration/apache/dataset/apache.access/filebeat/input.yml
@@ -0,0 +1,6 @@
+type: log
+paths:
+{{ range $i, $path := .paths }}
+ - {{$path}}
+{{ end }}
+exclude_files: [".gz$"]
diff --git a/integration/apache/dataset/apache.access/filebeat/module.yml b/integration/apache/dataset/apache.access/filebeat/module.yml
@@ -0,0 +1,8 @@
+- module: apache
+  # Access logs
+  access:
+    enabled: true
+
+    # Set custom paths for the log files. If left empty,
+    # Filebeat will choose the paths depending on your OS.
+    #var.paths:
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,4 @@


		* Entries for each version -> fix can go into multiple versions?
		* How will the module read this changelog entries? Should we use yaml?