-
Notifications
You must be signed in to change notification settings - Fork 554
POC/Discuss: Building integrations package #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9d772f6
9c1a2b7
946d082
29c6ad1
de4749e
10cca28
b6f6139
582b402
3c0343a
dd506a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| .DS_Store | ||
|
|
||
| build | ||
| .idea |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| **STATUS: Proposal** | ||
|
|
||
| # Overview | ||
|
|
||
| This describes how integrations are built and then packaged for the [integrations registry](https://github.com/elastic/integrations-registry). The format described in this document is focused allowing build and test integrations which then are packaged with tools. | ||
|
|
||
| The format of a package for the registry can be found [here](https://github.com/elastic/integrations-registry#package-structure). But for the development of a package, this is not enough as we also need testing of datasets which needs additional meta data. The proposed structure allows to tests metricsets / filsets in a similar way as we do today. With `mage package` all assets are packaged together to conform to the package structure. | ||
|
|
||
| # Definitions | ||
|
|
||
| **Integration package**: An integration package is a packaged version of an integration. This is what is served by the integrations registry. An example on what such a package looks like can be found [here](https://github.com/elastic/integrations-registry#package-structure). It’s important to state that the shipped package does not look identical to the format here which is optimised for development and testing. | ||
|
|
||
| **Integration**: Integration definition with manifest and several datasets with the assets for the Elastic Stack. | ||
|
|
||
| **Dataset**: Group of assets which are grouped together for testing and development purposes. | ||
|
|
||
|
|
||
| # Integration files structure | ||
|
|
||
| As today with modules, the base structure for the implementation is split up into integrations which contains multiple datasets. All the assets are inside each dataset and the dataset file structure is very similar to the final structure of a package. The structure looks as following: | ||
|
|
||
| ``` | ||
| {integration-name}/dataset/{dataset-name}/{package-structure} | ||
ruflin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| An example of the above for an apache.access log ingest pipeline is: | ||
|
|
||
| ``` | ||
| apache/dataset/access/elasticsearch/ingest-pipeline/default.json | ||
| ``` | ||
|
|
||
| On the top level of each integration, it contains the `manifest.yml`, `LICENSE.txt` file and the `changelog.yml`. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jsoriano I guess we don't have to add a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess that we'd need a license per repository, and it should be added to the generate packages, yes.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably need more then 1 ;-) |
||
|
|
||
| ## Assets | ||
|
|
||
| Below are all the existing assets described. The assets which are already defined in [ASSET.md](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md) from the package definition are not repeated here. | ||
|
|
||
| ### manifest.yml | ||
ruflin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The manifest contains all the information about the integration and follows the logic of the [package manifest](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md#general-assets). The manifest might be enriched with further information from its dataset during packaging. Also verifications on compatiblity version etc. will be done. | ||
|
|
||
| It contains a few additional fields which are not part of the package: | ||
|
|
||
| **datasets** | ||
|
|
||
| This is a list of dataset this integration depends on. As packages today do not allow to depend on other packages, it is important to have a dependency feature during building integration to not have to duplicate all the assets. Some examples here are fields for ECS or fields specific to Filebeat. An example is below: | ||
|
|
||
| ``` | ||
| datasets: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we use here a different name that more explictly refers to dependencies as We may also need to differentiate "runtime" dependencies from build-time dependencies (
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jsoriano Changing the code, I just had an idea for an other name:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SGTM, I would only avoid the generic |
||
| - name: "ecs:ecs" | ||
| - name: "filebeat:filebeat" | ||
| ``` | ||
|
|
||
| No versions are mentioned above of the datasets. It's up to the implementer to make sure to increase the version number of the integration if a dependency changes. Alternative we could use versions which are then validate and if not correct anymore, an error is thrown. This would probably be more dev friendly but more complex to implement. | ||
|
|
||
| **Package config** | ||
|
|
||
| Not all integrations which are in this repo need packaging. For example the Filebeat or ECS integration directory are only placeholders for the assets but will not come any integration. To prevent these from packaging. `package: false` can be set. | ||
ruflin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### changelog.yml | ||
|
|
||
| Every integration should keep a changelog so if a user upgrades, we can show the user what changed. If a dependency of an integration changes, its up to the integration to add these items to the changelog list if needed. | ||
|
|
||
| The changelog is in a structure format, so it can be read out and visualised in the package manager. | ||
|
|
||
| More details about the changelog can be found here: https://github.com/elastic/integrations-registry/blob/master/ASSETS.md#changelogyml | ||
|
|
||
| ``` | ||
| # The changelog.yml contains all the changes made to the integration and it's datasets. | ||
| # If a dataset is adjusted, it should also be added to this changelog. | ||
| # The changelog is in a structure format so the order does not matter and it can be used | ||
| # for visualisation in the UI. | ||
|
|
||
| - version: 1.0.4 | ||
| changes: | ||
| - description: > | ||
| Unexpected breaking change had to be introduced. This should not happen in a minor. | ||
| type: breaking-change | ||
| link: https://github.com/elastic/beats/issues/13504 | ||
| - version: 1.0.3 | ||
| changes: | ||
| - description: Fix broken template | ||
| type: bugfix | ||
| link: https://github.com/elastic/beats/issues/13507 | ||
| - description: It is a known issue that the dashboard does not load properly | ||
| type: known-issue | ||
| link: https://github.com/elastic/beats/issues/13506 | ||
| ``` | ||
|
|
||
| ### testing.yml | ||
|
|
||
| The testing.yml can contain information about how the integration should be tested. So far the focus is on testing datasets so this file might not be necessary. It could be used to include in datasets to share common testing info. | ||
|
|
||
| ### docs/README.md | ||
|
|
||
| README document which contains all the documentation about the integration. It is possible that each dataset has its own additional documentation. It is expected that this will be just appended to the main README on packaging. | ||
|
|
||
| ### img/ | ||
|
|
||
| Directory for all the icons, screenshots and potentially videos. | ||
|
|
||
| Question: Should we name this media? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it is expected to contain also videos then yes please 🙂
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. long term, yes. Will rename. |
||
|
|
||
| ### dataset/{dataset}/testing.yml | ||
|
|
||
| This yaml file should contain all the configuration on how a dataset can be tested. It might contain which services have to be booted up for testing and how the tests should be run. | ||
|
|
||
| ### dataset/{dataset}/testdata | ||
|
|
||
| All the data used for testing. For example example logs and the generated output of it. | ||
|
|
||
| ### dataset/fields/fields.yml | ||
|
|
||
| The fields.yml contains the content to generate the Elasticsearch index template and the Kibana index pattern. This happens in the integrations manager. | ||
|
|
||
| On thing that is important is that all fields.yml used, start with the global full path. At the moment we have some fields.yml in the datasets which are relative to the module they are in. We need to change those. | ||
| The idea is that the order of the fields.yml does not matter when combining them all into 1 file. | ||
|
|
||
| ## Reusable content | ||
|
|
||
| Any dataset can be reused by just referencing it in the manifest. But some of these reused assets don't need packaging on it's own. These go into integrations directory list `filebeat` or `ecs` as datasets where `package: false` is set. This allows to reuse all these assets without also getting a package for it. It would be possible to store these assets outside the "integration" directory for better separation. But implementation of the collection script has shown, that the script stays much simpler like this. | ||
|
|
||
|
|
||
| ## Versioning | ||
|
|
||
| The version of a package is taken from the manifest file. If the CoreDNS package contains `version: 1.2.3` it will build the package `coredns-1.2.3`. For now, no exact version of a dataset is specified. If a dataset is updated, next time packaging is called for an integration, it will pull in the newest assets. So if there is a breaking change in a dataset, it's up the integration package dev to decide if this is needed. To reduce errors we could introduce exact specification of a dataset version. This would mean in case a dataset version is updated, all datasets which reference it must be updated too. As everything is in one repository, this shouldn't be too much hassle but would make it more explicit. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even if everything is in the same repository now we have to keep thinking that we should support integrations living out of this repository.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the tricky part if something is in an other repository will be the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may end up needing some other repository for build-time dependencies 🙂
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or maybe the external source can be specified in the include itself, e.g. for an integration including ECS: To allow something like this in the future without needing to support transitive dependencies, we may add the restriction that datasets with And this makes me wonder if datasets with
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All good questions :-) I initially had it in its separate library directory. This had two disadvantages:
So in summary, I think
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this again: The good news with the syntax we have (independent of the name), we can always extend it with more fields as it's an array. not having a source, just assumes it's local.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
But then if it is packaged it won't have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it will not have
ruflin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| It is expected that [semantic versioning](https://semver.org) is used. | ||
|
|
||
| ### Backports | ||
|
|
||
| I expect most integrations to only be moving forward and have rarely breaking changes. Because of this no backports to branches etc. are needed. In case this is needed, there are several options: | ||
|
|
||
| * Have a branch for this integration. Packaging will just work as is. | ||
| * Have this integration in a separate repo. | ||
|
|
||
|
|
||
| ## Conversion of modules to integrations | ||
|
|
||
| As the data of a module and an integration stay mostly the same, transformation from a module to an integration can mostly be automated. I started to play around with some tooling to convert existing modules to integrations but I would prefer to delay the discussion around this until we agreed on the format for building integrations. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remember Metricbeat light modules 🙂
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the main thing this could change is that potentially there are more config files? But in the first phase I expect these are still shipped as part of MB. Only when we get to inputs, the configs the user has to past in become much bigger?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, current light modules can be shipped as part of MB, but as they are just plain text files we could have integrations with them.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess for now, we need to keep them shipped with MB. With Fleet, it will become easy to just ship them to the agent. I assume lightweight modules do not necessarly require a restart of MB? The other option is, that we let user copy / paste it as a backup.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, light modules don't require a restart of MB, they are loaded "lazily" when instantiated, so new light modules can already be installed in run time. |
||
|
|
||
| # Questions | ||
|
|
||
| * Why don't we store the main assets of an integration in for example `coredns/dataset/coredns` instead of just the top level? | ||
| * One of the main reasons is that it heavily simplifies the code of collecting assets, as there is just one and no checks have to be made if there are also assets on the top level. It also prevents potential directory name conflicts. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| { | ||
| "description": "Pipeline for parsing Apache HTTP Server access logs. Requires the geoip and user_agent plugins.", | ||
| "on_failure": [ | ||
| { | ||
| "set": { | ||
| "field": "error.message", | ||
| "value": "{{ _ingest.on_failure_message }}" | ||
| } | ||
| } | ||
| ], | ||
| "processors": [ | ||
| { | ||
| "grok": { | ||
| "field": "message", | ||
| "ignore_missing": true, | ||
| "patterns": [ | ||
| "%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"(?:%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}|-)?\" %{NUMBER:http.response.status_code:long} (?:%{NUMBER:http.response.body.bytes:long}|-)( \"%{DATA:http.request.referrer}\")?( \"%{DATA:user_agent.original}\")?", | ||
| "%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"-\" %{NUMBER:http.response.status_code:long} -", | ||
| "\\[%{HTTPDATE:apache.access.time}\\] %{IPORHOST:source.address} %{DATA:apache.access.ssl.protocol} %{DATA:apache.access.ssl.cipher} \"%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}\" %{NUMBER:http.response.body.bytes:long}" | ||
| ] | ||
| } | ||
| }, | ||
| { | ||
| "remove": { | ||
| "field": "message" | ||
| } | ||
| }, | ||
| { | ||
| "grok": { | ||
| "field": "source.address", | ||
| "ignore_missing": true, | ||
| "patterns": [ | ||
| "^(%{IP:source.ip}|%{HOSTNAME:source.domain})$" | ||
| ] | ||
| } | ||
| }, | ||
| { | ||
| "rename": { | ||
| "field": "@timestamp", | ||
| "target_field": "event.created" | ||
| } | ||
| }, | ||
| { | ||
| "date": { | ||
| "field": "apache.access.time", | ||
| "formats": [ | ||
| "dd/MMM/yyyy:H:m:s Z" | ||
| ], | ||
| "ignore_failure": true, | ||
| "target_field": "@timestamp" | ||
| } | ||
| }, | ||
| { | ||
| "remove": { | ||
| "field": "apache.access.time", | ||
| "ignore_failure": true | ||
| } | ||
| }, | ||
| { | ||
| "user_agent": { | ||
| "field": "user_agent.original", | ||
| "ignore_failure": true | ||
| } | ||
| }, | ||
| { | ||
| "geoip": { | ||
| "field": "source.ip", | ||
| "ignore_missing": true, | ||
| "target_field": "source.geo" | ||
| } | ||
| } | ||
| ] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| { | ||
| "index_patterns": [ | ||
| "te*", | ||
| "bar*" | ||
| ], | ||
| "mappings": { | ||
| "_source": { | ||
| "enabled": false | ||
| }, | ||
| "properties": { | ||
| "created_at": { | ||
| "format": "EEE MMM dd HH:mm:ss Z yyyy", | ||
| "type": "date" | ||
| }, | ||
| "host_name": { | ||
| "type": "keyword" | ||
| } | ||
| } | ||
| }, | ||
| "settings": { | ||
| "number_of_shards": 1 | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| - name: access | ||
| type: group | ||
| description: > | ||
| Contains fields for the Apache HTTP Server access logs. | ||
| fields: | ||
| - name: ssl.protocol | ||
| type: keyword | ||
| description: > | ||
| SSL protocol version. | ||
|
|
||
| - name: ssl.cipher | ||
| type: keyword | ||
| description: > | ||
| SSL cipher name. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| type: log | ||
| paths: | ||
| {{ range $i, $path := .paths }} | ||
| - {{$path}} | ||
| {{ end }} | ||
| exclude_files: [".gz$"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| - module: apache | ||
| # Access logs | ||
| access: | ||
| enabled: true | ||
|
|
||
| # Set custom paths for the log files. If left empty, | ||
| # Filebeat will choose the paths depending on your OS. | ||
| #var.paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a group of assets, why don't we call them "Assets"? In my head, when I read datasets I'm thinking in some sort of structured information about some topic that have been already extracted from "somewhere" and is ready to analyze. The wikipedia definition also states something similar: https://en.wikipedia.org/wiki/Data_set
Just doing some "duck testing" here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description here is unfortunate. These assets together are needed to create a dataset. But I think you have a point here and it needs either some revising of the description or potential renaming. More "duck testing" needed :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to agree here that we should call the directory inside each integration
assetsinstead ofdataset. Because in some cases, it is literally only assets and nothing in it generates the dataset. One of the reasons of this grouping is also the "things" we test together. All these assets are in one place. It seems to be straight forward to explain that everything we need for testing together goes into one asset directory with a name. @sayden Would that help?@jsoriano WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to also add an an example here:
The coredns integration has 3 asset groups:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on a first read
datasetalso sounded a bit funny to me for that. +1 to consider another naming likeassets.For the directory we could also think in something like
srcorsourceas they contain the "source code" of the package.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src reminds me too much of source code but it's actually mostly "config" files. My favorite at the moment: assets