Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.DS_Store

build
.idea
146 changes: 146 additions & 0 deletions integration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
**STATUS: Proposal**

# Overview

This describes how integrations are built and then packaged for the [integrations registry](https://github.com/elastic/integrations-registry). The format described in this document is focused allowing build and test integrations which then are packaged with tools.

The format of a package for the registry can be found [here](https://github.com/elastic/integrations-registry#package-structure). But for the development of a package, this is not enough as we also need testing of datasets which needs additional meta data. The proposed structure allows to tests metricsets / filsets in a similar way as we do today. With `mage package` all assets are packaged together to conform to the package structure.

# Definitions

**Integration package**: An integration package is a packaged version of an integration. This is what is served by the integrations registry. An example on what such a package looks like can be found [here](https://github.com/elastic/integrations-registry#package-structure). It’s important to state that the shipped package does not look identical to the format here which is optimised for development and testing.

**Integration**: Integration definition with manifest and several datasets with the assets for the Elastic Stack.

**Dataset**: Group of assets which are grouped together for testing and development purposes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a group of assets, why don't we call them "Assets"? In my head, when I read datasets I'm thinking in some sort of structured information about some topic that have been already extracted from "somewhere" and is ready to analyze. The wikipedia definition also states something similar: https://en.wikipedia.org/wiki/Data_set

Just doing some "duck testing" here 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description here is unfortunate. These assets together are needed to create a dataset. But I think you have a point here and it needs either some revising of the description or potential renaming. More "duck testing" needed :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to agree here that we should call the directory inside each integration assets instead of dataset. Because in some cases, it is literally only assets and nothing in it generates the dataset. One of the reasons of this grouping is also the "things" we test together. All these assets are in one place. It seems to be straight forward to explain that everything we need for testing together goes into one asset directory with a name. @sayden Would that help?

@jsoriano WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to also add an an example here:

The coredns integration has 3 asset groups:

  • coredns (all the generic assets)
  • log (assets for the log dataset)
  • stats (assets for the stats dataset)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on a first read dataset also sounded a bit funny to me for that. +1 to consider another naming like assets.
For the directory we could also think in something like src or source as they contain the "source code" of the package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src reminds me too much of source code but it's actually mostly "config" files. My favorite at the moment: assets



# Integration files structure

As today with modules, the base structure for the implementation is split up into integrations which contains multiple datasets. All the assets are inside each dataset and the dataset file structure is very similar to the final structure of a package. The structure looks as following:

```
{integration-name}/dataset/{dataset-name}/{package-structure}
```

An example of the above for an apache.access log ingest pipeline is:

```
apache/dataset/access/elasticsearch/ingest-pipeline/default.json
```

On the top level of each integration, it contains the `manifest.yml`, `LICENSE.txt` file and the `changelog.yml`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano I guess we don't have to add a LICENSE.txt file into each integration. We can add it during packaging time based on the entry in the manifest.yml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that we'd need a license per repository, and it should be added to the generate packages, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need more then 1 ;-)


## Assets

Below are all the existing assets described. The assets which are already defined in [ASSET.md](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md) from the package definition are not repeated here.

### manifest.yml

The manifest contains all the information about the integration and follows the logic of the [package manifest](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md#general-assets). The manifest might be enriched with further information from its dataset during packaging. Also verifications on compatiblity version etc. will be done.

It contains a few additional fields which are not part of the package:

**datasets**

This is a list of dataset this integration depends on. As packages today do not allow to depend on other packages, it is important to have a dependency feature during building integration to not have to duplicate all the assets. Some examples here are fields for ECS or fields specific to Filebeat. An example is below:

```
datasets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use here a different name that more explictly refers to dependencies as requires, or depends_on? I think that datasets could be confusing as it could also meant the datasets provided by the package.

We may also need to differentiate "runtime" dependencies from build-time dependencies (includes?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like includes as in the end it's included in the package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano Changing the code, I just had an idea for an other name: imports ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I would only avoid the generic datasets 🙂

- name: "ecs:ecs"
- name: "filebeat:filebeat"
```

No versions are mentioned above of the datasets. It's up to the implementer to make sure to increase the version number of the integration if a dependency changes. Alternative we could use versions which are then validate and if not correct anymore, an error is thrown. This would probably be more dev friendly but more complex to implement.

**Package config**

Not all integrations which are in this repo need packaging. For example the Filebeat or ECS integration directory are only placeholders for the assets but will not come any integration. To prevent these from packaging. `package: false` can be set.

### changelog.yml

Every integration should keep a changelog so if a user upgrades, we can show the user what changed. If a dependency of an integration changes, its up to the integration to add these items to the changelog list if needed.

The changelog is in a structure format, so it can be read out and visualised in the package manager.

More details about the changelog can be found here: https://github.com/elastic/integrations-registry/blob/master/ASSETS.md#changelogyml

```
# The changelog.yml contains all the changes made to the integration and it's datasets.
# If a dataset is adjusted, it should also be added to this changelog.
# The changelog is in a structure format so the order does not matter and it can be used
# for visualisation in the UI.

- version: 1.0.4
changes:
- description: >
Unexpected breaking change had to be introduced. This should not happen in a minor.
type: breaking-change
link: https://github.com/elastic/beats/issues/13504
- version: 1.0.3
changes:
- description: Fix broken template
type: bugfix
link: https://github.com/elastic/beats/issues/13507
- description: It is a known issue that the dashboard does not load properly
type: known-issue
link: https://github.com/elastic/beats/issues/13506
```

### testing.yml

The testing.yml can contain information about how the integration should be tested. So far the focus is on testing datasets so this file might not be necessary. It could be used to include in datasets to share common testing info.

### docs/README.md

README document which contains all the documentation about the integration. It is possible that each dataset has its own additional documentation. It is expected that this will be just appended to the main README on packaging.

### img/

Directory for all the icons, screenshots and potentially videos.

Question: Should we name this media?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is expected to contain also videos then yes please 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long term, yes. Will rename.


### dataset/{dataset}/testing.yml

This yaml file should contain all the configuration on how a dataset can be tested. It might contain which services have to be booted up for testing and how the tests should be run.

### dataset/{dataset}/testdata

All the data used for testing. For example example logs and the generated output of it.

### dataset/fields/fields.yml

The fields.yml contains the content to generate the Elasticsearch index template and the Kibana index pattern. This happens in the integrations manager.

On thing that is important is that all fields.yml used, start with the global full path. At the moment we have some fields.yml in the datasets which are relative to the module they are in. We need to change those.
The idea is that the order of the fields.yml does not matter when combining them all into 1 file.

## Reusable content

Any dataset can be reused by just referencing it in the manifest. But some of these reused assets don't need packaging on it's own. These go into integrations directory list `filebeat` or `ecs` as datasets where `package: false` is set. This allows to reuse all these assets without also getting a package for it. It would be possible to store these assets outside the "integration" directory for better separation. But implementation of the collection script has shown, that the script stays much simpler like this.


## Versioning

The version of a package is taken from the manifest file. If the CoreDNS package contains `version: 1.2.3` it will build the package `coredns-1.2.3`. For now, no exact version of a dataset is specified. If a dataset is updated, next time packaging is called for an integration, it will pull in the newest assets. So if there is a breaking change in a dataset, it's up the integration package dev to decide if this is needed. To reduce errors we could introduce exact specification of a dataset version. This would mean in case a dataset version is updated, all datasets which reference it must be updated too. As everything is in one repository, this shouldn't be too much hassle but would make it more explicit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if everything is in the same repository now we have to keep thinking that we should support integrations living out of this repository.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tricky part if something is in an other repository will be the includes. But I'm sure we could figure out some automation around copying these over.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may end up needing some other repository for build-time dependencies 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe the external source can be specified in the include itself, e.g. for an integration including ECS:

includes:
- name: "ecs:ecs"
  source: https://github.com/elastic/integrations

To allow something like this in the future without needing to support transitive dependencies, we may add the restriction that datasets with package: false cannot have includes, and only datasets with package: false can be included.

And this makes me wonder if datasets with package: false should just be another thing like "libraries"(?) instead of datasets, as they probably will have more restrictions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good questions :-)

I initially had it in its separate library directory. This had two disadvantages:

  • My collection script became more more complex
  • It's not possible to have a dependency "directly" on an existing package. Let me give an example: I'm assuming ECS will become it's own package that we should also ship. But almost all integrations will depend on it. We could now move 99% of the package into library or have it as a package which others depend ont. Having it's own package solves also the changelog question for example and I would guess it's easier to understand / find. But technically I guess both options work.

So in summary, I think package: false cannot include others, but I don't think that package: true means it cannot be included. But I see how this could bring us in building include-trees. Perhaps we could go with this limitation to keep things simple to get started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again: The good news with the syntax we have (independent of the name), we can always extend it with more fields as it's an array. not having a source, just assumes it's local.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming ECS will become it's own package that we should also ship. But almost all integrations will depend on it.

But then if it is packaged it won't have package: false, right? Or I am misunderstanding something? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will not have package: false but others will still depend on it.


It is expected that [semantic versioning](https://semver.org) is used.

### Backports

I expect most integrations to only be moving forward and have rarely breaking changes. Because of this no backports to branches etc. are needed. In case this is needed, there are several options:

* Have a branch for this integration. Packaging will just work as is.
* Have this integration in a separate repo.


## Conversion of modules to integrations

As the data of a module and an integration stay mostly the same, transformation from a module to an integration can mostly be automated. I started to play around with some tooling to convert existing modules to integrations but I would prefer to delay the discussion around this until we agreed on the format for building integrations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember Metricbeat light modules 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing this could change is that potentially there are more config files? But in the first phase I expect these are still shipped as part of MB. Only when we get to inputs, the configs the user has to past in become much bigger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, current light modules can be shipped as part of MB, but as they are just plain text files we could have integrations with them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for now, we need to keep them shipped with MB. With Fleet, it will become easy to just ship them to the agent. I assume lightweight modules do not necessarly require a restart of MB? The other option is, that we let user copy / paste it as a backup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, light modules don't require a restart of MB, they are loaded "lazily" when instantiated, so new light modules can already be installed in run time.


# Questions

* Why don't we store the main assets of an integration in for example `coredns/dataset/coredns` instead of just the top level?
* One of the main reasons is that it heavily simplifies the code of collecting assets, as there is just one and no checks have to be made if there are also assets on the top level. It also prevents potential directory name conflicts.

Empty file added integration/apache/LICENSE.txt
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
{
"description": "Pipeline for parsing Apache HTTP Server access logs. Requires the geoip and user_agent plugins.",
"on_failure": [
{
"set": {
"field": "error.message",
"value": "{{ _ingest.on_failure_message }}"
}
}
],
"processors": [
{
"grok": {
"field": "message",
"ignore_missing": true,
"patterns": [
"%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"(?:%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}|-)?\" %{NUMBER:http.response.status_code:long} (?:%{NUMBER:http.response.body.bytes:long}|-)( \"%{DATA:http.request.referrer}\")?( \"%{DATA:user_agent.original}\")?",
"%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"-\" %{NUMBER:http.response.status_code:long} -",
"\\[%{HTTPDATE:apache.access.time}\\] %{IPORHOST:source.address} %{DATA:apache.access.ssl.protocol} %{DATA:apache.access.ssl.cipher} \"%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}\" %{NUMBER:http.response.body.bytes:long}"
]
}
},
{
"remove": {
"field": "message"
}
},
{
"grok": {
"field": "source.address",
"ignore_missing": true,
"patterns": [
"^(%{IP:source.ip}|%{HOSTNAME:source.domain})$"
]
}
},
{
"rename": {
"field": "@timestamp",
"target_field": "event.created"
}
},
{
"date": {
"field": "apache.access.time",
"formats": [
"dd/MMM/yyyy:H:m:s Z"
],
"ignore_failure": true,
"target_field": "@timestamp"
}
},
{
"remove": {
"field": "apache.access.time",
"ignore_failure": true
}
},
{
"user_agent": {
"field": "user_agent.original",
"ignore_failure": true
}
},
{
"geoip": {
"field": "source.ip",
"ignore_missing": true,
"target_field": "source.geo"
}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"index_patterns": [
"te*",
"bar*"
],
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"created_at": {
"format": "EEE MMM dd HH:mm:ss Z yyyy",
"type": "date"
},
"host_name": {
"type": "keyword"
}
}
},
"settings": {
"number_of_shards": 1
}
}
14 changes: 14 additions & 0 deletions integration/apache/asset/access/fields/fields.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
- name: access
type: group
description: >
Contains fields for the Apache HTTP Server access logs.
fields:
- name: ssl.protocol
type: keyword
description: >
SSL protocol version.

- name: ssl.cipher
type: keyword
description: >
SSL cipher name.
6 changes: 6 additions & 0 deletions integration/apache/asset/access/filebeat/input.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
type: log
paths:
{{ range $i, $path := .paths }}
- {{$path}}
{{ end }}
exclude_files: [".gz$"]
8 changes: 8 additions & 0 deletions integration/apache/asset/access/filebeat/module.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
- module: apache
# Access logs
access:
enabled: true

# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
#var.paths:
Loading