Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.DS_Store

build
.idea
131 changes: 131 additions & 0 deletions integration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
**STATUS: Proposal**

# Overview

This describes how integrations are built and then packaged for the [integrations registry](https://github.com/elastic/integrations-registry). The format described in this document is focused allowing build and test integrations which then are packaged with tools.

The format of a package for the registry can be found here: https://github.com/elastic/integrations-registry#package-structure But for the development of a package, this is not enough as we also need testing of datasets which needs additional meta data. The proposed structure allows to tests metricsets / filsets in a similar way as we do today. With `mage package` all assets are packaged together to conform to the package structure.

# Definitions

**Integration package**: An integration package is a packaged version of an integration. This is what is served by the integrations registry. An example on what such a package looks like can be found here. It’s important to state that the shipped package does not look identical to the format here which is optimised for development and testing.
Comment thread
ruflin marked this conversation as resolved.
Outdated

**Integration**: Integration definition with manifest and several datasets with the assets for the Elastic Stack.

**Dataset**: Group of assets which are grouped together for testing and development purposes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a group of assets, why don't we call them "Assets"? In my head, when I read datasets I'm thinking in some sort of structured information about some topic that have been already extracted from "somewhere" and is ready to analyze. The wikipedia definition also states something similar: https://en.wikipedia.org/wiki/Data_set

Just doing some "duck testing" here 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description here is unfortunate. These assets together are needed to create a dataset. But I think you have a point here and it needs either some revising of the description or potential renaming. More "duck testing" needed :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to agree here that we should call the directory inside each integration assets instead of dataset. Because in some cases, it is literally only assets and nothing in it generates the dataset. One of the reasons of this grouping is also the "things" we test together. All these assets are in one place. It seems to be straight forward to explain that everything we need for testing together goes into one asset directory with a name. @sayden Would that help?

@jsoriano WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to also add an an example here:

The coredns integration has 3 asset groups:

  • coredns (all the generic assets)
  • log (assets for the log dataset)
  • stats (assets for the stats dataset)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on a first read dataset also sounded a bit funny to me for that. +1 to consider another naming like assets.
For the directory we could also think in something like src or source as they contain the "source code" of the package.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src reminds me too much of source code but it's actually mostly "config" files. My favorite at the moment: assets



# Integration files structure

As today with modules, the base structure for the implementation is split up into integrations which contains multiple datasets. All the assets are inside each dataset and the dataset file structure is very similar to the final structure of a package. The structure looks as following:

```
{integration-name}/dataset/{dataset-name}/{package-structure}
Comment thread
ruflin marked this conversation as resolved.
```

On the top level of each integration, it contains the `manifest.yml`, `LICENSE.txt` file and the `changelog.yml`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano I guess we don't have to add a LICENSE.txt file into each integration. We can add it during packaging time based on the entry in the manifest.yml.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that we'd need a license per repository, and it should be added to the generate packages, yes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need more then 1 ;-)


## Assets

Below are all the existing assets described. The assets which are already defined in [ASSET.md](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md) from the package definition are not repeated here.

### manifest.yml
Comment thread
ruflin marked this conversation as resolved.

The manifest contains all the information about the integration and follows the logic of the [package manifest](https://github.com/elastic/integrations-registry/blob/master/ASSETS.md#general-assets). The manifest might be enriched with further information from its dataset during packaging. Also verifications on compatiblity version etc. will be done.

It contains a few additional fields which are not part of the package:

**datasets**

This is a list of dataset this integration depends on. As packages today do not allow to depend on other packages, it is important to have a dependency feature during building integration to not have to duplicate all the assets. Some examples here are fields for ECS or fields specific to Filebeat. An example is below:

```
datasets:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use here a different name that more explictly refers to dependencies as requires, or depends_on? I think that datasets could be confusing as it could also meant the datasets provided by the package.

We may also need to differentiate "runtime" dependencies from build-time dependencies (includes?).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like includes as in the end it's included in the package.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano Changing the code, I just had an idea for an other name: imports ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I would only avoid the generic datasets 🙂

- name: "ecs:ecs"
- name: "filebeat:filebeat"
```

No versions are mentioned above of the datasets. It's up to the implementer to make sure to increase the version number of the integration if a dependency changes. Alternative we could use versions which are then validate and if not correct anymore, an error is thrown. This would probably be more dev friendly but more complex to implement.

**Package config**

Not all integrations which are in this repo need packaging. For example the Filebeat or ECS integration directory are only placeholders for the assets but will not come any integration. To prevent these from packaging. `package: false` can be set.
Comment thread
ruflin marked this conversation as resolved.

### changelog.yml

Every integration should keep a changelog so if a user upgrades, we can show the user what changed. If a dependency of an integration changes, its up to the integration to add these items to the changelog list if needed.

The changelog is in a structure format, so it can be read out and visualised in the package manager.

```
# The changelog.yml contains all the changes made to the integration and it's datasets.
# If a dataset is adjusted, it should also be added to this changelog.
# The changelog is in a structure format so the order does not matter and it can be used
# for visualisation in the UI.

# Description of the change
- description: Added dataset foo

# The versions here should follow semver
version: 1.2.3
Comment thread
ruflin marked this conversation as resolved.
Outdated

# The options here are:
# - breaking change
# - bugfix
# - Added
# - Deprecated
# - Known Issue
type: bugfix

# Link to a Github issue or a PR
link: https://github.com/elastic/integrations
```

### testing.yml

The testing.yml can contain information about how the integration should be tested. So far the focus is on testing datasets so this file might not be necessary. It could be used to include in datasets to share common testing info.

### docs/README.md

README document which contains all the documentation about the integration. It is possible that each dataset has its own additional documentation. It is expected that this will be just appended to the main README on packaging.

### img/

Directory for all the icons, screenshots and potentially videos.

Question: Should we name this media?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is expected to contain also videos then yes please 🙂

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long term, yes. Will rename.


### dataset/{dataset}/testing.yml

This yaml file should contain all the configuration on how a dataset can be tested. It might contain which services have to be booted up for testing and how the tests should be run.

### dataset/{dataset}/testdata

All the data used for testing. For example example logs and the generated output of it.

## Reusable content

Any dataset can be reused by just referencing it in the manifest. But some of these reused assets don't need packaging on it's own. These go into integrations directory list `filebeat` or `ecs` as datasets where `package: false` is set. This allows to reuse all these assets without also getting a package for it. It would be possible to store these assets outside the "integration" directory for better separation. But implementation of the collection script has shown, that the script stays much simpler like this.


## Versioning

The version of a package is taken from the manifest file. If the CoreDNS package contains `version: 1.2.3` it will build the package `coredns-1.2.3`. For now, no exact version of a dataset is specified. If a dataset is updated, next time packaging is called for an integration, it will pull in the newest assets. So if there is a breaking change in a dataset, it's up the integration package dev to decide if this is needed. To reduce errors we could introduce exact specification of a dataset version. This would mean in case a dataset version is updated, all datasets which reference it must be updated too. As everything is in one repository, this shouldn't be too much hassle but would make it more explicit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if everything is in the same repository now we have to keep thinking that we should support integrations living out of this repository.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tricky part if something is in an other repository will be the includes. But I'm sure we could figure out some automation around copying these over.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may end up needing some other repository for build-time dependencies 🙂

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe the external source can be specified in the include itself, e.g. for an integration including ECS:

includes:
- name: "ecs:ecs"
  source: https://github.com/elastic/integrations

To allow something like this in the future without needing to support transitive dependencies, we may add the restriction that datasets with package: false cannot have includes, and only datasets with package: false can be included.

And this makes me wonder if datasets with package: false should just be another thing like "libraries"(?) instead of datasets, as they probably will have more restrictions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good questions :-)

I initially had it in its separate library directory. This had two disadvantages:

  • My collection script became more more complex
  • It's not possible to have a dependency "directly" on an existing package. Let me give an example: I'm assuming ECS will become it's own package that we should also ship. But almost all integrations will depend on it. We could now move 99% of the package into library or have it as a package which others depend ont. Having it's own package solves also the changelog question for example and I would guess it's easier to understand / find. But technically I guess both options work.

So in summary, I think package: false cannot include others, but I don't think that package: true means it cannot be included. But I see how this could bring us in building include-trees. Perhaps we could go with this limitation to keep things simple to get started.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again: The good news with the syntax we have (independent of the name), we can always extend it with more fields as it's an array. not having a source, just assumes it's local.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming ECS will become it's own package that we should also ship. But almost all integrations will depend on it.

But then if it is packaged it won't have package: false, right? Or I am misunderstanding something? 🙂

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will not have package: false but others will still depend on it.

Comment thread
ruflin marked this conversation as resolved.

### Backports

I expect most integrations to only be moving forward and have rarely breaking changes. Because of this no backports to branches etc. are needed. In case this is needed, there are several options:

* Have a branch for this integration. Packaging will just work as is.
* Have this integration in a separate repo.


## Conversion of modules to integrations

As the data of a module and an integration stay mostly the same, transformation from a module to an integration can mostly be automated. I started to play around with some tooling to convert existing modules to integrations but I would prefer to delay the discussion around this until we agreed on the format for building integrations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember Metricbeat light modules 🙂

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing this could change is that potentially there are more config files? But in the first phase I expect these are still shipped as part of MB. Only when we get to inputs, the configs the user has to past in become much bigger?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, current light modules can be shipped as part of MB, but as they are just plain text files we could have integrations with them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for now, we need to keep them shipped with MB. With Fleet, it will become easy to just ship them to the agent. I assume lightweight modules do not necessarly require a restart of MB? The other option is, that we let user copy / paste it as a backup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, light modules don't require a restart of MB, they are loaded "lazily" when instantiated, so new light modules can already be installed in run time.


# Questions

* Why don't we store the main assets of an integration in for example `coredns/dataset/coredns` instead of just the top level?
* One of the main reasons is that it heavily simplifies the code of collecting assets, as there is just one and no checks have to be made if there are also assets on the top level. It also prevents potential directory name conflicts.

4 changes: 4 additions & 0 deletions integration/apache/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@


* Entries for each version -> fix can go into multiple versions?
* How will the module read this changelog entries? Should we use yaml?
Empty file added integration/apache/LICENSE.txt
Empty file.
67 changes: 67 additions & 0 deletions integration/apache/dataset/apache.access/dataset.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
name: apache.access
version: master
testing:
files:
- testfile.log

# TODO: What should we call this?


datasource:
apache.access:
vars:
- name: paths
default:
- /var/log/apache2/access.log*
- /var/log/apache2/other_vhosts_access.log*
os.darwin:
- /usr/local/var/log/apache2/access_log*
os.windows:
- "C:/tools/Apache/httpd-2.*/Apache24/logs/access.log*"
- "C:/Program Files/Apache Software Foundation/Apache2.*/logs/access.log*"

# TODO: Configure input for the module. This can be overwritten by central management.
input:

ingest_pipeline: ingest/default.json
Comment thread
ruflin marked this conversation as resolved.
Outdated
input: config/access.yml

# This becomes a requirement of the module
elasticsearch:
requires.processors:
- name: user_agent
plugin: ingest-user-agent
- name: geoip
plugin: ingest-geoip


# From PH
configs:
include:
- package: common/logs.yml:configs
# Note I need to play with the syntax here, but I think that would work.
# it will be up to the building to create the right input but we could use some helpers
# methods to fix that.
transport:
type: nested
default: file
validations:
enum: ["file", "tcp"]
required: true
transport.type:
type: string
default: "file"
enum: ["file", "tcp"]
transport.file:
paths:
type: Array<PATH>
default: ["%base_path/access.log"]
validations:
presence: true
transport.tcp:
port:
type: range # We can have Port type which has the default validations.
default: 8000
validations:
min: 0
max: 65568

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, but I think that using the keys as names for the configs can lead to ambiguous situations, for example here transport.type is nested in the first block, and an object in the second block. What about having it as a list with the name of the option as an attribute more?

configuration:
  include:
    - package: common/logs.yml:configurations
  options:
  - name: transport
    type: nested
    default: file
    validations:
      enum: ["file", "tcp"]
      required: true
  - name: transport.type
    type: string
    default: "file"
    enum: ["file", "tcp"]
  - name: transport.file.paths
    type: Array<PATH>
    default: ["%base_path/access.log"]
    validations:
      presence: true
  - name: transport.tcp.port
    type: range # We can have Port type which has the default validations.
    default: 8000
    validations:
      min: 0
      max: 65568

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will actually not need this for our first version. This is a copy / paste from an initial proposal from @ph .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be interesting for that elastic/beats#13357

Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
{
"description": "Pipeline for parsing Apache HTTP Server access logs. Requires the geoip and user_agent plugins.",
"on_failure": [
{
"set": {
"field": "error.message",
"value": "{{ _ingest.on_failure_message }}"
}
}
],
"processors": [
{
"grok": {
"field": "message",
"ignore_missing": true,
"patterns": [
"%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"(?:%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}|-)?\" %{NUMBER:http.response.status_code:long} (?:%{NUMBER:http.response.body.bytes:long}|-)( \"%{DATA:http.request.referrer}\")?( \"%{DATA:user_agent.original}\")?",
"%{IPORHOST:source.address} - %{DATA:user.name} \\[%{HTTPDATE:apache.access.time}\\] \"-\" %{NUMBER:http.response.status_code:long} -",
"\\[%{HTTPDATE:apache.access.time}\\] %{IPORHOST:source.address} %{DATA:apache.access.ssl.protocol} %{DATA:apache.access.ssl.cipher} \"%{WORD:http.request.method} %{DATA:url.original} HTTP/%{NUMBER:http.version}\" %{NUMBER:http.response.body.bytes:long}"
]
}
},
{
"remove": {
"field": "message"
}
},
{
"grok": {
"field": "source.address",
"ignore_missing": true,
"patterns": [
"^(%{IP:source.ip}|%{HOSTNAME:source.domain})$"
]
}
},
{
"rename": {
"field": "@timestamp",
"target_field": "event.created"
}
},
{
"date": {
"field": "apache.access.time",
"formats": [
"dd/MMM/yyyy:H:m:s Z"
],
"ignore_failure": true,
"target_field": "@timestamp"
}
},
{
"remove": {
"field": "apache.access.time",
"ignore_failure": true
}
},
{
"user_agent": {
"field": "user_agent.original",
"ignore_failure": true
}
},
{
"geoip": {
"field": "source.ip",
"ignore_missing": true,
"target_field": "source.geo"
}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"index_patterns": [
"te*",
"bar*"
],
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"created_at": {
"format": "EEE MMM dd HH:mm:ss Z yyyy",
"type": "date"
},
"host_name": {
"type": "keyword"
}
}
},
"settings": {
"number_of_shards": 1
}
}
14 changes: 14 additions & 0 deletions integration/apache/dataset/apache.access/fields/fields.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
- name: access
type: group
description: >
Contains fields for the Apache HTTP Server access logs.
fields:
- name: ssl.protocol
type: keyword
description: >
SSL protocol version.

- name: ssl.cipher
type: keyword
description: >
SSL cipher name.
6 changes: 6 additions & 0 deletions integration/apache/dataset/apache.access/filebeat/input.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
type: log
paths:
{{ range $i, $path := .paths }}
- {{$path}}
{{ end }}
exclude_files: [".gz$"]
8 changes: 8 additions & 0 deletions integration/apache/dataset/apache.access/filebeat/module.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
- module: apache
# Access logs
access:
enabled: true

# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
#var.paths:
Loading