Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions tsdb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
## TSDB Track

This data is anonymized monitoring data from elastic-apps designed to test
our TSDB project. TSDB support is being actively developed in Elasticsearch
right now so it's best to run this track only against Elasticsearch's
master branch.

TSDB needs us to be careful how we anonymize. Too much randomization and TSDB
can no longer do its job identifying time series and metrics and rates of
change. Too little and everyone knows all the software we run. We mostly err
towards openness here, but with a dash of paranoia.


### Example document

```
{
"@timestamp": "2021-04-28T19:45:28.222Z",
"kubernetes": {
"namespace": "namespace0",
"node": {"name": "gke-apps-node-name-0"},
"pod": {"name": "pod-name-pod-name-0"},
"volume": {
"name": "volume-0",
"fs": {
"capacity": {"bytes": 7883960320},
"used": {"bytes": 12288},
"inodes": {"used": 9, "free": 1924786, "count": 1924795},
"available": {"bytes": 7883948032}
}
}
},
"metricset": {"name": "volume","period": 10000},
"fields": {"cluster": "elastic-apps"},
"host": {"name": "gke-apps-host-name0"},
"agent": {
"id": "96db921d-d0a0-4d00-93b7-2b6cfc591bc3",
"version": "7.6.2",
"type": "metricbeat",
"ephemeral_id": "c0aee896-0c67-45e4-ba76-68fcd6ec4cde",
"hostname": "gke-apps-host-name-0"
},
"ecs": {"version": "1.4.0"},
"service": {"address": "service-address-0","type": "kubernetes"},
"event": {
"dataset": "kubernetes.volume",
"module": "kubernetes",
"duration": 132588484
}
}
```


### Fetching new data

This data comes from a private elastic-apps private k8s cluster. If you have
access to that cluster's logging then you can update the test data by
fetching raw, non-anonymized data from it's monitoring cluster with something
like elastic-dump. You want one document per line.

Now that you have the data, the goal is to get the anonymizer to run on the
entire dump in one go. But, in order to do that, let's slice out parts of the
file and make sure they can be processed first. Then we'll run them all at once.

First, let's tackle the documents with `message` fields. These are super diverse
and likely to fail on every new batch of data.
```
mkdir tmp
cd tmp

grep \"message\" ../data.json | split -l 100000
for file in x*; do
echo $file
python ../tsdb/_tools/anonymize.py < $file > /dev/null && rm $file
done
```

Some of the runs in the `for` loop are likely to fail. Those will leave files
around so you can fix the anonymizer and retry. We try to keep as many real
messages as we can without leaking too much information. So any new kind of
message will fail this process. Modify the script, redacting new message types
that come from the errors. Rerun the for loop above until it finishes without
error.

Now run the same process on documents without a message.

```
grep -v \"message\" ../data.json | split -l 100000
for file in x*; do
echo $file
python ../tsdb/_tools/anonymize.py < $file > /dev/null && rm $file
done
```

These are less likely to fail but there are more documents without messages
and any new failure will require a litte thought about whether the field it
found needs to be redacted or modified or passed through unchanged. Good luck.

Once all of those finish you should run the tool on the newly aquired data
from start to finish. You can't split the input data or the anonymizer won't
make consistent time series ids. Do something like:

```
cd ..
rm -rf tmp
python tsdb/_tools/anonymize.py < data.json > documents.json
```

Once that finishes you need to generate `documents-1k.json` for easy testing:
```
head -n 1000 documents.json > documents-1k.json
```

### Parameters

This track allows to overwrite the following parameters using `--track-params`:

* `bulk_size` (default: 10000)
* `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests.
* `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested.
* `number_of_replicas` (default: 0)
* `number_of_shards` (default: 1)
* `force_merge_max_num_segments` (default: unset): An integer specifying the max amount of segments the force-merge operation should use.
* `index_mode` (default: time_series): Whether to make a standard index (`standard`) or time series index (`time_series`)
* `codec` (default: default): The codec to use compressing the index. `default` uses more space and less cpu. `best_compression` uses less space and more cpu.

### License

All articles that are included are licensed as CC-BY-NC-ND (https://creativecommons.org/licenses/by-nc-nd/4.0/)

This data set is licensed under the same terms. Please refer to https://creativecommons.org/licenses/by-nc-nd/4.0/ for details.
Loading