-
Notifications
You must be signed in to change notification settings - Fork 206
Create a track for tsdb data #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
ecae40e
Create a track for tsdb data
nik9000 f7def01
Apply suggestions from code review
nik9000 d02d012
Apply suggestions from code review
nik9000 042bb26
Format strings
nik9000 59b4d45
Move message replacement to loop
nik9000 83b42ad
Black
nik9000 e0fef24
Fix different names
nik9000 a2873dd
Point
nik9000 c7ba60e
Size
nik9000 be91414
Move
nik9000 f329f2b
Fixup
nik9000 6ef62df
Time series
nik9000 3f5eaad
start and end time
nik9000 9821c61
License
nik9000 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| ## TSDB Track | ||
|
|
||
| This data is anonymized monitoring data from elastic-apps designed to test | ||
| our TSDB project. TSDB needs us to be careful how we anonymize. Too much | ||
| randomization and TSDB can no longer do its job identifying time series and | ||
| metrics and rates of change. Too little an everyone knows all the software we | ||
| run. We mostly err towards openness here, but a dash of paranoia. | ||
nik9000 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| ### Example document | ||
|
|
||
| ``` | ||
| { | ||
| "@timestamp": "2021-04-28T19:45:28.222Z", | ||
| "kubernetes": { | ||
| "namespace": "namespace0", | ||
| "node": {"name": "gke-apps-node-name-0"}, | ||
| "pod": {"name": "pod-name-pod-name-0"}, | ||
| "volume": { | ||
| "name": "volume-0", | ||
| "fs": { | ||
| "capacity": {"bytes": 7883960320}, | ||
| "used": {"bytes": 12288}, | ||
| "inodes": {"used": 9,"free": 1924786,"count": 1924795}, | ||
nik9000 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| "available": {"bytes": 7883948032} | ||
| } | ||
| } | ||
| }, | ||
| "metricset": {"name": "volume","period": 10000}, | ||
| "fields": {"cluster": "elastic-apps"}, | ||
| "host": {"name": "gke-apps-host-name0"}, | ||
| "agent": { | ||
| "id": "96db921d-d0a0-4d00-93b7-2b6cfc591bc3", | ||
| "version": "7.6.2", | ||
| "type": "metricbeat", | ||
| "ephemeral_id": "c0aee896-0c67-45e4-ba76-68fcd6ec4cde", | ||
| "hostname": "gke-apps-host-name-0" | ||
| }, | ||
| "ecs": {"version": "1.4.0"}, | ||
| "service": {"address": "service-address-0","type": "kubernetes"}, | ||
| "event": { | ||
| "dataset": "kubernetes.volume", | ||
| "module": "kubernetes", | ||
| "duration": 132588484 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
|
|
||
| ### Fetching new data | ||
|
|
||
| To fetch new data, grab it from elastic-apps's monitoring cluser with something | ||
nik9000 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| like elastic-dump. You want one document per line. | ||
|
|
||
| OK! Now that you have the data, the goal is to get the anonymizer to run on the | ||
| entire dump in one go. But, in order to do that, let's slice out parts of the | ||
| file and make sure they can be processed first. Then we'll run them all at once. | ||
|
|
||
| First, let's tackle the documents with `message` fields. These are super diverse | ||
| and likely to fail on every new batch of data. | ||
| ``` | ||
| mkdir tmp | ||
| cd tmp | ||
|
|
||
| grep \"message\" ../data.json | split -l 100000 | ||
| for file in x*; do | ||
| echo $file | ||
| python ../tsdb/_tools/anonymize.py < $file > /dev/null && rm $file | ||
| done | ||
| ``` | ||
|
|
||
| Some of the runs in the `for` loop are likely to fail. Those will leave files | ||
| around so you can fix the anonymizer and retry. We try to keep as many real | ||
| messages as we can without leaking too much information. So any new kind of | ||
| message will fail this process. Modify the script, redacting new message types | ||
| that come from the errors. Rerun the for loop above until it finishes without | ||
| error. | ||
|
|
||
| Now run the same process on documents without a message. | ||
|
|
||
| ``` | ||
| grep -v \"message\" ../data.json | split -l 100000 | ||
| for file in x*; do | ||
| echo $file | ||
| python ../tsdb/_tools/anonymize.py < $file > /dev/null && rm $file | ||
| done | ||
| ``` | ||
|
|
||
| These are less likely to fail but there are more documents without messages | ||
| and any new failure will require a litte thought about whether the field it | ||
| found needs to be redacted or modified or passed through unchanged. Good luck. | ||
|
|
||
| Once all of those finish you should run the tool on the newly aquired data | ||
| from start to finish. You can't split the input data or the anonymizer won't | ||
| make consistent time series ids. Do something like: | ||
|
|
||
| ``` | ||
| cd .. | ||
| rm -rf tmp | ||
| python tsdb/_tools/anonymize.py < data.json > documents.json | ||
| ``` | ||
|
|
||
| Once that finishes you need to generate `documents-1k.json` for easy testing: | ||
| ``` | ||
| head -n 1000 documents.json > documents-1k.json | ||
| ``` | ||
|
|
||
| ### Parameters | ||
|
|
||
| This track allows to overwrite the following parameters using `--track-params`: | ||
|
|
||
| * `bulk_size` (default: 10000) | ||
| * `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests. | ||
| * `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested. | ||
| * `number_of_replicas` (default: 0) | ||
| * `number_of_shards` (default: 1) | ||
| * `force_merge_max_num_segments` (default: unset): An integer specifying the max amount of segments the force-merge operation should use. | ||
| * `index_mode` (default: standard): Whether to make a standard index (`standard`) or time series index (`time_series`) | ||
|
||
| * `codec` (default: default): The codec to use compressing the index. `default` uses more space and less cpu. `best_compression` uses less space and more cpu. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please mention somewhere prominent that by default it only works with Elasticsearch 8.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:+!: