Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

meta-tags (previously known as extrinsic tags) #660

Closed
shalstea opened this issue Jun 21, 2017 · 22 comments
Closed

meta-tags (previously known as extrinsic tags) #660

shalstea opened this issue Jun 21, 2017 · 22 comments
Assignees
Milestone

Comments

@shalstea
Copy link

Metrics 2.0 supports adding meta data to metrics, however this is at the cost of network bandwidth. A lot of meta data could be very static (e.g. the data-center a machine is in). It would be very nice to have a means for bulk-loading / updating static meta-data and having it merge in with tags.

For example every metric might have a tag host. Associated with the host is a collection of static data, cluster, data-center, os, os-version, etc. We would like to feed this in. From grafana this would appear as tags to the metric.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jun 22, 2017

these are all known issues.

  1. all metadata currently transmitted (especially mdm format) is very redundant and too resource intensive (in network bandwith but also (de)serializing overhead in form of cpu time and memory allocations)
  2. any metadata should thus be able to be sent/maintained asynchronously from the data stream (see metricdefinition refactor #199 for some concrete ideas on how to address)
  3. this applies for both intrinsic and extrensic properties (tags that affect the metric Id and tags that don't)
  4. as for exposing to grafana, I see 2 main ways : A) extending the query language to support tag based searching/filtering etc. B) a custom datasource for metrictank that is a superset of the graphite datasource, but with the extensions for displaying tags in the editor etc.

related : #352

@shalstea
Copy link
Author

shalstea commented Jun 30, 2017

We would love to be able to add detailed descriptions of metrics, with some way to access this if clicking on some visual indicator on a panel that uses the metric. Users would then be able to really understand what the metric means.

Other static data would include things like:
units, graph as rate, etc.

@Dieterbe
Copy link
Contributor

Dieterbe commented Jul 1, 2017

you're describing a feature request for grafana. maybe @daniellee or @torkelo can advise where to direct that topic.
I believe this at least partially overlaps with grafana/grafana#1153

@TheStigB
Copy link

Some additional details about the additional meta data / extrinsic tags.

The goal is to provide better filtering and group by capabilities in Grafana / MetricTank, by being able to augment the core tags with additional tags / meta data that should work like any first class tags in Grafana.

So we would like to be able to upload a set of tags that refer to a core tag on a daily or weekly basis.
Example:
Primary Tag, Additional Tags
HostId, Rollout Stage, OS, Location, .......
100, S1, RHEL7.1, DataCenter1

Initially it's ok if we only have one version of the extrinsic tags, having historical versions and handle changes in them overtime is a nice to have. But not a requirement at this point.

@Dieterbe
Copy link
Contributor

The big question here I think is how do we want to do the associations?
Am I reading this right that you'd like to assign the additional tags by hooking them to pre-existing tags?
for example assigning OS and location tags to metrics, by specifying a hostId tag, and then all metrics with that hostId should get the tag?
implementation wise it could take the form as a set of dynamic high-level rules that we take into account when we query the index, or we could actually go and apply all these tags to every single metric.

looping in @DanCech we've previously discussed this. but i don't remember the outcome.

@TheStigB
Copy link

You got it. Basically a "table join" on a given tag the provided (per metric point) tags and the additional tags.

As a future enhancement it might be interesting to have two primary tags, so we can limit additional tags to a given namespace. (No need to do this for MVP)

As for "the upload" probably the easiest to via special message / topic over kafka. I don't think one upload needs to be atomic, but individual "key tag vale" and "additional tags" can be handled independently from the other tag values.

@DanCech
Copy link
Contributor

DanCech commented Jul 12, 2018

After some lengthy discussions, we came up with a concept that may work for this functionality.

We would start by adding a separate index to hold meta-records that map tag queries (used to identify target series) with a set of "extrinsic" tag/value pairs to be added to those series.

We would then add a second reverse index to allow looking those records up by tag & value, in the same way as the existing reverse index is used to look up series.

The existing index would be augmented by adding a list of the meta-records associated with each series.

When altering the meta-records the system would look up the associated series (by executing the tag queries against the primary reverse index) and update their lists.

When adding a series, it would be compared against each entry in the meta-index to build the list.

We may also want to maintain in the meta-index a list of all series that are associated with each record, since we already have to do the work to produce the lists of meta-records associated with each series. This would be a cost in terms of index size, but would potentially be a big performance boost at query time (see below).

When executing a query, there are quite a few complex scenarios that would need to be dealt with, mostly around how to deal with query conditions. Basically we would need to pick a query condition that requires a non-empty value (as we do already), then do a lookup for that condition in both reverse indexes. Series matched from the primary index would be added to the prospective result set as normal, while any results from the lookup in the second reverse index would be used to look up series associated with the meta-records (either by executing the tag queries against the main reverse index or by getting a list of matching series directly if they were stored in the meta-record), and those would also be added to the prospective result set.

At this point each entry in the result set would need to be "enriched" with the tags from the associated meta-records by walking the list of meta-records in each series (we need to determine how conflicts would be handled when a meta-record contains extrinsic tag values that conflict with other meta-records associated with the series or with intrinsic tags), then we would be able to apply the rest of the query conditions to filter the enriched result down to the final set of series.

@shanson7
Copy link
Collaborator

shanson7 commented Aug 6, 2018

For a concrete use case, we will have tags like datacenter that will match millions of series and only have a couple values. In this case, we would need to have a list of thousands of host values that map to a particular dc.

As an alternative implementation, it could be possible to send the extrinsic tags along with the normal tags, but not use them to calculate the series id. In that case, metrictank would just need to update the index when a full MetricData message is received with different extrinsic tags. If they don't change then the MetricPoint optimization can still be used. This would mean that the application of extrinsic tags would be handled externally to MT.

@shanson7
Copy link
Collaborator

Thinking about this a little this morning, I've got the following notes:

  1. Memory usage - The memory idx is currently about 60% of our heap usage. In our use case, every extrinsic tag we add will match almost every series. That means giant sets of metric ids.
  2. Cost of Update - We will likely do updates every few hours. Having to run every series through an expression match could be quite costly in this scenario, especially seeing as most of them are unlikely to change.

In our particular use case, what we actually need is a mapping from one tag key/value to another. As alluded to in the primary post, most of ours will be keyed off of the host tag. Almost every series has a host tag, and there are many orders of magnitudes more series than there are hosts. With this in mind I propose that we allow a simple mapping of intrinsic tag to extrinsic tags.

That way we could upload something like dc=dc1 maps to host=[host1,host3,host6,...].

At query time, we determine if the expressions references an extrinsic tag and do an efficient lookup (likely map) to determine if the current series mapped key matches the requested. e.g. if someone asks for something like name=abc AND dc=dc1 if we find abc;host=host1 and abc;host=host2 (using the name=abc filter) we can quickly lookup extrinsic_tags["dc"]["dc1"]["host1"] and extrinsic_tags["dc"]["dc1"]["host2"] to find that only host matches.

  1. Memory Usage - With this approach, each extrinsic tag just needs the set of mapped intrinsic tags that match. This is a much smaller set in our case (about 20k vs 400M).
  2. Cost of Update - If we require that the entire set be updated at once (i.e. all mappings defined for dc=dc1 must be supplied for each update, then it's pretty efficient as a map insert/update.

@Dieterbe Dieterbe added this to the 1.0 milestone Aug 22, 2018
@Dieterbe Dieterbe changed the title Static Meta-Data uploading and merging into tags meta-tags (previously known as extrinsic tags) Oct 24, 2018
@Dieterbe
Copy link
Contributor

Dieterbe commented Oct 24, 2018

from here on, this ticket is about the meta tags. the other tangential ideas (such as uploading generic info for display only but not for searching/filtering) can be done in a new ticket.

we're in the design phase for this feature. @replay can you share your work-in-progress design doc.

@replay
Copy link
Contributor

replay commented Jan 22, 2019

This is a status update where I describe where we are at:
Last week I started working on the implementation as we've planned in the design doc.
So far I wrote the API calls to add and modify the rules which define the associations between meta-tags (extrinsic) and metric-tags (intrinsic), that's pushed in this branch: #960

Over the course of this and next week I'll implement the procedures to update the new index data structures when meta-records get added/modified/deleted and when metrics get added/deleted.
Querying the new index will be the last part, I suspect it will also be the most complicated part.

@shanson7 I would like to come up with an estimate how much additional memory those new data structures are going to consume, I will already take the current plans to optimize memory efficiency into account (@robert-milan is working on that). Could you maybe provide some numbers of your typical planned use cases of the meta tag index? Such as:

  • Number of series per MT
  • Average number of metric/intrinsic tags per series
  • Total number of meta tag rules defining associations between meta tags and metric/intrinsic tags
  • Average number of meta tags per series

@shanson7
Copy link
Collaborator

Number of series per MT

We have about 4 million series in the index per MT instance

Average number of metric/intrinsic tags per series

I'm not sure how to calculate this, but every series has at least 5 tags, so average would be 6 or 7 tags.

Total number of meta tag rules defining associations between meta tags and metric/intrinsic tags

We would likely have at least one per host, so in the tens of thousands, each one effecting a small set of the series.

Average number of meta tags per series

Probably 5-10

@replay
Copy link
Contributor

replay commented Feb 4, 2019

For reference i'm linking the current design doc from here: https://docs.google.com/document/d/1Kk3QYd3X1yIEUcRFigEjdx23dgZMEH2lM4pmka9oAcc

@replay
Copy link
Contributor

replay commented May 8, 2019

Update:
We just merged the PR #1301
#1301 is merging a part of what's in the branch of #960, plus some improvements.
Next I'll rebase #960 onto the current master and then create more small PRs to merge the modifications of that branch piece by piece. That way they are easier to review, it's easier to keep them concise, and it's easier to assure that there are no unexpected regressions.

There will be at least 4 follow-up PRs, 2-4 are mostly just copying the modifications from the branch of #960 :

  1. Move the input validation for tag queries into the API layer, currently that's in the index. We discussed that in Meta tags part1: meta record data structures and corresponding CRUD api calls #1301 (comment) and this should be relatively simple
  2. Refactor the query expression type to make it more flexible, because in order to implement the querying/filtering by meta record functionality we need to be able to build sub-queries from meta records and for that it needs some more flexibility
  3. Start using the meta records to build sub-queries, this will allow us to query by meta record
  4. Implement the enrichment (at first without a cache)

After the above is done, I'll need to:

  • Add a way to persist the meta records into a permanent store
  • Implement the ability to swap out a whole set of meta records, instead of updating them one-by-one
  • Implement the enrichment cache

@replay
Copy link
Contributor

replay commented Aug 28, 2019

Status Update:

This refers to the above comment (#660 (comment)):

The features listed in points 1 - 4 are done and working in my test environment. Also the "enrichment cache" which is mentioned at the bottom is done and merged. These changes have not been deployed in any production environment yet as far as I'm aware.

A PR for the ability to swap out a whole set of meta records is waiting for review: #1442

The persisting of meta records is not implement yet, I'm currently working on that. The plan is to add a new table in Cassandra/BigTable if the feature flag meta-tag-support is enabled, on startup the records would be read from there.

Ideas for improvements:

  • We want to improve how the meta records get propagated across a cluster. Currently this is done via http calls between cluster nodes, if one cluster node was not available then the client that submitted the original request will receive an error indicating that. This is not optimal, because with large clusters it can be normal that some number of MTs are down at any point in time, so we want to switch to a mechanism that allows us to come to a consensus among all nodes, without requiring them to all be available at the same time.
  • I believe there is room for improvement in how the evaluation order of a given set of query expressions gets determined. Currently, when MT determines the order in which to evaluate the given expressions, it only takes their operators and the cardinality of the involved parts of the metric index (intrinsic index) into account. This could be made smarter by also taking the cardinality of the meta tag index into account, if the meta tag feature is enabled.

@agao48
Copy link
Contributor

agao48 commented Sep 10, 2019

We did some preliminary research with Bloomberg's metrictank setup and enabling metatags comparing setups with varying amounts of metatags (no metatgas to 3 metatags). Details can be found here: https://gist.github.com/agao48/e3e2681d3652b8ca083b32b40733e550. More information like memory and cpu performance while ingesting and querying can also be provided.

@replay
Copy link
Contributor

replay commented Sep 11, 2019

Thanks for the results @agao48.

Based on your profile, it looks to me like the enrichment phase is slower than expected. The enrichment works like this:

  • When the first query gets received after some meta tag records have been modified, a new enricher gets instantiated. That enricher has a set of filter functions for all the defined meta tags.
  • After the lookup of series is done, it passes each of them to the enricher to do a reverse lookup over the meta tag index which results in the set of meta tags that need to be associated with each of the series in the result set.
  • The correct meta tags then get associated with each series in the result set, and this result also gets cached for the next time when this metric needs to be enriched

I have a few questions regarding your benchmarks:

  1. In your benchmarks with no meta tags, was the meta tag support feature flag turned on or off? This makes a difference, because if the support is turned on, even if no meta tags are defined certain parts of the lookup will be a bit slower.
  2. In your test query seriesByTag('namespace=os','name=cpu.percent.idle.g') is the namespace tag a meta tag? When you did your benchmarks without meta tags, did you run the same query while the namespace tag wasn't present? Or did you run a different query during the benchmarks without meta tags?
  3. There is a config setting called enrichment-cache-size, was that set to the default of 10000? If so, then I'm surprised that so much CPU time was spent on the enrichment, because if your result set consisted of only 21 time series then their enrichment results should have been cached on the first query and after that they should always have been reused. Then I'd need to check what I can do to improve the enrichment cache speed.

@replay
Copy link
Contributor

replay commented Sep 11, 2019

For completeness I'm copy pasting the reply that I got from agao48

  1. The original benchmark was with metatags enabled, but no metatags added. I updated the gist with the metatags being completely disabled. As you stated already, lookup was faster when the metatags feature was completely disabled.
  2. That query contains no metatags at all so all the benchmarks use a query with no metatags I did run a test locally where we specified one metatag and lookup seemed to be faster. If you would like, I can rerun those tests to get you the stats there.
  3. enrichment-cache-size was set to the default of 10000. Before each test, I queried the data once also thinking the data should get cached. I can experiment with an extremely low value to test the impact of cache size if you would like

@replay
Copy link
Contributor

replay commented Sep 11, 2019

@agao48 I think we have found a bug in how the enricher gets instantiated and fixed it with this PR:
#1455

If you get a chance, could you please retry the same benchmark with the latest master that includes this PR? Thanks.

@agao48
Copy link
Contributor

agao48 commented Sep 11, 2019

@replay Rebuilt and tested with the master that has that fix. Results in table format below, but looks a lot better.

Query:

GET http://localhost:6060/render?target=sumSeries(seriesByTag('namespace=os','name=cpu.percent.idle.g'))&from=1567922400&until=1567954800&format=json

Test: $vegeta attack -duration 120s -rate 10 -timeout 0

Before: results from initial benchmarking

After: results after fixing bug in get enricher

1 tag before 1 tag after 2 tags before 2 tags after 3 tags before 3 tags after
Latencies
mean 205.651992ms 12.962963ms 213.400526ms 13.47646ms 220.924238ms 13.324256ms
50 189.744883ms 12.187033ms 188.395671ms 12.517429ms 197.006305ms 12.30844ms
90 291.740413m 18.750181ms 358.45829ms 20.224633ms 344.310485ms 20.233422ms
99 400.456912ms 27.464192ms 505.764124ms 31.077227ms 591.181863ms 34.614008ms
max 786.23406ms 46.510145ms 834.770011ms 49.137374ms 894.33863ms 99.592477ms

@Dieterbe Dieterbe modified the milestones: vnext, sprint-2 Oct 7, 2019
@fkaleo fkaleo modified the milestones: sprint-2, sprint-3 Oct 28, 2019
@replay replay modified the milestones: sprint-3, sprint-4 Nov 18, 2019
@robert-milan robert-milan modified the milestones: sprint-4, sprint-5 Dec 9, 2019
@replay
Copy link
Contributor

replay commented Dec 17, 2019

Can we close this issue? As far as I'm aware the feature is "done" as in "it works". If there are any further issues coming up, these would then be a new issue. Or would you prefer to wait with that until you deployed it @agao48?

@shanson7
Copy link
Collaborator

I'm ok with closing this. I think we can open more specific issues if/when we find the need.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants