Skip to content

Conversation

@cyriltovena
Copy link
Contributor

@cyriltovena cyriltovena commented Feb 10, 2022

What this PR does / why we need it:

This PRs add usage report to grafana.com into Loki.

It basically add a new modules that will never fail, when running the module try to get a consensus on what is the cluster unique ID and then send a report from every component running every hour. The cluster ID is use to compute aggregation of all components the server side.

How does the consensus works ?

Ingesters are leader in the consensus meaning they are the only one that can actually store in the object store the unique ID. They do that using the Loki kv store and object store for persisting the data over restart.

Each ingester will do as follow:

  • Check if the cluster id exists in the kv store.
    • if it does, we verify that it also exists in the object store and reconcile if needed.
  • Check if the cluster id exists in the object store and reconcile the kv store.
  • If none of those are true, ingester will try to CAS the kvstore to set a new cluster id, in case of wining they will store the cluster id in the object store.
  • Then finally they will use that cluster id to send report.

Other component (followers) will only retry indefinitely to fetch the cluster id from the object store and once they have it, they will start sending report with the ID.

In case there are many failure trying to unmarshal the cluster ID, all component can decide to nuke it.

What happen if we change to a new object store ?

Since we also store the cluster ID in the kvstore, and ingester will realize that it is missing in the new object store and will try to reconcile.
This means if you nuke at the same time the object store AND the kv store, you'll end up with having a new cluster ID but we consider this case to be rare.

What stats are we sending ?

Full disclaimer here, we're not sending any confidential data but only informations about:

  • What object store is being used ?
  • What's the scale of the data being ingested ?
  • How fast are we ingesting and flushing ?
  • How fast queries are in that cluster ?
  • Version, CPU count and memory size.

See the json below.

This is a report from a single binary, if you're using multiple component some stats may be missing from one component to another.

json report
{
	"clusterID": "f06b33a4-be8a-45d5-a8f9-9667f003b700",
	"createdAt": "2022-02-09T08:32:10.26395+01:00",
	"interval": "2022-02-10T08:36:10.26395+01:00",
	"target": "all",
	"version": {
		"version": "",
		"revision": "",
		"branch": "",
		"buildUser": "",
		"buildDate": "",
		"goVersion": "go1.17.2"
	},
	"os": "darwin",
	"arch": "amd64",
	"edition": "oss",
	"metrics": {
		"ingester_flushed_chunks_age_seconds": {
			"stddev": 0,
			"stdvar": 0,
			"avg": 32857.973619,
			"count": 1,
			"min": 32857.973619,
			"max": 32857.973619
		},
		"num_cpu": 16,
		"distributor_replication_factor": 1,
		"ingester_streams_count": 1,
		"query_metric_bytes_per_second": {
			"avg": 86512.48688046652,
			"count": 1715,
			"min": 0,
			"max": 7001745,
			"stddev": 578305.5424162439,
			"stdvar": 334437300389.34607
		},
		"query_metric_lines_per_second": {
			"min": 0,
			"max": 308201,
			"stddev": 25884.49007341756,
			"stdvar": 670006826.3608522,
			"avg": 3873.5586005830855,
			"count": 1715
		},
		"ingester_active_tenants": 1,
		"ingester_target_size_bytes": 1572864,
		"memstats": {
			"sys": 70534152,
			"heap_alloc": 33771944,
			"num_gc": 101,
			"gc_cpu_fraction": 0.00025775059945585605,
			"alloc": 33771944,
			"total_alloc": 1515006248,
			"heap_inuse": 41517056,
			"stack_inuse": 3997696,
			"pause_total_ns": 19223528
		},
		"compactor_retention_enabled": "false",
		"distributor_bytes_received": {
			"total": 30968,
			"rate": 516.1260609192866
		},
		"ingester_flushed_chunks": {
			"total": 0,
			"rate": 0
		},
		"query_log_bytes_per_second": {
			"stddev": 663299.4385104065,
			"stdvar": 439966145128.22064,
			"avg": 101709.73578717193,
			"count": 2744,
			"min": 0,
			"max": 7778734
		},
		"store_object_type": "filesystem",
		"ingester_flushed_chunks_lines": {
			"avg": 594,
			"count": 1,
			"min": 594,
			"max": 594,
			"stddev": 0,
			"stdvar": 0
		},
		"ingester_wal": "enabled",
		"ingester_chunk_created": {
			"total": 0,
			"rate": 0
		},
		"ingester_compression": "gzip",
		"ingester_flushed_chunks_lifespan_seconds": {
			"stdvar": 0,
			"avg": 9.126944444444444,
			"count": 1,
			"min": 9.126944444444444,
			"max": 9.126944444444444,
			"stddev": 0
		},
		"ingester_flushed_chunks_utilization": {
			"avg": 0.0017712910970052083,
			"count": 1,
			"min": 0.0017712910970052083,
			"max": 0.0017712910970052083,
			"stddev": 0,
			"stdvar": 0
		},
		"num_goroutine": 258,
		"distributor_lines_received": {
			"total": 3871,
			"rate": 64.51575570097039
		},
		"compactor_default_retention": "31d",
		"store_schema": "v11",
		"query_log_lines_per_second": {
			"count": 2744,
			"min": 0,
			"max": 315413,
			"stddev": 27780.167284388925,
			"stdvar": 771737694.3486327,
			"avg": 4281.011297376088
		},
		"store_index_type": "boltdb-shipper",
		"ingester_flushed_chunks_bytes": {
			"min": 3008,
			"max": 3008,
			"stddev": 0,
			"stdvar": 0,
			"avg": 3008,
			"count": 1
		}
	}
}

Special notes for your reviewer:

Found a bug in DSkit and had to revendor a fix. see grafana/dskit#132

Fixes #5062

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

@cyriltovena cyriltovena requested a review from a team as a code owner February 10, 2022 08:32
@cyriltovena cyriltovena requested a review from DanCech February 10, 2022 08:32
Signed-off-by: Cyril Tovena <[email protected]>
Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document the option to disable reports? I think we should be transparent on this.

// sendReport sends the report to the stats server
func sendReport(ctx context.Context, seed *ClusterSeed, interval time.Time) error {
report := buildReport(seed, interval)
out, err := jsoniter.MarshalIndent(report, "", " ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's gonna be Prometheus metrics. What's the reason for a custom API and store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very hard to read a Prometheus metric. And I needed more stats like counter, min,max, string !

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty goddamn awesome @cyriltovena!
I'd love to add some usage stats around recording/alerting rules, but we can do this later

@cyriltovena
Copy link
Contributor Author

The new DSKit brought some linter issue on it.

Signed-off-by: Cyril Tovena <[email protected]>
Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super cool 🎉

@cyriltovena
Copy link
Contributor Author

I'll follow up with a documentation on what we collect.

@dannykopping dannykopping merged commit bbaef79 into grafana:main Feb 10, 2022
dannykopping added a commit that referenced this pull request Feb 10, 2022
* Adds leader election process

Signed-off-by: Cyril Tovena <[email protected]>

* fluke

Signed-off-by: Cyril Tovena <[email protected]>

* fixes the kv typecheck

* wire up the http client

* Hooking into loki services, hit a bug

* Add stats variable.

* re-vendor dskit and improve to never fail service

* Intrument Loki with the package

* Add changelog entry

Signed-off-by: Cyril Tovena <[email protected]>

* Fixes compactor test

Signed-off-by: Cyril Tovena <[email protected]>

* Add configuration documentation

Signed-off-by: Cyril Tovena <[email protected]>

* Update pkg/usagestats/reporter.go

Co-authored-by: Danny Kopping <[email protected]>

* Add boundary check

Signed-off-by: Cyril Tovena <[email protected]>

* Add log for success report.

Signed-off-by: Cyril Tovena <[email protected]>

* lint

Signed-off-by: Cyril Tovena <[email protected]>

* Update pkg/usagestats/reporter.go

Co-authored-by: Danny Kopping <[email protected]>

Co-authored-by: Danny Kopping <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add usage reporting capability for Loki to (optionally) send usage stats to Grafana Labs

4 participants