feat: add per-tenant time sharding for long out-of-order ingestion #14711

na-- · 2024-11-01T07:05:35Z

What this PR does / why we need it:

This adds support for automatically splitting incoming log streams in the distributor by injecting a __time_shard__ label. The value of that label is bounded by the ingester.max_chunk_age/2 value, which should allow the ingesters to accept all logs without rejecting them as too far behind here:

loki/pkg/ingester/stream.go

Lines 424 to 428 in c0856bf

    
           // The validity window for unordered writes is the highest timestamp present minus 1/2 * max-chunk-age. 
        
           cutoff := highestTs.Add(-s.cfg.MaxChunkAge / 2) 
        
           if !isReplay && s.unorderedWrites && !highestTs.IsZero() && cutoff.After(entries[i].Timestamp) { 
        
           	failedEntriesWithError = append(failedEntriesWithError, entryWithError{&entries[i], chunkenc.ErrTooFarBehind(entries[i].Timestamp, cutoff)}) 
        
           	s.writeFailures.Log(s.tenant, fmt.Errorf("%w for stream %s", failedEntriesWithError[len(failedEntriesWithError)-1].e, s.labels))

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.

na-- · 2024-11-01T07:37:42Z

pkg/loki/modules.go

@@ -330,6 +330,7 @@ func (t *Loki) initDistributor() (services.Service, error) {
 	logger := log.With(util_log.Logger, "component", "distributor")
 	t.distributor, err = distributor.New(
 		t.Cfg.Distributor,
+		t.Cfg.Ingester,


This is somewhat risky because, in microservices mode, the distributor and ingester can be ran with a different -ingester.max-chunk-age CLI flag values. In practice this is very, very unlikely, but it's something we should be mindful of, because I am not sure if there is a way to avoid it.

yes, we can not prevent it, even if it happens, because CLI flag can be assigned to ingesters only and distributors will be out of sync for this flag...

vlad-diachenko

Looks awesome 💎
Also, I left some comments and one proposal

pkg/distributor/distributor.go

vlad-diachenko · 2024-11-06T08:30:28Z

pkg/loki/modules.go

@@ -330,6 +330,7 @@ func (t *Loki) initDistributor() (services.Service, error) {
 	logger := log.With(util_log.Logger, "component", "distributor")
 	t.distributor, err = distributor.New(
 		t.Cfg.Distributor,
+		t.Cfg.Ingester,


yes, we can not prevent it, even if it happens, because CLI flag can be assigned to ingesters only and distributors will be out of sync for this flag...

vlad-diachenko · 2024-11-06T08:43:09Z

pkg/distributor/distributor.go

+	}
+	maybeShardByTime := func(stream logproto.Stream, labels labels.Labels, pushSize int) {
+		if shardStreamsCfg.TimeShardingEnabled {
+			streamsByTime := shardStreamByTime(stream, labels, d.ingesterCfg.MaxChunkAge/2)


I would apply sharding by time only if the logs are older than now - d.ingesterCfg.MaxChunkAge/2
It would let the normal logs to be ingested without the need to create a time-bucket every hour. At the same time, it would still allow old logs to be ingested without strict order requirements.
wdyt?

I believe it would help us to avoid such spikes of created streams

I considered this, but I'm somewhat worried that it might cause more issues than it solves:

Assuming that the number of out-of-order logs is not insignificant, this will actually result in more total and more active streams than the current approach. We will have logs in both {foo="bar"} and {foo="bar", __time_shard="111_222"} streams.

The "now" value of the distributors can be slightly different than the "now" value of the ingesters. And even if they are the same, some time passes until the ingesters handle the push request. So if we try to calculate in the distributors which logs will be rejected by the ingesters based on the current time, we might be wrong in a small percentage of the cases, resulting in TooFarBehind errors. We can add a safety margin, of course, but it would make everything more complicated.

So, because this is a per-tenant config, I think the tradeoffs are slightly better if we always inject the __time_shard__ label when it's enabled for a specific tenant, even for current data 🤔 Or maybe it should be configurable and we should support both ways? 🤔 What do you think, am I missing something here?

this will actually result in more total and more active streams than the current approach.

I do not think so.
let say we ingest 4 logs lines with ts: now , now-15m, now - 1h, now-1d.
With your version, we would split these logs to 4 separate streams with __time_shard__ : A, B, C, D.
(If now and now-15m are from different time buckets).

With the changes that I propose, we would not create new buckets for now , now-15m. So, in result, we would have 3 streams: original, C, D.

In general, I would try to guess if the data will be rejected by the ingester or not (maybe with some safe margin, maybe 15m is enough), because we would not create new streams, every hour, for the data that is fresh enough...

We already see that it almost doubles the stream count for the cases when a lot of fresh data is ingested. Also, it affects the chunks, because we get more underutilized chunks that are flushed due to the reason idle.

Also, this change does not give us any drawbacks in terms of ingesting out-of-order old logs...

With the changes that I propose, we would not create new buckets for now , now-15m. So, in result, we would have 3 streams: original, C, D.

The question is how are the streams going to be distributed over time. Because in your example, if you keep getting more and more logs that have now, now-15m, now - 1h, now-1d timestamps as the value of now changes, eventually you'd still cover the whole time range with 2 sets of streams - ones that do have __time_shard__ and the original ones that don't. Which probably will be worse than the current approach.

But if these out-of-order logs arrive only occasionally, we'd be much better off with what you suggest, for sure.

So I think I'll add another per-tenant config option called something like time_sharding_ignore_recent with a value of 30 minutes or something like that. Any logs with timestamps greater than now -time_sharding_ignore_recent won't be split into a __time_shard__ stream, but if we set the option to 0, everything will be sharded (i.e. the current behavior from this PR will be applied). It should give us the flexibility to configure this to achieve the optimal results for different tenants. WDYT?

yep, it would be ideal.

This should now be resolved by the most recent commit, PTAL: 7787735

na-- requested a review from a team as a code owner November 1, 2024 07:05

pull-request-size bot added the size/L label Nov 1, 2024

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Nov 1, 2024

na-- commented Nov 1, 2024

View reviewed changes

na-- force-pushed the time-sharding branch from 0562da2 to ccf3d9a Compare November 1, 2024 07:48

vlad-diachenko self-requested a review November 4, 2024 13:21

feat: add per-tenant time sharding for long out-of-order ingestion

5b70be6

na-- force-pushed the time-sharding branch from ccf3d9a to 5b70be6 Compare November 5, 2024 10:37

vlad-diachenko approved these changes Nov 6, 2024

View reviewed changes

na-- added 4 commits November 6, 2024 11:21

Fix style issues

ad1c0b2

Add a new time_sharding_ignore_recent option to ignore recent logs

7787735

Merge branch 'main' into time-sharding

36e0cfc

Fix docs

dfb092a

vlad-diachenko approved these changes Nov 6, 2024

View reviewed changes

na-- merged commit 0d6d68d into main Nov 6, 2024
60 checks passed

na-- deleted the time-sharding branch November 6, 2024 15:03

xperimental mentioned this pull request Dec 18, 2024

feat: add per-tenant time sharding for long out-of-order ingestion (backport release-3.3.x) #15482

Closed

4 tasks

This was referenced Dec 23, 2024

chore(k234): release 3.4.0 #15536

Open

chore(k235): release 3.4.0 #15555

Open

loki-gh-app bot mentioned this pull request Jan 6, 2025

chore(k236): release 3.4.0 #15595

Open

loki-gh-app bot mentioned this pull request Jan 13, 2025

chore(k237): release 3.4.0 #15705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add per-tenant time sharding for long out-of-order ingestion #14711

feat: add per-tenant time sharding for long out-of-order ingestion #14711

na-- commented Nov 1, 2024

na-- Nov 1, 2024

vlad-diachenko Nov 6, 2024

vlad-diachenko left a comment

vlad-diachenko Nov 6, 2024

vlad-diachenko Nov 6, 2024

vlad-diachenko Nov 6, 2024

na-- Nov 6, 2024

vlad-diachenko Nov 6, 2024

na-- Nov 6, 2024

vlad-diachenko Nov 6, 2024

na-- Nov 6, 2024

	// The validity window for unordered writes is the highest timestamp present minus 1/2 * max-chunk-age.
	cutoff := highestTs.Add(-s.cfg.MaxChunkAge / 2)
	if !isReplay && s.unorderedWrites && !highestTs.IsZero() && cutoff.After(entries[i].Timestamp) {
	failedEntriesWithError = append(failedEntriesWithError, entryWithError{&entries[i], chunkenc.ErrTooFarBehind(entries[i].Timestamp, cutoff)})
	s.writeFailures.Log(s.tenant, fmt.Errorf("%w for stream %s", failedEntriesWithError[len(failedEntriesWithError)-1].e, s.labels))

feat: add per-tenant time sharding for long out-of-order ingestion #14711

feat: add per-tenant time sharding for long out-of-order ingestion #14711

Conversation

na-- commented Nov 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlad-diachenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment