feat(loki.process): Add debug metrics for CRI stage to track truncation of lines and partial line flushing#5399
Conversation
| if c.cfg.MaxPartialLineSizeTruncate && len(e.Line) > int(c.cfg.MaxPartialLineSize) { | ||
| e.Line = e.Line[:c.cfg.MaxPartialLineSize] | ||
| if c.linesTruncatedMetric != nil { | ||
| c.linesTruncatedMetric.Inc() |
There was a problem hiding this comment.
I considered adding the log labels as metric labels, to make it easier to identify log streams with long lines. I suspect most of the time it'll be a particular stream. But for now I don't want to make changes that could lead to too many metrics.
|
💻 Deploy preview available (feat(loki.process): Add two metrics for CRI stage to track truncation of lines and partial line flushing): |
25eb122 to
43fcf97
Compare
kalleep
left a comment
There was a problem hiding this comment.
I don't know how useful these metrics would be.
Could you explain how these could be used to actually find issues?
|
No suggestions for docs. Looks OK as-is. |
It's hard to tell what values to set for |
|
Hi @kalleep, have you had a chance to think about this please? I'm open to other ways of debug this but a metric seems like the easy and low cost way of doing it. |
|
It looks good to me but still we should use the helper function I mentioned here #5399 (comment) |
43fcf97 to
67c07d2
Compare
67c07d2 to
d6fcb62
Compare
|
💻 Deploy preview deleted (feat(loki.process): Add debug metrics for CRI stage to track truncation of lines and partial line flushing). |
…on of lines and partial line flushing (#5399)
…on of lines and partial line flushing (#5399)
🤖 I have created a release *beep* *boop* --- ## [1.14.0](v1.13.0...v1.14.0) (2026-03-06) ### ⚠ BREAKING CHANGES * **loki.secretfilter:** Some config options are removed entirely: - `partial_mask` (replaced with `redact_percent`) - `allowlist` (now controlled with custom gitleaks config) - `enable_entropy` - `include_generic` (now controlled with custom gitleaks config) - `types` (now controlled with custom gitleaks config) * **otelcol.receiver.prometheus:** `otelcol.receiver.prometheus` no longer sets start times of OTLP metrics. Grafana Cloud and Mimir do not currently use OTLP metric start times. If you do want your metrics to have them, you can use `otelcol.processor.metric_start_time` with `strategy` set to `true_reset_point` to get the same behaviour. ### Features 🌟 * Add automatic reconnection to database_observability components ([#5444](#5444)) ([553f967](553f967)) * Add limited type checking for validate command ([#5076](#5076)) ([045fb76](045fb76)) * **database_observability.mysql:** Collect client info for query samples ([#5552](#5552)) ([257a699](257a699)) * **database_observability.postgres:** Add exclude databases/users for `logs` collector ([#5569](#5569)) ([5dddd9b](5dddd9b)) * **database_observability.postgres:** Add logs collector ([#5445](#5445)) ([46d79d4](46d79d4)) * **database_observability.postgres:** Allow excluding queries ran by specific users ([#5544](#5544)) ([2d0ca15](2d0ca15)) * Deprecate prometheus.write.queue ([#5509](#5509)) ([ee0f227](ee0f227)) * Introduce SeriesRefMappingStore ([#5522](#5522)) ([33ee297](33ee297)) * **local.file_match, loki.source.file:** Match multiple files using doublestar `{...}` expressions ([#5470](#5470)) ([284e48f](284e48f)) * **loki.process:** Add debug metrics for CRI stage to track truncation of lines and partial line flushing ([#5399](#5399)) ([a1728f6](a1728f6)) * **mixin:** Add OTel Engine Overview dashboard ([#5573](#5573)) ([df52116](df52116)) * **mixin:** Add zipped dashboards as a release artifact ([#5603](#5603)) ([4f7fe85](4f7fe85)) * **otel:** Add receivers used in the otel k8s helm chart presets ([#5466](#5466)) ([100f6ea](100f6ea)) * **otelcol.receiver.prometheus:** Remove requirement to run Alloy with `--stability.level=experimental` in order to translate Prometheus native histograms into OTLP exponential histograms. ([#5308](#5308)) ([237e985](237e985)) * **otelcol:** Expose missing tail_sampling drop and bytes_limiting ([6021154](6021154)) * **prometheus.exporter.postgres:** Update to version `0.19.0` and expose new collectors settings ([#4640](#4640)) ([aa01e45](aa01e45)) * **prometheus.exporter.postgres:** Update to version 0.19.1 ([#5659](#5659)) ([9f4e88f](9f4e88f)) * Update github exporter with github app authentication ([#5377](#5377)) ([ca741a6](ca741a6)) * Update grafana cadvisor fork to v0.54.1 ([#5447](#5447)) ([2a3aba0](2a3aba0)) * Upgrade prometheus to version 0.309.1 ([#5479](#5479)) ([633944b](633944b)) ### Bug Fixes 🐛 * Add /FORCEREGISTRY flag to windows installer ([#5517](#5517)) ([6b22d4e](6b22d4e)) * Add missing otelcol alias to make OTel Engine work with OTel Collector helm chart ([#5473](#5473)) ([90478cd](90478cd)) * **controller:** Prevent duplicate loaders from being created ([#5446](#5446)) ([31d5eea](31d5eea)) * **database_observability.mysql:** Skip wait events with `NULL` timer_wait ([#5478](#5478)) ([48750e5](48750e5)) * **database_observability.postgres:** Correctly handle table name casing when parsing postgres queries ([#5440](#5440)) ([7cca2b9](7cca2b9)) * **deps:** Update module github.com/go-git/go-git/v5 to v5.16.5 [SECURITY] ([#5485](#5485)) ([71a1b8b](71a1b8b)) * Ensure Valid/Clear States in Alloy Engine Extension ([#5551](#5551)) ([99ad024](99ad024)) * Expose missing `otelcol.processor.tail_sampling` options ([#5606](#5606)) ([6021154](6021154)) * **loki.process:** Registration of stage.metric when used inside stage.match ([#5460](#5460)) ([81caf72](81caf72)) * **loki.source.docker:** Parse timestamp correctly when log line only contains newline ([#5489](#5489)) ([162011d](162011d)) * **loki.source.file:** Close file if we cannot find encoding ([#5528](#5528)) ([56bcb26](56bcb26)) * **mixin:** Support OTel exporter batching ([#5618](#5618)) ([f2b7cb8](f2b7cb8)) * **prometheus.echo:** Return zero for SeriesRef ([#5622](#5622)) ([31a8680](31a8680)) * **prometheus.exporter.cloudwatch:** Respect debug flag ([#5469](#5469)) ([44ade00](44ade00)) * **prometheus.receive_http:** Bump prometheus patch for bugfix ([#5505](#5505)) ([b7a1d05](b7a1d05)) * **prometheus.remote_write:** Fix sent_batch_duration_seconds measuring before the request was sent [backport] ([#5698](#5698)) ([150aecb](150aecb)) * Use read-write mutex locks to prevent concurrent tagsCache map reads and writes ([#5534](#5534)) ([8efed2e](8efed2e)) ### Performance * **loki.secretfilter:** Change secretfilter implementation to use Gitleaks ([#5503](#5503)) ([08e265c](08e265c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: grafana-alloybot[bot] <167359181+grafana-alloybot[bot]@users.noreply.github.com>
Pull Request Details
For partial lines there is currently a log line, but having a metric could be an easier way to tell if something wrong is going on. And it means we could alert on it.
For truncation there are no logs and metrics, so this will be the first time we can track it.
Those are both low cardinality metrics and I don't expect them to have much impact.
PR Checklist