Skip to content
64 changes: 63 additions & 1 deletion administration/buffering-and-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,64 @@
| `storage.keep.rejected` | When enabled, the dead-letter queue feature stores failed chunks that can't be delivered. Accepted values: `Off`, `On`. | `Off`|
| `storage.rejected.path` | When specified, the dead-letter queue is stored in a subdirectory (stream) under `storage.path`. The default value `rejected` is used at runtime if not set. | _none_ |

### Dead Letter Queue (DLQ)

Check warning on line 154 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Headings] 'Dead Letter Queue (DLQ)' should use sentence-style capitalization. Raw Output: {"message": "[FluentBit.Headings] 'Dead Letter Queue (DLQ)' should use sentence-style capitalization.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 154, "column": 5}}}, "severity": "INFO"}

The Dead Letter Queue (DLQ) feature preserves chunks that fail to be delivered to output destinations. Instead of losing this data, Fluent Bit copies the rejected chunks to a dedicated storage location for later analysis and troubleshooting.

#### When DLQ is triggered

Check warning on line 158 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Headings] 'When DLQ is triggered' should use sentence-style capitalization. Raw Output: {"message": "[FluentBit.Headings] 'When DLQ is triggered' should use sentence-style capitalization.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 158, "column": 6}}}, "severity": "INFO"}

Chunks are copied to the DLQ in the following failure scenarios:

- **Permanent errors**: When an output plugin returns an unrecoverable error (`FLB_ERROR`).
- **Retry limit reached**: When a chunk exhausts all configured retry attempts.
- **Retries disabled**: When `retry_limit` is set to `no_retries` and a flush fails.
- **Scheduler failures**: When the retry scheduler can't schedule a retry (for example, due to resource constraints).

#### Requirements

The DLQ feature requires:

- `storage.path` must be configured (filesystem storage must be enabled).
- `storage.keep.rejected` must be set to `On`.

#### DLQ file location and format

Check warning on line 174 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Headings] 'DLQ file location and format' should use sentence-style capitalization. Raw Output: {"message": "[FluentBit.Headings] 'DLQ file location and format' should use sentence-style capitalization.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 174, "column": 6}}}, "severity": "INFO"}

Rejected chunks are stored in a subdirectory under `storage.path`. For example, with the following configuration:

```yaml
service:
storage.path: /var/log/flb-storage/
storage.keep.rejected: on
storage.rejected.path: rejected
```

Rejected chunks are stored at `/var/log/flb-storage/rejected/`.

Each DLQ file is named using this format:

```text
<sanitized_tag>_<status_code>_<output_name>_<unique_id>.flb
```

For example: `kube_var_log_containers_test_400_http_0x7f8b4c.flb`

The file contains the original chunk data in Fluent Bit's internal format, preserving all records and metadata.

Check warning on line 195 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Possessives] Rewrite 'Bit's' to not use 's. Raw Output: {"message": "[FluentBit.Possessives] Rewrite 'Bit's' to not use 's.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 195, "column": 53}}}, "severity": "WARNING"}

#### Troubleshooting with DLQ

Check warning on line 197 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Headings] 'Troubleshooting with DLQ' should use sentence-style capitalization. Raw Output: {"message": "[FluentBit.Headings] 'Troubleshooting with DLQ' should use sentence-style capitalization.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 197, "column": 6}}}, "severity": "INFO"}

The DLQ feature is particularly useful for:

Check warning on line 199 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Simplicity] Avoid words like "useful" that imply ease of use, because the user may find this action difficult. Raw Output: {"message": "[FluentBit.Simplicity] Avoid words like \"useful\" that imply ease of use, because the user may find this action difficult.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 199, "column": 33}}}, "severity": "WARNING"}

- **Data preservation**: Invalid or rejected chunks are preserved instead of being permanently lost.
- **Root cause analysis**: Investigate why specific data failed to be delivered without impacting live processing.
- **Data recovery**: Replay or transform rejected chunks after fixing the underlying issue.
- **Debugging**: Analyze the exact content of problematic records.

To examine DLQ chunks, you can use the storage metrics endpoint (when `storage.metrics` is enabled) or directly inspect the files in the rejected directory.

{% hint style="info" %}
DLQ files remain on disk until manually removed. Monitor disk usage in the rejected directory and implement a cleanup policy for older files.
{% endhint %}

A Service section will look like this:

{% tabs %}
Expand All @@ -165,6 +223,8 @@
storage.checksum: off
storage.backlog.mem_limit: 5M
storage.backlog.flush_on_shutdown: off
storage.keep.rejected: on
storage.rejected.path: rejected
```

{% endtab %}
Expand All @@ -179,12 +239,14 @@
storage.checksum off
storage.backlog.mem_limit 5M
storage.backlog.flush_on_shutdown off
storage.keep.rejected on
storage.rejected.path rejected
```

{% endtab %}
{% endtabs %}

This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5&nbsp;MB of memory when processing backlog data.
This configuration sets an optional buffering mechanism where the route to the data is `/var/log/flb-storage/`. It uses `normal` synchronization mode, without running a checksum and up to a maximum of 5MB of memory when processing backlog data. Additionally, the dead letter queue is enabled, and rejected chunks are stored in `/var/log/flb-storage/rejected/`.

Check warning on line 249 in administration/buffering-and-storage.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Units] Put a nonbreaking space between the number and the unit in '5MB'. Raw Output: {"message": "[FluentBit.Units] Put a nonbreaking space between the number and the unit in '5MB'.", "location": {"path": "administration/buffering-and-storage.md", "range": {"start": {"line": 249, "column": 202}}}, "severity": "INFO"}

### Input section configuration

Expand Down
19 changes: 19 additions & 0 deletions administration/configuring-fluent-bit/yaml/service-section.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,27 @@
| `sp.convert_from_str_to_num` | If enabled, the stream processor converts strings that represent numbers to a numeric type. | `true` |
| `windows.maxstdio` | If specified, adjusts the limit of `stdio`. Only provided for Windows. Values from `512` to `2048` are allowed. | `512` |

### Storage configuration

The following storage-related keys can be set in the `service` section:

| Key | Description | Default Value |
| --- | ----------- | ------------- |
| `storage.path` | Set a location in the file system to store streams and chunks of data. Required for filesystem buffering. | _none_ |
| `storage.sync` | Configure the synchronization mode used to store data in the file system. Accepted values: `normal` or `full`. | `normal` |
| `storage.checksum` | Enable data integrity check when writing and reading data from the filesystem. Accepted values: `off` or `on`. | `off` |
| `storage.max_chunks_up` | Set the maximum number of chunks that can be `up` in memory when using filesystem storage. | `128` |
| `storage.backlog.mem_limit` | Set the memory limit for backlog data chunks. | `5M` |
| `storage.backlog.flush_on_shutdown` | Attempt to flush all backlog chunks during shutdown. Accepted values: `off` or `on`. | `off` |
| `storage.metrics` | Enable storage layer metrics on the HTTP endpoint. Accepted values: `off` or `on`. | `off` |
| `storage.delete_irrecoverable_chunks` | Delete irrecoverable chunks during runtime and at startup. Accepted values: `off` or `on`. | `off` |
| `storage.keep.rejected` | Enable the dead letter queue (DLQ) to preserve chunks that fail to be delivered. Accepted values: `off` or `on`. | `off` |

Check warning on line 43 in administration/configuring-fluent-bit/yaml/service-section.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Acronyms] Spell out 'DLQ', if it's unfamiliar to the audience. Raw Output: {"message": "[FluentBit.Acronyms] Spell out 'DLQ', if it's unfamiliar to the audience.", "location": {"path": "administration/configuring-fluent-bit/yaml/service-section.md", "range": {"start": {"line": 43, "column": 59}}}, "severity": "INFO"}
| `storage.rejected.path` | Subdirectory name under `storage.path` for storing rejected chunks. | `rejected` |

For scheduler and retry details, see [scheduling and retries](../../scheduling-and-retries.md#Scheduling-and-Retries).

For storage and buffering details, see [buffering and storage](../../buffering-and-storage.md).

## Configuration example

The following configuration example that defines a `service` section with [hot reloading](../../hot-reload.md) enabled and a pipeline with a `random` input and `stdout` output:
Expand Down
4 changes: 4 additions & 0 deletions administration/scheduling-and-retries.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,10 @@ The scheduler provides a configuration option called `Retry_Limit`, which can be
| `Retry_Limit` | `no_limits` or `False` | When set there no limit for the number of retries that the scheduler can do. |
| `Retry_Limit` | `no_retries` | When set, retries are disabled and scheduler doesn't try to send data to the destination if it failed the first time. |

{% hint style="info" %}
When a chunk exhausts all retry attempts or retries are disabled, the data is discarded by default. To preserve rejected data for later analysis, enable the [Dead Letter Queue (DLQ)](buffering-and-storage.md#dead-letter-queue-dlq) feature by setting `storage.keep.rejected` to `on` in the Service section.
{% endhint %}

### Retry example

The following example configures two outputs, where the HTTP plugin has an unlimited number of retries, and the Elasticsearch plugin have a limit of `5` retries:
Expand Down
55 changes: 55 additions & 0 deletions administration/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,64 @@

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=759ddb3d-b363-4ee6-91fa-21025259767a" />

- [Dead letter queue: preserve failed chunks](#dead-letter-queue)
- [Tap: generate events or records](#tap)
- [Dump internals signal](#dump-internals-and-signal)

## Dead Letter Queue

Check warning on line 9 in administration/troubleshooting.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Headings] 'Dead Letter Queue' should use sentence-style capitalization. Raw Output: {"message": "[FluentBit.Headings] 'Dead Letter Queue' should use sentence-style capitalization.", "location": {"path": "administration/troubleshooting.md", "range": {"start": {"line": 9, "column": 4}}}, "severity": "INFO"}

The Dead Letter Queue (DLQ) feature preserves chunks that fail to be delivered to output destinations. This is useful for troubleshooting delivery failures without losing data.

Check warning on line 11 in administration/troubleshooting.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [FluentBit.Simplicity] Avoid words like "useful" that imply ease of use, because the user may find this action difficult. Raw Output: {"message": "[FluentBit.Simplicity] Avoid words like \"useful\" that imply ease of use, because the user may find this action difficult.", "location": {"path": "administration/troubleshooting.md", "range": {"start": {"line": 11, "column": 112}}}, "severity": "WARNING"}

### Enable DLQ

To enable the DLQ, add the following to your Service section:

{% tabs %}
{% tab title="fluent-bit.yaml" %}

```yaml
service:
storage.path: /var/log/flb-storage/
storage.keep.rejected: on
storage.rejected.path: rejected
```

{% endtab %}
{% tab title="fluent-bit.conf" %}

```text
[SERVICE]
storage.path /var/log/flb-storage/
storage.keep.rejected on
storage.rejected.path rejected
```

{% endtab %}
{% endtabs %}

### What gets stored

Chunks are copied to the DLQ when:

- An output plugin returns an unrecoverable error.
- A chunk exhausts all configured retry attempts.
- Retries are disabled (`retry_limit: no_retries`) and the flush fails.
- The scheduler fails to schedule a retry.

### Examine DLQ files

DLQ files are stored in the configured path (for example, `/var/log/flb-storage/rejected/`) with names that include the tag, status code, and output plugin name. This helps identify which records failed and why.

For example, a file named `kube_var_log_containers_test_400_http_0x7f8b4c.flb` indicates a chunk with tag `kube.var.log.containers.test` that failed with status code `400` when sending to the `http` output.

### DLQ management

{% hint style="warning" %}
DLQ files remain on disk until manually removed. Monitor disk usage and implement a cleanup policy.
{% endhint %}

For more details on DLQ configuration, see [Buffering and Storage](buffering-and-storage.md#dead-letter-queue-dlq).

## Tap

Tap can be used to generate events or records detailing what messages pass through Fluent Bit, at what time and what filters affect them.
Expand Down
Loading