Skip to content

Commit 792964b

Browse files
committed
add proposal for parquet storage
Signed-off-by: Ben Ye <[email protected]>
1 parent 0e85ae0 commit 792964b

File tree

1 file changed

+128
-0
lines changed

1 file changed

+128
-0
lines changed

docs/proposals/parquet-storage.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
---
2+
title: "Parquet-based Storage"
3+
linkTitle: "Parquet-based Storage"
4+
weight: 1
5+
slug: parquet-storage
6+
---
7+
8+
- Author: [Ben Ye](https://github.com/yeya24), [Alan Protasio](https://github.com/alanprot)
9+
- Date: April 2025
10+
- Status: Proposed
11+
12+
## Background
13+
14+
Since the introducion of Block Storage in Cortex, TSDB format and Store Gateway is the de-facto way to query long term data on object storage. However, it presents several significant chanllenges:
15+
16+
### TSDB Format Limitations
17+
18+
TSDB format, while efficient for write-heavy workloads on local SSDs, is not designed for object storage:
19+
- Index relies heavily on random reads to serve queries, where each random read becomes a request to object store
20+
- In order to reduce requests to object store, requests needs to be merged, leading to higher overfetch
21+
- Index relies on postings, which can be a huge bottleneck for high cardinality data
22+
23+
### Store Gateway Operational Challenges
24+
25+
Store Gateway is originally introduced in Thanos. Both Cortex and Thanos community have been collaborating to add a lot of optimizations to Store Gateway. However, it has its own problems related to the design.
26+
27+
1. Resource Intensive
28+
- Requires significant local disk space to store index headers
29+
- High memory utilization due to index header mmap
30+
- Often needs over-provisioning to handle query spikes
31+
32+
2. State Management and Scaling Difficulties
33+
- Requires complex data sharding when scaling. Often causing issues such as consistency check failure. Hard to configure for users
34+
- Initial sync causes long startup time. This affects service availability on both scaling and failure recovery scenario
35+
36+
3. Query Inefficiencies
37+
- Attempts to minimize storage requests often lead to overfetching, causing high bandwidth usage
38+
- Complex caching logic with varying effectiveness. Latency varies a lot when cache miss
39+
- Processes single block with one goroutine, leading to high latency for large blocks and cannot scale without complex data partitioning
40+
41+
### Why Parquet?
42+
43+
Apache Parquet is a columnar storage format designed specifically for efficient data storage and retrieval from object storage systems. It offers several key advantages that directly address the problems we face with TSDB and Store Gateway:
44+
45+
- Data organized by columns rather than rows, reduces number of requests to object storage as only limited IO is required to fetch the whole column
46+
- Rich file metadata and index, no local state like index header required to query the data, making it stateless
47+
- Advanced compression techniques reduce storage costs and improve query performance
48+
- Parallel processing friendly using Parquet Row Group
49+
50+
There are other benefits of Parquet formats, but they are not directly related to the proposal:
51+
52+
- Wide ecosystem and tooling support
53+
- Column pruning opportunity using projection pushdown
54+
55+
## Out of Scope
56+
57+
- Allow Ingester and Compactor to create Parquet files instead of TSDB blocks directly. This could be in the future roadmap but this proposal only focuses on converting and querying Parquet files.
58+
59+
## Proposed Design
60+
61+
### Components
62+
63+
There are 2 new Cortex components/modules introduced in this design.
64+
65+
#### 1. Parquet Converter
66+
67+
Parquet Converter is a new component that converts TSDB blocks on object store to Parquet file format.
68+
69+
It is similar to compactor, however, it only converts single block. The converted Parquet files will be stored in the same TSDB block folder so that the lifecycle of Parquet file will be managed together with the block.
70+
71+
Only certain blocks can be configured to convert to Parquet file and it will be block duration based, for example we only convert if block duration is >= 12h.
72+
73+
#### 2. Parquet Queryable
74+
75+
Similar to the existing distributorQueryable and blockStorageQueryable, Parquet queryable is a queryable implementation which allows Cortex to query parquet files and can be used in both Cortex Querier and Ruler.
76+
77+
If Parquet queryable is enabled, block storage queryable will be disabled and Cortex querier will not query Store Gateway anymore. But it still queries Ingesters.
78+
79+
Cortex querier remains a stateless component when Parquet queryable is enabled.
80+
81+
### Architecture
82+
83+
```
84+
┌──────────┐ ┌─────────────┐ ┌──────────────┐
85+
│ Ingester │───>│ TSDB │───>│ Parquet │
86+
└──────────┘ │ Blocks │ │ Converter │
87+
└─────────────┘ └──────────────┘
88+
89+
v
90+
┌──────────┐ ┌─────────────┐ ┌──────────────┐
91+
│ Query │───>│ Parquet │───>│ Parquet │
92+
│ Frontend │ │ Querier │ │ Files │
93+
└──────────┘ └─────────────┘ └──────────────┘
94+
```
95+
96+
### Data Format
97+
98+
Following the current desgin of Cortex, each Parquet file contains at most 1 day of data.
99+
100+
#### Schema Overview
101+
102+
The Parquet format consists of two types of files:
103+
104+
1. **Labels Parquet File**
105+
- Each row represents a unique time series
106+
- Each column corresponds to a label name (e.g., `__name__`, `label1`, ..., `labelN`)
107+
- Row groups are sorted by `__name__` alphabetically in ascending order
108+
109+
2. **Chunks Parquet File**
110+
- Maintains row and row group order matching the Labels file
111+
- Contains multiple chunk columns for time-series data. Each column covering a time range of chunks: 0-8h, 8h-16h, 16-24h.
112+
113+
#### Column Specifications
114+
115+
| Column Name | Description | Type | Encoding/Compression/skipPageBounds | Required |
116+
|------------|-------------|------|-----------------------------------|-----------|
117+
| `s_hash` | Hash of all labels | INT64 | None/Zstd/Yes | No |
118+
| `s_col_indexes` | Bitmap indicating which columns store the label set for this row (series) | ByteArray (bitmap) | DeltaByteArray/Zstd/Yes | Yes |
119+
| `s_lbl_{labelName}` | Values for a given label name. Rows are sorted by metric name | ByteArray (string) | RLE_DICTIONARY/Zstd/No | Yes |
120+
| `s_data_{n}` | Chunks columns (0 to data_cols_count). Each column contains data from `[n*duration, (n+1)*duration]` where duration is `24h/data_cols_count` | ByteArray (encoded chunks) | DeltaByteArray/Zstd/Yes | Yes |
121+
122+
data_cols_count_md will be a parquet file metadata and its value is usually 3 but it can be configurable to adjust for different usecases.
123+
124+
## Open Questions
125+
126+
1. Should we use Parquet Gateway to replace Store Gateway
127+
- Separate query engine and storage
128+
- We can make Parquet Gateway semi-stateful like data locality for better performance

0 commit comments

Comments
 (0)