Implement TSDB->Parquet RowReader #2

alanprot · 2025-04-16T01:46:41Z

This PR introduces a tsdbRowReader inspired by the Cloudflare PoC, but with a key design change.

Instead of using a fixed number of encoded data columns, this implementation proposes a more flexible approach: configure only the duration of each data column. This allows the format to be more flexible and adapt blocks of varying time ranges.

Ex:

Block duration: 24h
Configured column duration: 8h
→ Result: 3 data columns
Block duration: 48h
→ Result: 6 data columns

Timestamp Layout

Each data column starts at a calculated offset from the block's minimum timestamp (min_ts). Ex:

min_ts = x, duration = 8h

data_col_1 = (x, x + 8h]
data_col_2 = (x + 8h, x + 16h]
data_col_3 = (x + 16h, x + 24h]

The minTs, maxTs and duration can be stored on the parquet metadata so we can use thins info to know what data cols to open when running a query.

Another change is that we are re-encoding the chunks to make sure they fit perfectly on the data cols boundaries.

PS:I wanna add more tests in this PR but im creating just to start the discussion.

Signed-off-by: alanprot <[email protected]>

MichaHoffmann · 2025-04-16T06:46:42Z

Great stuff! I have one proposal - right now we cannot define how long a parquet file should be - its always as long as the range that all blocks we use to create it cover. We could add parameters to the tsdb row reader that define this range "minT, maxT uint64" - that way we can break down 14d blocks ( that thanos or prometheus could compact to ) into many parquet files of length 1d if we want - or anything inbetween!

alanprot · 2025-04-16T07:07:44Z

Great stuff! I have one proposal - right now we cannot define how long a parquet file should be - its always as long as the range that all blocks we use to create it cover. We could add parameters to the tsdb row reader that define this range "minT, maxT uint64" - that way we can break down 14d blocks ( that thanos or prometheus could compact to ) into many parquet files of length 1d if we want - or anything inbetween!

Make sense!! I will change the PR to add this filter.

Signed-off-by: alanprot <[email protected]>

Implement TSDB->Parquet RowReader

66b85b6

Signed-off-by: alanprot <[email protected]>

alanprot force-pushed the converter branch from a8ced36 to 66b85b6 Compare April 16, 2025 01:55

llint

15e5269

Signed-off-by: alanprot <[email protected]>

jesusvazquez changed the base branch from readme to main April 16, 2025 07:58

Addressing comments

3a5104f

Signed-off-by: alanprot <[email protected]>

MichaHoffmann approved these changes Apr 16, 2025

View reviewed changes

add more tests

1983967

Signed-off-by: alanprot <[email protected]>

alanprot force-pushed the converter branch 2 times, most recently from a8498ff to 1983967 Compare April 16, 2025 18:05

alanprot merged commit 4b6b605 into prometheus-community:main Apr 16, 2025
5 checks passed

This was referenced Apr 21, 2025

Add proposal for parquet storage cortexproject/cortex#6712

Merged

Fix merge logic for duplicate series not appearing consecutively in the heap #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement TSDB->Parquet RowReader #2

Implement TSDB->Parquet RowReader #2

Uh oh!

alanprot commented Apr 16, 2025 •

edited

Loading

Uh oh!

MichaHoffmann commented Apr 16, 2025

Uh oh!

alanprot commented Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement TSDB->Parquet RowReader #2

Implement TSDB->Parquet RowReader #2

Uh oh!

Conversation

alanprot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Timestamp Layout

Uh oh!

MichaHoffmann commented Apr 16, 2025

Uh oh!

alanprot commented Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alanprot commented Apr 16, 2025 •

edited

Loading