Create new TestAnalytics "daily-rollups" file #3016

Swatinem · 2024-11-26T10:27:21Z

The existing "daily rollups" for test analytics looks like this:

The reports_dailytestrollups table stores pre-aggregated data on a per-calendar-day basis.
Each test is inserted into that table using an INSERT ON CONFLICT UPDATE statement.
That table is being queried and further aggregated into 6 different files holding aggregations for different time ranges.
Those aggregation files have to be touched every single day, as they contain data with aggregation windows relative to today.

I propose to replace all this with a single file containing the following data:

N days worth of consecutive per-calendar-day aggregates

Storing the data in such a format would avoid the need to re-aggregate the data every day, and allows running queries for an arbitrary date range by overlapping the ranges of the data stored for each test, and the range we want to query.

Querying data would work like this:

Assuming that the data stored for a given test is 2024-11-20 - 2024-11-18
Assuming that today is 2024-11-21
I want to query / aggregate data for today and yesterday

# data stored for the test:
… | 2024-11-21 | 2024-11-20 | 2024-11-19 | 2024-11-18 | …
               ^- begin     |            |            ^- end
# data we would like to query:
… | 2024-11-21 | 2024-11-20 | …
  ^ today      ^ yesterday  |
# the overlap between stored data and requested data is the data we actually aggregate:
… | 2024-11-21 | 2024-11-20 | …
               ^- begin     ^- end

By using range intersection (overlap), we can also filter out tests that have no data within the requested range by filtering out empty ranges after the intersection.

Updating data would work by first shifting existing data to the right if necessary, and adding the new test data to the first bucket.

Garbage collection would work by simply removing tests for which the start date of their stored data range is past the data retention period.

By storing and working with a time window of data, we have the following benefits compared to the existing workflow:

We can avoid storing data both in a DB table and in files, by storing the data just in one place (one file)
By avoiding the database altogether, we have the following benefits:
- no more complicated INSERT ON CONFLICT UPDATE which might be prone to DB locking
- data retention policy can be done using file TTLs instead of DB deletions (or partition deletion as is used right now?)
Storing a time window of data in a file means that we can store a single file vs 6 files for each desired aggregation window.
Using overlapping ranges between the time window being stored and the time window being aggregated, we avoid having to touch aggregation files each day.

codecov/test-results-parser#51 is implementing a custom binary file format storing different data table for tests and their time window of data.

It should be possible to implement the basic ideas explained above using any kind of data format, no matter if arrow, capnproto or even json.
However, the custom data format linked above is optimized for quick access and queries and does not require deserializing all the data stored within the format, like for example json would.

Nonetheless, I recognize that creating a custom binary format comes with a non-zero maintenance burden that an off-the-shelve data format would not have.

There is a indeed quite some code and effort tied to just wrangling the details of the file format itself.
However, a big portion of the code written so far is actually tied to managing the underlying idea of working with a time window of data, and making sure to properly shift that when doing data inserts and merges, as well as making sure that querying works correctly.

The text was updated successfully, but these errors were encountered:

trent-codecov · 2024-11-26T16:41:07Z

How large to we expect these files to get? There may be some perf/cost issues with pulling massive files from GCS often.

matt-codecov · 2024-11-26T21:27:57Z

are we dividing on "time we received" or "time the test allegedly ran"? around the date cutoff point there may be awkward cases where data from yesterday finishes processing today and some other data that arrived later finished processing sooner. so you can't necessarily just append, you may have to insert

Swatinem · 2024-11-27T10:05:23Z

How large to we expect these files to get? There may be some perf/cost issues with pulling massive files from GCS often.

The format I am designing is not optimized for filesize, but it should compress well with zstd which we are moving towards for storage.

are we dividing on "time we received" or "time the test allegedly ran"?

Good point. I haven’t considered inserting "older" data yet, but can adapt to that quickly.

Swatinem changed the title ~~Create new TestAnalytics "daily-rollups" file.~~ Create new TestAnalytics "daily-rollups" file Nov 26, 2024

Swatinem mentioned this issue Nov 26, 2024

[TA] Document new desired state of the system #2924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new TestAnalytics "daily-rollups" file #3016

Create new TestAnalytics "daily-rollups" file #3016

Swatinem commented Nov 26, 2024

trent-codecov commented Nov 26, 2024

matt-codecov commented Nov 26, 2024

Swatinem commented Nov 27, 2024

Create new TestAnalytics "daily-rollups" file #3016

Create new TestAnalytics "daily-rollups" file #3016

Comments

Swatinem commented Nov 26, 2024

trent-codecov commented Nov 26, 2024

matt-codecov commented Nov 26, 2024

Swatinem commented Nov 27, 2024