Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new TestAnalytics "daily-rollups" file #3016

Open
Swatinem opened this issue Nov 26, 2024 · 3 comments
Open

Create new TestAnalytics "daily-rollups" file #3016

Swatinem opened this issue Nov 26, 2024 · 3 comments

Comments

@Swatinem
Copy link

The existing "daily rollups" for test analytics looks like this:

  • The reports_dailytestrollups table stores pre-aggregated data on a per-calendar-day basis.
  • Each test is inserted into that table using an INSERT ON CONFLICT UPDATE statement.
  • That table is being queried and further aggregated into 6 different files holding aggregations for different time ranges.
  • Those aggregation files have to be touched every single day, as they contain data with aggregation windows relative to today.

I propose to replace all this with a single file containing the following data:

  • N days worth of consecutive per-calendar-day aggregates

Storing the data in such a format would avoid the need to re-aggregate the data every day, and allows running queries for an arbitrary date range by overlapping the ranges of the data stored for each test, and the range we want to query.

Querying data would work like this:

  • Assuming that the data stored for a given test is 2024-11-20 - 2024-11-18
  • Assuming that today is 2024-11-21
  • I want to query / aggregate data for today and yesterday
# data stored for the test:
… | 2024-11-21 | 2024-11-20 | 2024-11-19 | 2024-11-18 | …
               ^- begin     |            |            ^- end
# data we would like to query:
… | 2024-11-21 | 2024-11-20 | …
  ^ today      ^ yesterday  |
# the overlap between stored data and requested data is the data we actually aggregate:
… | 2024-11-21 | 2024-11-20 | …
               ^- begin     ^- end

By using range intersection (overlap), we can also filter out tests that have no data within the requested range by filtering out empty ranges after the intersection.

Updating data would work by first shifting existing data to the right if necessary, and adding the new test data to the first bucket.

Garbage collection would work by simply removing tests for which the start date of their stored data range is past the data retention period.


By storing and working with a time window of data, we have the following benefits compared to the existing workflow:

  • We can avoid storing data both in a DB table and in files, by storing the data just in one place (one file)
  • By avoiding the database altogether, we have the following benefits:
    • no more complicated INSERT ON CONFLICT UPDATE which might be prone to DB locking
    • data retention policy can be done using file TTLs instead of DB deletions (or partition deletion as is used right now?)
  • Storing a time window of data in a file means that we can store a single file vs 6 files for each desired aggregation window.
  • Using overlapping ranges between the time window being stored and the time window being aggregated, we avoid having to touch aggregation files each day.

codecov/test-results-parser#51 is implementing a custom binary file format storing different data table for tests and their time window of data.

It should be possible to implement the basic ideas explained above using any kind of data format, no matter if arrow, capnproto or even json.
However, the custom data format linked above is optimized for quick access and queries and does not require deserializing all the data stored within the format, like for example json would.

Nonetheless, I recognize that creating a custom binary format comes with a non-zero maintenance burden that an off-the-shelve data format would not have.

There is a indeed quite some code and effort tied to just wrangling the details of the file format itself.
However, a big portion of the code written so far is actually tied to managing the underlying idea of working with a time window of data, and making sure to properly shift that when doing data inserts and merges, as well as making sure that querying works correctly.

@Swatinem Swatinem changed the title Create new TestAnalytics "daily-rollups" file. Create new TestAnalytics "daily-rollups" file Nov 26, 2024
@trent-codecov
Copy link
Contributor

How large to we expect these files to get? There may be some perf/cost issues with pulling massive files from GCS often.

@matt-codecov
Copy link

are we dividing on "time we received" or "time the test allegedly ran"? around the date cutoff point there may be awkward cases where data from yesterday finishes processing today and some other data that arrived later finished processing sooner. so you can't necessarily just append, you may have to insert

@Swatinem
Copy link
Author

How large to we expect these files to get? There may be some perf/cost issues with pulling massive files from GCS often.

The format I am designing is not optimized for filesize, but it should compress well with zstd which we are moving towards for storage.

are we dividing on "time we received" or "time the test allegedly ran"?

Good point. I haven’t considered inserting "older" data yet, but can adapt to that quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants