-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create new TestAnalytics "daily-rollups" file #3016
Comments
How large to we expect these files to get? There may be some perf/cost issues with pulling massive files from GCS often. |
are we dividing on "time we received" or "time the test allegedly ran"? around the date cutoff point there may be awkward cases where data from yesterday finishes processing today and some other data that arrived later finished processing sooner. so you can't necessarily just append, you may have to insert |
The format I am designing is not optimized for filesize, but it should compress well with
Good point. I haven’t considered inserting "older" data yet, but can adapt to that quickly. |
The existing "daily rollups" for test analytics looks like this:
reports_dailytestrollups
table stores pre-aggregated data on a per-calendar-day basis.INSERT ON CONFLICT UPDATE
statement.I propose to replace all this with a single file containing the following data:
Storing the data in such a format would avoid the need to re-aggregate the data every day, and allows running queries for an arbitrary date range by overlapping the ranges of the data stored for each test, and the range we want to query.
Querying data would work like this:
2024-11-20 - 2024-11-18
2024-11-21
By using range intersection (overlap), we can also filter out tests that have no data within the requested range by filtering out empty ranges after the intersection.
Updating data would work by first shifting existing data to the right if necessary, and adding the new test data to the first bucket.
Garbage collection would work by simply removing tests for which the start date of their stored data range is past the data retention period.
By storing and working with a time window of data, we have the following benefits compared to the existing workflow:
INSERT ON CONFLICT UPDATE
which might be prone to DB lockingcodecov/test-results-parser#51 is implementing a custom binary file format storing different data table for tests and their time window of data.
It should be possible to implement the basic ideas explained above using any kind of data format, no matter if
arrow
,capnproto
or evenjson
.However, the custom data format linked above is optimized for quick access and queries and does not require deserializing all the data stored within the format, like for example
json
would.Nonetheless, I recognize that creating a custom binary format comes with a non-zero maintenance burden that an off-the-shelve data format would not have.
There is a indeed quite some code and effort tied to just wrangling the details of the file format itself.
However, a big portion of the code written so far is actually tied to managing the underlying idea of working with a time window of data, and making sure to properly shift that when doing data inserts and merges, as well as making sure that querying works correctly.
The text was updated successfully, but these errors were encountered: