Investigate why GCS data is being saved without meta info #119

trent-codecov · 2023-07-20T16:08:14Z

https://sentry.slack.com/archives/C04M5GE2R5M/p1689792185704099 context

giovanni-guidini · 2023-07-24T11:19:10Z

I checked several commits in several repos and they all had the "meta" info.
However in the logs we do see quite a few "meta info not found in encoded SessionArray" (see here)

That hasn't happened for the last 2 days tho.

trent-codecov · 2023-07-28T18:45:40Z

Marking as in review. Please check it again before the end of sprint and lets close pending more cases arising.

In an effort to understand why we're getting `SessionArray` data from GCS without 'meta' info we'll add a temporary log line when saving data from the API. Mostly because I don't think this should happen, so I don't expect to see that line. But who knows... Part of codecov/engineering-team#119

giovanni-guidini · 2023-07-31T09:18:34Z

The are still cases of this happening. It's like a silent error because it doesn't matter anymore since the fix. But I'll add some log lines to help us debug where the problem is.

In an effort to understand why we're getting `SessionArray` data from GCS without 'meta' info we'll add a temporary log line when saving data from the API. Mostly because I don't think this should happen, so I don't expect to see that line. But who knows... Part of codecov/engineering-team#119

giovanni-guidini · 2023-08-03T08:47:13Z

You can play around the logs with the following messages:

"Saving files_array data without meta"
indicates worker saved report_details data and the stringified data doesn't include the substring "meta"
"Setting files_array from the API"
indicates the api is saving report_details data (which should not happen)
"meta info not found in encoded SessionArray"
the original warning message that sparked this whole thing

From what I see in the logs we had 2 times data without meta (when creating the report details, the data was []. In the same period 0 times data being saved by the api. And a staggering 57K times the warning.

So I think this leads us to believe that:

Data is being saved correctly but never updated (by the worker)
We're doing some strange transformation to the data that is making it not have the "meta" info, some how
There's another edge case that I'm missing that is the actual cause of this issue, because point 2 does seem unlikely

UPDATE
After a few days the trend above seems to be confirmed. We will need to investigate further the process of retrieving data from API.

giovanni-guidini · 2023-08-08T21:42:44Z

I reproduced this locally. The thing that uses this is the sunburst chart in the UI (see image)

Now it starts to get interesting.

I changed the retrieve function to do this:

# Roughly https://github.com/codecov/codecov-api/blob/14dfd7c868e34a840387f9529b51194924c2a669/utils/model_utils.py#L87
@lru_cache(maxsize=1)
    def _get_value_from_archive(self, obj):
        repository = obj.get_repository()
        archive_service = ArchiveService(repository=repository)
        archive_field = getattr(obj, self.archive_field_name)
        if archive_field:
            try:
                file_str = archive_service.read_file(archive_field)
                if self.public_name == 'files_array':
                    rehydrated_data = self.rehydrate_fn(obj, json.loads(file_str))
                    for file in rehydrated_data:
                        if "session_totals" not in file or "meta" not in file['session_totals']:
                            log.warning("META NOT IN SESSION_TOTALS", extra=dict(data=file))
                return self.rehydrate_fn(obj, json.loads(file_str))
            except FileNotInStorageError:
                log.error(
                    "Archive enabled field not in storage",
                    extra=dict(
                        storage_path=archive_field,
                        object_id=obj.id,
                        commit=obj.get_commitid(),
                    ),
                )
...

so we would get the META NOT IN SESSION_TOTALS message if something went wrong there.
I saw NO LOGS of those.

At the same time I changed the place where we use this in a similar fashion

# Roughly https://github.com/codecov/codecov-api/blob/14dfd7c868e34a840387f9529b51194924c2a669/services/report.py#L206
files_array = report_details.files_array
    for file in files_array:
        if "session_totals" not in file or "meta" not in file['session_totals']:
            log.warning("really missing meta info", extra=dict(data=file))
    return {
        file["filename"]: ReportFileSummary(
            file_index=file["file_index"],
            file_totals=ReportTotals(*file["file_totals"]),
            session_totals=file["session_totals"],
            diff_totals=file["diff_totals"],
        )
        for file in files_array
    }

And... well...

WARNING: really missing meta info --- {"asctime": "2023-08-08 21:38:27,275", "name": "services.report", "lineno": 210, "pathname": "/app/services/report.py", "funcName": "build_files", "threadName": "MainThread", "data": {"filename": "services/fetch_data/tests/test_fetch_data.py", "file_index": 24, "file_totals": [0, 1, 1, 0, 0, "100", 0, 0, 0, 0, 0, 0, 0], "session_totals": {"0": [0, 1, 1, 0, 0, "100"]}, "diff_totals": null}}

EDIT
Disabling the lru_cache of the function solves the problem (here)

Context: codecov/engineering-team#119 So it seems that after much investigation the issue was not in our code per se, but in the `lru_cache` function. Probably because of the complex nature of both the arguments and the return value. I propose this solution as quick and simple. The idea being that string values are safer to use as the cached return. Obviously there's some extra processing that will need to be done on each access, but I suppose it's an acceptable trade-off if we get the correct values every time. closes codecov/engineering-team#119

recently we had [issues](codecov/engineering-team#119) with meta info missing from the encoded data of sessions array. After much investigation and head scratching the reason is that moving data to GCS and *caching* the data causes us to re-use the dictionary instance of files_array. Why does this matter? because dictionaries are passed via *reference*. This means that if you change the dict reference it will change in all pointers to that dictionary (read the *cached value* in `ArchiveField`). So basically we were shooting ourselves in the foot. The good news is that data is being saved and fetched correctly. We just need to not make destructive changes to the encoded data, because it might be reused in the future. The current changes address that, making the building of `SessionTotalArray` not destructive for the encoded data. closes codecov/engineering-team#119

@matt-codecov

Context: codecov/engineering-team#119 So the real issue with the meta info is fixed in codecov/shared#22. spoiler: reusing the report details cached values and _changing_ them is not a good idea. However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly. Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue. There are further complications with the caching situation, which explain why I decided to add the cached value in the `obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance of `ArchiveField` in the declaration of the `ReportDetail` class. Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities: 1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value. If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected. 2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway. 3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model) This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life (because it is updates on the SET) closes codecov/engineering-team#119

* Add trial logic * Get rid of unecessary comments + linting * reintroduce print statement and revert gql nullability * Added expiration value when expiring trial * Add pretrial user count field * create a new endpoint that marks the completion step for uploads, which triggers notifs and syncs pulls (#49) * fix: Cast UUID to str before saving in the db (#56) * chore: add debug log for files array saving (#47) In an effort to understand why we're getting `SessionArray` data from GCS without 'meta' info we'll add a temporary log line when saving data from the API. Mostly because I don't think this should happen, so I don't expect to see that line. But who knows... Part of codecov/engineering-team#119 * chore: change logic to write data to storage (#48) After moving `ReportDetails.files_array` and `Commit.report` to archive successfilly for Codecov we want to do the same for some selected customers (in `repo_ids`) and then we will want to start the process for new fields. Only for codecov. The way things were before, if you have a field set to write to GCS it is valid for all codecov repos AND all repos under `repo_ids`, while we would only want that for codecov repos. So by changing the `master_write_switch` to strings we can have more power over each individual field. Now we can use: * `True` or `codecov_access` to write *only* for codecov repos * `restricted_access` to write to all repos in `repo_ids` (+ codecov) * `general_access` to write to all repos Because currently we're only writing for codecov repos the change will not affect any customers Part of codecov/engineering-team#100 * fix: Force owner login during pagination test (#58) * merge * Small fixes + addons * Misc additions to the plan service + trial * fix: Ensure we don't accidentally leak stack trace info (#59) * Add pretrial_plan resolver * Fix test --------- Co-authored-by: Dana Yaish <[email protected]> Co-authored-by: scott-codecov <[email protected]> Co-authored-by: Giovanni M Guidini <[email protected]>

@matt-codecov

Context: codecov/engineering-team#119 So the real issue with the meta info is fixed in codecov/shared#22. spoiler: reusing the report details cached values and _changing_ them is not a good idea. However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly. Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue. There are further complications with the caching situation, which explain why I decided to add the cached value in the `obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance of `ArchiveField` in the declaration of the `ReportDetail` class. Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities: 1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value. If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected. 2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway. 3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model) This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life (because it is updates on the SET) closes codecov/engineering-team#119

@matt-codecov

Context: codecov/engineering-team#119 So the real issue with the meta info is fixed in codecov/shared#22. spoiler: reusing the report details cached values and _changing_ them is not a good idea. However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly. Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue. There are further complications with the caching situation, which explain why I decided to add the cached value in the `obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance of `ArchiveField` in the declaration of the `ReportDetail` class. Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities: 1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value. If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected. 2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway. 3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model) This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life (because it is updates on the SET) closes codecov/engineering-team#119

trent-codecov added the investigation label Jul 20, 2023

trent-codecov assigned giovanni-guidini Jul 20, 2023

trent-codecov mentioned this issue Jul 20, 2023

Codecov's Report Data will be stored in GCS #100

Closed

giovanni-guidini removed their assignment Jul 24, 2023

trent-codecov assigned giovanni-guidini Jul 28, 2023

giovanni-guidini mentioned this issue Jul 31, 2023

chore: add debug log for files array saving codecov/codecov-api#47

Merged

giovanni-guidini mentioned this issue Aug 8, 2023

fix: lru_cache issues + meta info missing codecov/codecov-api#72

Merged

giovanni-guidini mentioned this issue Aug 9, 2023

fix: don't pop meta info codecov/shared#22

Merged

giovanni-guidini closed this as completed in codecov/shared#22 Aug 9, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why GCS data is being saved without meta info #119

Investigate why GCS data is being saved without meta info #119

trent-codecov commented Jul 20, 2023 •

edited

Loading

giovanni-guidini commented Jul 24, 2023 •

edited by scott-codecov

Loading

trent-codecov commented Jul 28, 2023

giovanni-guidini commented Jul 31, 2023

giovanni-guidini commented Aug 3, 2023 •

edited

Loading

giovanni-guidini commented Aug 8, 2023 •

edited

Loading

Investigate why GCS data is being saved without meta info #119

Investigate why GCS data is being saved without meta info #119

Comments

trent-codecov commented Jul 20, 2023 • edited Loading

giovanni-guidini commented Jul 24, 2023 • edited by scott-codecov Loading

trent-codecov commented Jul 28, 2023

giovanni-guidini commented Jul 31, 2023

giovanni-guidini commented Aug 3, 2023 • edited Loading

giovanni-guidini commented Aug 8, 2023 • edited Loading

trent-codecov commented Jul 20, 2023 •

edited

Loading

giovanni-guidini commented Jul 24, 2023 •

edited by scott-codecov

Loading

giovanni-guidini commented Aug 3, 2023 •

edited

Loading

giovanni-guidini commented Aug 8, 2023 •

edited

Loading