Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why GCS data is being saved without meta info #119

Closed
Tracked by #100
trent-codecov opened this issue Jul 20, 2023 · 5 comments · Fixed by codecov/codecov-api#72 or codecov/shared#22
Closed
Tracked by #100
Assignees

Comments

@trent-codecov
Copy link
Contributor

trent-codecov commented Jul 20, 2023

https://sentry.slack.com/archives/C04M5GE2R5M/p1689792185704099 context

@giovanni-guidini
Copy link

giovanni-guidini commented Jul 24, 2023

I checked several commits in several repos and they all had the "meta" info.
However in the logs we do see quite a few "meta info not found in encoded SessionArray" (see here)

That hasn't happened for the last 2 days tho.

@trent-codecov
Copy link
Contributor Author

Marking as in review. Please check it again before the end of sprint and lets close pending more cases arising.

giovanni-guidini added a commit to codecov/codecov-api that referenced this issue Jul 31, 2023
In an effort to understand why we're getting `SessionArray` data from GCS without 'meta' info
we'll add a temporary log line when saving data from the API.
Mostly because I don't think this should happen, so I don't expect to see that line.
But who knows...

Part of codecov/engineering-team#119
@giovanni-guidini
Copy link

The are still cases of this happening. It's like a silent error because it doesn't matter anymore since the fix. But I'll add some log lines to help us debug where the problem is.

giovanni-guidini added a commit to codecov/codecov-api that referenced this issue Aug 1, 2023
In an effort to understand why we're getting `SessionArray` data from GCS without 'meta' info
we'll add a temporary log line when saving data from the API.
Mostly because I don't think this should happen, so I don't expect to see that line.
But who knows...

Part of codecov/engineering-team#119
adrian-codecov pushed a commit to codecov/codecov-api that referenced this issue Aug 2, 2023
In an effort to understand why we're getting `SessionArray` data from GCS without 'meta' info
we'll add a temporary log line when saving data from the API.
Mostly because I don't think this should happen, so I don't expect to see that line.
But who knows...

Part of codecov/engineering-team#119
@giovanni-guidini
Copy link

giovanni-guidini commented Aug 3, 2023

You can play around the logs with the following messages:

  • "Saving files_array data without meta"
    indicates worker saved report_details data and the stringified data doesn't include the substring "meta"
  • "Setting files_array from the API"
    indicates the api is saving report_details data (which should not happen)
  • "meta info not found in encoded SessionArray"
    the original warning message that sparked this whole thing

From what I see in the logs we had 2 times data without meta (when creating the report details, the data was []. In the same period 0 times data being saved by the api. And a staggering 57K times the warning.

So I think this leads us to believe that:

  1. Data is being saved correctly but never updated (by the worker)
  2. We're doing some strange transformation to the data that is making it not have the "meta" info, some how
  3. There's another edge case that I'm missing that is the actual cause of this issue, because point 2 does seem unlikely

UPDATE
After a few days the trend above seems to be confirmed. We will need to investigate further the process of retrieving data from API.

@giovanni-guidini
Copy link

giovanni-guidini commented Aug 8, 2023

I reproduced this locally. The thing that uses this is the sunburst chart in the UI (see image)

Image

Now it starts to get interesting.

I changed the retrieve function to do this:

# Roughly https://github.com/codecov/codecov-api/blob/14dfd7c868e34a840387f9529b51194924c2a669/utils/model_utils.py#L87
@lru_cache(maxsize=1)
    def _get_value_from_archive(self, obj):
        repository = obj.get_repository()
        archive_service = ArchiveService(repository=repository)
        archive_field = getattr(obj, self.archive_field_name)
        if archive_field:
            try:
                file_str = archive_service.read_file(archive_field)
                if self.public_name == 'files_array':
                    rehydrated_data = self.rehydrate_fn(obj, json.loads(file_str))
                    for file in rehydrated_data:
                        if "session_totals" not in file or "meta" not in file['session_totals']:
                            log.warning("META NOT IN SESSION_TOTALS", extra=dict(data=file))
                return self.rehydrate_fn(obj, json.loads(file_str))
            except FileNotInStorageError:
                log.error(
                    "Archive enabled field not in storage",
                    extra=dict(
                        storage_path=archive_field,
                        object_id=obj.id,
                        commit=obj.get_commitid(),
                    ),
                )
...

so we would get the META NOT IN SESSION_TOTALS message if something went wrong there.
I saw NO LOGS of those.

At the same time I changed the place where we use this in a similar fashion

# Roughly https://github.com/codecov/codecov-api/blob/14dfd7c868e34a840387f9529b51194924c2a669/services/report.py#L206
files_array = report_details.files_array
    for file in files_array:
        if "session_totals" not in file or "meta" not in file['session_totals']:
            log.warning("really missing meta info", extra=dict(data=file))
    return {
        file["filename"]: ReportFileSummary(
            file_index=file["file_index"],
            file_totals=ReportTotals(*file["file_totals"]),
            session_totals=file["session_totals"],
            diff_totals=file["diff_totals"],
        )
        for file in files_array
    }

And... well...

WARNING: really missing meta info --- {"asctime": "2023-08-08 21:38:27,275", "name": "services.report", "lineno": 210, "pathname": "/app/services/report.py", "funcName": "build_files", "threadName": "MainThread", "data": {"filename": "services/fetch_data/tests/test_fetch_data.py", "file_index": 24, "file_totals": [0, 1, 1, 0, 0, "100", 0, 0, 0, 0, 0, 0, 0], "session_totals": {"0": [0, 1, 1, 0, 0, "100"]}, "diff_totals": null}}

EDIT
Disabling the lru_cache of the function solves the problem (here)

giovanni-guidini added a commit to codecov/codecov-api that referenced this issue Aug 8, 2023
Context: codecov/engineering-team#119

So it seems that after much investigation the issue was not in our code per se,
but in the `lru_cache` function. Probably because of the complex nature of both
the arguments and the return value.

I propose this solution as quick and simple. The idea being that string values
are safer to use as the cached return. Obviously there's some extra processing
that will need to be done on each access, but I suppose it's an acceptable
trade-off if we get the correct values every time.

closes codecov/engineering-team#119
giovanni-guidini added a commit to codecov/shared that referenced this issue Aug 9, 2023
recently we had [issues](codecov/engineering-team#119) with meta info missing from the encoded data
of sessions array. After much investigation and head scratching the reason is that moving data to GCS and *caching* the data
causes us to re-use the dictionary instance of files_array.

Why does this matter? because dictionaries are passed via *reference*. This means that if you change the dict reference it will
change in all pointers to that dictionary (read the *cached value* in `ArchiveField`).
So basically we were shooting ourselves in the foot.

The good news is that data is being saved and fetched correctly. We just need to not make destructive changes to the encoded data,
because it might be reused in the future.

The current changes address that, making the building of `SessionTotalArray` not destructive for the encoded data.

closes codecov/engineering-team#119
giovanni-guidini added a commit to codecov/shared that referenced this issue Aug 9, 2023
recently we had [issues](codecov/engineering-team#119) with meta info missing from the encoded data
of sessions array. After much investigation and head scratching the reason is that moving data to GCS and *caching* the data
causes us to re-use the dictionary instance of files_array.

Why does this matter? because dictionaries are passed via *reference*. This means that if you change the dict reference it will
change in all pointers to that dictionary (read the *cached value* in `ArchiveField`).
So basically we were shooting ourselves in the foot.

The good news is that data is being saved and fetched correctly. We just need to not make destructive changes to the encoded data,
because it might be reused in the future.

The current changes address that, making the building of `SessionTotalArray` not destructive for the encoded data.

closes codecov/engineering-team#119
giovanni-guidini added a commit to codecov/codecov-api that referenced this issue Aug 10, 2023
Context: codecov/engineering-team#119

So the real issue with the meta info is fixed in codecov/shared#22.
spoiler: reusing the report details cached values and _changing_ them is not a good idea.

However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly.
Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g

So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue.
There are further complications with the caching situation, which explain why I decided to add the cached value in the
`obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of
the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance
of `ArchiveField` in the declaration of the `ReportDetail` class.

Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities:
1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value.
If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is
for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected.

2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway.

3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model)

This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life
(because it is updates on the SET)

closes codecov/engineering-team#119
adrian-codecov added a commit to codecov/codecov-api that referenced this issue Aug 10, 2023
* Add trial logic

* Get rid of unecessary comments + linting

* reintroduce print statement and revert gql nullability

* Added expiration value when expiring trial

* Add pretrial user count field

* create a new endpoint that marks the completion step for uploads, which triggers notifs and syncs pulls (#49)

* fix: Cast UUID to str before saving in the db (#56)

* chore: add debug log for files array saving (#47)

In an effort to understand why we're getting `SessionArray` data from GCS without 'meta' info
we'll add a temporary log line when saving data from the API.
Mostly because I don't think this should happen, so I don't expect to see that line.
But who knows...

Part of codecov/engineering-team#119

* chore: change logic to write data to storage (#48)

After moving `ReportDetails.files_array` and `Commit.report` to archive successfilly for Codecov
we want to do the same for some selected customers (in `repo_ids`) and then we will want
to start the process for new fields. Only for codecov.

The way things were before, if you have a field set to write to GCS it is valid for all codecov
repos AND all repos under `repo_ids`, while we would only want that for codecov repos.

So by changing the `master_write_switch` to strings we can have more power over each individual
field. Now we can use:
*  `True` or `codecov_access` to write *only* for codecov repos
*  `restricted_access` to write to all repos in `repo_ids` (+ codecov)
*  `general_access` to write to all repos

Because currently we're only writing for codecov repos the change will not affect any customers

Part of codecov/engineering-team#100

* fix: Force owner login during pagination test (#58)

* merge

* Small fixes + addons

* Misc additions to the plan service + trial

* fix: Ensure we don't accidentally leak stack trace info (#59)

* Add pretrial_plan resolver

* Fix test

---------

Co-authored-by: Dana Yaish <[email protected]>
Co-authored-by: scott-codecov <[email protected]>
Co-authored-by: Giovanni M Guidini <[email protected]>
giovanni-guidini added a commit to codecov/codecov-api that referenced this issue Aug 10, 2023
Context: codecov/engineering-team#119

So the real issue with the meta info is fixed in codecov/shared#22.
spoiler: reusing the report details cached values and _changing_ them is not a good idea.

However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly.
Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g

So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue.
There are further complications with the caching situation, which explain why I decided to add the cached value in the
`obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of
the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance
of `ArchiveField` in the declaration of the `ReportDetail` class.

Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities:
1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value.
If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is
for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected.

2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway.

3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model)

This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life
(because it is updates on the SET)

closes codecov/engineering-team#119
giovanni-guidini added a commit to codecov/codecov-api that referenced this issue Aug 11, 2023
Context: codecov/engineering-team#119

So the real issue with the meta info is fixed in codecov/shared#22.
spoiler: reusing the report details cached values and _changing_ them is not a good idea.

However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly.
Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g

So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue.
There are further complications with the caching situation, which explain why I decided to add the cached value in the
`obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of
the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance
of `ArchiveField` in the declaration of the `ReportDetail` class.

Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities:
1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value.
If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is
for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected.

2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway.

3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model)

This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life
(because it is updates on the SET)

closes codecov/engineering-team#119
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants