Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: lru_cache issues + meta info missing #72

Merged
merged 3 commits into from
Aug 11, 2023

Conversation

giovanni-guidini
Copy link
Contributor

@giovanni-guidini giovanni-guidini commented Aug 8, 2023

Context: codecov/engineering-team#119

So the real issue with the meta info is fixed in codecov/shared#22.
spoiler: reusing the report details cached values and changing them is not a good idea.

However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly.
Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g

So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue.
There are further complications with the caching situation, which explain why I decided to add the cached value in the
obj instead of self. The thing is that there's only 1 instance of ArchiveField shared among ALL instances of
the model class (for example, all ReportDetail instances). This kinda makes sense because we only create an instance
of ArchiveField in the declaration of the ReportDetail class.

Because of that if the cache is in the self of ArchiveField different instances of ReportDetails will have dirty cached value of other ReportDetails instances and we get wrong values. To fix that I envision 3 possibilities:

  1. Putting the cached value in the ReportDetails instance directly (the obj), and checking for the presence of that value.
    If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is
    for each ReportDetails instance we always get the correct value, and it's cleared when the instance is killed and garbage collected.

  2. Storing an entire table of cached values in the self (ArchiveField) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway.

  3. We move the getting and parsing of the value to outside ArchiveField (so it's a normal function) and use lru_cache in that function. Because the rehydrate function takes a reference to obj I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model)

This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life
(because it is updates on the SET)

closes codecov/engineering-team#119

matt-codecov

This comment was marked as outdated.

@giovanni-guidini giovanni-guidini changed the title fix: session totals missing meta warning @giovanni-guidini fix: lru_cache issues + meta info missing Aug 10, 2023
@giovanni-guidini giovanni-guidini changed the title @giovanni-guidini fix: lru_cache issues + meta info missing fix: lru_cache issues + meta info missing Aug 10, 2023
@codecov-staging
Copy link

codecov-staging bot commented Aug 10, 2023

Codecov Report

Patch coverage is 100.00% of modified lines.

Files Changed Coverage Δ
utils/model_utils.py 96.77% <100.00%> (+1.77%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Merging #72 (6528bd0) into main (b37dcd5) will not change coverage.
The diff coverage is 100.00%.

@@          Coverage Diff          @@
##            main     #72   +/-   ##
=====================================
  Coverage   95.30   95.30           
=====================================
  Files        693     693           
  Lines      14734   14736    +2     
=====================================
+ Hits       14041   14044    +3     
+ Misses       693     692    -1     
Flag Coverage Δ
unit 95.24% <100.00%> (+<0.01%) ⬆️
unit-latest-uploader 95.24% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
utils/model_utils.py 96.87% <100.00%> (+1.71%) ⬆️

@matt-codecov
Copy link
Contributor

good catch on there only being one ArchiveField instance for the class just like lru_cache

so the goal here is to avoid multiple requests to archive storage for the same resource, right? do we already know why those requests happen or is this speculative?

the ideal caching approach depends on the access characteristics, i guess. the per-instance caching is the best match if, hypothetically, ReportDetails repeatedly dereferences files_array during its processing. the class-wide caching is the best match if, hypothetically, each Commit instance only dereferences its ArchiveField once, but we create many Commit instances for the same commit by querying it from the db whenever we need it / across requests. and if you pick a good cache size for the class-wide cache, it also helps the first case as well

i have a weak preference for the class-wide caching, which i think is your option 2. but i don't think we need to build anything complex; i think we can just use @lru_cache slightly differently:

class ArchiveField:
    def __init__(self, maxsize=30):
        self._get_value_from_archive = lru_cache(maxsize=maxsize)(self._uncached_get_value_from_archive)

    def _uncached_get_value_from_archive(self):
        pass

that fixes the memory leak problem + behaves like your option 2. and it picks a default cache size, but leaves it open to configure per field if we ever need to

thanks for putting so much thought into this!

@giovanni-guidini
Copy link
Contributor Author

giovanni-guidini commented Aug 10, 2023

that fixes the memory leak problem + behaves like your option 2. and it picks a default cache size, but leaves it open to configure per field if we ever need to

My understanding of the "memory leak" is that it holds on to the references of the objects passed to lru_cache. That was the case of self in the video. Your suggestion does fixes that (although we can argue that it was not a problem because we only have 1 ArchiveField instance that lives until the end of the process anyway).

We then need to be careful with the obj reference that is passed to _uncached_get_value_from_archive. Otherwise those references might leak, and that would be a big leak. It is doable, but I personally think that if we are to do that it's better to go straight to option 3 and move the function outside ArchiveField completely... maybe compromise somewhere between 2 and 3 and use the lru_cache inside the ArchiveField class is a good option

EDIT
Given that the cached objects can be in the order of hundreds of Mbs (for commit.report) I think we should stick with option 1 - put the value in the object instance. Because it uses less memory and we can be sure the object references will not stick around in the ArchiveField cache. When the object is done the memory is freed completely.

Context: codecov/engineering-team#119

So the real issue with the meta info is fixed in codecov/shared#22.
spoiler: reusing the report details cached values and _changing_ them is not a good idea.

However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly.
Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g

So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue.
There are further complications with the caching situation, which explain why I decided to add the cached value in the
`obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of
the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance
of `ArchiveField` in the declaration of the `ReportDetail` class.

Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities:
1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value.
If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is
for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected.

2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway.

3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model)

This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life
(because it is updates on the SET)

closes codecov/engineering-team#119
giovanni-guidini added a commit to codecov/worker that referenced this pull request Aug 10, 2023
These changes are similar to codecov/codecov-api#72
Same reasoning applies.
* Handle case were a single model uses multiple archived fields
(dynamic archived field cached property name)
* Concentrate getting/setting cache in `__get__` and `__set__` methods
@giovanni-guidini giovanni-guidini merged commit e2c6b1c into main Aug 11, 2023
11 of 12 checks passed
@giovanni-guidini giovanni-guidini deleted the gio/fix-session-totals-warning branch August 11, 2023 15:42
giovanni-guidini added a commit to codecov/worker that referenced this pull request Aug 11, 2023
These changes are similar to codecov/codecov-api#72
Same reasoning applies.

The test changed happens because now we update the cache on the write. Because of that we
are not doing the encode/decode/rehydrate operations. So the data you put in is the data you get.
On one hand we can do such operations to guarantee consistency. On the other this is no different than
what we used before `ArchiveField` AND such operations might be expensive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate why GCS data is being saved without meta info
2 participants