Skip to content

Commit

Permalink
fix: lru_cache issues + meta info missing
Browse files Browse the repository at this point in the history
Context: codecov/engineering-team#119

So the real issue with the meta info is fixed in codecov/shared#22.
spoiler: reusing the report details cached values and _changing_ them is not a good idea.

However in the process of debuging that @matt-codecov pointed out that we were not using lru_cache correctly.
Check this very well made video: https://www.youtube.com/watch?v=sVjtp6tGo0g

So the present changes upgrade shared so we fix the meta info stuff AND address the cache issue.
There are further complications with the caching situation, which explain why I decided to add the cached value in the
`obj` instead of `self`. The thing is that there's only 1 instance of `ArchiveField` shared among ALL instances of
the model class (for example, all `ReportDetail` instances). This kinda makes sense because we only create an instance
of `ArchiveField` in the declaration of the `ReportDetail` class.

Because of that if the cache is in the `self` of `ArchiveField` different instances of `ReportDetails` will have dirty cached value of other `ReportDetails` instances and we get wrong values. To fix that I envision 3 possibilities:
1. Putting the cached value in the `ReportDetails` instance directly (the `obj`), and checking for the presence of that value.
If it's there it's guaranteed that we put it there, and we can update it on writes, so that we can always use it. Because it is
for each `ReportDetails` instance we always get the correct value, and it's cleared when the instance is killed and garbage collected.

2. Storing an entire table of cached values in the `self` (`ArchiveField`) and using the appropriate cache value when possible. The problem here is that we need to manage the cache ourselves (which is not that hard, honestly) and probably set a max value. Then we will populate the cache and over time evict old values. The 2nd problem is that the values themselves might be too big to hold in memory (which can be fixed by setting a very small value in the cache size). There's a fine line there, but it's more work than option 1 anyway.

3. We move the getting and parsing of the value to outside `ArchiveField` (so it's a normal function) and use `lru_cache` in that function. Because the `rehydrate` function takes a reference to `obj` I don't think we should pass that, so the issue here is that we can't cache the rehydrated value, and would have to rehydrate every time (which currently is not expensive at all in any model)

This is an instance cache, so it shouldn't need to be cleaned for the duration of the instance's life
(because it is updates on the SET)

closes codecov/engineering-team#119
  • Loading branch information
giovanni-guidini committed Aug 10, 2023
1 parent 3bbd3c4 commit 4bacedc
Show file tree
Hide file tree
Showing 6 changed files with 22 additions and 35 deletions.
9 changes: 4 additions & 5 deletions core/tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,19 +86,18 @@ def test_get_report_from_storage(self, mock_archive):
assert mock_archive.call_count == 1
assert mock_read_file.call_count == 1
# This one to help us understand caching across different instances
# different instances if they are the same
assert commit.report == self.sample_report
assert mock_archive.call_count == 1
assert mock_read_file.call_count == 1
assert mock_archive.call_count == 2
assert mock_read_file.call_count == 2
# Let's see for objects with different IDs
diff_commit = CommitFactory()
storage_path = "https://storage/path/files_array.json"
diff_commit._report = None
diff_commit._report_storage_path = storage_path
diff_commit.save()
assert diff_commit.report == self.sample_report
assert mock_archive.call_count == 2
assert mock_read_file.call_count == 2
assert mock_archive.call_count == 3
assert mock_read_file.call_count == 3

@patch("utils.model_utils.ArchiveService")
def test_get_report_from_storage_file_not_found(self, mock_archive):
Expand Down
9 changes: 4 additions & 5 deletions reports/tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,19 +107,18 @@ def test_get_files_array_from_storage(self, mock_archive):
assert mock_archive.call_count == 1
assert mock_read_file.call_count == 1
# This one to help us understand caching across different instances
# different instances if they are the same
assert details.files_array == self.sample_files_array
assert mock_archive.call_count == 1
assert mock_read_file.call_count == 1
assert mock_archive.call_count == 2
assert mock_read_file.call_count == 2
# Let's see for objects with different IDs
diff_details = ReportDetailsFactory()
storage_path = "https://storage/path/files_array.json"
diff_details._files_array = None
diff_details._files_array_storage_path = storage_path
diff_details.save()
assert diff_details.files_array == self.sample_files_array
assert mock_archive.call_count == 2
assert mock_read_file.call_count == 2
assert mock_archive.call_count == 3
assert mock_read_file.call_count == 3

@patch("utils.model_utils.ArchiveService")
def test_get_files_array_from_storage_file_not_found(self, mock_archive):
Expand Down
2 changes: 1 addition & 1 deletion requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ factory-boy
fakeredis
freezegun
git+ssh://[email protected]/codecov/[email protected]#egg=codecovopentelem
git+ssh://[email protected]/codecov/shared.git@8573d97fd99470b73bbff6bdb69c54177e0de409#egg=shared
git+ssh://[email protected]/codecov/shared.git@6a8a33248804a9c101d34f417efda7c11e4bbe63#egg=shared
gunicorn
https://github.com/photocrowd/django-cursor-pagination/archive/f560902696b0c8509e4d95c10ba0d62700181d84.tar.gz
minio
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ sentry-sdk==1.19.1
# via -r requirements.in
setproctitle==1.1.10
# via -r requirements.in
shared @ git+ssh://[email protected]/codecov/shared.git@8573d97fd99470b73bbff6bdb69c54177e0de409
shared @ git+ssh://[email protected]/codecov/shared.git@6a8a33248804a9c101d34f417efda7c11e4bbe63
# via -r requirements.in
simplejson==3.17.2
# via -r requirements.in
Expand Down
22 changes: 10 additions & 12 deletions utils/model_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import json
import logging
from functools import lru_cache
from typing import Any, Callable

from shared.storage.exceptions import FileNotInStorageError
Expand Down Expand Up @@ -77,15 +76,16 @@ def __set_name__(self, owner, name):
self.db_field_name = "_" + name
self.archive_field_name = "_" + name + "_storage_path"

@lru_cache(maxsize=1)
def _get_value_from_archive(self, obj):
repository = obj.get_repository()
archive_service = ArchiveService(repository=repository)
archive_field = getattr(obj, self.archive_field_name)
if archive_field:
try:
file_str = archive_service.read_file(archive_field)
return self.rehydrate_fn(obj, json.loads(file_str))
value = self.rehydrate_fn(obj, json.loads(file_str))
setattr(obj, "__archive_field_cached_value", value)
return value
except FileNotInStorageError:
log.error(
"Archive enabled field not in storage",
Expand All @@ -106,9 +106,14 @@ def _get_value_from_archive(self, obj):
return self.default_value

def __get__(self, obj, objtype=None):
cached_value = getattr(obj, "__archive_field_cached_value", None)
if cached_value:
return cached_value
db_field = getattr(obj, self.db_field_name)
if db_field is not None:
return self.rehydrate_fn(obj, db_field)
value = self.rehydrate_fn(obj, db_field)
setattr(obj, "__archive_field_cached_value", value)
return value
return self._get_value_from_archive(obj)

def __set__(self, obj, value):
Expand All @@ -118,13 +123,6 @@ def __set__(self, obj, value):
archive_service = ArchiveService(repository=repository)
old_file_path = getattr(obj, self.archive_field_name)
table_name = obj._meta.db_table
# DEBUG https://github.com/codecov/platform-team/issues/119
# We don't expect this saving to be done here, actually
if table_name == "reports_reportdetails":
log.info(
"Setting files_array from the API",
extra=dict(commit=obj.get_commitid(), data=value),
)
path = archive_service.write_json_data_to_storage(
commit_id=obj.get_commitid(),
table=table_name,
Expand All @@ -137,6 +135,6 @@ def __set__(self, obj, value):
archive_service.delete_file(old_file_path)
setattr(obj, self.archive_field_name, path)
setattr(obj, self.db_field_name, None)
self._get_value_from_archive.cache_clear()
else:
setattr(obj, self.db_field_name, value)
setattr(obj, "__archive_field_cached_value", value)
13 changes: 2 additions & 11 deletions utils/tests/unit/test_model_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,17 +131,8 @@ def test_archive_setter_archive_field(self, db, mocker):
assert test_class._archive_field == None
assert test_class._archive_field_storage_path == "path/to/written/object"
assert test_class.archive_field == some_json
# Writing cleans the cache
assert mock_read_file.call_count == 1
mock_read_file.assert_called_with("path/to/written/object")
mock_write_file.assert_called_with(
commit_id=commit.commitid,
table="test_table",
field="archive_field",
external_id=test_class.external_id,
data=some_json,
encoder=ReportEncoder,
)
# The cache is updated on write, so reading doesn't trigger another read
assert mock_read_file.call_count == 0
mock_archive_service.return_value.delete_file.assert_called_with(
"path/to/old/data"
)

0 comments on commit 4bacedc

Please sign in to comment.