Identical data is loaded into every session wasting memory #3078

sparrowt · 2023-11-28T16:41:24Z

Describe the bug

Each botocore Session creates its own instance of a Loader within which JSON content loaded from botocore/data/ is cached by @instance_cache e.g. on methods like load_service_model & load_data_with_path.

This caching applies to many things including loading endpoints.json into an EndpointResolver which happens in every session and results in approx 6 MB of memory allocation (to load details of all the HTTP endpoints for every region/partition for all 300+ AWS services).

The JSON files shipped with botocore presumably do not change on disk at runtime. Nevertheless if you create several sessions within a process - e.g. in a multi-threaded app because sessions are not thread safe - this exact same data is loaded into memory multiple times and cached separately in each Session's Loader and in its EndpointResolver.

It seems therefore like a bug (of the wasteful memory usage variety) that the immutable JSON cache is per-session rather than per-process. In a multi-threaded app in a resource-constrained environment every 6MB really adds up.

Expected Behavior

When creating a 2nd (and any subsequent) Session, the data which has already been loaded from endpoints.json should be re-used, quickly and without unnecessary extra memory allocation.

Current Behavior

Instead, each new session actually loads the whole thing in again resulting in another ~6 MB of memory usage each time, storing it in a new EndpointResolver (and Loader) with the new Session. (The same issue exists for other JSON data such as service definitions but I'm just focussing on the most common & most impactful example that I observed.)

Reproduction Steps

import boto3.session

# Note: in real usage the below would be in separate threads (otherwise we could just re-use the Session)
# but the threading code is omitted from this example for brevity and because it does not affect the repro

# In one thread
session = boto3.session.Session(region_name='us-east-1')  # +6mb
client = session.client('s3')  # (+5mb as it happens)
# do stuff with client

# In another thread
session = boto3.session.Session(region_name='us-east-1')  # +6mb again = REPRO: this shouldn't need to load endpoints.json again
client = session.client('someotherservice')
# do stuff with client

Possible Solution

One solution would be to make the Loader process-wide with suitable locking on state as necessary. I imagine the small extra overhead is more than paid for by the memory savings if many sessions/clients are created.

A more radical alternative would be for the pre-processing step that generates the botocore/data/ to spit out, instead of each JSON file, a python module (.py file) containing a dict with the same data. Then Loader doesn't have to load JSON, it just lazily imports the python files it needs and python's importlib gives you the process-wide sharing and thread safety for free. I imagine this would be a much more difficult change having seen the existence of things like CUSTOMER_DATA_PATH (~/.aws/models/) so it may not be feasible - but I've included it if nothing else for hypothetical comparison and to illustrate the principle of the problem.

Additional Information/Context

boto/boto3#1670 is very related - this ticket is an attempt at a detailed description of why each session increases memory usage so much and how this might be avoided.

Any

SDK version used

botocore==1.33.1 boto3==1.33.1

Environment details (OS name and version, etc.)

Windows 10 Ubuntu WSL / same happens on Amazon Linux 2

The text was updated successfully, but these errors were encountered:

tim-finnigan · 2023-12-08T18:09:24Z

Hi @sparrowt thanks for reaching out. Have you tried sharing a single loader instance across several sessions? For example:

from botocore.loaders import Loader

loader = Loader()
sessions = some_func_that_makes_multiple_sessions()
for session in session:
    session.register_component('data_loader', loader)

Another option is using a single session to create multiple clients which get passed to the other threads:

session = boto3.session.Session(region_name='us-east-1')
client1 = session.client('s3')
client2 = session.client('someotherservice')

# In one thread
client1.do_something()

# In another thread
client2.do_something()

The endpoints.json file itself is relatively small and only a small fraction of what’s causing the memory usage. The suggestions you described could involve extensive refactoring and I can't guarantee that those changes would be considered. I think it would help to have memory profile reports here that highlight the current memory usage you're seeing and how it compares with the approaches provided above.

sparrowt · 2023-12-11T15:27:41Z

Thanks so much for getting back to me @tim-finnigan. I have not tried that, I assumed Loader was not thread safe (otherwise why would each session need its own?) - before I do could you clarify a couple of things:

is Loader thread safe?
if so, is there any reason not to make this the default behaviour? (i.e. all sessions using the same 'data_loader' component, I guess unless they specify a non-default 'data_path')

To respond to some of your other points:

Another option is using a single session to create multiple clients which get passed to the other threads

Sadly this is not really an option in my case: the app in question is a multi-threaded web server and it is not possible to predict in advance which boto3 clients any given thread might need, so because Session is not thread safe, each thread has to create its own session in order to create the client(s) it needs. I am already caching that session using threading.local() so that subsequent client creations in the same thread don't need to make another session.

The endpoints.json file itself is relatively small and only a small fraction of what’s causing the memory usage.

It is 781KB (only surpassed by a handful of the service definitions) however loading it into memory in python results in nearly 6 MB of memory allocation according to analysis using the Austin profiler e.g. in the memory allocation profile trace below where I did session = boto3.session.Session(region_name='us-east-1') and then client = session.client('s3') you see that within create_default_resolver where it loads endpoints.json there is 5.86 MB of memory allocation, which is quite a large fraction of the total memory allocated, the other major parts (to the right in the trace below) being smaller and s3 specific _load_service_model (4.71 MB) and _load_service_endpoints_ruleset (2.05 MB):

jimdigriz · 2024-01-27T12:55:06Z

Have you tried sharing a single loader instance across several sessions? For example:
[snipped]

Found that each call of boto3.session.Session() (not using the default session) eat 200ms of wall clock time which lead me to this issue after I noticed all the stat/JSON parsing in my profiling; so not a RAM problem per se for me but a really expensive set up time.

The example was not exactly clear to me but did point me in the right direction on what I should be trying, so as a note for others, this got my ~200ms down to ~20ms per call to create and S3 Resource:

# preload and reuse the model to shave ~200ms each time we create a session
# https://github.com/boto/boto3/issues/1670
# https://github.com/boto/botocore/issues/3078
_loader = botocore.loaders.Loader()
# iterate contents of botocore/data/s3
for type_name in frozenset(['endpoint-rule-set-1', 'paginators-1', 'service-2', 'waiters-2']):
    _loader.load_service_model(service_name='s3', type_name=type_name)
# session *instantiation* is not safe either
_boto_session_lock = threading.Lock()
def _session():

    session = botocore.session.get_session()
    session.register_component('data_loader', _loader)
    with _boto_session_lock:
        return boto3.session.Session(region_name=region_name, botocore_session=session)

Then in your threads later you can use the following for a significantly faster setup time:

session = _session()
#session.events.register(...)
resource = session.resource('s3', config=config, endpoint_url=AWS_ENDPOINT_URL)

By default, a botocore session creates and caches an instance of JSONDecoder which consumes a lot of memory. This issue was reported here boto/botocore#3078. In the context of triggers which use boto sessions, this can result in excessive memory usage and as a result reduced capacity on the triggerer. We can reduce memory footprint by sharing the loader instance across the sessions.

sparrowt added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Nov 28, 2023

sparrowt mentioned this issue Nov 28, 2023

Excessive memory usage on multithreading boto/boto3#1670

Open

tim-finnigan added response-requested Waiting on additional info and feedback. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Dec 8, 2023

github-actions bot removed the response-requested Waiting on additional info and feedback. label Dec 19, 2023

dstandish mentioned this issue Jul 8, 2024

Share data loader to across asyncio boto sessions apache/airflow#40658

Merged

tim-finnigan added feature-request This issue requests a feature. needs-review This issue or pull request needs review from a core team member. and removed bug This issue is a confirmed bug. labels Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identical data is loaded into every session wasting memory #3078

Identical data is loaded into every session wasting memory #3078

sparrowt commented Nov 28, 2023 •

edited

Loading

tim-finnigan commented Dec 8, 2023

sparrowt commented Dec 11, 2023 •

edited

Loading

jimdigriz commented Jan 27, 2024 •

edited

Loading

Identical data is loaded into every session wasting memory #3078

Identical data is loaded into every session wasting memory #3078

Comments

sparrowt commented Nov 28, 2023 • edited Loading

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

SDK version used

Environment details (OS name and version, etc.)

tim-finnigan commented Dec 8, 2023

sparrowt commented Dec 11, 2023 • edited Loading

jimdigriz commented Jan 27, 2024 • edited Loading

sparrowt commented Nov 28, 2023 •

edited

Loading

sparrowt commented Dec 11, 2023 •

edited

Loading

jimdigriz commented Jan 27, 2024 •

edited

Loading