Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify multithreading documentation #1246

Closed
dfee opened this issue Jul 18, 2017 · 37 comments
Closed

Clarify multithreading documentation #1246

dfee opened this issue Jul 18, 2017 · 37 comments
Labels
documentation This is a problem with documentation. enhancement This issue requests an improvement to a current feature.

Comments

@dfee
Copy link

dfee commented Jul 18, 2017

The documentation for boto3 states that:

It is recommended to create a resource instance for each thread / process in a multithreaded or multiprocess application rather than sharing a single instance among the threads / processes.(emphasis mine)

The documentation than goes on to show a code example where a session is created per thread, not a resource.

Reading through previous github issues, I see a note that we should create a separate session per thread. The comment that immediately follows says "resource", however.

So, do we need one session per thread, or are sessions thread safe, but not resources? Is there a 1:1 mapping between the thread safety of a resource and a client?

As a bonus question, how expensive / wasteful is it to create new clients (or sessions, as above) on-demand per executor thread...?

@JordonPhillips JordonPhillips added documentation This is a problem with documentation. enhancement This issue requests an improvement to a current feature. labels Jul 19, 2017
@JordonPhillips
Copy link
Contributor

Thanks for the feedback! I agree that it's a little ambiguous, we'll make sure to get that updated.

@dfee
Copy link
Author

dfee commented Jul 25, 2017

In the meantime (before you update the documentation), can you provide some insight to those questions? Thanks :)

@joguSD joguSD added the response-requested Waiting on additional info and feedback. label Aug 4, 2017
@joethompsonduo
Copy link

An update is still needed.

@shadycuz
Copy link

This ^

@joethompsonduo
Copy link

Once again, echoing the sentiment that some sort of response is needed. Can you acknowledge that the maintainers are at least receiving these communications?

@jweinst1
Copy link

jweinst1 commented May 7, 2018

Question, for firing large amounts of s3 restore_object requests, across multiple threads, what is the safest approach to take?

@tuukkamustonen
Copy link

tuukkamustonen commented Jun 5, 2018

Also wondering how this is. I've been carelessly sharing session and client instances over threads for years without (perceived) problems. Not saying it's the way to go, but I wonder what are the cases where it might break?

Even the docs state:

It is recommended to create ...

I read that as is "you may do just fine without... but we don't guarantee it", which is pretty vague.

@rahulr3
Copy link

rahulr3 commented Jun 23, 2018

Can someone comment ? I'm working on a multithreaded application and each thread is creating it's own session/ resource. Are sessions/resource thread safe ? How expensive it is to create session, resource, clients at scale ?

@jsyrjala
Copy link

@JordonPhillips It would be nice to have some documentation about the thread safety matters.

@rkiyanchuk
Copy link

rkiyanchuk commented Sep 28, 2018

@kyleknap @jamesls

Could someone please provide the clarification:
is boto threadsafe per resource, client, or session?

It's been more than a year with no response :(

@lhufnagel
Copy link

lhufnagel commented Oct 8, 2018

Am I right in assuming that with the switch to urllib3 (which promises thread-safety) in botocore 1.11.0 we should be good?

@NickWoodhams
Copy link

I would also like some insight on this. I have been using put_object in celery workers and getting intermittent errors and I cannot figure out if this is due to the number of concurrent workers and limitations from AWS as to the number of clients, or an issue of threading within Boto.

Could someone please provide some insight? I would be very grateful.

@ori-n
Copy link

ori-n commented Dec 4, 2018

Could someone please share a code example for working with a session per thread? Thanks

@hajapy
Copy link

hajapy commented Aug 7, 2019

I would echo that this remains an issue. Clearer documentation would be helpful, as would general thread safety in the construction of clients/resources/sessions.

@nebi-frame
Copy link

I have no idea why no one from AWS gives a clear explanation for all these issues raised.

@LyleScott
Copy link

LyleScott commented Feb 22, 2020

Instead of

import boto3
s3 = boto3.resource("s3")
s3.Object(bucket_name, filename).put(Body=s.getvalue())

I used

import boto3.session
sess = boto3.session.Session()
c = sess.client("s3")
c.put_object(Bucket=bucket_name, Key=filename, Body=s.getvalue())

and then multithreading worked fine. the boiler plate for that is something like:

# i'm using a partial here ... which is the  slightly more complicated case.
from functools import partial
from io import StringIO
from multiprocessing.pool import ThreadPool

def generate_db_rows(db_client) -> None:
    s = StringIO()
    s.write("some content...")

    # actual line that matters... this is a session per thread.
    sess = boto3.session.Session()

    client = sess.client("s3")
    client.put_object(Bucket='something', Key='thread_unique_filename.csv', Body=s.getvalue())

table_names = ("foo", "bar", "baz")
func = partial(generate_db_rows, db_client)
with ThreadPool(processes=10) as pool:
        pool.map(func, table_names)

@rohan-mo
Copy link

rohan-mo commented Mar 4, 2020

@JordonPhillips I've just seen this documentation and it is still vague as was originally pointed out by the issue author.
Could you please help clarify the documentation here?

@swetashre swetashre removed the response-requested Waiting on additional info and feedback. label Mar 16, 2020
@Luis-Palacios
Copy link

Luis-Palacios commented May 8, 2020

I'm joining the party here, I need to query from different regions depending on certain parameters on each request and just realized initializing a boto3 resource is not exactly cheap so planning on initialization a resource for each possible region on app start as a singleton and then using the corresponding resource on each request

I started thinking about thread-safe, I went to docs and I was not able to understand if this is going to be thread-safe.

I will highly appreciate clarifications too

@jsmodic
Copy link

jsmodic commented May 16, 2020

I've fought with this issue a few times due to adverse effects of role based authentication and multiple botocore sessions.

From reading the code alone, it's clear there are thread safety problems. This initialization pattern combined with how it lazy initializes things and then how it creates clients is definitely not thread safe, and that's just something that immediately stands out.

However, if you look at other parts, it's obvious there is careful consideration to thread safety.

So it really looks like it is supposed to be thread safe but isn't in practice.

@dash-samuel
Copy link

dash-samuel commented Sep 22, 2020

Hi everyone in my case I can successfully create multiple threads that share the same session, and am able to download from an S3 bucket for example without problems.

This is how I do it:

import concurrent.futures
import boto3
import json

# setup client and session
sess = boto3.session.Session()
client = sess.client("s3")

files = ["path-to-file.json", "path-to-file2.json"] 

def download_from_s3(file_path):
    obj = client.get_object(Bucket="<your-bucket>", Key=file_path)
    resp = json.loads(obj["Body"].read())
    return resp

with concurrent.futures.ThreadPoolExecutor() as executor:
     executor.map(download_from_s3, files)

Creating a session for each thread in my case results in a big slow down, whereas with this approach I am seeing an up to 7x improvement in performance compared to synchronous downloads.

@tomaszhlawiczka
Copy link

The doc https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html#multithreading-and-multiprocessing states:

It is recommended to create a resource instance for each thread / process in a multithreaded or multiprocess application rather than sharing a single instance among the threads / processes.

Then how can I benefit from reused connections with max_pool_connectionshttps://botocore.amazonaws.com/v1/documentation/api/latest/reference/config.html#botocore-config ?

@jmehnle
Copy link

jmehnle commented Mar 5, 2021

@dash-samuel wrote:

Hi everyone in my case I can successfully create multiple threads that share the same session, and am able to download from an S3 bucket for example without problems.

Your threads are sharing the client, which is officially thread-safe. This explains why your example works, but this doesn't mean you can generally share a session across threads. Directly operating (such as creating clients) on a shared session from multiple threads is not thread-safe per the boto3 Session docs.

@ryansonshine
Copy link

To summarize, there are two cases, one being multithreading and the other being multiprocessing.

Session: Unsafe in all cases due to shared metadata/urllib3.

Resource: Unsafe in all cases due to its direct interaction with a Botocore session.

Client: (Assuming the client is not used to interact with the underlying Botocore Session) Safe in a threaded environment, unsafe in a multiprocess environment. This is due to forking issues with urllib3’s connection pool and leaves botocore unable to guarantee http messages are read in the right order if the pool isn’t created under the same PID. (psf/requests#4323)

I've opened up a PR boto/boto3#2848 and am requesting feedback if this makes it more clear.

ryansonshine added a commit to ryansonshine/boto3 that referenced this issue May 5, 2021
@ryansonshine
Copy link

@nebi-frame @jsmodic @mattsb42-aws Do you think the PR boto/boto3#2848 adds clarity to this?

@nateprewitt
Copy link
Contributor

nateprewitt commented May 19, 2021

We've merged boto/boto3#2848 today, adding detailed information on multi-threading requirements for each of the main Boto3 primitives (Clients, Resources and Sessions). The updated documentation should be in the next release. I'll leave this open until the end of the week for any further feedback and we'll plan to resolve afterwards. Thanks everyone!

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@blakete
Copy link

blakete commented Oct 25, 2023

To summarize, there are two cases, one being multithreading and the other being multiprocessing.

Session: Unsafe in all cases due to shared metadata/urllib3.

Resource: Unsafe in all cases due to its direct interaction with a Botocore session.

Client: (Assuming the client is not used to interact with the underlying Botocore Session) Safe in a threaded environment, unsafe in a multiprocess environment. This is due to forking issues with urllib3’s connection pool and leaves botocore unable to guarantee http messages are read in the right order if the pool isn’t created under the same PID. (psf/requests#4323)

I've opened up a PR boto/boto3#2848 and am requesting feedback if this makes it more clear.

@ryansonshine is comment still up to date? My main take away is that clients are thread safe while sessions and resources are not.

@ryansonshine
Copy link

To summarize, there are two cases, one being multithreading and the other being multiprocessing.
Session: Unsafe in all cases due to shared metadata/urllib3.
Resource: Unsafe in all cases due to its direct interaction with a Botocore session.
Client: (Assuming the client is not used to interact with the underlying Botocore Session) Safe in a threaded environment, unsafe in a multiprocess environment. This is due to forking issues with urllib3’s connection pool and leaves botocore unable to guarantee http messages are read in the right order if the pool isn’t created under the same PID. (psf/requests#4323)
I've opened up a PR boto/boto3#2848 and am requesting feedback if this makes it more clear.

@ryansonshine is comment still up to date? My main take away is that clients are thread safe while sessions and resources are not.

Hi @blakete , the information merged on PR boto/boto3#2848 is up to date.

@anteph
Copy link

anteph commented Feb 8, 2024

Hi @ryansonshine , sorry to revive this old thread :)

I was reading the generic problem with sharing a boto3 client with multiple processes. It is my understanding that it is related to the urllib3 connection pool that boto uses under the hood, which is problematic if shared amongst processes.

I was going through one of the linked issues (psf/requests#4323) and this caught my attention:

It can’t happen merely by creating a Session before forking, it has to actually be used before the fork.

I'm no expert on the lower level details of what a fork does, but I believe in Unix it uses a copy on write approach, meaning that it copies the parent process memory only when it effectively intends to modify it. However, this doesn't work for neither sockets or opened files (meaning that they are not fork safe).

So, my expectation is that it would be ok if the connection pool was initialized in the parent process and then used by the child processes because they would eventually get a copy, as long as the connection pool was never used in the parent process (thus, no socket created prior to fork).

Translating this a layer up to Boto, my expectation would be that it is safe to initialize a boto3 client (creating the object) in the parent process and have the child processes using that instance (eventually it will get copied), as long as there was never any operation being performed prior to the fork.

The reason I'm asking this is because of the use case Celery + Boto with Celery using fork to spawn workers.
I'm was basically planning to initialize the boto3 client prior to the fork and use it for operations only after.
The reason for doing this is because celery does import app modules in the master process and only then does the fork: celery/celery#6036. This means that any global variables we declare are evaluated still in the master process.

Do you think this would be safe or am I making incorrect assumptions? I'm also not sure if the initialization of the boto3 client itself is somehow doing some network call that could be already filling the conn pool, so this may eventually be dangerous anyway?

I can always work this around with a lazy initialization of the object, but before doing so wanted to be sure it is really necessary.

Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation This is a problem with documentation. enhancement This issue requests an improvement to a current feature.
Projects
None yet
Development

No branches or pull requests