-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify multithreading documentation #1246
Comments
Thanks for the feedback! I agree that it's a little ambiguous, we'll make sure to get that updated. |
In the meantime (before you update the documentation), can you provide some insight to those questions? Thanks :) |
An update is still needed. |
This ^ |
Once again, echoing the sentiment that some sort of response is needed. Can you acknowledge that the maintainers are at least receiving these communications? |
Question, for firing large amounts of s3 restore_object requests, across multiple threads, what is the safest approach to take? |
Also wondering how this is. I've been carelessly sharing session and client instances over threads for years without (perceived) problems. Not saying it's the way to go, but I wonder what are the cases where it might break? Even the docs state:
I read that as is "you may do just fine without... but we don't guarantee it", which is pretty vague. |
Can someone comment ? I'm working on a multithreaded application and each thread is creating it's own session/ resource. Are sessions/resource thread safe ? How expensive it is to create session, resource, clients at scale ? |
@JordonPhillips It would be nice to have some documentation about the thread safety matters. |
Am I right in assuming that with the switch to urllib3 (which promises thread-safety) in botocore 1.11.0 we should be good? |
I would also like some insight on this. I have been using put_object in celery workers and getting intermittent errors and I cannot figure out if this is due to the number of concurrent workers and limitations from AWS as to the number of clients, or an issue of threading within Boto. Could someone please provide some insight? I would be very grateful. |
Could someone please share a code example for working with a session per thread? Thanks |
I would echo that this remains an issue. Clearer documentation would be helpful, as would general thread safety in the construction of clients/resources/sessions. |
It seems that the boto3 library is not threadsafe. The solution discussed in GitHub issues such as boto/botocore#1246 boto/boto3#1592 and the documentation at https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-multiprocessing suggest a very simple change that sems to make things work.
I have no idea why no one from AWS gives a clear explanation for all these issues raised. |
Instead of
I used
and then multithreading worked fine. the boiler plate for that is something like:
|
@JordonPhillips I've just seen this documentation and it is still vague as was originally pointed out by the issue author. |
I'm joining the party here, I need to query from different regions depending on certain parameters on each request and just realized initializing a boto3 resource is not exactly cheap so planning on initialization a resource for each possible region on app start as a singleton and then using the corresponding resource on each request I started thinking about thread-safe, I went to docs and I was not able to understand if this is going to be thread-safe. I will highly appreciate clarifications too |
I've fought with this issue a few times due to adverse effects of role based authentication and multiple botocore sessions. From reading the code alone, it's clear there are thread safety problems. This initialization pattern combined with how it lazy initializes things and then how it creates clients is definitely not thread safe, and that's just something that immediately stands out. However, if you look at other parts, it's obvious there is careful consideration to thread safety. So it really looks like it is supposed to be thread safe but isn't in practice. |
Hi everyone in my case I can successfully create multiple threads that share the same session, and am able to download from an S3 bucket for example without problems. This is how I do it: import concurrent.futures
import boto3
import json
# setup client and session
sess = boto3.session.Session()
client = sess.client("s3")
files = ["path-to-file.json", "path-to-file2.json"]
def download_from_s3(file_path):
obj = client.get_object(Bucket="<your-bucket>", Key=file_path)
resp = json.loads(obj["Body"].read())
return resp
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(download_from_s3, files) Creating a session for each thread in my case results in a big slow down, whereas with this approach I am seeing an up to 7x improvement in performance compared to synchronous downloads. |
The doc https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html#multithreading-and-multiprocessing states:
Then how can I benefit from reused connections with |
@dash-samuel wrote:
Your threads are sharing the client, which is officially thread-safe. This explains why your example works, but this doesn't mean you can generally share a session across threads. Directly operating (such as creating clients) on a shared session from multiple threads is not thread-safe per the boto3 Session docs. |
To summarize, there are two cases, one being multithreading and the other being multiprocessing. Session: Unsafe in all cases due to shared metadata/urllib3. Resource: Unsafe in all cases due to its direct interaction with a Botocore session. Client: (Assuming the client is not used to interact with the underlying Botocore Session) Safe in a threaded environment, unsafe in a multiprocess environment. This is due to forking issues with urllib3’s connection pool and leaves botocore unable to guarantee http messages are read in the right order if the pool isn’t created under the same PID. (psf/requests#4323) I've opened up a PR boto/boto3#2848 and am requesting feedback if this makes it more clear. |
@nebi-frame @jsmodic @mattsb42-aws Do you think the PR boto/boto3#2848 adds clarity to this? |
We've merged boto/boto3#2848 today, adding detailed information on multi-threading requirements for each of the main Boto3 primitives (Clients, Resources and Sessions). The updated documentation should be in the next release. I'll leave this open until the end of the week for any further feedback and we'll plan to resolve afterwards. Thanks everyone! |
|
@ryansonshine is comment still up to date? My main take away is that clients are thread safe while sessions and resources are not. |
Hi @blakete , the information merged on PR boto/boto3#2848 is up to date. |
Hi @ryansonshine , sorry to revive this old thread :) I was reading the generic problem with sharing a boto3 client with multiple processes. It is my understanding that it is related to the urllib3 connection pool that boto uses under the hood, which is problematic if shared amongst processes. I was going through one of the linked issues (psf/requests#4323) and this caught my attention:
I'm no expert on the lower level details of what a fork does, but I believe in Unix it uses a copy on write approach, meaning that it copies the parent process memory only when it effectively intends to modify it. However, this doesn't work for neither sockets or opened files (meaning that they are not fork safe). So, my expectation is that it would be ok if the connection pool was initialized in the parent process and then used by the child processes because they would eventually get a copy, as long as the connection pool was never used in the parent process (thus, no socket created prior to fork). Translating this a layer up to Boto, my expectation would be that it is safe to initialize a boto3 client (creating the object) in the parent process and have the child processes using that instance (eventually it will get copied), as long as there was never any operation being performed prior to the fork. The reason I'm asking this is because of the use case Celery + Boto with Celery using fork to spawn workers. Do you think this would be safe or am I making incorrect assumptions? I'm also not sure if the initialization of the boto3 client itself is somehow doing some network call that could be already filling the conn pool, so this may eventually be dangerous anyway? I can always work this around with a lazy initialization of the object, but before doing so wanted to be sure it is really necessary. Thanks :) |
The documentation for
boto3
states that:The documentation than goes on to show a code example where a
session
is created per thread, not aresource
.Reading through previous github issues, I see a note that we should create a separate session per thread. The comment that immediately follows says "resource", however.
So, do we need one session per thread, or are sessions thread safe, but not resources? Is there a 1:1 mapping between the thread safety of a
resource
and aclient
?As a bonus question, how expensive / wasteful is it to create new clients (or sessions, as above) on-demand per executor thread...?
The text was updated successfully, but these errors were encountered: