- 
                Notifications
    
You must be signed in to change notification settings  - Fork 4.4k
 
Description
Confirm this is an issue with the Python library and not an underlying OpenAI API
- This is an issue with the Python library
 
Describe the bug
I noticed that rate limit errors are raised differently in the two APIs during streaming. In chat completions the error is raised and retried by the client itself on the initial call, before reading the stream. On the other hand, the first call to responses.create(...) returns a stream object successfully and then upon reading the stream, an APIError is raised indicating a rate limit. As a result, the whole retrying logic in the client is completely skipped and the user has to implement their own. It is also unclear whether an error could be raised mid-stream which adds to the complexity of the fix.
After inspecting the client's code, I initially thought that this is not a library issue but a misalignment with the model's service. I contacted Microsoft support since they host our models but they insisted I raise an issue here.
To Reproduce
To reproduce this you would need an Azure hosted OpenAI model ( I have personally already tried this with gpt-4o and gpt-5 ). Lower your token rate limit threshold to a low enough value and use the below snippet.
Code snippets
from openai import OpenAI
client = OpenAI(  
  base_url = "your-endpoint-v1",
  api_key="your-api-key"
)
response = client.responses.create(
    model="your-model",
    store=False,
    stream=True,
    input='a long enough prompt'*10000 # to hit a rate limit
)
for event in response:
   continue # an APIError is raised hereOS
Linux, Windows
Python version
Python >=3.11.9
Library version
openai v1.109.1