- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4.6k
Closed as not planned
Labels
enhancementNew feature or requestNew feature or request
Description
The Feature
Instead of returning the response to the user, upload the response quickly to a fast S3 bucket (like GCS or R2), and return a presigned URL to the client. This would only work for non-streaming responses.
This has been supported in OpenAI's python client since: openai/openai-python#1100. Following redirects with fetch in JavaScript is a default thing.
PoC:
from fastapi import FastAPI
from fastapi.responses import RedirectResponse
app = FastAPI()
@app.post("/chat/completions")
async def redirect_to_webhook():
    return RedirectResponse(url="https://webhook.site/removed-removed-removed-removed-removed", status_code=303)
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="localhost", port=8000)
Then when using:
#!/usr/bin/env python3.11
# -*- coding: utf-8 -*-
# Author: David Manouchehri
import asyncio
import openai
import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
c_handler = logging.StreamHandler()
logger.addHandler(c_handler)
client = openai.AsyncOpenAI(
    api_key="FAKE",
    base_url="http://localhost:8000",
)
async def main():
    response = await client.chat.completions.create(
        model="gemini-1.5-pro-preview-0409",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What’s in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                        }
                    }
                ]
            }
        ],
        temperature=0.0,
    )
    logger.info(response)
if __name__ == "__main__":
    asyncio.run(main())Then this results in a GET request with these headers:
connection: close
x-stainless-async: async:asyncio
x-stainless-runtime-version: 3.11.9
x-stainless-runtime: CPython
x-stainless-arch: arm64
x-stainless-os: MacOS
x-stainless-package-version: 1.28.0
x-stainless-lang: python
user-agent: AsyncOpenAI/Python 1.28.0
content-type: application/json
accept: application/json
accept-encoding: gzip, deflate, br
host: webhook.site
content-length: 
Content-Type: application/json
Note to self, do not presigned the GET URL with the authorization header.
Motivation, pitch
For large responses:
- This might reduce the load put on LiteLLM.
- If there's a slow client, this would allow LiteLLM to not need to maintain a connection. e.g. making scaling with serverless platforms more efficient.
Twitter / LinkedIn details
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request