-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Using Claude Code with Open Models
Below is a practical, step-by-step tutorial that shows you how to aim Claude Code at any OpenAI "open-models" release (gpt-oss-20b / gpt-oss-120b) or Qwen3-Coder by self-hosting on Hugging Face Inference Endpoints or by routing through OpenRouter. It demonstrates the minimal environment-variable technique (URL + key) as well as an optional LiteLLM proxy for larger fleets. Follow the path that best fits your infrastructure.
-
Claude Code ≥ 0.5.3 with gateway support (you can check with
claude --version
). (LiteLLM) - A Hugging Face account with a read/write token (Settings → Access Tokens). (Hugging Face)
- For OpenRouter, an OpenRouter API key. (OpenRouter)
- Open the GPT-OSS repo (
openai/gpt-oss-20b
oropenai/gpt-oss-120b
) on Hugging Face and accept the Apache-2.0 license. (Hugging Face, OpenAI) - For Qwen choose
Qwen/Qwen3-Coder-480B-A35B-Instruct
(or a smaller GGUF spin-off if you lack GPUs). (Hugging Face)
- Click Deploy → Inference Endpoint on the model page.
- Select the Text Generation Inference (TGI) template ≥ v1.4.0. TGI now ships an OpenAI-compatible Messages API—tick "Enable OpenAI compatibility" or add
--enable-openai
in advanced settings. (Hugging Face) - Choose hardware (A10 G, A100, or CPU for 20 B) and create the endpoint. (Hugging Face)
After the endpoint is "Running", copy:
-
ENDPOINT_URL (ends in
/v1
). - HF_API_TOKEN (your user or org token). (Hugging Face)
Set environment variables in the shell that launches Claude Code:
export ANTHROPIC_BASE_URL="https://<your-endpoint>.us-east-1.aws.endpoints.huggingface.cloud"
export ANTHROPIC_AUTH_TOKEN="hf_xxxxxxxxxxxxxxxxx"
export ANTHROPIC_MODEL="gpt-oss-20b" # or gpt-oss-120b / Qwen model id
Claude Code now believes it is talking to Anthropic yet routes to your open model because TGI mirrors the OpenAI schema. Test:
claude --model gpt-oss-20b
Streaming works—TGI returns token streams under /v1/chat/completions
just like the real OpenAI API. (Hugging Face)
- HF Inference Endpoints auto-scales, so watch credit burn. (Hugging Face)
- If you need local control, run TGI in Docker with
docker run --name tgi -p 8080:80 ... --enable-openai
. (Hugging Face, GitHub)
OpenRouter exposes hundreds of models (including the new GPT-OSS and Qwen3-Coder slugs) behind one OpenAI-compatible endpoint.
- Sign up at openrouter.ai, copy your key. (OpenRouter)
- Model slugs:
-
openai/gpt-oss-20b
oropenai/gpt-oss-120b
(OpenAI open models). (OpenRouter) -
qwen/qwen3-coder-480b
(Qwen coder). (OpenRouter)
-
export ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1"
export ANTHROPIC_AUTH_TOKEN="or_xxxxxxxxx"
export ANTHROPIC_MODEL="openai/gpt-oss-20b"
Run:
claude --model openai/gpt-oss-20b
OpenRouter handles billing and fallback; Claude Code stays unchanged. (OpenRouter)
If you want Claude Code to hot-swap between Anthropic, GPT-OSS, Qwen, and Azure models, drop LiteLLM in front:
model_list:
- model_name: gpt-oss-20b
litellm_params:
model: openai/gpt-oss-20b # via OpenRouter or local TGI
api_key: os.environ/OPENROUTER_KEY
- model_name: qwen3-coder
litellm_params:
model: openrouter/qwen/qwen3-coder
api_key: os.environ/OPENROUTER_KEY
Start the proxy and then:
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="litellm_master"
claude --model gpt-oss-20b
LiteLLM keeps a cost log and supports simple-shuffle routing—avoid the latency routing mode when you still call Anthropic models. (LiteLLM)
Symptom | Fix |
---|---|
404 on /v1/chat/completions
|
Ensure --enable-openai flag is active in TGI. (Hugging Face) |
Empty responses | Verify the ANTHROPIC_MODEL matches the slug you mapped. (LiteLLM) |
400 error after model swap | Switch LiteLLM router to simple-shuffle not latency-based. (LiteLLM) |
Slow first token | Warm up the endpoint with a small prompt after scaling to zero. (Hugging Face) |
- Claude Code needs only ANTHROPIC_BASE_URL and AUTH_TOKEN to talk to any OpenAI-compatible backend.
- Hugging Face TGI 1.4+ exposes that schema, letting you host GPT-OSS or Qwen in your own cloud with minimal glue.
- OpenRouter is the fastest route if you want zero DevOps.
- LiteLLM sits in front when you want policy-based routing across many vendors.
With these methods, you can mix and match open-source and proprietary models inside the same CLI workflow, keeping costs low while preserving the familiar Claude Code developer experience.
Claude Flow can enhance this setup with its swarm orchestration capabilities:
- Initialize Claude Flow MCP:
claude mcp add claude-flow npx claude-flow@alpha mcp start
- Configure for Open Models:
# Set your chosen model backend
export ANTHROPIC_BASE_URL="https://openrouter.ai/api/v1"
export ANTHROPIC_AUTH_TOKEN="your_key"
export ANTHROPIC_MODEL="openai/gpt-oss-20b"
# Enable Claude Flow features
export CLAUDE_FLOW_HOOKS_ENABLED="true"
export CLAUDE_FLOW_TELEMETRY_ENABLED="true"
- Leverage Swarm Coordination:
# Initialize a swarm for complex tasks
npx claude-flow@alpha swarm init --topology mesh --max-agents 5
# Use SPARC methodology with open models
npx claude-flow@alpha sparc run architect "Design authentication system"
- Cost Optimization: Route simple tasks to smaller models, complex ones to larger
- Performance Tracking: Monitor token usage across different models
- Swarm Coordination: Distribute work across multiple model instances
- Memory Persistence: Maintain context across sessions regardless of model
(OpenAI, Business Insider, Hugging Face, OpenRouter, LiteLLM, WIRED)
Last updated: January 2025 | Part of Claude Flow v2.0.0-alpha.87 documentation