The heart of Tasya, my (voice) assistant - or at least, a wrapper for it.
- LLM RAG-backed responses
- Text API endpoint
- Voice API endpoint
- Random chatting
- Weather report
- Internet search
- A recent videocard with around 16GB of VRAM for all features. Could be trimmed down to ~6GB of VRAM for only text interface.
- LLaMA-like model run with ollama. LLaMA 3 is the preferred choice
- whisper.cpp instance
- xtts-api-server instance (subject to change)
- Whispering voice generation: Yandex.Speechkit API key
- Internet search: Tavily API key
- Weather information: OpenWeatherMap API key
- Best-in-class translation: DeepL API key
And last, but not least: pip install -r requirements.txt
!
POST /text_input
Generates a text response based on text input. Input should be formatted as JSON.
At least one is required. If both are specified, query will be appended to the history.
query: str
- user question for the AI
history: str
- chat history for generation. Provided as text block, where speakers are separated by newlines.
History is prepended as-is, so it should be formatted like
<|start_header_id|>assistant<|end_header_id|>AI message...<|eot_id|>
(LLaMA 3 format)
session_id: str
- persistent key to save chat history on the server
translate: str
- two-letter language code to translate query and responses from and to
Internally model and history are used with english language. This allows interactions with AI in other languages, with a bit of quality loss
POST /voice_input
Generates a voice response based on text input. Input should be a multipart/form-data
!
file: application/octet-stream
- WAV-encoded voice input
history: str
- chat history for generation. Provided as text block, where speakers are separated by newlines.
History is prepended as-is, so it should be formatted like
<|start_header_id|>assistant<|end_header_id|>AI message...<|eot_id|>
(LLaMA 3 format)
session_id: str
- persistent key to save chat history on the server
translate: str
- two-letter language code to translate query and responses from and to
Internally model and history are used with english language. This allows interactions with AI in other languages, with a bit of quality loss
return_file: bool
- whether to return the resulting audio file or play it directly on voice_player instance
Trimmed down whisper.cpp client for voice_input endpoint. Should be compiled with whisper.cpp headers.
Simple WAV audio player over network. Should be used with VOICE_PLAYER_HOST
config variable.
- Try LLaMA 70b. Should make possible usage of OllamaFunctions LangChain wrapper. Also should resolve problems with "output only ...".
- Try LangGraph. Currently isn't possible because no functions wrapper exists.
- Add more agents for different tasks.
- Tune models for real-time voice conversation.
- Tune prompts (might be unnecessary with 70b model).
- Find a way to run xtts-api-server with DeepSpeed on ROCm / cool Nvidia card to enable voice streaming generation.