2024.AI.Challenge-Miraculum.generationis_demo.mp4
Bytes is a chat assistant designed to streamline access to various data sources within your company, including Jira, Github, Notion, and your company's website. Users can also upload their own files and leverage the capabilities of the Assistant for them. It is built using Python, with Solara for UI and Langchain for Retrieval Augmented Generation.
- Multi-source Data Retrieval: Access data from Jira, Github, Notion, and your company's website seamlessly within the chat interface. Users can enable/disable specific sources as per their preference.
- Natural Language Understanding: Utilizes advanced NLP techniques to understand user queries and provide relevant responses.
- Intuitive UI: Built on Solara for a user-friendly interface, making interaction with the assistant smooth and intuitive.
- Augmented Generation: Powered by Langchain, the assistant not only retrieves data but also generates augmented responses, enhancing user experience.
- Multiple Open Source Models: Supports multiple open source models from the Llama Family by Meta and a Gemma open source model from Google. allowing users to switch between LLMs based on their requirements.
- Source Information URLs: Provides URLs for the source information the assistant uses for its answer to promote transparency and traceability.
After cloning the repository, create a .env
file in the project directory with the following environment variables:
JIRA_API_TOKEN
OPENAI_API_KEY
JIRA_USERNAME
GOOGLE_API_KEY
GITHUB_ACCESS_TOKEN
NOTION_API_KEY
GROQ_API_KEY
Obtaining Environment Variables
JIRA_API_TOKEN
andJIRA_USERNAME
can be obtained from any Jira company account with read access.OPENAI_API_KEY
can be obtained from the OpenAI API website.GITHUB_ACCESS_TOKEN
can be obtained from any GitHub account.NOTION_API_KEY
can be obtained by creating an integration in Notion as an admin and using the secret key. Integrate the integration in all Notion pages you want to retrieve data from.GROQ_API_KEY
can be obtained from Groq Cloud (for LLaMA models and Gemma).
Create a Conda environment using the environment.yml
file in the project directory:
conda env create -f environment.yml
conda activate <env_name>
Run the following scripts to extract data from various sources:
python data_extraction/extract_corpus_notion.py
python data_extraction/extract_corpus_github.py
python data_extraction/extract_corpus_jira.py
Run the following script to create and store vector stores and embeddings for all data sources locally:
python save_vector_store.py
Run the assistant using Solara:
solara run ai_assistant.py
That's it! You should now have the chatbot up and running.
This document outlines the methodology and approach of Bytes, which leverages Large Language Models (LLMs) to process and respond to various data sources like Jira, GitHub, Notion, and general websites. The system is designed to intelligently manage and utilize data across these platforms, enhancing retrieval capabilities and embedding management.
Bytes is composed of several key components that interact to process data, generate embeddings, and facilitate intelligent query handling:
- Corpus Loading: JSON files representing different data sources (e.g., Notion, Jira, GitHub) are loaded into the system.
- Data Cleaning: Data from these sources is cleaned and structured. Specific keys are removed from Notion data to refine the content.
- Data Flattening and Parsing: Complex data structures from Jira and GitHub are flattened, and website data is chunked into manageable pieces for further processing.
- Embedding Generation: Utilizes
OpenAIEmbeddings
to transform processed text data into dense vector representations. - FAISS Indexing: Embeddings are stored in FAISS indices, a highly efficient similarity search library, which allows for quick retrieval of related documents based on vector similarity.
- Chunking: Large texts are split into smaller chunks to manage the load on the LLM and improve the response accuracy.
- Embedding Retrieval: For a given query, the system retrieves the most relevant embeddings from the FAISS store.
- Chains and Prompts: The system uses LangChain to create sophisticated chains of operations, such as data retrieval chains and document processing chains.
- LLM Integration: The assistant integrates with various LLMs like Google Generative AI and open source LLMs from Llama family and Gemma for processing and generating responses based on the context provided by the embeddings.
- Dynamic LLM Switching: Depending on user preferences, the assistant can switch between different LLM configurations to generate the most accurate and contextually appropriate responses.
- PDF Processing: Capable of extracting text from PDF files, processing the content, and loading it into the vector store for query handling.
- Initialization: Load environment variables and initialize embeddings.
- Data Loading: Load and process JSON corpora from multiple sources.
- Embedding Storage: Generate and store embeddings in FAISS.
- Query Handling: On receiving a query, chunk the necessary texts, retrieve relevant embeddings, and generate a response using the selected LLM.
- Output: The system formats and delivers the response, handling any follow-up queries by referencing the stored context and embeddings.
- Parallel Processing: The system is designed to handle multiple tasks in parallel, significantly speeding up data processing and response generation.
- Configurable Runtime: Depending on the operational needs, different components of the AI assistant can be configured dynamically, allowing flexible adaptations to various types of queries and data sources.
This methodology ensures a robust, scalable, and efficient AI assistant capable of handling complex queries across multiple domains, utilizing advanced AI and machine learning techniques to enhance productivity and decision-making processes.