This repository contains code for a chatbot that can answer user questions based on the content of a file. The chatbot supports PDF, plain text, and DOCX file formats.
The RAG module consists of two main phases: retrieval and generation. The retrieval phase retrieves relevant context from a knowledge document based on the user's question, and the generation phase uses a language model to generate a personalized answer using the retrieved knowledge. The goal is to create a chatbot that can accurately answer user questions from the provided knowledge document while preventing hallucination.
- Upload a file and ask questions about its content.
- Process PDF files using PyPDF2 library.
- Extract text from plain text and DOCX files using textract library.
- Split text into smaller chunks for efficient processing using CharacterTextSplitter from langchain library.
- Generate embeddings for text chunks using OpenAIEmbeddings from langchain library.
- Build a knowledge base of text chunks using FAISS from langchain library.
- Perform similarity search to find relevant documents based on user queries.
- Utilize a question-answering model to generate answers using load_qa_chain from langchain library.
- Display the generated answer to the user using Streamlit.
- Clone the repository:
git clone https://github.com/tknishh/FileWise.git
- Navigate to the project directory:
cd FileWise
- Install the dependencies:
pip install -r requirements.txt
Note: Make sure to update your OpenAI API key in .env file.
- Run the application:
streamlit run app.py
- Open the application in your browser by visiting
http://localhost:8501
(or the address provided by Streamlit). - Click on the "Choose File" button to upload a file.
- Once the file is uploaded, enter your question in the text input field.
- The chatbot will process the file, search for relevant documents, and generate an answer.
- The answer will be displayed below the text input field.
This project utilizes the following libraries and frameworks:
- PyPDF2
- textract
- Streamlit
- langchain
- The knowledge document contains sufficient information to answer user questions.
- The user questions are within the scope of the knowledge document.
- The chatbot will be a text-based interface.
- The chatbot will handle one user question at a time.
- Improve retrieval performance by using more advanced models like DPR with passage re-ranking.
- Explore different generation techniques, such as controlled text generation or leveraging pretraining on domain-specific data.
- Enhance the chatbot's conversational abilities by incorporating dialogue management techniques and context tracking.
- Deploy the chatbot as a web application or integrate it into existing chat platforms.
- Incorporate feedback loops to continuously improve the chatbot's performance and address user queries.
- Expand the knowledge base and keep it up to date with the latest information.
For any inquiries, please email [email protected].