A simple Android app that allows the user to add a PDF/DOCX document and ask natural-language questions whose answers are generated by the means of an LLM (remote or on-device)
app_demo.mp4
(The PDF used in the demo can be found in resources directory)
- Document Question-Answering with Local RAG in Android
- Demonstrate the collective use of an on-device vector database, embeddings model and a custom text-splitter to build a retrieval-augmented generation (RAG) based pipeline for simple document question-answering
- Use modern Android development practices and recommended architecture guidelines
- Explore and suggest better tools/alternatives for building fully offline, on-device RAG pipeline for Android with minimum compute and storage requirements
Feature | On-Device | Remote |
---|---|---|
Sentence Embedding | ✅ | |
Text Splitter | ✅ | |
Vector Database | ✅ | |
LLM | ✅ | ✅ |
Download the latest APK from GitHub Releases and install it on your Android device.
Clone the main
branch,
$> git clone --depth=1 https://github.com/shubham0204/Android-Document-QA
Open the resulting directory in Android Studio. A project build is initiated automatically, if not, run
./gradlew :app:build
in the terminal.
Run the app on a physical device or a emulator.
Obtainium allows users to update/download apps directly from their sources, like GitHub or FDroid.
- Download the Obtainium app by choosing your device architecture or 'Download Universal APK'.
- From the bottom menu, select '➕Add App'
- In the text field labelled 'App source URL *', enter the following URL and click 'Add' besides the text field:
https://github.com/shubham0204/OnDevice-RAG-Android
Doc-QA
should now be visible in the 'Apps' screen. You can get notifications about newer releases and download them directly without going to the GitHub repo.
Get an API key from Google AI Studio to use the Gemini API.
Tap '⋮' in the top-left corner, select Edit Credentials
. Paste the Gemini API key in the first text-field.
The app supports downloading popular models from HuggingFace and using them locally/on-device for document question-answering.
Tap '⋮' in the top-left corner, select Manage Local Models
. Click on the download icon besides the name of the model.
Once the download is complete, click the arrow to load the model.
Note
To access gated-models on HuggingFace, you might need to add an HuggingFace access token to the app. Tap '⋮' in the
top-left corner, select Edit Credentials
. Paste the HuggingFace access token in the first text-field.
Perform a Gradle sync, and run the application.
- Apache POI and iTextPDF for parsing DOCX and PDF documents
- ObjectBox for on-device vector-store and NoSQL database
- Sentence Embeddings (
all-MiniLM-L6-V2
) for generating on-device text/sentence embeddings - Gemini Android SDK as a hosted large-language model (Uses Gemini-1.5-Flash)
- Mediapipe LLM Inference API for executing SLMs/LLMs locally, downloaded from the litert-community HF organization
The basic working flow on the app is as follows:
- When the user selects a PDF/DOCX document (the only ones which can be imported for now), the text is parsed with the libraries mentioned in (1) of Tools. See PDFReader.kt and DOCXReader.kt for reference.
- Chunks or overlapping sub-sequences are produced from the text, given the size of sequence (
chunkSize
) and the extent of overlap between two sequences (chunkOverlap
). See WhiteSpaceSplitter.kt for reference. - Each chunk is encoded into a fixed-size vector i.e. a text embedding. The embeddings are inserted in the vector database, with each chunk/embedding having a distinct
chunkId
. See SentenceEmbeddingProvider.kt for reference. - When the user submits a query, we find the top-K most similar chunks from the database by comparing their embeddings.
- The chunks corresponding to the nearest embeddings are injected into a pre-built prompt along with the query, which is provided to the LLM. The LLM generates a well-formed natural language answer to the user's query. See GeminiRemoteAPI.kt for reference.
See the prompt,
You are an intelligent search engine. You will be provided with some retrieved context, as well as the users query.
Your job is to understand the request, and answer based on the retrieved context.
Strictly Use ONLY the following pieces of context to answer the question at the end.
Provide only the answer as a response
Here is the retrieved context:
$CONTEXT
Here is the users query:
$QUERY
Using an on-device LLM is possible in Android, but at the expense of a large app size (>1GB) and compute requirements. Google's Edge AI SDK has some options where models like Gemma, MS Phi-2, Falcon can be used completely on-device and accessed via Mediapipe's Android/iOS/Web APIs. See the official documentation for Mediapipe LLM Inference, it also includes instructions ofr LoRA fine-tuning.
Moreover, the same docs specific for Android mention the fact,
During development, you can use adb to push the model to your test device for a simpler workflow. For deployment, host the model on a server and download it at runtime. The model is too large to be bundled in an APK.
The integration using Mediapipe LLM inference API is easy. Due to the absence of a good Android device, I went ahead with the Cloud API, but it would be great to have an on-device option. Gemini Nano currently available on limited devices is also an on-device solution.
Other tools for using LLMs on Android:
- mlc (Also see Llama3 on Android)
- llama.cpp for Android
The app currently uses the Universal Sentence Encoder model from Google, as it was the only possible way to generate text/sentence embeddings on an Android device, with a builtin API and tokenizer. It generates an embedding of size 100.
After checking the retrieved context (similar chunks) for a few questions, I recognized that the embedding model was not able to understand the context of the sentence to a significant extent. I couldn't find a metric to validate this point on HuggingFace's MTEB. Models such as sentence-transformers were particular great at understanding context, but their integration in an Android app remains an open problem.
The all-MiniLM-L2-V6
model from sentence-transformers has been ported to Android with the help of ONNX/onnxruntime and the Rust-implementation of huggingface/tokenziers. See the app's assets folder to find the ONNX model tokenizer.json
See the main repository shubham0204/Sentence-Embeddings-Android for more details.
Feel free to raise an issue or open a PR. The following can be improved in the app:
- Instead of just Mediapipe/LiteRT compatible models, use Kotlin/Java bindings from llama.cpp to load any GGUF model.
- Build a new text-splitter, taking inspiration from Langchain or LlamaIndex