Document Question-Answering with Local RAG in Android

A simple Android app that allows the user to add a PDF/DOCX document and ask natural-language questions whose answers are generated by the means of an LLM (remote or on-device)

app_demo.mp4

(The PDF used in the demo can be found in resources directory)

YT Video

Document Question-Answering with Local RAG in Android

Goals

Demonstrate the collective use of an on-device vector database, embeddings model and a custom text-splitter to build a retrieval-augmented generation (RAG) based pipeline for simple document question-answering
Use modern Android development practices and recommended architecture guidelines
Explore and suggest better tools/alternatives for building fully offline, on-device RAG pipeline for Android with minimum compute and storage requirements

Feature	On-Device	Remote
Sentence Embedding	✅
Text Splitter	✅
Vector Database	✅
LLM	✅	✅

Usage

Getting the APK

Download the latest APK from GitHub Releases

Download the latest APK from GitHub Releases and install it on your Android device.

Build the Project in Android Studio / Intellij IDEA

Clone the main branch,

$> git clone --depth=1 https://github.com/shubham0204/Android-Document-QA

Open the resulting directory in Android Studio. A project build is initiated automatically, if not, run ./gradlew :app:build in the terminal.

Run the app on a physical device or a emulator.

Use Obtainium to install the APK

Obtainium allows users to update/download apps directly from their sources, like GitHub or FDroid.

Download the Obtainium app by choosing your device architecture or 'Download Universal APK'.
From the bottom menu, select '➕Add App'
In the text field labelled 'App source URL *', enter the following URL and click 'Add' besides the text field: https://github.com/shubham0204/OnDevice-RAG-Android
Doc-QA should now be visible in the 'Apps' screen. You can get notifications about newer releases and download them directly without going to the GitHub repo.

Configure LLM

Gemini API Key (Cloud based remote inference)

Get an API key from Google AI Studio to use the Gemini API.

Tap '⋮' in the top-left corner, select Edit Credentials. Paste the Gemini API key in the first text-field.

Local Models from HuggingFace

The app supports downloading popular models from HuggingFace and using them locally/on-device for document question-answering.

Tap '⋮' in the top-left corner, select Manage Local Models. Click on the download icon besides the name of the model. Once the download is complete, click the arrow to load the model.

Note

To access gated-models on HuggingFace, you might need to add an HuggingFace access token to the app. Tap '⋮' in the top-left corner, select Edit Credentials. Paste the HuggingFace access token in the first text-field.

Perform a Gradle sync, and run the application.

Tools

Apache POI and iTextPDF for parsing DOCX and PDF documents
ObjectBox for on-device vector-store and NoSQL database
Sentence Embeddings (all-MiniLM-L6-V2) for generating on-device text/sentence embeddings
Gemini Android SDK as a hosted large-language model (Uses Gemini-1.5-Flash)
Mediapipe LLM Inference API for executing SLMs/LLMs locally, downloaded from the litert-community HF organization

Working

The basic working flow on the app is as follows:

When the user selects a PDF/DOCX document (the only ones which can be imported for now), the text is parsed with the libraries mentioned in (1) of Tools. See PDFReader.kt and DOCXReader.kt for reference.
Chunks or overlapping sub-sequences are produced from the text, given the size of sequence (chunkSize) and the extent of overlap between two sequences (chunkOverlap). See WhiteSpaceSplitter.kt for reference.
Each chunk is encoded into a fixed-size vector i.e. a text embedding. The embeddings are inserted in the vector database, with each chunk/embedding having a distinct chunkId. See SentenceEmbeddingProvider.kt for reference.
When the user submits a query, we find the top-K most similar chunks from the database by comparing their embeddings.
The chunks corresponding to the nearest embeddings are injected into a pre-built prompt along with the query, which is provided to the LLM. The LLM generates a well-formed natural language answer to the user's query. See GeminiRemoteAPI.kt for reference.

See the prompt,

You are an intelligent search engine. You will be provided with some retrieved context, as well as the users query.
Your job is to understand the request, and answer based on the retrieved context.
Strictly Use ONLY the following pieces of context to answer the question at the end.
Provide only the answer as a response

Here is the retrieved context:
    $CONTEXT

Here is the users query:
    $QUERY

Discussion

Why not use on-device LLMs instead of the Gemini's Cloud SDK?

Using an on-device LLM is possible in Android, but at the expense of a large app size (>1GB) and compute requirements. Google's Edge AI SDK has some options where models like Gemma, MS Phi-2, Falcon can be used completely on-device and accessed via Mediapipe's Android/iOS/Web APIs. See the official documentation for Mediapipe LLM Inference, it also includes instructions ofr LoRA fine-tuning.

Moreover, the same docs specific for Android mention the fact,

During development, you can use adb to push the model to your test device for a simpler workflow. For deployment, host the model on a server and download it at runtime. The model is too large to be bundled in an APK.

The integration using Mediapipe LLM inference API is easy. Due to the absence of a good Android device, I went ahead with the Cloud API, but it would be great to have an on-device option. Gemini Nano currently available on limited devices is also an on-device solution.

Other tools for using LLMs on Android:

mlc (Also see Llama3 on Android)
llama.cpp for Android

(Solved) Better alternatives for the Universal Sentence Encoder (embedding model)

Problem:

The app currently uses the Universal Sentence Encoder model from Google, as it was the only possible way to generate text/sentence embeddings on an Android device, with a builtin API and tokenizer. It generates an embedding of size 100.

After checking the retrieved context (similar chunks) for a few questions, I recognized that the embedding model was not able to understand the context of the sentence to a significant extent. I couldn't find a metric to validate this point on HuggingFace's MTEB. Models such as sentence-transformers were particular great at understanding context, but their integration in an Android app remains an open problem.

Solution

The all-MiniLM-L2-V6 model from sentence-transformers has been ported to Android with the help of ONNX/onnxruntime and the Rust-implementation of huggingface/tokenziers. See the app's assets folder to find the ONNX model tokenizer.json

See the main repository shubham0204/Sentence-Embeddings-Android for more details.

Contributions and Open Problems

Feel free to raise an issue or open a PR. The following can be improved in the app:

Instead of just Mediapipe/LiteRT compatible models, use Kotlin/Java bindings from llama.cpp to load any GGUF model.
Build a new text-splitter, taking inspiration from Langchain or LlamaIndex

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
app		app
gradle		gradle
resources		resources
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Document Question-Answering with Local RAG in Android

Goals

Usage

Getting the APK

Download the latest APK from GitHub Releases

Build the Project in Android Studio / Intellij IDEA

Use Obtainium to install the APK

Configure LLM

Gemini API Key (Cloud based remote inference)

Local Models from HuggingFace

Tools

Working

Discussion

Why not use on-device LLMs instead of the Gemini's Cloud SDK?

(Solved) Better alternatives for the Universal Sentence Encoder (embedding model)

Problem:

Solution

Contributions and Open Problems

About

Uh oh!

Releases 2

Sponsor this project

Uh oh!

Contributors 2

Languages

Uh oh!

License

shubham0204/OnDevice-RAG-Android

Folders and files

Latest commit

History

Repository files navigation

Document Question-Answering with Local RAG in Android

Goals

Usage

Getting the APK

Download the latest APK from GitHub Releases

Build the Project in Android Studio / Intellij IDEA

Use Obtainium to install the APK

Configure LLM

Gemini API Key (Cloud based remote inference)

Local Models from HuggingFace

Tools

Working

Discussion

Why not use on-device LLMs instead of the Gemini's Cloud SDK?

(Solved) Better alternatives for the Universal Sentence Encoder (embedding model)

Problem:

Solution

Contributions and Open Problems

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Contributors 2

Languages