Skip to content

Commit

Permalink
Lance DB migration (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
truskovskiyk authored Jan 28, 2025
1 parent c530dbe commit 5090ddb
Show file tree
Hide file tree
Showing 10 changed files with 383 additions and 142 deletions.
30 changes: 10 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,11 @@ The core purpose of "No OCR" is to simplify AI-based PDF processing:
- Perform text and/or visual queries using modern embeddings.
- Use open source models for advanced question-answering on document-based diagrams, text, and more.

Key technologies:
- React-based front end (no-ocr-ui) for uploading, managing, and searching documents.
- Python-based API (no-ocr-api) that coordinates ingestion, indexing, and searching.
- Qdrant for efficient vector search and retrieval.
- ColPali & Qwen2-VL handle inference tasks (both text and vision-based).

## Key Features

- Create and manage PDF/document collections, also referred to as "cases".
- Automated ingestion to build Hugging Face-style datasets (HF_Dataset).
- Vector-based search over PDF pages (and relevant images) in Qdrant.
- Vector-based search over PDF pages (and relevant images) in LanceDB.
- Visual question-answering on images and diagrams via Qwen2-VL.
- Deployable via Docker for both the backend (Python) and UI (React).

Expand All @@ -58,19 +52,19 @@ sequenceDiagram
participant no-ocr-ui (CreateCase)
participant no-ocr-api
participant HF_Dataset
participant IngestClient
participant Qdrant
participant SearchClient
participant LanceDB
User->>no-ocr-ui (CreateCase): Upload PDFs & specify case name
no-ocr-ui (CreateCase)->>no-ocr-api: POST /create_case with PDFs
no-ocr-api->>no-ocr-api: Save PDFs to local storage
no-ocr-api->>no-ocr-api: Spawn background task (process_case)
no-ocr-api->>HF_Dataset: Convert PDFs to HF dataset
HF_Dataset-->>no-ocr-api: Return dataset
no-ocr-api->>IngestClient: Ingest dataset
IngestClient->>Qdrant: Create collection & upload points
Qdrant-->>IngestClient: Acknowledge ingestion
IngestClient-->>no-ocr-api: Done ingestion
no-ocr-api->>SearchClient: Ingest dataset
SearchClient->>LanceDB: Create collection & upload points
LanceDB-->>SearchClient: Acknowledge ingestion
SearchClient-->>no-ocr-api: Done ingestion
no-ocr-api->>no-ocr-api: Mark case status as 'done'
no-ocr-api-->>no-ocr-ui (CreateCase): Return creation response
no-ocr-ui (CreateCase)-->>User: Display success message
Expand All @@ -83,14 +77,14 @@ sequenceDiagram
participant User
participant no-ocr-ui
participant SearchClient
participant Qdrant
participant LanceDB
participant HF_Dataset
participant VLLM
User->>no-ocr-ui: Enter search query and select case
no-ocr-ui->>SearchClient: Search images by text
SearchClient->>Qdrant: Query collection with text embedding
Qdrant-->>SearchClient: Return search results
SearchClient->>LanceDB: Query collection with text embedding
LanceDB-->>SearchClient: Return search results
SearchClient-->>no-ocr-ui: Provide search results
no-ocr-ui->>HF_Dataset: Load dataset for collection
HF_Dataset-->>no-ocr-ui: Return dataset
Expand Down Expand Up @@ -166,7 +160,3 @@ sequenceDiagram
cd no-ocr-ui
npm run dev
```
5. (Qdrant) Run qdrant
```bash
docker run -p 6333:6333 qdrant/qdrant:v1.12.5
```
Binary file modified docs/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 0 additions & 6 deletions no-ocr-api/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,4 @@ SEARCH_TOP_K=3
COLPALI_TOKEN=
VLLM_URL=
COLPALI_BASE_URL=
QDRANT_URI="localhost"
QDRANT_PORT=6333
VECTOR_SIZE=128
INDEXING_THRESHOLD=100
QUANTILE=0.99
TOP_K=5
QDRANT_HTTPS=False
5 changes: 4 additions & 1 deletion no-ocr-api/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ RUN pip install --upgrade pip
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

# TODO: replace with lancedb==0.18.1b1
RUN pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ lancedb==0.18.1b1

COPY . .
ENV PYTHONPATH /app/

CMD fastapi run --host 0.0.0.0 --port 8000 --workers 1 np_ocr/api.py
CMD fastapi run --host 0.0.0.0 --port 8000 --workers 1 np_ocr/api.py
75 changes: 24 additions & 51 deletions no-ocr-api/np_ocr/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from pydantic_settings import BaseSettings

from np_ocr.data import pdfs_to_hf_dataset
from np_ocr.search import IngestClient, SearchClient, call_vllm
from np_ocr.search import SearchClient, call_vllm


class CustomRailwayLogFormatter(logging.Formatter):
Expand Down Expand Up @@ -60,13 +60,7 @@ class Settings(BaseSettings):
COLPALI_TOKEN: str
VLLM_URL: str
COLPALI_BASE_URL: str
QDRANT_URI: str
QDRANT_PORT: int
VECTOR_SIZE: int = 128
INDEXING_THRESHOLD: int = 100
QUANTILE: float = 0.99
TOP_K: int = 5
QDRANT_HTTPS: bool = True
VLLM_API_KEY: str
VLLM_MODEL: str = "Qwen2-VL-7B-Instruct"

Expand All @@ -77,12 +71,20 @@ class Config:
settings = Settings()


class SearchResult(BaseModel):
score: float
pdf_name: str
pdf_page: int
image_base64: str

class SearchResponse(BaseModel):
search_results: List[SearchResult]

class ImageAnswer(BaseModel):
answer: str

class CaseInfo(BaseModel):
name: str
unique_name: str
status: str
number_of_PDFs: int
files: List[str]
Expand All @@ -97,8 +99,7 @@ def update_status(self, new_status: str):
self.save()


search_client = SearchClient(qdrant_uri=settings.QDRANT_URI, port=settings.QDRANT_PORT, https=settings.QDRANT_HTTPS, top_k=settings.TOP_K, base_url=settings.COLPALI_BASE_URL, token=settings.COLPALI_TOKEN)
ingest_client = IngestClient(qdrant_uri=settings.QDRANT_URI, port=settings.QDRANT_PORT, https=settings.QDRANT_HTTPS, index_threshold=settings.INDEXING_THRESHOLD, vector_size=settings.VECTOR_SIZE, quantile=settings.QUANTILE, top_k=settings.TOP_K, base_url=settings.COLPALI_BASE_URL, token=settings.COLPALI_TOKEN)
search_client = SearchClient(storage_dir=settings.STORAGE_DIR, vector_size=settings.VECTOR_SIZE, base_url=settings.COLPALI_BASE_URL, token=settings.COLPALI_TOKEN)


@app.post("/vllm_call")
Expand Down Expand Up @@ -137,17 +138,6 @@ def vllm_call(
return image_answer




class SearchResult(BaseModel):
score: float
pdf_name: str
pdf_page: int
image_base64: str

class SearchResponse(BaseModel):
search_results: List[SearchResult]

@app.post("/search", response_model=SearchResponse)
def ai_search(user_query: str = Form(...), user_id: str = Form(...), case_name: str = Form(...)):
logger.info("start ai_search")
Expand All @@ -167,8 +157,7 @@ def ai_search(user_query: str = Form(...), user_id: str = Form(...), case_name:
with open(case_info_path, "r") as json_file:
_ = json.load(json_file) # case_info is not used directly below

unique_name =f"{user_id}_{case_name}"
search_results = search_client.search_images_by_text(user_query, case_name=unique_name, top_k=settings.SEARCH_TOP_K)
search_results = search_client.search_images_by_text(user_query, case_name=case_name, user_id=user_id, top_k=settings.SEARCH_TOP_K)
if not search_results:
return {"message": "No results found."}

Expand All @@ -178,13 +167,14 @@ def ai_search(user_query: str = Form(...), user_id: str = Form(...), case_name:

dataset = load_from_disk(dataset_path)
search_results_data = []
for result in search_results.points:
payload = result.payload
logger.info(payload)
score = result.score
image_data = dataset[payload["index"]]["image"]
pdf_name = dataset[payload["index"]]["pdf_name"]
pdf_page = dataset[payload["index"]]["pdf_page"]
print(search_results)
for point in search_results:
logger.info(point)
score = point['_distance']
index = point['index']
image_data = dataset[index]["image"]
pdf_name = dataset[index]["pdf_name"]
pdf_page = dataset[index]["pdf_page"]

# Convert image to base64 string
buffered = BytesIO()
Expand All @@ -204,13 +194,13 @@ def ai_search(user_query: str = Form(...), user_id: str = Form(...), case_name:
return SearchResponse(search_results=search_results_data)


def process_case(case_info: CaseInfo):
def process_case(case_info: CaseInfo, user_id: str):
logger.info("start post_process_case")
start_time = time.time()

dataset = pdfs_to_hf_dataset(case_info.case_dir)
dataset.save_to_disk(case_info.case_dir / settings.HF_DATASET_DIRNAME)
ingest_client.ingest(case_info.unique_name, dataset)
search_client.ingest(case_info.name, dataset, user_id)

case_info.update_status("done")

Expand Down Expand Up @@ -247,7 +237,6 @@ def create_new_case(

case_info = CaseInfo(
name=case_name,
unique_name=f"{user_id}_{case_name}",
status="processing",
number_of_PDFs=len(files),
files=file_names,
Expand All @@ -256,7 +245,7 @@ def create_new_case(
case_info.save()


background_tasks.add_task(process_case, case_info=case_info)
background_tasks.add_task(process_case, case_info=case_info, user_id=user_id)

end_time = time.time()
logger.info(f"done create_new_case, total time {end_time - start_time}")
Expand Down Expand Up @@ -308,12 +297,6 @@ def get_cases(user_id: str):

@app.get("/get_case/{case_name}")
def get_case(user_id: str, case_name: str) -> CaseInfo:
logger.info("start get_case")
start_time = time.time()

"""
Return the metadata of a specific case by its name for a specific user.
"""
case_info_path = os.path.join(settings.STORAGE_DIR, user_id, case_name, settings.CASE_INFO_FILENAME)
if not os.path.exists(case_info_path):
# Check common cases
Expand All @@ -323,11 +306,7 @@ def get_case(user_id: str, case_name: str) -> CaseInfo:

with open(case_info_path, "r") as json_file:
case_info = CaseInfo(**json.load(json_file))

end_time = time.time()
logger.info(f"done get_case, total time {end_time - start_time}")

return case_info.dict()
return case_info

@app.delete("/delete_case/{case_name}")
def delete_case(user_id: str, case_name: str):
Expand All @@ -344,12 +323,6 @@ def delete_case(user_id: str, case_name: str):
else:
raise HTTPException(status_code=404, detail="Case not found in storage.")

# Delete the case from Qdrant
try:
ingest_client.qdrant_client.delete_collection(case_name)
except Exception as e:
raise HTTPException(status_code=500, detail=f"An error occurred while deleting the case from Qdrant: {str(e)}")

end_time = time.time()
logger.info(f"done delete_case, total time {end_time - start_time}")

Expand Down
Loading

0 comments on commit 5090ddb

Please sign in to comment.