Skip to content

Conversation

@michaelneale
Copy link
Collaborator

@michaelneale michaelneale commented Mar 13, 2025

Using a new library https://crates.io/crates/extractous which seems to perform vastly better (still using low level obejcts to extract images when needed).

Enhancements:

  • can load large PDFs
  • can load PDFs from URLs/websites
  • can load many other doc formats (office docs, email, ebooks, presentations)
  • can load web content (as text)
  • ADVANCED: if tesseract is installed, it can OCR text out of image files, PDFs

Fixes: #1664

@michaelneale michaelneale marked this pull request as ready for review March 13, 2025 05:16
@michaelneale michaelneale requested review from baxen and Copilot March 13, 2025 05:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves PDF processing by integrating the extractous library to better handle text extraction (including from URLs) and large PDF files, and it also refines the CI pipeline with more aggressive disk cleanup steps.

  • Replace low-level PDF text extraction with extractous for both file and URL-based PDFs
  • Add tests for URL text extraction and image extraction error handling, and support for large PDF files
  • Enhance CI workflows with additional cleanup steps and dependency updates

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

File Description
crates/goose-mcp/src/computercontroller/pdf_tool.rs Introduces extractous library for enhanced PDF text extraction and improved handling of large texts
.github/workflows/ci.yml Adds aggressive disk cleanup and artifact removal steps to the CI process
crates/goose-mcp/Cargo.toml Adds extractous dependency for improved PDF processing
crates/goose-mcp/src/computercontroller/mod.rs Updates PDF tool description to clarify URL support
Comments suppressed due to low confidence (1)

.github/workflows/ci.yml:83

  • The removal command for '/opt/hostedtoolcache' is duplicated (also appearing on line 85). Consider removing the duplicate to simplify the CI script.
sudo rm -rf /opt/hostedtoolcache

@michaelneale michaelneale changed the title handling larger more complex PDF docs feat: handling larger more complex PDF docs (and fix) Mar 13, 2025
Copy link
Collaborator

@baxen baxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! But i'd like to test this out, @wendytang is on it

@wendytang wendytang self-requested a review March 13, 2025 18:54
rg 'search term' {}\n\n\
Or view portions of it:\n\
head -n 50 {}\n\
tail -n 50 {}\n\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't outputted to user correct? Mainly an fyi for goose iiuc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is instructions for goose to use in case of large files

Copy link

@wendytang wendytang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to process a large pdf that the previous version wasn't 👍

@wendytang wendytang merged commit 4c31832 into main Mar 13, 2025
6 checks passed
@wendytang wendytang deleted the micn/fix-large-pdf branch March 13, 2025 19:37
wendytang pushed a commit that referenced this pull request Mar 13, 2025
michaelneale added a commit that referenced this pull request Mar 14, 2025
* main: (32 commits)
  ui: load builtins (#1679)
  chore(release): release version 1.0.14 (#1676)
  Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675)
  fix: uvshim default to existing uv configuration (#1670)
  fix: handle interruptions during tool responses (#1651)
  feat: Copy error message button in toast (#1658)
  feat: handling larger more complex PDF docs (and fix) (#1663)
  Add Filesystem Tutorial (#1666)
  docs: figma blog post (#1647)
  docs: updating goose modes doc (#1665)
  docs: Add running tasks guide (#1626)
  docs: Add experimental features (#1644)
  feat(cli): add better error message, support stdin via -i - or just no args (#1660)
  feat: extensions read config (#1637)
  fix: trigger words for memory (#1654)
  fix: cleanup keyboard shortcut indication (#1642)
  Extensions load in background and show pending state (#1657)
  Extension error toast stays until dismissed, and error message cleanup (#1653)
  fix: remove other category in settings (#1641)
  fix: restore image outputs from tool calls (#1640)
  ...
kalvinnchau added a commit that referenced this pull request Mar 14, 2025
* origin/main: (29 commits)
  ui: reorganize extensions settings (#1702)
  feat: google_drive write tools and read comment tool (#1650)
  fix: developer builtin name (#1699)
  chore: update extensions section to work with new endpoints (#1696)
  chore: move things around (#1662)
  ui: extensions state updates (#1674)
  docs: goose ollama blog, updated (#1691)
  ui: load builtins (#1679)
  chore(release): release version 1.0.14 (#1676)
  Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675)
  fix: uvshim default to existing uv configuration (#1670)
  fix: handle interruptions during tool responses (#1651)
  feat: Copy error message button in toast (#1658)
  feat: handling larger more complex PDF docs (and fix) (#1663)
  Add Filesystem Tutorial (#1666)
  docs: figma blog post (#1647)
  docs: updating goose modes doc (#1665)
  docs: Add running tasks guide (#1626)
  docs: Add experimental features (#1644)
  feat(cli): add better error message, support stdin via -i - or just no args (#1660)
  ...
laanak08 added a commit that referenced this pull request Mar 16, 2025
* main: (31 commits)
  feat: add default metrics for core evals (#1602)
  feat(google_drive): use oauth2 crate for PKCE support, make token storage generic over Serializable (#1645)
  ui: reorganize extensions settings (#1702)
  feat: google_drive write tools and read comment tool (#1650)
  fix: developer builtin name (#1699)
  chore: update extensions section to work with new endpoints (#1696)
  chore: move things around (#1662)
  ui: extensions state updates (#1674)
  docs: goose ollama blog, updated (#1691)
  ui: load builtins (#1679)
  chore(release): release version 1.0.14 (#1676)
  Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675)
  fix: uvshim default to existing uv configuration (#1670)
  fix: handle interruptions during tool responses (#1651)
  feat: Copy error message button in toast (#1658)
  feat: handling larger more complex PDF docs (and fix) (#1663)
  Add Filesystem Tutorial (#1666)
  docs: figma blog post (#1647)
  docs: updating goose modes doc (#1665)
  docs: Add running tasks guide (#1626)
  ...
cbruyndoncx pushed a commit to cbruyndoncx/goose that referenced this pull request Jul 20, 2025
cbruyndoncx pushed a commit to cbruyndoncx/goose that referenced this pull request Jul 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle large PDFs and more complex document types (computer controller)

4 participants