-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: handling larger more complex PDF docs (and fix) #1663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves PDF processing by integrating the extractous library to better handle text extraction (including from URLs) and large PDF files, and it also refines the CI pipeline with more aggressive disk cleanup steps.
- Replace low-level PDF text extraction with extractous for both file and URL-based PDFs
- Add tests for URL text extraction and image extraction error handling, and support for large PDF files
- Enhance CI workflows with additional cleanup steps and dependency updates
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| crates/goose-mcp/src/computercontroller/pdf_tool.rs | Introduces extractous library for enhanced PDF text extraction and improved handling of large texts |
| .github/workflows/ci.yml | Adds aggressive disk cleanup and artifact removal steps to the CI process |
| crates/goose-mcp/Cargo.toml | Adds extractous dependency for improved PDF processing |
| crates/goose-mcp/src/computercontroller/mod.rs | Updates PDF tool description to clarify URL support |
Comments suppressed due to low confidence (1)
.github/workflows/ci.yml:83
- The removal command for '/opt/hostedtoolcache' is duplicated (also appearing on line 85). Consider removing the duplicate to simplify the CI script.
sudo rm -rf /opt/hostedtoolcache
baxen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! But i'd like to test this out, @wendytang is on it
| rg 'search term' {}\n\n\ | ||
| Or view portions of it:\n\ | ||
| head -n 50 {}\n\ | ||
| tail -n 50 {}\n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't outputted to user correct? Mainly an fyi for goose iiuc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is instructions for goose to use in case of large files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to process a large pdf that the previous version wasn't 👍
This reverts commit 4c31832.
* main: (32 commits) ui: load builtins (#1679) chore(release): release version 1.0.14 (#1676) Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675) fix: uvshim default to existing uv configuration (#1670) fix: handle interruptions during tool responses (#1651) feat: Copy error message button in toast (#1658) feat: handling larger more complex PDF docs (and fix) (#1663) Add Filesystem Tutorial (#1666) docs: figma blog post (#1647) docs: updating goose modes doc (#1665) docs: Add running tasks guide (#1626) docs: Add experimental features (#1644) feat(cli): add better error message, support stdin via -i - or just no args (#1660) feat: extensions read config (#1637) fix: trigger words for memory (#1654) fix: cleanup keyboard shortcut indication (#1642) Extensions load in background and show pending state (#1657) Extension error toast stays until dismissed, and error message cleanup (#1653) fix: remove other category in settings (#1641) fix: restore image outputs from tool calls (#1640) ...
* origin/main: (29 commits) ui: reorganize extensions settings (#1702) feat: google_drive write tools and read comment tool (#1650) fix: developer builtin name (#1699) chore: update extensions section to work with new endpoints (#1696) chore: move things around (#1662) ui: extensions state updates (#1674) docs: goose ollama blog, updated (#1691) ui: load builtins (#1679) chore(release): release version 1.0.14 (#1676) Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675) fix: uvshim default to existing uv configuration (#1670) fix: handle interruptions during tool responses (#1651) feat: Copy error message button in toast (#1658) feat: handling larger more complex PDF docs (and fix) (#1663) Add Filesystem Tutorial (#1666) docs: figma blog post (#1647) docs: updating goose modes doc (#1665) docs: Add running tasks guide (#1626) docs: Add experimental features (#1644) feat(cli): add better error message, support stdin via -i - or just no args (#1660) ...
* main: (31 commits) feat: add default metrics for core evals (#1602) feat(google_drive): use oauth2 crate for PKCE support, make token storage generic over Serializable (#1645) ui: reorganize extensions settings (#1702) feat: google_drive write tools and read comment tool (#1650) fix: developer builtin name (#1699) chore: update extensions section to work with new endpoints (#1696) chore: move things around (#1662) ui: extensions state updates (#1674) docs: goose ollama blog, updated (#1691) ui: load builtins (#1679) chore(release): release version 1.0.14 (#1676) Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675) fix: uvshim default to existing uv configuration (#1670) fix: handle interruptions during tool responses (#1651) feat: Copy error message button in toast (#1658) feat: handling larger more complex PDF docs (and fix) (#1663) Add Filesystem Tutorial (#1666) docs: figma blog post (#1647) docs: updating goose modes doc (#1665) docs: Add running tasks guide (#1626) ...
Using a new library https://crates.io/crates/extractous which seems to perform vastly better (still using low level obejcts to extract images when needed).
Enhancements:
Fixes: #1664