-
Notifications
You must be signed in to change notification settings - Fork 291
Performance improvements and .Net 9 support #1116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection.
Expanded compatibility in `UglyToad.PdfPig.csproj` by adding `net9.0` to the list of target frameworks, alongside existing versions.
Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features. Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance. Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency. Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support.
| public IReadOnlyList<TextBlock> GetBlocks(IEnumerable<Word> words) | ||
| { | ||
| if (words?.Any() != true) | ||
| if (words == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use words is null instead
|
|
||
| return GetBlocks(words.ToList(), | ||
| // Avoid multiple enumeration and unnecessary ToList() if already a list | ||
| var wordList = words as IReadOnlyList<Word> ?? words.ToList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ToArray() instead (was my mistake from when I originally wrote the code)
src/UglyToad.PdfPig.DocumentLayoutAnalysis/WordExtractor/NearestNeighbourWordExtractor.cs
Show resolved
Hide resolved
|
thanks @chuckbeasley for the PR, very interesting work! Do you mind providing a benchmark to assess the improvement? |
|
I don't have benchmarks. I am using this library at work and ran it through the VS 2022 profiler. My changes were related to a hot path and I reduced times based on the profiler only. |
|
I do have an interesting problem with certain PDFs. When I process the file I'm testing with, it uses 2.5G of memory. I think it's related to fonts, but haven't traced the source of the issue. It runs fine on my PC. However, when it on Azure App Service, it causes an out of memory error in the PageContentParser during tokenization. |
|
@chuckbeasley No problem for the benchmarks, I'll run some myself. Regarding your issue, the best option might be to open an issue, especially if you can share the pdf document. Side note, I've requested some very minor changes. I believe it's good to go once updated |
|
Yes, I saw and will be making those changes later this afternoon. |
- Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance. - Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration. - Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage.
|
Let me know if I need to make any changes. |
Refactor letter handling by orientation for efficiency
Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to
GetWordsand minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over theletterscollection.Update target frameworks to include net9.0
Expanded compatibility in
UglyToad.PdfPig.csprojby addingnet9.0to the list of target frameworks, alongside existingversions.
Add .NET 9.0 support and refactor key components
Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.
Refactored
GetBlocksinDocstrumBoundingBoxes.csfor improved input handling and performance.Significantly optimized
NearestNeighbourWordExtractor.csby replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.Consistent updates across
Fonts,Tests,Tokenization, andTokensproject files to include .NET 9.0 support.