Skip to content

Conversation

@chuckbeasley
Copy link
Contributor

Refactor letter handling by orientation for efficiency

Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to GetWords and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the letters collection.

Update target frameworks to include net9.0

Expanded compatibility in UglyToad.PdfPig.csproj by adding
net9.0 to the list of target frameworks, alongside existing
versions.

Add .NET 9.0 support and refactor key components

Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.

Refactored GetBlocks in DocstrumBoundingBoxes.cs for improved input handling and performance.

Significantly optimized NearestNeighbourWordExtractor.cs by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.

Consistent updates across Fonts, Tests, Tokenization, and Tokens project files to include .NET 9.0 support.

Chuck Beasley added 3 commits July 31, 2025 11:50
Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection.
Expanded compatibility in `UglyToad.PdfPig.csproj` by adding
`net9.0` to the list of target frameworks, alongside existing
versions.
Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.

Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance.

Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.

Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support.
public IReadOnlyList<TextBlock> GetBlocks(IEnumerable<Word> words)
{
if (words?.Any() != true)
if (words == null)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use words is null instead


return GetBlocks(words.ToList(),
// Avoid multiple enumeration and unnecessary ToList() if already a list
var wordList = words as IReadOnlyList<Word> ?? words.ToList();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ToArray() instead (was my mistake from when I originally wrote the code)

@BobLd
Copy link
Collaborator

BobLd commented Aug 1, 2025

thanks @chuckbeasley for the PR, very interesting work! Do you mind providing a benchmark to assess the improvement?

@chuckbeasley
Copy link
Contributor Author

I don't have benchmarks. I am using this library at work and ran it through the VS 2022 profiler. My changes were related to a hot path and I reduced times based on the profiler only.

@chuckbeasley
Copy link
Contributor Author

I do have an interesting problem with certain PDFs. When I process the file I'm testing with, it uses 2.5G of memory. I think it's related to fonts, but haven't traced the source of the issue. It runs fine on my PC. However, when it on Azure App Service, it causes an out of memory error in the PageContentParser during tokenization.

@BobLd
Copy link
Collaborator

BobLd commented Aug 1, 2025

@chuckbeasley No problem for the benchmarks, I'll run some myself.

Regarding your issue, the best option might be to open an issue, especially if you can share the pdf document.

Side note, I've requested some very minor changes. I believe it's good to go once updated

@chuckbeasley
Copy link
Contributor Author

Yes, I saw and will be making those changes later this afternoon.

- Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance.
- Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration.
- Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage.
@chuckbeasley
Copy link
Contributor Author

Let me know if I need to make any changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants