Performance improvements and .Net 9 support #1116

chuckbeasley · 2025-07-31T20:09:54Z

Refactor letter handling by orientation for efficiency

Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to GetWords and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the letters collection.

Update target frameworks to include net9.0

Expanded compatibility in UglyToad.PdfPig.csproj by adding
net9.0 to the list of target frameworks, alongside existing
versions.

Add .NET 9.0 support and refactor key components

Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features.

Refactored GetBlocks in DocstrumBoundingBoxes.cs for improved input handling and performance.

Significantly optimized NearestNeighbourWordExtractor.cs by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency.

Consistent updates across Fonts, Tests, Tokenization, and Tokens project files to include .NET 9.0 support.

Improved the processing of letters based on their text orientation by preallocating separate lists for each orientation (horizontal, rotate270, rotate180, rotate90, and other). This change reduces multiple calls to `GetWords` and minimizes enumerations and allocations, enhancing performance and readability. Each letter is now added to the appropriate list in a single iteration over the `letters` collection.

Expanded compatibility in `UglyToad.PdfPig.csproj` by adding `net9.0` to the list of target frameworks, alongside existing versions.

Updated project files for UglyToad.PdfPig to target .NET 9.0, enhancing compatibility with the latest framework features. Refactored `GetBlocks` in `DocstrumBoundingBoxes.cs` for improved input handling and performance. Significantly optimized `NearestNeighbourWordExtractor.cs` by replacing multiple lists with an array of buckets and implementing parallel processing for better efficiency. Consistent updates across `Fonts`, `Tests`, `Tokenization`, and `Tokens` project files to include .NET 9.0 support.

BobLd · 2025-08-01T17:02:03Z

src/UglyToad.PdfPig.DocumentLayoutAnalysis/PageSegmenter/DocstrumBoundingBoxes.cs

        public IReadOnlyList<TextBlock> GetBlocks(IEnumerable<Word> words)
        {
-            if (words?.Any() != true)
+            if (words == null)


use words is null instead

BobLd · 2025-08-01T17:02:50Z

src/UglyToad.PdfPig.DocumentLayoutAnalysis/PageSegmenter/DocstrumBoundingBoxes.cs


-            return GetBlocks(words.ToList(),
+            // Avoid multiple enumeration and unnecessary ToList() if already a list
+            var wordList = words as IReadOnlyList<Word> ?? words.ToList();


Use ToArray() instead (was my mistake from when I originally wrote the code)

src/UglyToad.PdfPig.DocumentLayoutAnalysis/WordExtractor/NearestNeighbourWordExtractor.cs

BobLd · 2025-08-01T17:05:48Z

thanks @chuckbeasley for the PR, very interesting work! Do you mind providing a benchmark to assess the improvement?

chuckbeasley · 2025-08-01T17:09:11Z

I don't have benchmarks. I am using this library at work and ran it through the VS 2022 profiler. My changes were related to a hot path and I reduced times based on the profiler only.

chuckbeasley · 2025-08-01T17:16:43Z

I do have an interesting problem with certain PDFs. When I process the file I'm testing with, it uses 2.5G of memory. I think it's related to fonts, but haven't traced the source of the issue. It runs fine on my PC. However, when it on Azure App Service, it causes an out of memory error in the PageContentParser during tokenization.

BobLd · 2025-08-01T17:20:25Z

@chuckbeasley No problem for the benchmarks, I'll run some myself.

Regarding your issue, the best option might be to open an issue, especially if you can share the pdf document.

Side note, I've requested some very minor changes. I believe it's good to go once updated

chuckbeasley · 2025-08-01T17:34:10Z

Yes, I saw and will be making those changes later this afternoon.

- Updated null check for `words` in `DocstrumBoundingBoxes.cs` for better readability and performance. - Changed from `ToList()` to `ToArray()` to avoid unnecessary enumeration. - Added `results.TrimExcess()` in `NearestNeighbourWordExtractor.cs` to optimize memory usage.

chuckbeasley · 2025-08-01T18:21:59Z

Let me know if I need to make any changes.

Chuck Beasley added 3 commits July 31, 2025 11:50

Update target frameworks to include net9.0

b21f42a

Expanded compatibility in `UglyToad.PdfPig.csproj` by adding `net9.0` to the list of target frameworks, alongside existing versions.

BobLd requested changes Aug 1, 2025

View reviewed changes

BobLd approved these changes Aug 1, 2025

View reviewed changes

BobLd merged commit 1ed9e01 into UglyToad:master Aug 1, 2025
2 of 3 checks passed

This was referenced Nov 23, 2025

Bump PdfPig from 0.1.11 to 0.1.12 yildirim-mehmet/onlineOfiice#7

Open

Bump PdfPig from 0.1.11 to 0.1.12 dotnet-presentations/ai-workshop#283

Open

This was referenced Dec 1, 2025

Bump PdfPig from 0.1.11 to 0.1.12 EvotecIT/OfficeIMO#1385

Open

Bump PdfPig from 0.1.11 to 0.1.12 MjrTom/PDF2MD#49

Open

Bump the nuget-all group with 9 updates magico13/MagiCloud#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvements and .Net 9 support #1116

Performance improvements and .Net 9 support #1116

Uh oh!

chuckbeasley commented Jul 31, 2025

Uh oh!

BobLd Aug 1, 2025

Uh oh!

BobLd Aug 1, 2025

Uh oh!

Uh oh!

BobLd commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

BobLd commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Performance improvements and .Net 9 support #1116

Performance improvements and .Net 9 support #1116

Uh oh!

Conversation

chuckbeasley commented Jul 31, 2025

Uh oh!

BobLd Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

BobLd Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BobLd commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

BobLd commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

chuckbeasley commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants