Lazy-load SharedStringsTable with streaming parser, dirty tracking, and allocation reduction#1728
Merged
Merged
Conversation
…ptimizations Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Optimize XSSFWorkbook loading performance with lazy-loaded SST
Lazy-load SharedStringsTable with streaming parser, dirty tracking, and allocation reduction
Mar 12, 2026
Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Member
|
BenchmarkDotNet v0.13.12, Windows 11 (10.0.26200.8037) Benchmark Summary based on NPOI 2.7.6Job=ShortRun IterationCount=3 LaunchCount=1
Benchmark Summary based on NPOI master branchJob=ShortRun IterationCount=3 LaunchCount=1
XSSFWorkbookLargeSstOpenDispose 11.15G -> 10.35G - 7% Decrease |
This was referenced Apr 6, 2026
6 tasks
This was referenced Apr 7, 2026
This was referenced Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SharedStringsTableeagerly parsed the fullsharedStrings.xmlDOM on every workbook open, even when no string cells were ever accessed. Opening + writing an untouched workbook unnecessarily forced a full SST parse and re-serialisation.Changes
Lazy loading + dirty tracking
SharedStringsTable(PackagePart)no longer callsReadFrom(). Parsing is deferred viaEnsureLoaded(), called only byGetEntryAt,Items,Count,UniqueCount, andAddEntry.AddEntry()sets_dirty = true. Reads never do.Commit()/PrepareForCommit()are no-ops when_dirty == false— the originalsharedStrings.xmlbytes pass through the ZIP unchanged without ever being parsed.Streaming parser (full fidelity, replaces DOM)
ConvertStreamToXml()+SstDocument.Parse()with anXmlReaderstate-machine parser.CT_Rstincluding<t>,<r>/<rPr>(all sub-elements),<rPh>, and<phoneticPr>.DtdProcessing.Prohibit+XmlResolver = nullapplied in both the constructor security scan and the streaming parser (preserves XML-bomb rejection).Allocation reduction
ArrayPool<char>.Shared+reader.ReadValueChunk()for text node buffering — one allocation per<t>value.stmap(dedup dictionary) is not populated during parse; built lazily fromstringson firstAddEntry()viaEnsureStmapBuilt().Benchmark results
Using
Test1,000,000x10_SharingStrings.xlsx(36 MB on disk,sharedStrings.xmldecompresses to 31 MB, 1,000,000 unique entries):XSSFWorkbookLargeSstOpenDispose— open + dispose, SST never parsedXSSFWorkbookLargeSstLoadStrings— open + force SST parseAny caller that opens without touching string cells (open-then-write, read numeric cells, inspect sheet names) saves the full 239 MB and ~665 ms of SST parse cost.
Original prompt
Goal
Optimize XSSFWorkbook loading performance by making the OOXML Shared Strings Table (SST) lazy-loaded and avoiding SST parse/serialization unless actually used/modified. Additionally, when SST is loaded, replace DOM-based parsing with a streaming parser that preserves full fidelity (rich text runs and phonetic runs) while reducing allocations using .NET Span / pooling techniques where applicable.
Repository
nissl-lab/npoi7400815Context
SharedStringsTablecurrently eagerly parsessharedStrings.xmlin itsPackagePartconstructor viaConvertStreamToXml+SstDocument.Parse(XmlDocument, ...), building a full DOM and object model.XSSFWorkbookdiscovers and assigns the SST part duringOnDocumentRead(), so SST parsing happens during workbook open even when the caller never accesses shared strings. The user wants:Write(), SST must remain unloaded if untouched (no parsing, no re-serialization, keep original bytes).CT_Rststructures.Requirements
A. Lazy load behavior
ooxml/XSSF/Model/SharedStringsTable.cs:SharedStringsTable(PackagePart part)MUST NOT parse SST immediately.PackagePart(or a stream factory) internally.EnsureLoaded()that loads/parses SST only on first access to SST content.GetEntryAt,Items, and any APIs that require SST content must callEnsureLoaded().Count/UniqueCountshould be available without fully parsing if feasible; if not feasible, ensure they do not force load unless accessed.B. Do not touch SST bytes when unused
AddEntry, etc.) must mark SST dirty.Commit()so that:Commit()is a no-op (do not open output stream or write), preserving original SST bytes and avoiding forced load.XSSFWorkbookopen +Write()with no shared-string access does not cause SST parsing or writing.C. Replace DOM parsing with streaming parser (full fidelity)
ReadFrom(Stream)inSharedStringsTableto useXmlReaderstreaming parsing rather thanXmlDocument.CT_Sstheader attributescountanduniqueCount.CT_Rstentries including:<t>text<r>with<rPr>and<t><rPh>and<phoneticPr>51519.xlsx.D. Allocation reduction / Span usage
XmlReader.ReadValueChunk(or equivalent) into pooled buffers.stmapduring read; build lazily only whenAddEntry()is used.CT_Rst.XmlTextas a key during read (it is allocation-heavy).E. Tests
Add/extend tests under
testcases/:Write()without accessing any string cells orGetSharedStringSource(), and assert SST was not loaded and not committed. Use reflection if needed.51519.xlsx), ensure parsed shared strings match expectations and round-trip via write+read.F. Benchmarks (optional but desired)
benchmarks/NPOI.Benchmarks/LargeExcelFileBenchmark.csto demonstrate improved load time/memory with lazy SST (e.g., workbook open+dispose without touching shared strings).Acceptance criteria
XSSFWorkbookopen does not parse SST unless shared-string functionality is accessed.Write()does not parse or rewrite SST if not used/modified.Notes
master.The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Help me optimize the reading performance of Shared string table and make it lazy loading when necessary
Assistant: To optimize Shared Strings Table (SST) reading performance and make it lazy-load when necessary, it helps to separate the two main consumers in NPOI:
XSSFWorkbook) currently usesXSSF.Model.SharedStringsTablewhich eagerly parses the full XML DOM.This pull request was created from Copilot chat.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.