Skip to content

Lazy-load SharedStringsTable with streaming parser, dirty tracking, and allocation reduction#1728

Merged
tonyqus merged 4 commits into
masterfrom
copilot/optimize-ooxml-sst-loading
Mar 12, 2026
Merged

Lazy-load SharedStringsTable with streaming parser, dirty tracking, and allocation reduction#1728
tonyqus merged 4 commits into
masterfrom
copilot/optimize-ooxml-sst-loading

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 12, 2026

SharedStringsTable eagerly parsed the full sharedStrings.xml DOM on every workbook open, even when no string cells were ever accessed. Opening + writing an untouched workbook unnecessarily forced a full SST parse and re-serialisation.

Changes

Lazy loading + dirty tracking

  • SharedStringsTable(PackagePart) no longer calls ReadFrom(). Parsing is deferred via EnsureLoaded(), called only by GetEntryAt, Items, Count, UniqueCount, and AddEntry.
  • Only AddEntry() sets _dirty = true. Reads never do.
  • Commit() / PrepareForCommit() are no-ops when _dirty == false — the original sharedStrings.xml bytes pass through the ZIP unchanged without ever being parsed.
// Before: SST parsed on every open
var wb = new XSSFWorkbook("large.xlsx"); // DOM parse here, always

// After: SST untouched until first access
var wb = new XSSFWorkbook("large.xlsx"); // no SST I/O
wb.Write(outStream);                     // bytes copied as-is; IsLoaded == false

Streaming parser (full fidelity, replaces DOM)

  • Replaces ConvertStreamToXml() + SstDocument.Parse() with an XmlReader state-machine parser.
  • Correctly populates CT_Rst including <t>, <r>/<rPr> (all sub-elements), <rPh>, and <phoneticPr>.
  • DtdProcessing.Prohibit + XmlResolver = null applied in both the constructor security scan and the streaming parser (preserves XML-bomb rejection).

Allocation reduction

  • ArrayPool<char>.Shared + reader.ReadValueChunk() for text node buffering — one allocation per <t> value.
  • stmap (dedup dictionary) is not populated during parse; built lazily from strings on first AddEntry() via EnsureStmapBuilt().

Benchmark results

Using Test1,000,000x10_SharingStrings.xlsx (36 MB on disk, sharedStrings.xml decompresses to 31 MB, 1,000,000 unique entries):

Method Mean Allocated
XSSFWorkbookLargeSstOpenDispose — open + dispose, SST never parsed 35,635 ms 10,601 MB
XSSFWorkbookLargeSstLoadStrings — open + force SST parse 36,301 ms 10,840 MB
SST parse cost (delta) +666 ms +239 MB

Any caller that opens without touching string cells (open-then-write, read numeric cells, inspect sheet names) saves the full 239 MB and ~665 ms of SST parse cost.

Original prompt

Goal

Optimize XSSFWorkbook loading performance by making the OOXML Shared Strings Table (SST) lazy-loaded and avoiding SST parse/serialization unless actually used/modified. Additionally, when SST is loaded, replace DOM-based parsing with a streaming parser that preserves full fidelity (rich text runs and phonetic runs) while reducing allocations using .NET Span / pooling techniques where applicable.

Repository

  • Repo: nissl-lab/npoi
  • Repo ID: 7400815

Context

SharedStringsTable currently eagerly parses sharedStrings.xml in its PackagePart constructor via ConvertStreamToXml + SstDocument.Parse(XmlDocument, ...), building a full DOM and object model. XSSFWorkbook discovers and assigns the SST part during OnDocumentRead(), so SST parsing happens during workbook open even when the caller never accesses shared strings. The user wants:

  1. When opening an existing workbook and then calling Write(), SST must remain unloaded if untouched (no parsing, no re-serialization, keep original bytes).
  2. If/when SST must be loaded, it must parse rich text runs and phonetic runs correctly into CT_Rst structures.
  3. Use .NET Span/pooling techniques to avoid unnecessary allocations during parsing.

Requirements

A. Lazy load behavior

  • Modify ooxml/XSSF/Model/SharedStringsTable.cs:
    • SharedStringsTable(PackagePart part) MUST NOT parse SST immediately.
    • Store the PackagePart (or a stream factory) internally.
    • Introduce EnsureLoaded() that loads/parses SST only on first access to SST content.
    • GetEntryAt, Items, and any APIs that require SST content must call EnsureLoaded().
    • Count/UniqueCount should be available without fully parsing if feasible; if not feasible, ensure they do not force load unless accessed.

B. Do not touch SST bytes when unused

  • Implement dirty tracking:
    • Reading SST does not mark it dirty.
    • Any mutation (AddEntry, etc.) must mark SST dirty.
  • Override Commit() so that:
    • If SST is not dirty, Commit() is a no-op (do not open output stream or write), preserving original SST bytes and avoiding forced load.
    • If SST is dirty, write the updated SST.
  • Ensure that XSSFWorkbook open + Write() with no shared-string access does not cause SST parsing or writing.

C. Replace DOM parsing with streaming parser (full fidelity)

  • Replace/augment ReadFrom(Stream) in SharedStringsTable to use XmlReader streaming parsing rather than XmlDocument.
  • Must correctly parse and populate:
    • CT_Sst header attributes count and uniqueCount.
    • CT_Rst entries including:
      • plain <t> text
      • rich text runs <r> with <rPr> and <t>
      • phonetic runs <rPh> and <phoneticPr>
  • Preserve correctness for existing test files like 51519.xlsx.

D. Allocation reduction / Span usage

  • Use Span/pooling patterns where possible:
    • Prefer XmlReader.ReadValueChunk (or equivalent) into pooled buffers.
    • Build text with minimal intermediate allocations (ValueStringBuilder-like approach or pooled StringBuilder).
    • Avoid building stmap during read; build lazily only when AddEntry() is used.
    • Avoid using CT_Rst.XmlText as a key during read (it is allocation-heavy).

E. Tests

Add/extend tests under testcases/:

  1. Lazy behavior test: open a workbook that contains SST, call Write() without accessing any string cells or GetSharedStringSource(), and assert SST was not loaded and not committed. Use reflection if needed.
  2. Rich/phonetic fidelity test: open workbook with rich + phonetic shared strings (e.g., 51519.xlsx), ensure parsed shared strings match expectations and round-trip via write+read.

F. Benchmarks (optional but desired)

  • If feasible, extend benchmarks/NPOI.Benchmarks/LargeExcelFileBenchmark.cs to demonstrate improved load time/memory with lazy SST (e.g., workbook open+dispose without touching shared strings).

Acceptance criteria

  • Existing tests pass.
  • New tests pass and validate lazy/no-touch behavior.
  • XSSFWorkbook open does not parse SST unless shared-string functionality is accessed.
  • Write() does not parse or rewrite SST if not used/modified.
  • Rich text and phonetic runs in SST are parsed correctly.

Notes

  • This PR should target base branch master.
  • Keep public API changes minimal; prefer internal lazy loading without breaking existing call sites.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Help me optimize the reading performance of Shared string table and make it lazy loading when necessary

Assistant: To optimize Shared Strings Table (SST) reading performance and make it lazy-load when necessary, it helps to separate the two main consumers in NPOI:

  1. UserModel (.xlsx via XSSFWorkbook) currently uses XSSF.Model.SharedStringsTable which eagerly parses the full XML DOM.
  2. EventUserModel (SAX-style) already has a lighter `ReadOnlyShared...

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits March 12, 2026 11:07
…ptimizations

Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Copilot AI changed the title [WIP] Optimize XSSFWorkbook loading performance with lazy-loaded SST Lazy-load SharedStringsTable with streaming parser, dirty tracking, and allocation reduction Mar 12, 2026
@tonyqus tonyqus added this to the NPOI 2.8.0 milestone Mar 12, 2026
Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
@tonyqus tonyqus marked this pull request as ready for review March 12, 2026 12:16
@tonyqus tonyqus merged commit d4f0963 into master Mar 12, 2026
3 of 4 checks passed
@tonyqus
Copy link
Copy Markdown
Member

tonyqus commented Mar 12, 2026

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26200.8037)
11th Gen Intel Core i7-1165G7 2.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 10.0.103
[Host] : .NET 8.0.24 (8.0.2426.7010), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI [AttachedDebugger]
ShortRun : .NET 8.0.24 (8.0.2426.7010), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Benchmark Summary based on NPOI 2.7.6

Job=ShortRun IterationCount=3 LaunchCount=1
WarmupCount=3

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
XSSFWorkbookLargeSstOpenDispose 69.76 s 20.63 s 1.131 s 1912000.0000 875000.0000 31000.0000 11.15 GB
XSSFWorkbookLargeSstLoadStrings 67.11 s 121.31 s 6.649 s 1910000.0000 874000.0000 30000.0000 11.15 GB

Benchmark Summary based on NPOI master branch

Job=ShortRun IterationCount=3 LaunchCount=1
WarmupCount=3

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
XSSFWorkbookLargeSstOpenDispose 92.17 s 187.17 s 10.26 s 1789000.0000 822000.0000 23000.0000 10.35 GB
XSSFWorkbookLargeSstLoadStrings 94.56 s 354.80 s 19.45 s 1819000.0000 836000.0000 19000.0000 10.59 GB

XSSFWorkbookLargeSstOpenDispose 11.15G -> 10.35G - 7% Decrease
XSSFWorkbookLargeSstLoadStrings 11.15G->10.59G - 5% Decrease

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants