Eliminate redundant CT_Sst.si parallel list in SharedStringsTable write path#1793
Merged
Merged
Conversation
…te path - WriteTo: bypass _sstDoc.Save(), iterate strings list directly for streaming write - ReadFromStreamViaXmlReader: remove _sstDoc init and sst.si.Add(currentSi) dual-write - AddEntry: remove _sstDoc.GetSst().si.Add(newSt) dual-write - CT_Rst.Write: change internal to public so ooxml assembly can call it directly - Add SSTWriteBenchmark for write-side performance comparison Agent-Logs-Url: https://github.com/nissl-lab/npoi/sessions/89e6fe6d-1f73-414c-9962-5790e867a50e Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Refactor write path for SharedStringsTable to eliminate redundancy
Eliminate redundant CT_Sst.si parallel list in SharedStringsTable write path
Apr 6, 2026
- SharedStringsTable.UseDirectWrite = true (default): direct streaming write from strings list - UseDirectWrite = false: legacy path — rebuild sst.si from strings, call _sstDoc.Save() - SSTWriteBenchmark: split into XSSFWorkbookWriteLargeSstDirectWrite and XSSFWorkbookWriteLargeSstLegacy for side-by-side comparison Agent-Logs-Url: https://github.com/nissl-lab/npoi/sessions/9a4c0e1a-4eda-4e84-8552-ea404c7db16c Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>
Member
|
/oc run new sst write benchmark and show me the result |
Contributor
|
Results from
At 100K rows, the direct write path is ~1.4% faster and allocates ~2 MB less (~0.6% reduction). The difference is modest because the benchmark measures full The 10K results are noisy (ShortRun, only 3 iterations) — the ~15% speed regression there is within the error margin. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

internal static bool UseDirectWrite { get; set; } = true;toSharedStringsTableWriteTo, branch onUseDirectWrite: whenfalserebuildsst.sifromstringsand call_sstDoc.Save()(old path)SSTWriteBenchmark.cs: split into two benchmark methods — one per path — for side-by-side comparisonTestSharedStringsTabletests passOriginal prompt
Background
PR #1728 introduced lazy-loading and a streaming XmlReader-based parser for
SharedStringsTable. However, the write path still has a redundant intermediate layer:SharedStringsTable.strings(aList<CT_Rst>) holds all shared string entries_sstDoc.GetSst().si(anotherList<CT_Rst>) holds the exact same object references in parallelWriteTo()goes through_sstDoc.Save()→sst.si→CT_Rst.Write(), when it could directly iteratestringsChanges Required
1.
ooxml/XSSF/Model/SharedStringsTable.csa)
WriteTo— bypass_sstDocentirely:Replace:
With a direct streaming write that iterates
stringsinstead ofsst.si:b)
ReadFromStreamViaXmlReader— remove the dual-write tosst.si:In the
EndElement→"si"case (around line 366-372), remove thesst.si.Add(currentSi)line, keeping onlystrings.Add(currentSi):Since
sst.siis no longer consumed byWriteTo, the localCT_Sst sst = _sstDoc.GetSst();variable at the top ofReadFromStreamViaXmlReaderis also no longer needed and can be removed. The_sstDocinitialization lines at the top ofReadFromStreamViaXmlReadercan also be removed (the field_sstDocis still needed for theUseXmlReader = false/ DOM fallback path viaReadFromStreamViaXmlDocument, so do NOT remove the_sstDocfield itself).c)
AddEntry— remove the dual-write tosst.si:Remove the
_sstDoc.GetSst().si.Add(newSt)line:2.
benchmarks/NPOI.Benchmarks/LargeSSTBenchmark.csAdd a new benchmark method that measures the write path for a dirty SST — create a workbook with many unique shared strings and write it to a stream. This lets us compare allocation and throughput before and after the change.
Add a new benchmark class or new methods to
LargeSSTBenchmark:Place this as a new benchmark class
SSTWriteBenchmarkin a new filebenchmarks/NPOI.Benchmarks/SSTWriteBenchmark.cswith[ShortRunJob]and[MemoryDiagnoser]attributes.Invariants to preserve
UseXmlReader,ReadFromStreamViaXmlDocument,_sstDocfield, or any related code — these must remain intact for the DOM fallback path.EnsureLoaded()call inWriteTomust remain — it ensures lazy-loaded SST is populated before writing.ReadFromStreamViaXmlDocumentpath still populates_sstDoc.GetSst().si(notstrings) — this is a known pre-existing limitation of that path and should NOT be changed in this PR.Goal
Eliminate the redundant parallel
List<CT_Rst>population in the streaming read path and write path, reducing per-entry allocations and indirect call overhead duringWriteTo.The following is the prior conversation context from the user's chat exploration (may be truncated):
Assistant: 现在我对整个写入路径有了完整的认识。让我分析一下:
Assistant: 现在我已经掌握了完整的代码结构,可以给出详尽的分析和方案建议。
当前状态分析
PR #1728 已经做了以下优化:
XmlReader解析替代 DOM,跳过了XmlDocument对象_dirty):未修改的 SST 直接透传原始字节,跳过整个序列化但写入侧(
_dirty = true时)的路径仍然存在一个冗余的中间层: