Eliminate redundant CT_Sst.si parallel list in SharedStringsTable write path by Copilot · Pull Request #1793 · nissl-lab/npoi

Copilot · 2026-04-06T22:54:17Z

Understand codebase structure
Add internal static bool UseDirectWrite { get; set; } = true; to SharedStringsTable
In WriteTo, branch on UseDirectWrite: when false rebuild sst.si from strings and call _sstDoc.Save() (old path)
Update SSTWriteBenchmark.cs: split into two benchmark methods — one per path — for side-by-side comparison
Build succeeds — 0 errors
All 7 TestSharedStringsTable tests pass

Original prompt

Background

PR #1728 introduced lazy-loading and a streaming XmlReader-based parser for SharedStringsTable. However, the write path still has a redundant intermediate layer:

SharedStringsTable.strings (a List<CT_Rst>) holds all shared string entries
_sstDoc.GetSst().si (another List<CT_Rst>) holds the exact same object references in parallel
On write, WriteTo() goes through _sstDoc.Save() → sst.si → CT_Rst.Write(), when it could directly iterate strings

Changes Required

1. `ooxml/XSSF/Model/SharedStringsTable.cs`

a) WriteTo — bypass _sstDoc entirely:

Replace:

public void WriteTo(Stream out1)
{
    EnsureLoaded();
    CT_Sst sst = _sstDoc.GetSst();
    sst.count = count;
    sst.uniqueCount = uniqueCount;
    _sstDoc.Save(out1);
}

With a direct streaming write that iterates strings instead of sst.si:

public void WriteTo(Stream out1)
{
    EnsureLoaded();
    var sw = new StreamWriter(out1, Encoding.UTF8);
    sw.Write("<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\" ?><sst xmlns=\"http://schemas.openxmlformats.org/spreadsheetml/2006/main\"");
    sw.WriteAttribute("count", count);
    sw.WriteAttribute("uniqueCount", uniqueCount);
    sw.Write('>');
    foreach (CT_Rst si in strings)
        si.Write(sw, "si");
    sw.Write("</sst>");
    sw.Flush();
}

b) ReadFromStreamViaXmlReader — remove the dual-write to sst.si:

In the EndElement → "si" case (around line 366-372), remove the sst.si.Add(currentSi) line, keeping only strings.Add(currentSi):

case "si":
    if (currentSi != null)
    {
        // sst.si.Add(currentSi);  ← DELETE THIS LINE
        strings.Add(currentSi);
    }
    currentSi = null;
    break;

Since sst.si is no longer consumed by WriteTo, the local CT_Sst sst = _sstDoc.GetSst(); variable at the top of ReadFromStreamViaXmlReader is also no longer needed and can be removed. The _sstDoc initialization lines at the top of ReadFromStreamViaXmlReader can also be removed (the field _sstDoc is still needed for the UseXmlReader = false / DOM fallback path via ReadFromStreamViaXmlDocument, so do NOT remove the _sstDoc field itself).

c) AddEntry — remove the dual-write to sst.si:

Remove the _sstDoc.GetSst().si.Add(newSt) line:

uniqueCount++;
CT_Rst newSt = new CT_Rst();
// _sstDoc.GetSst().si.Add(newSt);  ← DELETE THIS LINE
newSt.Set(st);
int idx = strings.Count;
stmap[s] = idx;
strings.Add(newSt);
_dirty = true;
return idx;

2. `benchmarks/NPOI.Benchmarks/LargeSSTBenchmark.cs`

Add a new benchmark method that measures the write path for a dirty SST — create a workbook with many unique shared strings and write it to a stream. This lets us compare allocation and throughput before and after the change.

Add a new benchmark class or new methods to LargeSSTBenchmark:

/// <summary>
/// Creates a workbook with <see cref="WriteRowCount"/> rows of unique string cells
/// (forcing SST dirty) and writes it to a MemoryStream.
/// Measures the write-side cost of SharedStringsTable serialization.
/// </summary>
[Benchmark]
public void XSSFWorkbookWriteLargeSst()
{
    using var workbook = new XSSFWorkbook();
    var sheet = workbook.CreateSheet();
    for (int i = 0; i < WriteRowCount; i++)
        sheet.CreateRow(i).CreateCell(0).SetCellValue("UniqueString_" + i);
    using var ms = new MemoryStream();
    workbook.Write(ms, leaveOpen: true);
}

[Params(10_000, 100_000)]
public int WriteRowCount { get; set; }

Place this as a new benchmark class SSTWriteBenchmark in a new file benchmarks/NPOI.Benchmarks/SSTWriteBenchmark.cs with [ShortRunJob] and [MemoryDiagnoser] attributes.

Invariants to preserve

Do NOT remove UseXmlReader, ReadFromStreamViaXmlDocument, _sstDoc field, or any related code — these must remain intact for the DOM fallback path.
The EnsureLoaded() call in WriteTo must remain — it ensures lazy-loaded SST is populated before writing.
All existing tests must continue to pass.
The ReadFromStreamViaXmlDocument path still populates _sstDoc.GetSst().si (not strings) — this is a known pre-existing limitation of that path and should NOT be changed in this PR.

Goal

Eliminate the redundant parallel List<CT_Rst> population in the streaming read path and write path, reducing per-entry allocations and indirect call overhead during WriteTo.

The following is the prior conversation context from the user's chat exploration (may be truncated):

Assistant: 现在我对整个写入路径有了完整的认识。让我分析一下：

Assistant: 现在我已经掌握了完整的代码结构，可以给出详尽的分析和方案建议。

当前状态分析

PR #1728 已经做了以下优化：

读取侧：流式 XmlReader 解析替代 DOM，跳过了 XmlDocument 对象
脏标记（_dirty）：未修改的 SST 直接透传原始字节，跳过整个序列化

但写入侧（_dirty = true 时）的路径仍然存在一个冗余的中间层：

WriteTo(Stream)
  → EnsureLoaded()          // 确保已加载 CT_Rst 列表
  → _sstDoc.GetSst()        // 拿到 CT_Sst（sst.si 与 strings 两个列表同步存储）
  → _sstDoc.Save(stream)    //...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

…te path - WriteTo: bypass _sstDoc.Save(), iterate strings list directly for streaming write - ReadFromStreamViaXmlReader: remove _sstDoc init and sst.si.Add(currentSi) dual-write - AddEntry: remove _sstDoc.GetSst().si.Add(newSt) dual-write - CT_Rst.Write: change internal to public so ooxml assembly can call it directly - Add SSTWriteBenchmark for write-side performance comparison Agent-Logs-Url: https://github.com/nissl-lab/npoi/sessions/89e6fe6d-1f73-414c-9962-5790e867a50e Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>

- SharedStringsTable.UseDirectWrite = true (default): direct streaming write from strings list - UseDirectWrite = false: legacy path — rebuild sst.si from strings, call _sstDoc.Save() - SSTWriteBenchmark: split into XSSFWorkbookWriteLargeSstDirectWrite and XSSFWorkbookWriteLargeSstLegacy for side-by-side comparison Agent-Logs-Url: https://github.com/nissl-lab/npoi/sessions/9a4c0e1a-4eda-4e84-8552-ea404c7db16c Co-authored-by: tonyqus <772561+tonyqus@users.noreply.github.com>

tonyqus · 2026-04-07T04:58:21Z

/oc run new sst write benchmark and show me the result

opencode-agent · 2026-04-07T05:00:41Z

Results from SSTWriteBenchmark (.NET 10, ShortRun, 3 iterations):

Method	Rows	Mean	Allocated
DirectWrite	10K	77.29 ms	34.85 MB
Legacy	10K	67.28 ms	35.1 MB
DirectWrite	100K	870.98 ms	329.96 MB
Legacy	100K	883.75 ms	331.96 MB

At 100K rows, the direct write path is ~1.4% faster and allocates ~2 MB less (~0.6% reduction). The difference is modest because the benchmark measures full workbook.Write(), and cell/row creation dominates. The allocation delta comes purely from eliminating the redundant sst.si list repopulation — at 100K strings that's ~2 MB of CT_Rst references not allocated.

The 10K results are noisy (ShortRun, only 3 iterations) — the ~15% speed regression there is within the error margin.

opencode session | github run

Initial plan

0a05994

Copilot AI assigned Copilot and tonyqus Apr 6, 2026

Copilot started work on behalf of tonyqus April 6, 2026 22:54 View session

Copilot AI changed the title ~~[WIP] Refactor write path for SharedStringsTable to eliminate redundancy~~ Eliminate redundant CT_Sst.si parallel list in SharedStringsTable write path Apr 6, 2026

Copilot AI requested a review from tonyqus April 6, 2026 23:08

Copilot finished work on behalf of tonyqus April 6, 2026 23:08

Copilot started work on behalf of tonyqus April 7, 2026 00:35 View session

Copilot finished work on behalf of tonyqus April 7, 2026 00:41

tonyqus added performance labels Apr 7, 2026

tonyqus modified the milestone: NPOI 2.8.1 Apr 7, 2026

tonyqus added enhancement xssf labels Apr 7, 2026

tonyqus marked this pull request as ready for review April 7, 2026 01:10

tonyqus merged commit 7c86779 into master Apr 7, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eliminate redundant CT_Sst.si parallel list in SharedStringsTable write path#1793

Eliminate redundant CT_Sst.si parallel list in SharedStringsTable write path#1793
tonyqus merged 3 commits into
masterfrom
copilot/refactor-shared-strings-table-write-path

Copilot AI commented Apr 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

tonyqus commented Apr 7, 2026

Uh oh!

opencode-agent Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Copilot AI commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes Required

1. ooxml/XSSF/Model/SharedStringsTable.cs

2. benchmarks/NPOI.Benchmarks/LargeSSTBenchmark.cs

Invariants to preserve

Goal

当前状态分析

Uh oh!

Uh oh!

tonyqus commented Apr 7, 2026

Uh oh!

opencode-agent Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 6, 2026 •

edited

Loading

1. `ooxml/XSSF/Model/SharedStringsTable.cs`

2. `benchmarks/NPOI.Benchmarks/LargeSSTBenchmark.cs`