Replace Hashtable/ArrayList with generic collections in formula eval hot path#1742
Conversation
…ation hot path Eliminates boxing overhead and improves type safety in the most frequently called code during formula evaluation. - PlainCellCache: Hashtable → Dictionary<Loc, PlainValueCellCacheEntry> - FormulaCellCache: Hashtable → Dictionary<object, FormulaCellCacheEntry> - Also fixes bug: Remove() was keying on cell instead of cell.IdentityKey - OperationEvaluatorFactory: Hashtable → Dictionary<OperationPtg, Function> - FormulaUsedBlankCellSet: Hashtable → Dictionary<BookSheetKey, BlankCellSheetGroup> - Ptg.ReadTokens: ArrayList → List<Ptg> - FormulaParser.Arguments/ParseArrayRow: ArrayList → List<ParseNode>/List<object> All lookups converted to TryGetValue to avoid double-lookup patterns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When both operands are very large doubles (> ~7.9e28), casting to decimal throws OverflowException. Fall back to double arithmetic, matching Excel behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ken-swyfft
left a comment
There was a problem hiding this comment.
Code Review
Overall: Approve with minor suggestions. The collection replacements are mechanically correct, no public API signatures change, and no thread-safety regression.
Good catch: FormulaCellCache.Remove() bug fix
The old code was calling _formulaEntriesByCell.Remove(cell) instead of Remove(cell.IdentityKey) — meaning entries were never actually removed, causing a memory leak during cell mutations. The fix is correct.
Moderate concerns
-
Missing test coverage for the
FormulaCellCache.Removefix — This is a real behavioral change (fixing a memory leak). A regression test demonstrating that entries are actually evicted afterNotifyDeleteCellwould guard against this silently regressing. -
Missing test coverage for the
MultiplyEvaloverflow fix — Casting adouble> ~7.9e28 todecimalthrowsOverflowException; the fallback todoublearithmetic is correct and matches Excel. However, the existing test file (TestMultiplyEval.cs) doesn't cover this scenario. A test case likeConfirm(new NumberEval(1e29), new NumberEval(1e29), 1e58)would lock this in.
Minor notes
- The 32% benchmark improvement is plausible — the double-lookup →
TryGetValuein the innermost eval loop is likely the primary driver, and theRemovebug may have caused unbounded cache growth slowing hash operations. Profiler attribution would strengthen the claim. FormulaCellCachekey type remainsobject(becauseIEvaluationCell.IdentityKeyreturnsObject). Fine for this PR's scope, just noting it.- Pre-existing
Equalsnull-safety issues inLocandBookSheetKey(direct cast without null/type check) — out of scope but worth a follow-up. - Removal of unused
System.Runtime.Serialization.Formatters.Binaryusing inPtg.csis correct cleanup.
…rflow - TestFormulaCellCacheRemoveActuallyEvicts: verifies entries are evicted after Remove(), guarding against the bug where Remove() keyed on cell instead of cell.IdentityKey - TestLargeValuesOverflowDecimal: verifies MultiplyEval falls back to double arithmetic when operands exceed decimal range (~7.9e28) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The CI is failing above, but I think those are failures that will be fixed by this PR here: #1746. All tests pass clean on my local. |
|
Thanks for the thorough review! Both moderate concerns were valid — addressed in 42364a3: 1. 2. Full suite green: 2,752 main + 1,787 OOXML = 4,539 passed across both net8.0 and net472. |
|
There is a logic change for XSSFRow.GetEnumerator in the recent PR #1753, which caused the failure of unit test TestCellIterationOrderWithSparseColumns. After abandoning SortedDictionary, the cells returned from GetEnumerator will NOT be sorted by column index anymore. After this change, the performance is so good. But this may cause some problems for the existing NPOI-based logic. Developers may assume that the cells returned by GetEnumerator should be sorted in the legacy NPOI package. How do you think? |
|
Good catch — you're right that this is a concern. After I've put up #1759 to fix this with a lazy-sorted cache approach:
Applied to 🤖 This response was written with the help of Claude (AI assistant). |
|
/oc please help review this PR. Remember you are doing pair programming with Claude Pro |
PR Review: Replace Hashtable/ArrayList with generic collectionsApprove — clean mechanical refactoring with solid tests. Changes Reviewed
Key Observations
All tests pass on both net8.0 and net472. No public API changes. |
|
LGTM |

Summary
HashtablewithDictionary<K,V>andArrayListwithList<T>in the formula evaluation hot path, eliminating boxing overhead and double-lookup patternsFormulaCellCache.Remove()which was keying on the cell object instead ofcell.IdentityKey, causing silent removal failuresDecimaloverflow crash inMultiplyEvalwhen operands exceeddecimalrange (~7.9e28), falling back todoublearithmetic to match Excel behaviorargstoBenchmarkSwitcher.Run()so--filterworks from CLIFiles changed
PlainCellCache.cs—Hashtable→Dictionary<Loc, PlainValueCellCacheEntry>FormulaCellCache.cs—Hashtable→Dictionary<object, FormulaCellCacheEntry>OperationEvaluatorFactory.cs—Hashtable→Dictionary<OperationPtg, Function>FormulaUsedBlankCellSet.cs—Hashtable→Dictionary<BookSheetKey, BlankCellSheetGroup>Ptg.cs—ArrayList→List<Ptg>FormulaParser.cs—ArrayList→List<ParseNode>/List<object>MultiplyEval.cs— catchOverflowExceptionfor large valuesProgram.cs— passargsthrough to BenchmarkSwitcherBenchmark results (EvaluateAll on 1.43M formulas, 17MB .xlsx)
Test plan
LargeExcelFileBenchmark.Evaluatecompletes without crash🤖 Generated with Claude Code