diff --git a/CHANGELOG.md b/CHANGELOG.md
index cf26e39..0f8914c 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,19 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht
 
 ## [Unreleased]
 
+## [1.27.1] - 2026-05-24
+
+### Performance
+
+- **In-memory deep run search is no longer quadratic.** `InMemoryFlowRunStore`
+  deep search (`deepSearch: true`) scanned the global step dictionary per
+  candidate run — O(runs × total_steps) — so latency and allocation grew
+  quadratically with run history. It now enumerates each run's steps via the
+  existing `_stepKeysByRun` index (O(runs × steps_in_run)). At 10,000 stored
+  runs a deep search drops from ~25 s / 4.6 GB to ~24 ms / 2.6 MB. Added a
+  BenchmarkDotNet case (`tests/benchmarks/.../RunSearchBenchmarks.cs`)
+  characterising the quick vs deep tiers and the before/after.
+
 ## [1.27.0] - 2026-05-24
 
 ### Changed — RUN search performance + dependency roll-up
diff --git a/Directory.Build.props b/Directory.Build.props
index 1ff4ccd..9553bed 100644
--- a/Directory.Build.props
+++ b/Directory.Build.props
@@ -5,7 +5,7 @@
     <RepositoryUrl>https://github.com/hoangsnowy/FlowOrchestrator</RepositoryUrl>
     <RepositoryType>git</RepositoryType>
     <PackageProjectUrl>https://github.com/hoangsnowy/FlowOrchestrator</PackageProjectUrl>
-    <VersionPrefix>1.27.0</VersionPrefix>
+    <VersionPrefix>1.27.1</VersionPrefix>
     <PackageReadmeFile>README.md</PackageReadmeFile>
     <PackageIcon>icon.png</PackageIcon>
   </PropertyGroup>
diff --git a/docs/benchmarks/run-search-quick-vs-deep-2026-05-24.md b/docs/benchmarks/run-search-quick-vs-deep-2026-05-24.md
new file mode 100644
index 0000000..825c7a7
--- /dev/null
+++ b/docs/benchmarks/run-search-quick-vs-deep-2026-05-24.md
@@ -0,0 +1,154 @@
+# RUN search — quick vs deep tier cost, 2026-05-24
+
+Quantifies the cost gap between the two RUN-search tiers added in v1.27.0
+on `IFlowRunStore.GetRunsPageAsync(..., bool deepSearch, ...)`, measured
+against the in-process `InMemoryFlowRunStore`.
+
+- **Quick** (`deepSearch: false`): matches the search term only against run
+  identity columns (id, flow name, trigger key, status, background job id),
+  then short-circuits. Work is O(runs).
+- **Deep** (`deepSearch: true`): additionally scans the step rows for every
+  run that survives the identity filter.
+
+Benchmark: `tests/benchmarks/FlowOrchestrator.Benchmarks/RunSearchBenchmarks.cs`.
+
+## Why the deep path is expensive in the in-memory store
+
+`InMemoryFlowRunStore.MatchesRunSearch`
+(`src/FlowOrchestrator.InMemory/InMemoryFlowRunStore.cs:678`) implements the
+deep branch as:
+
+```csharp
+return _steps.Values.Any(s =>
+    s.RunId == run.Id
+    && (ContainsIgnoreCase(s.StepKey, search)
+        || ContainsIgnoreCase(s.ErrorMessage, search)
+        || ContainsIgnoreCase(s.OutputJson, search)));
+```
+
+`_steps` is the **global** step dictionary across all runs in history. The
+predicate filters by `s.RunId == run.Id` *inside* the scan, so each candidate
+run walks the entire global step keyspace — O(total_steps). That predicate runs
+once per run surviving the identity filter (`ApplyRunsFilter` →
+`MatchesRunSearch`, line 672-673), so the whole search is
+**O(runs × total_steps)**. With a fixed steps-per-run, total_steps grows with
+run count, making the deep path scale **quadratically** with history size.
+
+The quick path returns `false` immediately after the identity-column checks
+(`InMemoryFlowRunStore.cs:689-690`), so it never touches `_steps`.
+
+## Setup
+
+- Each run has 6 completed steps (Started → Dispatched → Claimed → Completed).
+- Every step's `OutputJson` is a ~300-byte JSON blob; the search needle
+  (`needle-7f3a`) is planted in exactly **one** step's output per run and in
+  **no** identity column. So the quick path is a true negative (0 matches, no
+  step scan) and the deep path does the full representative scan and matches.
+- `take: 20` (the dashboard default page size).
+- `TotalRuns` sweeps {1,000, 10,000}; total steps in store = 6 × TotalRuns.
+- In-process emit toolchain, ShortRun (3 warmup / 3 iterations / 1 launch) —
+  the deep/10,000 cell is ~25 s per op, so a full job is infeasible.
+- BenchmarkDotNet v0.15.8, .NET 10.0.6, Intel Core Ultra 7 255H, 16 cores.
+
+## Results — before fix (original quadratic deep branch)
+
+Figures below are from a flag-free reproduction run; a prior `--job short` run
+agreed within run-to-run noise (e.g. quick/1,000: 61.15 µs vs 62.12 µs; deep
+allocations identical to the byte). Allocations are deterministic and the
+strongest signal; the deep/10,000 time has wide variance (n=3 short iterations
+over a ~25 s op) but the order of magnitude and the quadratic shape are
+unambiguous.
+
+| TotalRuns | Quick mean | Quick alloc | Deep mean | Deep alloc | Deep ÷ Quick (time) | Deep ÷ Quick (alloc) |
+|---:|---:|---:|---:|---:|---:|---:|
+| 1,000  | 62.12 µs   | 133.44 KB   | 91.09 ms  | 47,185 KB (≈46 MB)     | **1,466×** | **354×** |
+| 10,000 | 1,060.7 µs | 1,328.75 KB | 24,966 ms (≈25.0 s) | 4,691,243 KB (≈4.6 GB) | **23,537×** | **3,530×** |
+
+## Fix — per-run step-key index
+
+The deep branch now enumerates the run's own steps via the existing
+`_stepKeysByRun` secondary index and direct-looks-up each in `_steps`, instead
+of scanning the global `_steps` dictionary
+(`src/FlowOrchestrator.InMemory/InMemoryFlowRunStore.cs`, `MatchesRunSearch`):
+
+```csharp
+if (!_stepKeysByRun.TryGetValue(run.Id, out var stepKeys))
+    return false;
+foreach (var stepKey in stepKeys.Keys)
+{
+    if (_steps.TryGetValue((run.Id, stepKey), out var s)
+        && (ContainsIgnoreCase(s.StepKey, search)
+            || ContainsIgnoreCase(s.ErrorMessage, search)
+            || ContainsIgnoreCase(s.OutputJson, search)))
+    {
+        return true;
+    }
+}
+return false;
+```
+
+This makes the per-run cost O(steps_in_run) instead of O(total_steps), so the
+whole deep search is O(runs × steps_in_run) — linear in history size, same
+asymptotic shape as the quick tier (deep just does a small constant of extra
+work per run). It mirrors the index `GetRunDetailAsync` already uses.
+
+## Results — after fix (O(runs × steps_in_run))
+
+| TotalRuns | Quick mean | Quick alloc | Deep mean | Deep alloc | Deep ÷ Quick (time) | Deep ÷ Quick (alloc) |
+|---:|---:|---:|---:|---:|---:|---:|
+| 1,000  | 56.44 µs   | 102.19 KB   | 1.46 ms   | 262.83 KB   | ~26×  | ~2.6× |
+| 10,000 | 1,532.5 µs | 1,016.25 KB | 24.17 ms  | 2,622.36 KB | ~16×  | ~2.6× |
+
+Deep 1,000 → 10,000 now scales ~16× in time (was 274×) — linear, not quadratic.
+At 10,000 runs the deep search dropped from **~24,966 ms → ~24 ms (~1,040×)** and
+**~4.6 GB → ~2.6 MB allocations (~1,800×)**. (Deep times at n=3 short iterations
+have wide CI; the order of magnitude and the now-linear scaling are the signal.)
+
+## Headline
+
+The two tiers are not a constant-factor difference — they are different
+complexity classes. Scaling `TotalRuns` from 1,000 to 10,000 (10×):
+
+- **Quick** grows ~17× in time (1,061 µs / 62 µs) — linear-ish, dominated by
+  LINQ `Where`/`OrderByDescending`/`ToList` over the run set, and ~10× in
+  allocation (133 KB → 1.33 MB), matching the run-count scaling.
+- **Deep** grows ~274× in time (24,966 ms / 91 ms) and ~99× in allocation
+  (46 MB → 4.6 GB) — quadratic, exactly the O(runs × total_steps) blow-up
+  predicted above (10× runs × 10× total_steps ≈ 100×).
+
+At 10,000 runs in store, a single deep search takes **~25 seconds** and
+allocates **4.6 GB** (driving sustained Gen2 collections), versus **~1 ms /
+1.3 MB** for quick. Choosing the quick tier when the caller does not need
+step-body matching is a **~23,500× latency win** and a **~3,500× allocation
+reduction** at that scale.
+
+This is the data backing the v1.27.0 design decision to default the dashboard
+run list to the quick tier and gate deep search behind an explicit opt-in.
+
+The numbers above are the **original** quadratic implementation; the index fix
+documented in "Fix — per-run step-key index" collapses the in-memory deep path
+to linear (24,966 ms → ~24 ms at 10,000 runs). Quick remains the right default
+for typeahead, but deep is no longer a multi-second, multi-GB cliff on a store
+with history.
+
+> Note: the in-memory store is the most extreme case because the deep predicate
+> rescans the *global* `_steps` per candidate run. The SQL Server / PostgreSQL
+> stores push the step match into a single SQL statement (EXISTS / join), so
+> their deep-vs-quick ratio is smaller — but the asymptotic shape (deep adds a
+> per-run step scan the quick path skips) is the same. This benchmark
+> characterises the in-memory runtime; a Testcontainers-backed SQL benchmark is
+> intentionally out of scope to keep the suite dependency-free and CI-runnable.
+
+## Reproducing
+
+```bash
+# from the repo root, with current HEAD
+cd tests/benchmarks/FlowOrchestrator.Benchmarks/bin/Release/net10.0
+./FlowOrchestrator.Benchmarks.exe --filter "*RunSearchBenchmarks*"
+```
+
+No `--job` / `--inProcess` flags are needed — the benchmark pins the in-process
+emit toolchain and a ShortRun job via `[Config(typeof(RunSearchConfig))]`. The
+in-process toolchain is required because the repo's `.claude/worktrees/` copies
+of this project otherwise make BenchmarkDotNet's default toolchain fail on a
+duplicate-project-name discovery error.
diff --git a/src/FlowOrchestrator.InMemory/InMemoryFlowRunStore.cs b/src/FlowOrchestrator.InMemory/InMemoryFlowRunStore.cs
index 1fef533..94b1b84 100644
--- a/src/FlowOrchestrator.InMemory/InMemoryFlowRunStore.cs
+++ b/src/FlowOrchestrator.InMemory/InMemoryFlowRunStore.cs
@@ -689,10 +689,15 @@ private bool MatchesRunSearch(FlowRunRecord run, string search, bool deepSearch)
         if (!deepSearch)
             return false;
 
-        // Deep search also scans the current step rows (incl. OutputJson). Attempt history is
-        // intentionally not searched — it duplicates the current step row and is the dominant cost.
-        return _steps.Values.Any(s =>
-            s.RunId == run.Id
+        // Deep search also scans this run's current step rows (incl. OutputJson). Enumerate via
+        // the per-run step-key index (O(steps_in_run)) and direct-look-up each step, instead of
+        // scanning the global _steps dictionary (O(total_steps) per run — quadratic over run
+        // history). Attempt history is intentionally not searched — it duplicates the current row.
+        if (!_stepKeysByRun.TryGetValue(run.Id, out var stepKeys))
+            return false;
+
+        return stepKeys.Keys.Any(stepKey =>
+            _steps.TryGetValue((run.Id, stepKey), out var s)
             && (ContainsIgnoreCase(s.StepKey, search)
                 || ContainsIgnoreCase(s.ErrorMessage, search)
                 || ContainsIgnoreCase(s.OutputJson, search)));
diff --git a/tests/benchmarks/FlowOrchestrator.Benchmarks/RunSearchBenchmarks.cs b/tests/benchmarks/FlowOrchestrator.Benchmarks/RunSearchBenchmarks.cs
new file mode 100644
index 0000000..edad3e3
--- /dev/null
+++ b/tests/benchmarks/FlowOrchestrator.Benchmarks/RunSearchBenchmarks.cs
@@ -0,0 +1,169 @@
+using System.Globalization;
+using BenchmarkDotNet.Attributes;
+using BenchmarkDotNet.Configs;
+using BenchmarkDotNet.Jobs;
+using BenchmarkDotNet.Toolchains.InProcess.Emit;
+using FlowOrchestrator.InMemory;
+
+namespace FlowOrchestrator.Benchmarks;
+
+/// <summary>
+/// Quantifies the cost difference between the tiered RUN-search modes added in
+/// v1.27.0 on <see cref="InMemoryFlowRunStore.GetRunsPageAsync(System.Guid?, string?, int, int, string?, bool, System.DateTimeOffset?, System.DateTimeOffset?)"/>.
+/// <para>
+/// The <c>quick</c> path (<c>deepSearch: false</c>) matches the search term only
+/// against the run identity columns (id, flow name, trigger key, status, job id)
+/// and short-circuits — its work is O(runs).
+/// </para>
+/// <para>
+/// The <c>deep</c> path (<c>deepSearch: true</c>) additionally scans the step rows
+/// for every run that survives the identity filter. In the in-memory store that
+/// inner scan is <c>_steps.Values.Any(s =&gt; s.RunId == run.Id &amp;&amp; ...)</c>
+/// (see <c>InMemoryFlowRunStore.MatchesRunSearch</c>), which walks the whole global
+/// step keyspace per candidate run — so the cost is O(runs × total_steps) and
+/// grows quadratically as run history accumulates.
+/// </para>
+/// <para>
+/// The search term is chosen to appear <b>only</b> inside step <c>OutputJson</c>,
+/// never in any identity column, so the quick path matches zero rows (the cheap
+/// path) while the deep path performs the full step scan and returns matches (the
+/// representative path). Both calls request a single page (<c>take: 20</c>),
+/// matching the dashboard's default page size.
+/// </para>
+/// </summary>
+/// <remarks>
+/// Uses the in-process emit toolchain (<see cref="RunSearchConfig"/>) rather than
+/// the default out-of-process job. The repo keeps stale git worktrees under
+/// <c>.claude/worktrees/</c> that each contain a copy of this benchmark project;
+/// BenchmarkDotNet's default toolchain discovers the duplicate <c>.csproj</c> by
+/// assembly name and refuses to build the boilerplate. Running in-process skips
+/// the generated project entirely. The subject is a pure in-memory store, so
+/// in-process execution does not perturb the measurement.
+/// <para>
+/// A reduced job (3 warmup / 3 iterations) is used because the deep path at
+/// <c>TotalRuns=10_000</c> is O(runs × total_steps) ≈ 6×10⁸ comparisons per
+/// invocation and takes tens of seconds per op — a full job would run for hours.
+/// </para>
+/// </remarks>
+[MemoryDiagnoser]
+[Config(typeof(RunSearchConfig))]
+public class RunSearchBenchmarks
+{
+    /// <summary>
+    /// In-process, reduced-iteration configuration for <see cref="RunSearchBenchmarks"/>.
+    /// </summary>
+    private sealed class RunSearchConfig : ManualConfig
+    {
+        /// <summary>Initialises the config with the in-process emit toolchain and a short job.</summary>
+        public RunSearchConfig()
+        {
+            AddJob(Job.Default
+                .WithToolchain(InProcessEmitToolchain.Instance)
+                .WithWarmupCount(3)
+                .WithIterationCount(3)
+                .WithLaunchCount(1));
+        }
+    }
+
+    private const int StepsPerRun = 6;
+    private const int PageSize = 20;
+
+    /// <summary>
+    /// A token embedded in exactly one step's output JSON per run, and in no
+    /// identity column. Matching it forces the deep path to do real work while
+    /// the quick path provably returns nothing.
+    /// </summary>
+    private const string DeepOnlyNeedle = "needle-7f3a";
+
+    /// <summary>Total runs seeded into the store before each measurement.</summary>
+    [Params(1_000, 10_000)]
+    public int TotalRuns { get; set; }
+
+    /// <summary>
+    /// Selects the search tier under test: <see langword="false"/> = quick
+    /// (identity-only), <see langword="true"/> = deep (identity + step scan).
+    /// </summary>
+    [Params(false, true)]
+    public bool DeepSearch { get; set; }
+
+    private InMemoryFlowRunStore _store = null!;
+
+    /// <summary>
+    /// Seeds <see cref="TotalRuns"/> completed runs, each with
+    /// <see cref="StepsPerRun"/> steps carrying a non-trivial JSON output blob.
+    /// The needle is planted in one step per run so the deep scan has to walk to
+    /// it; identity columns never contain the needle so the quick path is a true
+    /// negative.
+    /// </summary>
+    [GlobalSetup]
+    public async Task Setup()
+    {
+        _store = new InMemoryFlowRunStore();
+
+        for (var i = 0; i < TotalRuns; i++)
+        {
+            var runId = Guid.NewGuid();
+            await _store.StartRunAsync(
+                flowId: Guid.Empty,
+                flowName: "BenchFlow",
+                runId: runId,
+                triggerKey: "manual",
+                triggerData: null,
+                jobId: null);
+
+            for (var s = 0; s < StepsPerRun; s++)
+            {
+                var stepKey = $"step_{s}";
+                await _store.RecordStepStartAsync(runId, stepKey, "noop", inputJson: null, jobId: null);
+                await _store.TryRecordDispatchAsync(runId, stepKey);
+                await _store.TryClaimStepAsync(runId, stepKey);
+
+                // Only the last step of each run carries the needle, so the deep
+                // scan must enumerate past the earlier steps before it can match.
+                var carriesNeedle = s == StepsPerRun - 1;
+                await _store.RecordStepCompleteAsync(
+                    runId, stepKey,
+                    status: "Succeeded",
+                    outputJson: BuildOutputJson(i, s, carriesNeedle),
+                    errorMessage: null);
+            }
+        }
+    }
+
+    /// <summary>
+    /// Runs a single search page against the store using the tier selected by
+    /// <see cref="DeepSearch"/>. The result tuple is returned so the JIT cannot
+    /// elide the call.
+    /// </summary>
+    [Benchmark(Description = "GetRunsPageAsync(search, take:20)")]
+    public async Task<int> Search()
+    {
+        var (_, total) = await _store.GetRunsPageAsync(
+            flowId: null,
+            status: null,
+            skip: 0,
+            take: PageSize,
+            search: DeepOnlyNeedle,
+            deepSearch: DeepSearch);
+        return total;
+    }
+
+    /// <summary>
+    /// Builds a realistic, non-trivial output payload for a step. The needle is
+    /// embedded as a field value only when <paramref name="carriesNeedle"/> is set
+    /// so it lives exclusively in step output, never in a run identity column.
+    /// </summary>
+    /// <param name="runOrdinal">Sequential index of the run being seeded.</param>
+    /// <param name="stepOrdinal">Index of the step within the run.</param>
+    /// <param name="carriesNeedle">When <see langword="true"/>, plants the deep-only search token.</param>
+    /// <returns>A JSON object string of a few hundred bytes.</returns>
+    private static string BuildOutputJson(int runOrdinal, int stepOrdinal, bool carriesNeedle)
+    {
+        var correlation = carriesNeedle ? DeepOnlyNeedle : "ok";
+        // Hand-built JSON (no serializer dependency) shaped like a typical step
+        // output: a status block, a few scalar fields, and a small nested object.
+        return string.Create(CultureInfo.InvariantCulture, $$"""
+            {"status":"completed","stepOrdinal":{{stepOrdinal}},"runOrdinal":{{runOrdinal}},"correlation":"{{correlation}}","httpStatus":200,"durationMs":{{42 + stepOrdinal}},"payload":{"itemsProcessed":{{100 + runOrdinal % 50}},"warnings":0,"region":"westus2","retryable":false},"timestamp":"2026-05-24T12:00:00Z"}
+            """);
+    }
+}