feat(logsdb): comprehensive benchmarks, engine optimizations, and interactive demo#205
Conversation
Add cross-corpus benchmarks for storage efficiency, query latency, and ingest throughput across 5 synthetic workload types (syslog, structured JSON, high-cardinality, cloud-native, mixed) at 1K/10K/100K scale. Key findings from initial run: - Syslog: 14.2 B/log (5.1× vs text, 14.9× vs ndjson) — excellent - Cloud-native: 21.1 B/log (4.4× vs text, 16.1× vs ndjson) — good - Mixed: 38.9 B/log (4.0× vs text, 9.5× vs ndjson) - Structured JSON: 47.7 B/log (6.3× vs text, 10.5× vs ndjson) - High-cardinality: 66.1 B/log (1.3× vs text, 7.7× vs ndjson) Ingest throughput: 4K-25K records/s depending on workload complexity. Query pruning working well: time_range queries prune 80-90% of chunks. Also includes: - Interactive logsdb-engine demo with Storage Explorer, Logs Explorer, and Query Builder panels using the real o11ylogsdb engine - 32 demo tests (data-gen, storage-model, query-model) - Browser compatibility fix for query.ts (performance.now fallback) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Chunks are immutable after freeze, so decoded LogRecord[] can be cached safely in a WeakMap. The query engine now uses store.decodeChunk() which returns cached results, eliminating redundant ZSTD decompression on repeated queries (the common case for dashboard refresh). Benchmark results (10K records, comprehensive-query): Before: 14-91ms p50 across corpus types After: 0.1-0.9ms p50 (cache-hot) Pattern borrowed from o11ytracesdb/src/engine.ts which uses the same WeakMap<Chunk, Span[]> approach. Memory cost: one additional reference per chunk (records are retained while chunk exists anyway). GC is automatic via WeakMap — no invalidation logic needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds three new benchmark runners (comprehensive-query, comprehensive-storage, ingest-throughput) and registers them with the bench runner. Introduces deterministic synthetic corpus generators and size presets. Implements a body-only extraction API (readBodiesOnly) and a ChunkPolicy.decodeBodiesOnly hook plus typed-columnar token metadata and a TypedColumnar decode-bodies-only path. queryStream gains a body-containing fast path that uses the body-only decoding. Re-exports readBodiesOnly from the public API. Adds a browser demo site (HTML/CSS/JS), a node:zlib browser stub, Vite config, and multiple Vitest suites for data generation, storage-model, and query-model. Possibly related PRs
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
…ry speedup" This reverts commit 92611a4.
| export function formatBodyPreview(body, maxLen = 120) { | ||
| const full = typeof body === "string" ? body : JSON.stringify(body); | ||
| if (full.length <= maxLen) return full; | ||
| return full.slice(0, maxLen) + "…"; |
There was a problem hiding this comment.
⚠️ Bug: formatBodyPreview crashes on undefined/null body
formatBodyPreview at line 166 does JSON.stringify(body) when body is not a string. If body is undefined, JSON.stringify(undefined) returns undefined (not a string), and the subsequent .length access on line 167 throws TypeError. The sibling formatBody function (line 154) already handles null/undefined/Uint8Array, but formatBodyPreview does not. This will crash the Query Builder results table in app.js:492 if any record has a missing body.
Suggested fix:
export function formatBodyPreview(body, maxLen = 120) {
if (body === null || body === undefined) return "";
if (body instanceof Uint8Array) return `<binary ${body.length} bytes>`;
const full = typeof body === "string" ? body : JSON.stringify(body);
if (full.length <= maxLen) return full;
return full.slice(0, maxLen) + "…";
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| decodeChunk(streamId: StreamId, chunk: Chunk): LogRecord[] { | ||
| let cached = this._decodeCache.get(chunk); | ||
| if (!cached) { | ||
| const policy = this.policyFor(streamId); | ||
| cached = readRecords(chunk, this.registry, policy); | ||
| this._decodeCache.set(chunk, cached); | ||
| } | ||
| return cached; | ||
| } |
There was a problem hiding this comment.
⚠️ Edge Case: Decode cache returns mutable array references to callers
The decodeChunk WeakMap cache (engine.ts:193-199) returns the same LogRecord[] reference to every caller. While current query.ts code only reads from this array, any future caller (or external consumer via iterRecords) that mutates the returned array (e.g., .sort(), .push(), or modifying record properties) would silently corrupt the cache for all subsequent queries. This is a latent correctness hazard for a caching layer meant to be transparent.
Consider returning a shallow copy (return [...cached]) or freezing the array with Object.freeze(cached) before caching. The cost is negligible compared to the ZSTD decode it replaces.
Suggested fix:
decodeChunk(streamId: StreamId, chunk: Chunk): LogRecord[] {
let cached = this._decodeCache.get(chunk);
if (!cached) {
const policy = this.policyFor(streamId);
cached = readRecords(chunk, this.registry, policy);
Object.freeze(cached);
this._decodeCache.set(chunk, cached);
}
return cached;
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| // @ts-nocheck | ||
| // ── Logs Explorer Model ─────────────────────────────────────────────── | ||
| // Curated log exploration view. Surfaces problematic patterns, | ||
| // error clusters, template analysis, and time-based insights. | ||
|
|
||
| import { query } from "o11ylogsdb"; | ||
|
|
||
| /** | ||
| * Analyze the store and produce curated insights. | ||
| */ | ||
| export function analyzeStore(store) { | ||
| store.flush(); | ||
| const stats = store.stats(); | ||
|
|
||
| // Query for errors |
There was a problem hiding this comment.
💡 Quality: Missing tests for logs-model.js module
The PR adds test files for data-gen.js, query-model.js, and storage-model.js, but logs-model.js (171 lines with analyzeStore, error clustering, template analysis, and timeline building) has no corresponding test file. Per the custom review instructions, new test files should be added for significant logic changes. The analyzeStore function contains non-trivial logic (error clustering by prefix, regex-based template extraction, per-service error rate calculation) that would benefit from test coverage.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| for (const svc of Object.keys(services)) { | ||
| const errorResult = query(store, { | ||
| resourceEquals: { "service.name": svc }, | ||
| severityGte: 17, | ||
| limit: 10000, | ||
| }); | ||
| services[svc].errors = errorResult.stats.recordsEmitted; | ||
| services[svc].errorRate = | ||
| services[svc].logs > 0 | ||
| ? ((services[svc].errors / services[svc].logs) * 100).toFixed(2) | ||
| : "0"; |
There was a problem hiding this comment.
💡 Performance: analyzeServices issues N+1 queries per service
In logs-model.js:148-158, analyzeServices iterates over all services and runs a separate query() call per service to count errors. With 6 services this is tolerable, but the pattern is O(services × records). A single query(store, { severityGte: 17 }) followed by grouping results by service attribute would be more efficient and avoid redundant chunk decoding.
Suggested fix:
// Replace per-service queries with a single error query:
const allErrors = query(store, { severityGte: 17 });
for (const r of allErrors.records) {
const svc = r.attributes?.find(a => a.key === "service.name")?.value ?? "unknown";
if (services[svc]) services[svc].errors++;
}
for (const svc of Object.values(services)) {
svc.errorRate = svc.logs > 0
? ((svc.errors / svc.logs) * 100).toFixed(2) : "0";
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| @@ -0,0 +1,511 @@ | |||
| // @ts-nocheck | |||
There was a problem hiding this comment.
💡 Quality: All site JS modules use @ts-nocheck instead of proper types
All 5 new JS files under site/logsdb-engine/js/ use @ts-nocheck on line 1. Per the custom review instructions, @ts-nocheck should be avoided unless strictly necessary for existing legacy files. Since these are all brand-new files, consider either converting them to .ts with proper types or at minimum adding JSDoc type annotations and removing the blanket @ts-nocheck.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
|
Note Your trial team has used its Gitar budget, so automatic reviews are paused. Upgrade now to unlock full capacity. Comment "Gitar review" to trigger a review manually. Code Review
|
| Compact |
|
Important
Your trial ends in 5 days — upgrade now to keep code review, CI analysis, auto-apply, custom automations, and more.
Was this helpful? React with 👍 / 👎 | Gitar
Two real engine optimizations for bodyContains queries: 1. Partial column decode (decodeBodiesOnly): skips sidecar JSON.parse entirely — only decodes kinds + templates + variable columns. Result: 9× speedup on high-cardinality body_substring (91ms → 10ms). 2. Template-token chunk pruning: stores template literal tokens in chunk header (codecMeta.toks). For bodyContains queries, checks if any token contains the needle BEFORE ZSTD decompression. If no template token matches and the needle looks like a keyword, the entire chunk is skipped without touching compressed data. Result: syslog body_substring 11ms → 0.1ms (113× faster), high-cardinality 91ms → 0ms (all chunks pruned at header level). These are real engine path improvements — no caching, no extra memory. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 17
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/o11ylogsdb/bench/comprehensive-query.bench.ts`:
- Around line 188-200: The benchmark is passing a record count into
profileEncode (ProfileResult expects outputBytes), so replace the encode closure
used in comprehensive-query.bench.ts (currently returning r.records.length from
query(store, spec)) with a value representing output bytes (e.g., serialize
records and measure byte length or sum individual record byte sizes) or
implement a dedicated query profiler/report type instead of reusing the
compression profiler; update references to outputBytes/bytesPerLog/compression
ratios accordingly so the saved report uses actual byte sizes rather than record
counts.
- Around line 109-154: The benchmarks hard-code syslog-specific predicates
(e.g., service_match uses "service.name":"bench-syslog" and combined_all uses
"corpus.type":"syslog" and bodyContains:"ssh"), causing non-syslog workloads to
exercise no-match paths; update the spec factories to derive predicates from the
provided records instead of fixed literals. For service_match, read a real
service name from the incoming records (e.g., use
records[0].resource["service.name"] when spec receives records) and return
resourceEquals with that value; for combined_all (and combined_time+severity if
applicable) replace the hard-coded corpus.type and body token with values
sampled from records (e.g., records[0].resource["corpus.type"] and pick a
substring present in records[0].body or scan records for a frequent token) so
each workload runs equivalent, corpus-specific queries across all benchmark
cases.
In `@packages/o11ylogsdb/bench/ingest-throughput.bench.ts`:
- Around line 72-99: The current peakHeapMB and peakRssMB only compare memBefore
and memAfter and miss transient spikes; change the measurement to track actual
peaks by sampling memory during the hot path: inside the records loop that calls
store.append(resource, SCOPE, record) (and/or right after store.flush()), call
process.memoryUsage() periodically (e.g., every N records or every append) and
update running maxHeapUsed and maxRss variables, then compute peakHeapMB and
peakRssMB from those max values instead of Math.max(memBefore.heapUsed,
memAfter.heapUsed); keep the existing memBefore/memAfter for snapshots if
desired.
In `@packages/o11ylogsdb/bench/synthetic-corpora.ts`:
- Around line 121-125: The generators leak wall-clock time into produced corpora
via Date.now() (e.g., baseNs = BigInt(Date.now()) * 1_000_000n used when
building records) and by constructing Date-based fields for bodies, so runs with
the same seed/count are not reproducible; replace uses of Date.now() / new
Date(...) and any direct time creation with deterministic values derived from
the seeded PRNG (e.g., generate a deterministic base timestamp from the seed and
then increment by deterministic offsets per record) and similarly generate body
text and any date-string fields using the same seeded RNG (affecting the blocks
around baseNs, the record time generation, and any body/date construction noted
at the other ranges). Ensure randomSeverity() remains seeded or consumes the
same PRNG instance so the entire corpus is deterministic for a given (count,
seed).
- Around line 23-27: The function rand() currently updates the module seed
inside an expression (let t = (_seed += 0x6d2b79f5)) which violates the
noAssignInExpressions rule; split the seed update into its own statement (e.g.,
_seed += 0x6d2b79f5; const t = _seed;) or add a small helper like advanceSeed()
that mutates _seed before computing t, then keep the rest of rand() unchanged
(referencing rand() and the module-level _seed variable to locate the change).
In `@packages/o11ylogsdb/src/engine.ts`:
- Around line 94-99: The decode cache _decodeCache (currently a WeakMap<Chunk,
LogRecord[]>) is unbounded because StreamRegistry keeps strong references to
Chunk, causing decoded arrays to accumulate; replace it with a bounded eviction
cache (e.g., an LRU with a max entries or byte-size limit) and use that cache
instead of _decodeCache in functions that call decode; ensure keys remain Chunk
instances (or a stable chunk id) and evict oldest entries when capacity is
exceeded so decoded LogRecord[] are released and GC-able even when
StreamRegistry retains chunk references.
- Around line 192-199: decodeChunk currently stores and returns the same mutable
LogRecord array from _decodeCache (populated via readRecords), causing
shared-mutation bugs; change to preserve immutability by either (a)
deep-freezing the cached records and their nested body/attributes before calling
this._decodeCache.set(chunk, cached) so all callers get read-only instances, or
(b) keep cached raw objects but return a shallow/deep cloned array and cloned
records (and cloned body/attributes) on each cache hit in decodeChunk so callers
get unique copies; update decodeChunk, the _decodeCache set site, and consider a
helper (e.g., cloneRecord or freezeRecord) to centralize copying/freezing logic
while still referring to decodeChunk, _decodeCache, readRecords, and LogRecord.
In `@site/logsdb-engine/css/base.css`:
- Around line 8-10: The :root CSS custom properties (--mono, --sans, --display)
are defined self-referentially (e.g., --mono: var(--mono, ...)) which creates a
cyclic/invalid value; replace those var(...) definitions with the actual
fallback token values so consumers reading var(--mono), var(--sans), or
var(--display) get the intended font stacks directly (e.g., set --mono to
"JetBrains Mono", monospace; --sans to "Inter Tight", sans-serif; --display to
"Fraunces", serif).
In `@site/logsdb-engine/js/app.js`:
- Around line 118-163: The current setTimeout wrapper still runs generateLogs()
and ingestRecords() synchronously and blocks the UI; change the implementation
to generate and ingest in incremental batches so the event loop can repaint (use
generateLogBatches() or a Web Worker). Specifically, replace the synchronous
call to generateLogs(...) and the direct ingestRecords(store, result.records)
usage inside the setTimeout callback with an async batch loop (or postMessage to
a Worker) that calls generateLogBatches(...) to produce chunks, calls
ingestRecords(store, chunk) for each chunk (or sends chunks back to the main
thread to ingest), updates onProgress and UI between batches, and only when all
batches complete set genStats, compute timings, update getStoreStats(...)
display values, call renderCurrentTab(), enable buttons, and resolve. Ensure
references to generateLogs, generateLogBatches, ingestRecords, createStore,
getStoreStats, and genStats are updated accordingly so progress updates and
stats are computed incrementally rather than after a single blocking call.
- Around line 137-159: When replacing the store after
createStore()/ingestRecords(), clear the cached query result so stale data isn't
rendered: reset the shared variable lastQueryResult (and any related cache like
lastQuery if present) to null/undefined before calling renderCurrentTab();
update the block after ingestRecords(store, ...) where store is assigned and
before renderCurrentTab() to explicitly clear lastQueryResult so Query Builder
won't render results from the previous dataset.
In `@site/logsdb-engine/js/data-gen.js`:
- Around line 340-346: generateLogs currently sets _seed but computes baseTime
from Date.now(), so the same seed still yields different results; change the
logic in generateLogs (and the similar block at the other branch around the
453-459 region) so that when a seed is provided you derive baseTime
deterministically (either compute baseTime from the seed or accept/derive a
fixed startTime from opts) instead of using Date.now(); ensure both code paths
set _seed and the deterministic baseTime consistently so timestamps are
reproducible for identical opts and seed.
- Around line 502-510: When the final batch aligns exactly with batchSize
consumers never see progress:1 because batch is emptied inside the loop; after
the loop, ensure you always emit a completion progress object: if batch.length >
0 yield { batch, progress: 1 } else if (count % batchSize === 0) yield { batch:
[], progress: 1 }; refer to the existing variables batch, batchSize, count and
the loop index i where yields occur and add this extra emit so consumers observe
completion.
- Around line 162-166: The mulberry32 function currently mutates _seed inside an
expression which violates noAssignInExpressions; fix it by first incrementing
_seed on its own (e.g., _seed = _seed + 0x6d2b79f5 or _seed += 0x6d2b79f5) into
a temporary or into _seed, then assign t from that updated value (let t =
_seed); keep the rest of the logic in function mulberry32 unchanged so
references to t and _seed remain correct.
In `@site/logsdb-engine/js/logs-model.js`:
- Around line 147-158: The errorRate is being computed from a capped query
because query(..., { limit: 10000 }) truncates recordsEmitted; update the logic
in the loop that calls query for each svc (the block using resourceEquals,
severityGte and limit) to get an uncapped total error count instead of the
limited recordsEmitted—either remove the limit or call the store/query path that
returns an aggregate/count (so services[svc].errors reflects the actual total)
and then compute services[svc].errorRate from that true total (or, if keeping a
sampled/capped approach, change the displayed label to indicate it is a sample
rather than an exact percentage). Ensure you modify the code that sets
services[svc].errors and services[svc].errorRate accordingly.
In `@site/logsdb-engine/js/query-model.js`:
- Around line 165-168: formatBodyPreview currently attempts to stringify the raw
body itself which can throw for undefined, circular structures, or produce
different previews for binary types; change formatBodyPreview to call the
existing formatBody(body) first (use the formatted string it returns), then
apply the length check and truncation (keeping the "…" suffix) using the maxLen
parameter so previews and formatting are consistent and safe; update references
to formatBodyPreview only, keeping its signature.
In `@site/logsdb-engine/js/storage-model.js`:
- Around line 65-72: The computed throughput logsPerSecond uses elapsed =
performance.now() - t0 which can be 0 (or negative) for tiny ingests, yielding
Infinity; in the return block for the function that builds the result object
(referencing elapsed, appended, and logsPerSecond) guard the divisor by checking
elapsed <= 0 and in that case set logsPerSecond to a safe value (e.g., 0 or
Math.round(appended)) or compute using a small epsilon (e.g., use
Math.max(elapsed, 1) or conditional: elapsed > 0 ? Math.round((appended /
elapsed) * 1000) : 0) before returning the object so Infinity/NaN are never
produced.
In `@site/logsdb-engine/vite.config.ts`:
- Around line 1-6: The Vite config uses __dirname (root: resolve(__dirname))
which is undefined in ESM; replace that with a directory derived from
import.meta.url using fileURLToPath and dirname: import fileURLToPath from 'url'
and dirname from 'path' (or node:path's dirname) and set root to
resolve(dirname(fileURLToPath(import.meta.url))). Keep the existing base and
defineConfig usage and update imports accordingly so root resolves correctly in
ESM.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: c752de1d-5bfe-4cdf-9b2a-ab41cd3df952
📒 Files selected for processing (20)
packages/o11ylogsdb/bench/comprehensive-query.bench.tspackages/o11ylogsdb/bench/comprehensive-storage.bench.tspackages/o11ylogsdb/bench/ingest-throughput.bench.tspackages/o11ylogsdb/bench/run.mjspackages/o11ylogsdb/bench/synthetic-corpora.tspackages/o11ylogsdb/src/engine.tspackages/o11ylogsdb/src/query.tssite/logsdb-engine/css/base.csssite/logsdb-engine/css/demo.csssite/logsdb-engine/index.htmlsite/logsdb-engine/js/app.jssite/logsdb-engine/js/data-gen.jssite/logsdb-engine/js/logs-model.jssite/logsdb-engine/js/query-model.jssite/logsdb-engine/js/storage-model.jssite/logsdb-engine/js/zlib-stub.jssite/logsdb-engine/test/data-gen.test.jssite/logsdb-engine/test/query-model.test.jssite/logsdb-engine/test/storage-model.test.jssite/logsdb-engine/vite.config.ts
| { | ||
| name: "service_match", | ||
| description: "Resource service.name match", | ||
| spec: () => ({ | ||
| resourceEquals: { "service.name": "bench-syslog" }, | ||
| }), | ||
| }, | ||
| { | ||
| name: "service_no_match", | ||
| description: "Resource service.name no match", | ||
| spec: () => ({ | ||
| resourceEquals: { "service.name": "_does_not_exist_" }, | ||
| }), | ||
| }, | ||
| { | ||
| name: "body_substring", | ||
| description: "Body contains common keyword", | ||
| spec: () => ({ bodyContains: "error" }), | ||
| }, | ||
| { | ||
| name: "combined_time+severity", | ||
| description: "Time range + severity filter", | ||
| spec: (records) => { | ||
| const first = records[0]!.timeUnixNano; | ||
| const last = records[records.length - 1]!.timeUnixNano; | ||
| const range = last - first; | ||
| return { | ||
| severityGte: 17, | ||
| range: { from: first, to: first + range / 2n }, | ||
| }; | ||
| }, | ||
| }, | ||
| { | ||
| name: "combined_all", | ||
| description: "Time + severity + service + body", | ||
| spec: (records) => { | ||
| const first = records[0]!.timeUnixNano; | ||
| const last = records[records.length - 1]!.timeUnixNano; | ||
| const range = last - first; | ||
| return { | ||
| severityGte: 13, | ||
| range: { from: first, to: first + range / 2n }, | ||
| resourceEquals: { "corpus.type": "syslog" }, | ||
| bodyContains: "ssh", | ||
| }; | ||
| }, |
There was a problem hiding this comment.
The “cross-corpus” predicates are hard-coded to syslog.
service_match always queries service.name = bench-syslog, and combined_all always filters on corpus.type = syslog plus "ssh". For every non-syslog corpus, those cases mostly measure the no-match path, so the summary table is not comparing equivalent queries across workloads.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/bench/comprehensive-query.bench.ts` around lines 109 -
154, The benchmarks hard-code syslog-specific predicates (e.g., service_match
uses "service.name":"bench-syslog" and combined_all uses "corpus.type":"syslog"
and bodyContains:"ssh"), causing non-syslog workloads to exercise no-match
paths; update the spec factories to derive predicates from the provided records
instead of fixed literals. For service_match, read a real service name from the
incoming records (e.g., use records[0].resource["service.name"] when spec
receives records) and return resourceEquals with that value; for combined_all
(and combined_time+severity if applicable) replace the hard-coded corpus.type
and body token with values sampled from records (e.g.,
records[0].resource["corpus.type"] and pick a substring present in
records[0].body or scan records for a frequent token) so each workload runs
equivalent, corpus-specific queries across all benchmark cases.
| return profileEncode({ | ||
| corpus: corpusType, | ||
| codec: qCase.name, | ||
| inputBytes: totalChunkBytes, | ||
| rawTextBytes, | ||
| rawNdjsonBytes: rawTextBytes * 2, // rough proxy | ||
| logCount: sample.records.length || 1, | ||
| encode: () => { | ||
| const r = query(store, spec); | ||
| return r.records.length; | ||
| }, | ||
| options: { warmup: 2, iterations: 5 }, | ||
| }); |
There was a problem hiding this comment.
profileEncode() is being used with record counts instead of bytes.
profileEncode()/ProfileResult treat encode()'s return value as outputBytes, but this closure returns r.records.length. That makes outputBytes, bytesPerLog, and both compression ratios in the saved report meaningless. This benchmark needs a query-specific profiler/report type instead of reusing the compression harness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/bench/comprehensive-query.bench.ts` around lines 188 -
200, The benchmark is passing a record count into profileEncode (ProfileResult
expects outputBytes), so replace the encode closure used in
comprehensive-query.bench.ts (currently returning r.records.length from
query(store, spec)) with a value representing output bytes (e.g., serialize
records and measure byte length or sum individual record byte sizes) or
implement a dedicated query profiler/report type instead of reusing the
compression profiler; update references to outputBytes/bytesPerLog/compression
ratios accordingly so the saved report uses actual byte sizes rather than record
counts.
| // Force GC before measurement | ||
| const g = globalThis as { gc?: () => void }; | ||
| if (typeof g.gc === "function") g.gc(); | ||
| const memBefore = process.memoryUsage(); | ||
|
|
||
| const t0 = nowMillis(); | ||
| for (const record of records) { | ||
| store.append(resource, SCOPE, record); | ||
| } | ||
| store.flush(); | ||
| const t1 = nowMillis(); | ||
|
|
||
| const memAfter = process.memoryUsage(); | ||
| const ingestMs = t1 - t0; | ||
| const stats = store.stats(); | ||
|
|
||
| return { | ||
| corpusType, | ||
| recordCount: RECORD_COUNT, | ||
| totalRawBytes, | ||
| totalChunkBytes: stats.totalChunkBytes, | ||
| ingestMs, | ||
| recordsPerSecond: RECORD_COUNT / (ingestMs / 1000), | ||
| rawMBPerSecond: (totalRawBytes / 1_000_000) / (ingestMs / 1000), | ||
| bytesPerLog: stats.totalChunkBytes / RECORD_COUNT, | ||
| chunkCount: stats.chunks, | ||
| peakHeapMB: Math.max(memBefore.heapUsed, memAfter.heapUsed) / 1_000_000, | ||
| peakRssMB: Math.max(memBefore.rss, memAfter.rss) / 1_000_000, |
There was a problem hiding this comment.
peakHeapMB and peakRssMB are not peaks here.
This only compares process.memoryUsage() before and after ingest. Any transient spike during append/flush is missed, so the reported “peak” values can materially understate memory pressure. Either sample during the hot path or rename these fields to reflect that they are end-of-run snapshots.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/bench/ingest-throughput.bench.ts` around lines 72 - 99,
The current peakHeapMB and peakRssMB only compare memBefore and memAfter and
miss transient spikes; change the measurement to track actual peaks by sampling
memory during the hot path: inside the records loop that calls
store.append(resource, SCOPE, record) (and/or right after store.flush()), call
process.memoryUsage() periodically (e.g., every N records or every append) and
update running maxHeapUsed and maxRss variables, then compute peakHeapMB and
peakRssMB from those max values instead of Math.max(memBefore.heapUsed,
memAfter.heapUsed); keep the existing memBefore/memAfter for snapshots if
desired.
| const baseNs = BigInt(Date.now()) * 1_000_000n; | ||
| for (let i = 0; i < count; i++) { | ||
| const { severityNumber, severityText } = randomSeverity(); | ||
| records.push({ | ||
| timeUnixNano: baseNs + BigInt(i) * 1_000_000_000n, |
There was a problem hiding this comment.
These generators are not actually deterministic.
Each generator seeds the PRNG, but Date.now()/new Date(Date.now()...) still leak wall-clock time into timeUnixNano and body fields. The same (count, seed) will produce different corpora across runs, which makes the benchmark baselines non-reproducible.
Also applies to: 147-165, 202-202, 263-284, 303-336
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/bench/synthetic-corpora.ts` around lines 121 - 125, The
generators leak wall-clock time into produced corpora via Date.now() (e.g.,
baseNs = BigInt(Date.now()) * 1_000_000n used when building records) and by
constructing Date-based fields for bodies, so runs with the same seed/count are
not reproducible; replace uses of Date.now() / new Date(...) and any direct time
creation with deterministic values derived from the seeded PRNG (e.g., generate
a deterministic base timestamp from the seed and then increment by deterministic
offsets per record) and similarly generate body text and any date-string fields
using the same seeded RNG (affecting the blocks around baseNs, the record time
generation, and any body/date construction noted at the other ranges). Ensure
randomSeverity() remains seeded or consumes the same PRNG instance so the entire
corpus is deterministic for a given (count, seed).
| if (batch.length >= batchSize) { | ||
| yield { batch, progress: i / count }; | ||
| batch = []; | ||
| } | ||
| } | ||
|
|
||
| if (batch.length > 0) { | ||
| yield { batch, progress: 1 }; | ||
| } |
There was a problem hiding this comment.
Emit progress: 1 for exact batch boundaries.
When count is an exact multiple of batchSize, the last batch is yielded from Line 503 with i / count, and the progress: 1 branch never runs. Consumers never observe completion.
Suggested fix
if (batch.length >= batchSize) {
- yield { batch, progress: i / count };
+ yield { batch, progress: (i + 1) / count };
batch = [];
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (batch.length >= batchSize) { | |
| yield { batch, progress: i / count }; | |
| batch = []; | |
| } | |
| } | |
| if (batch.length > 0) { | |
| yield { batch, progress: 1 }; | |
| } | |
| if (batch.length >= batchSize) { | |
| yield { batch, progress: (i + 1) / count }; | |
| batch = []; | |
| } | |
| } | |
| if (batch.length > 0) { | |
| yield { batch, progress: 1 }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/js/data-gen.js` around lines 502 - 510, When the final
batch aligns exactly with batchSize consumers never see progress:1 because batch
is emptied inside the loop; after the loop, ensure you always emit a completion
progress object: if batch.length > 0 yield { batch, progress: 1 } else if (count
% batchSize === 0) yield { batch: [], progress: 1 }; refer to the existing
variables batch, batchSize, count and the loop index i where yields occur and
add this extra emit so consumers observe completion.
| // Query errors per service | ||
| for (const svc of Object.keys(services)) { | ||
| const errorResult = query(store, { | ||
| resourceEquals: { "service.name": svc }, | ||
| severityGte: 17, | ||
| limit: 10000, | ||
| }); | ||
| services[svc].errors = errorResult.stats.recordsEmitted; | ||
| services[svc].errorRate = | ||
| services[svc].logs > 0 | ||
| ? ((services[svc].errors / services[svc].logs) * 100).toFixed(2) | ||
| : "0"; |
There was a problem hiding this comment.
Don't derive service error rates from a capped query.
limit: 10000 truncates recordsEmitted, so errors and errorRate go low as soon as a service exceeds 10k error logs. If this panel is meant to show exact percentages, use an uncapped count path here; otherwise label the metric as sampled instead of exact.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/js/logs-model.js` around lines 147 - 158, The errorRate is
being computed from a capped query because query(..., { limit: 10000 })
truncates recordsEmitted; update the logic in the loop that calls query for each
svc (the block using resourceEquals, severityGte and limit) to get an uncapped
total error count instead of the limited recordsEmitted—either remove the limit
or call the store/query path that returns an aggregate/count (so
services[svc].errors reflects the actual total) and then compute
services[svc].errorRate from that true total (or, if keeping a sampled/capped
approach, change the displayed label to indicate it is a sample rather than an
exact percentage). Ensure you modify the code that sets services[svc].errors and
services[svc].errorRate accordingly.
| export function formatBodyPreview(body, maxLen = 120) { | ||
| const full = typeof body === "string" ? body : JSON.stringify(body); | ||
| if (full.length <= maxLen) return full; | ||
| return full.slice(0, maxLen) + "…"; |
There was a problem hiding this comment.
Reuse formatBody() here to avoid preview-time crashes.
JSON.stringify(undefined) returns undefined, so Line 167 can throw on full.length; circular objects will also throw, and Uint8Array previews diverge from formatBody(). Format first, then truncate.
Suggested fix
export function formatBodyPreview(body, maxLen = 120) {
- const full = typeof body === "string" ? body : JSON.stringify(body);
+ const full = formatBody(body);
if (full.length <= maxLen) return full;
return full.slice(0, maxLen) + "…";
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export function formatBodyPreview(body, maxLen = 120) { | |
| const full = typeof body === "string" ? body : JSON.stringify(body); | |
| if (full.length <= maxLen) return full; | |
| return full.slice(0, maxLen) + "…"; | |
| export function formatBodyPreview(body, maxLen = 120) { | |
| const full = formatBody(body); | |
| if (full.length <= maxLen) return full; | |
| return full.slice(0, maxLen) + "…"; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/js/query-model.js` around lines 165 - 168,
formatBodyPreview currently attempts to stringify the raw body itself which can
throw for undefined, circular structures, or produce different previews for
binary types; change formatBodyPreview to call the existing formatBody(body)
first (use the formatted string it returns), then apply the length check and
truncation (keeping the "…" suffix) using the maxLen parameter so previews and
formatting are consistent and safe; update references to formatBodyPreview only,
keeping its signature.
| const elapsed = performance.now() - t0; | ||
|
|
||
| return { | ||
| recordsIngested: appended, | ||
| chunksClosed, | ||
| ingestTimeMs: elapsed, | ||
| logsPerSecond: Math.round((appended / elapsed) * 1000), | ||
| }; |
There was a problem hiding this comment.
Guard the throughput metric against zero elapsed time.
For very small ingests, performance.now() can return the same tick and make logsPerSecond evaluate to Infinity. Clamp or special-case elapsed <= 0.
Suggested fix
const elapsed = performance.now() - t0;
+ const safeElapsed = elapsed > 0 ? elapsed : 1;
return {
recordsIngested: appended,
chunksClosed,
ingestTimeMs: elapsed,
- logsPerSecond: Math.round((appended / elapsed) * 1000),
+ logsPerSecond: Math.round((appended / safeElapsed) * 1000),
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const elapsed = performance.now() - t0; | |
| return { | |
| recordsIngested: appended, | |
| chunksClosed, | |
| ingestTimeMs: elapsed, | |
| logsPerSecond: Math.round((appended / elapsed) * 1000), | |
| }; | |
| const elapsed = performance.now() - t0; | |
| const safeElapsed = elapsed > 0 ? elapsed : 1; | |
| return { | |
| recordsIngested: appended, | |
| chunksClosed, | |
| ingestTimeMs: elapsed, | |
| logsPerSecond: Math.round((appended / safeElapsed) * 1000), | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/js/storage-model.js` around lines 65 - 72, The computed
throughput logsPerSecond uses elapsed = performance.now() - t0 which can be 0
(or negative) for tiny ingests, yielding Infinity; in the return block for the
function that builds the result object (referencing elapsed, appended, and
logsPerSecond) guard the divisor by checking elapsed <= 0 and in that case set
logsPerSecond to a safe value (e.g., 0 or Math.round(appended)) or compute using
a small epsilon (e.g., use Math.max(elapsed, 1) or conditional: elapsed > 0 ?
Math.round((appended / elapsed) * 1000) : 0) before returning the object so
Infinity/NaN are never produced.
| import { resolve } from "node:path"; | ||
| import { defineConfig } from "vite"; | ||
|
|
||
| export default defineConfig({ | ||
| base: process.env.BASE_PATH ?? "/o11ykit/logsdb-engine/", | ||
| root: resolve(__dirname), |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Vite vite.config.tsfiles loaded as ESM, is__dirnameavailable, or should configs usefileURLToPath(import.meta.url)withdirname() instead?
💡 Result:
In Vite, when the Vite config file is loaded as ESM (e.g. vite.config.ts), you should not rely on Node’s global dirname being available. Instead, prefer deriving it from the file URL, e.g. use fileURLToPath(import.meta.url) + dirname. This is the standard ESM pattern. (And Vite’s docs explicitly note it replaces dirname/filename/import.meta.url in config files and their deps.) Example: import { dirname } from 'node:path' import { fileURLToPath } from 'node:url' const dirname = dirname(fileURLToPath(import.meta.url)) Explanation (what the sources say): - Vite documents that it replaces filename, dirname, and import.meta.url when evaluating config files and their dependencies, which implies those identifiers are not guaranteed to exist like native Node ESM would. - Multiple Vite issue reports show that in true ESM/“type: module” scenarios, Node itself throws “dirname is not defined in ES module scope”, and the correct fix is to define your own dirname from import.meta.url. So: write configs using fileURLToPath(import.meta.url) + dirname, rather than expecting dirname to be present.
Citations:
- 1: https://v2.vite.dev/config
- 2: https://main.vitejs.dev/config
- 3: "__dirname is not defined in ES module scope" in vite.config.js with module type in package.json vitejs/vite#6946
- 4: Docs for MPA config incorrectly suggest using __dirname vitejs/vite#9189
- 5: https://stackoverflow.com/questions/75004188/what-does-fileurltopathimport-meta-url-do
- 6: feat: better config
__dirnamesupport vitejs/vite#8442
__dirname is not available in ESM Vite configs.
Vite evaluates config files as ESM where __dirname is undefined. Use fileURLToPath(import.meta.url) with dirname() instead to derive the directory path.
Suggested fix
-import { resolve } from "node:path";
+import { dirname, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
import { defineConfig } from "vite";
+const __dirname = dirname(fileURLToPath(import.meta.url));
+
export default defineConfig({
base: process.env.BASE_PATH ?? "/o11ykit/logsdb-engine/",
root: resolve(__dirname),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import { resolve } from "node:path"; | |
| import { defineConfig } from "vite"; | |
| export default defineConfig({ | |
| base: process.env.BASE_PATH ?? "/o11ykit/logsdb-engine/", | |
| root: resolve(__dirname), | |
| import { dirname, resolve } from "node:path"; | |
| import { fileURLToPath } from "node:url"; | |
| import { defineConfig } from "vite"; | |
| const __dirname = dirname(fileURLToPath(import.meta.url)); | |
| export default defineConfig({ | |
| base: process.env.BASE_PATH ?? "/o11ykit/logsdb-engine/", | |
| root: resolve(__dirname), |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/vite.config.ts` around lines 1 - 6, The Vite config uses
__dirname (root: resolve(__dirname)) which is undefined in ESM; replace that
with a directory derived from import.meta.url using fileURLToPath and dirname:
import fileURLToPath from 'url' and dirname from 'path' (or node:path's dirname)
and set root to resolve(dirname(fileURLToPath(import.meta.url))). Keep the
existing base and defineConfig usage and update imports accordingly so root
resolves correctly in ESM.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/o11ylogsdb/src/codec-typed.ts`:
- Around line 1168-1394: decodeBodies duplicates the template dictionary,
slot-type table transposition and the seven var-decoding passes from decode();
extract that duplicated logic into a single helper (e.g., parseTemplatesAndSlots
or parseBodySkeleton) that accepts the ByteCursor (or buf and returns a new
cursor position) and returns the intermediate structures needed by both
functions: templateById, slotTypeArrays, slotPrefixArrays, slotTsShapeArrays,
rawStringByRecord, templatedIndices, tplIds, varCounts, allVars (and any ts
state needed); then replace the in-function parsing blocks in both decodeBodies
and decode to call this helper and continue with only the function-specific
steps (reconstructing templatedBodies, sidecar JSON parsing, and final
stitching), making sure the helper preserves/advances the cursor state so
subsequent reads behave identically.
- Around line 1208-1211: The inline generic for slotTypeMap is triggering the
formatter; create a named type for the mapped value (e.g., SlotTypeInfo) and
replace Map<string, { type: number; prefix?: string; timestampShapeId?: number
}> with Map<string, SlotTypeInfo> (then declare type SlotTypeInfo = { type:
number; prefix?: string; timestampShapeId?: number } above), or alternatively
split the declaration across lines so the generic isn't inline; update the
variable declaration for slotTypeMap and keep existing uses (slotTypeMap)
unchanged.
In `@packages/o11ylogsdb/src/query.ts`:
- Around line 162-179: The loop currently calls readBodiesOnly(chunk,
store.registry, policy) and then readRecords(chunk, store.registry, policy)
which decompresses/decodes the same chunk twice; update the logic to decompress
once and reuse the result—either change readBodiesOnly to return the
decompressed payload (or a tuple of bodies + raw decompressed buffer) or add a
new helper like decompressChunk(chunk, store.registry) and pass its result into
both body-only extraction and full record decoding so that when hasMatch is true
you reuse the cached decompressed payload instead of calling readRecords on the
original chunk again; adjust references to chunk, readBodiesOnly, readRecords,
store.registry, policy and stats.decodeMillis accordingly.
- Around line 155-156: The non-null assertion on spec.bodyContains should be
removed; instead capture the value into a local constant when you evaluate the
guard and reuse that constant. Concretely, where you check the presence of
bodyContains (the earlier condition that verifies spec.bodyContains exists),
assign const bodyContains = spec.bodyContains and then pass that local variable
(e.g., bodyContains or needle) into chunkPrunedByTemplateTokens rather than
using spec.bodyContains!; update usages in the same scope so the linter
noNonNullAssertion error is avoided.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 23f4fd12-7efb-4276-803d-e384097e53c4
📒 Files selected for processing (4)
packages/o11ylogsdb/src/chunk.tspackages/o11ylogsdb/src/codec-typed.tspackages/o11ylogsdb/src/index.tspackages/o11ylogsdb/src/query.ts
| /** | ||
| * Partial decode: extract only body values from the payload. Skips the | ||
| * sidecar JSON.parse() (which is ~40-60% of full decode CPU) and | ||
| * attribute/traceId/spanId reconstruction. Returns AnyValue[] in record | ||
| * order suitable for substring matching. | ||
| * | ||
| * For KIND_OTHER records (structured bodies stored in sidecar), we must | ||
| * still parse those specific sidecar lines. But for templated and | ||
| * raw-string bodies (the 95%+ case), zero JSON parsing occurs. | ||
| */ | ||
| function decodeBodies(buf: Uint8Array, expectedN: number): AnyValue[] { | ||
| const cur = new ByteCursor(buf); | ||
| const n = cur.readVarint(); | ||
| if (n !== expectedN) { | ||
| throw new Error(`typed: count mismatch payload=${n} header=${expectedN}`); | ||
| } | ||
| // Skip timestamps (delta-encoded varints) | ||
| for (let i = 0; i < n; i++) cur.readZZVarintBig(); | ||
| // Skip severities | ||
| cur.readBytes(n); | ||
| // kinds | ||
| const kinds = new Uint8Array(cur.readBytes(n)); | ||
| // template dictionary | ||
| const nTemplates = cur.readVarint(); | ||
| const templateById = new Map<number, string[]>(); | ||
| const dec = new TextDecoder(); | ||
| for (let t = 0; t < nTemplates; t++) { | ||
| const id = cur.readVarint(); | ||
| const len = cur.readVarint(); | ||
| const tplStr = dec.decode(cur.readBytes(len)); | ||
| templateById.set( | ||
| id, | ||
| tplStr.split(/\s+/).filter((s) => s.length > 0) | ||
| ); | ||
| } | ||
| // slot-type table | ||
| const slotTypeArrays = new Map<number, Int8Array>(); | ||
| const slotPrefixArrays = new Map<number, (string | undefined)[]>(); | ||
| const slotTsShapeArrays = new Map<number, (number | undefined)[]>(); | ||
| const nSlotTypes = cur.readVarint(); | ||
| const slotTypeMap = new Map<string, { type: number; prefix?: string; timestampShapeId?: number }>(); | ||
| for (let i = 0; i < nSlotTypes; i++) { | ||
| const tplId = cur.readVarint(); | ||
| const slotIdx = cur.readVarint(); | ||
| const type = cur.readByte(); | ||
| const entry: { type: number; prefix?: string; timestampShapeId?: number } = { type }; | ||
| if (type === SLOT_PREFIXED_INT64 || type === SLOT_PREFIXED_UUID) { | ||
| const plen = cur.readVarint(); | ||
| entry.prefix = dec.decode(cur.readBytes(plen)); | ||
| } else if (type === SLOT_TIMESTAMP_DELTA) { | ||
| entry.timestampShapeId = cur.readVarint(); | ||
| } | ||
| slotTypeMap.set(`${tplId}/${slotIdx}`, entry); | ||
| } | ||
| // Transpose to per-template arrays (same as full decode) | ||
| for (const [tplId, template] of templateById) { | ||
| let nVars = 0; | ||
| for (const t of template) if (t === PARAM_STR) nVars++; | ||
| if (nVars === 0) continue; | ||
| const types = new Int8Array(nVars); | ||
| const prefixes = new Array<string | undefined>(nVars); | ||
| const tsShapes = new Array<number | undefined>(nVars); | ||
| for (let s = 0; s < nVars; s++) { | ||
| const slot = slotTypeMap.get(`${tplId}/${s}`); | ||
| if (!slot) continue; | ||
| types[s] = slot.type; | ||
| if (slot.prefix !== undefined) prefixes[s] = slot.prefix; | ||
| if (slot.timestampShapeId !== undefined) tsShapes[s] = slot.timestampShapeId; | ||
| } | ||
| slotTypeArrays.set(tplId, types); | ||
| slotPrefixArrays.set(tplId, prefixes); | ||
| slotTsShapeArrays.set(tplId, tsShapes); | ||
| } | ||
| // raw-string bodies | ||
| const rawStringByRecord = new Map<number, string>(); | ||
| for (let i = 0; i < n; i++) { | ||
| if (kinds[i] !== KIND_RAW_STRING) continue; | ||
| const len = cur.readVarint(); | ||
| rawStringByRecord.set(i, dec.decode(cur.readBytes(len))); | ||
| } | ||
| // templated columns | ||
| const templatedIndices: number[] = []; | ||
| for (let i = 0; i < n; i++) if (kinds[i] === KIND_TEMPLATED) templatedIndices.push(i); | ||
| const tplIds: number[] = new Array(templatedIndices.length); | ||
| for (let i = 0; i < templatedIndices.length; i++) tplIds[i] = cur.readVarint(); | ||
| const varCounts: number[] = new Array(templatedIndices.length); | ||
| for (let i = 0; i < templatedIndices.length; i++) varCounts[i] = cur.readVarint(); | ||
| const allVars: string[][] = templatedIndices.map( | ||
| (_, i) => new Array<string>(varCounts[i] as number) | ||
| ); | ||
| // 7 passes (same as full decode — bodies depend on variable reconstruction) | ||
| // Pass 1: SLOT_STRING | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const types = slotTypeArrays.get(tplId); | ||
| if (!types) continue; | ||
| const nVars = varCounts[i] as number; | ||
| const vars = allVars[i] as string[]; | ||
| for (let s = 0; s < nVars; s++) { | ||
| if (types[s] !== SLOT_STRING) continue; | ||
| const len = cur.readVarint(); | ||
| vars[s] = dec.decode(cur.readBytes(len)); | ||
| } | ||
| } | ||
| // Pass 2: SLOT_SIGNED_INT | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const types = slotTypeArrays.get(tplId); | ||
| if (!types) continue; | ||
| const nVars = varCounts[i] as number; | ||
| const vars = allVars[i] as string[]; | ||
| for (let s = 0; s < nVars; s++) { | ||
| if (types[s] !== SLOT_SIGNED_INT) continue; | ||
| vars[s] = cur.readZZVarintBig().toString(); | ||
| } | ||
| } | ||
| // Pass 3: SLOT_PREFIXED_INT64 | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const types = slotTypeArrays.get(tplId); | ||
| if (!types) continue; | ||
| const prefixes = slotPrefixArrays.get(tplId); | ||
| const nVars = varCounts[i] as number; | ||
| const vars = allVars[i] as string[]; | ||
| for (let s = 0; s < nVars; s++) { | ||
| if (types[s] !== SLOT_PREFIXED_INT64) continue; | ||
| const big = cur.readI64LE(); | ||
| vars[s] = `${prefixes?.[s] ?? ""}${big.toString()}`; | ||
| } | ||
| } | ||
| // Pass 4: SLOT_UUID | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const types = slotTypeArrays.get(tplId); | ||
| if (!types) continue; | ||
| const nVars = varCounts[i] as number; | ||
| const vars = allVars[i] as string[]; | ||
| for (let s = 0; s < nVars; s++) { | ||
| if (types[s] !== SLOT_UUID) continue; | ||
| vars[s] = bytesToUuid(cur.readBytes(16)); | ||
| } | ||
| } | ||
| // Pass 5: SLOT_UUID_NODASH | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const types = slotTypeArrays.get(tplId); | ||
| if (!types) continue; | ||
| const nVars = varCounts[i] as number; | ||
| const vars = allVars[i] as string[]; | ||
| for (let s = 0; s < nVars; s++) { | ||
| if (types[s] !== SLOT_UUID_NODASH) continue; | ||
| vars[s] = bytesToUuidNodash(cur.readBytes(16)); | ||
| } | ||
| } | ||
| // Pass 6: SLOT_PREFIXED_UUID | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const types = slotTypeArrays.get(tplId); | ||
| if (!types) continue; | ||
| const prefixes = slotPrefixArrays.get(tplId); | ||
| const nVars = varCounts[i] as number; | ||
| const vars = allVars[i] as string[]; | ||
| for (let s = 0; s < nVars; s++) { | ||
| if (types[s] !== SLOT_PREFIXED_UUID) continue; | ||
| vars[s] = `${prefixes?.[s] ?? ""}${bytesToUuid(cur.readBytes(16))}`; | ||
| } | ||
| } | ||
| // Pass 7: SLOT_TIMESTAMP_DELTA | ||
| const tsPrev = new Map<number, bigint>(); | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const types = slotTypeArrays.get(tplId); | ||
| if (!types) continue; | ||
| const tsShapeIds = slotTsShapeArrays.get(tplId); | ||
| const nVars = varCounts[i] as number; | ||
| const vars = allVars[i] as string[]; | ||
| for (let s = 0; s < nVars; s++) { | ||
| if (types[s] !== SLOT_TIMESTAMP_DELTA) continue; | ||
| const dt = cur.readZZVarintBig(); | ||
| const key = (tplId << 16) | s; | ||
| const prev = tsPrev.get(key) ?? 0n; | ||
| const cur2 = prev + dt; | ||
| tsPrev.set(key, cur2); | ||
| const shape = tsShape(tsShapeIds?.[s] as number); | ||
| vars[s] = shape.format(cur2); | ||
| } | ||
| } | ||
| // Reconstruct templated bodies | ||
| const templatedBodies: string[] = new Array(templatedIndices.length); | ||
| for (let i = 0; i < templatedIndices.length; i++) { | ||
| const tplId = tplIds[i] as number; | ||
| const template = templateById.get(tplId); | ||
| if (!template) throw new Error(`typed: missing template id ${tplId}`); | ||
| templatedBodies[i] = reconstruct(template, allVars[i] as string[]); | ||
| } | ||
|
|
||
| // Read sidecar ONLY for KIND_OTHER records (structured bodies) | ||
| const hasOther = kinds.some((k) => k === KIND_OTHER); | ||
| let otherBodies: Map<number, AnyValue> | undefined; | ||
| if (hasOther) { | ||
| const sidecarLen = cur.readVarint(); | ||
| if (sidecarLen > 0) { | ||
| const sidecarText = dec.decode(cur.readBytes(sidecarLen)); | ||
| const lines = sidecarText.split("\n").filter((l) => l.length > 0); | ||
| otherBodies = new Map(); | ||
| for (let i = 0; i < n; i++) { | ||
| if (kinds[i] !== KIND_OTHER) continue; | ||
| const side = JSON.parse(lines[i] as string) as Record<string, unknown>; | ||
| otherBodies.set(i, jsonToAnyValue(side.b)); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Assemble body-only output (no LogRecord construction, no attribute parse) | ||
| const out: AnyValue[] = new Array(n); | ||
| let templatedCursor = 0; | ||
| for (let i = 0; i < n; i++) { | ||
| if (kinds[i] === KIND_RAW_STRING) { | ||
| out[i] = rawStringByRecord.get(i) ?? ""; | ||
| } else if (kinds[i] === KIND_TEMPLATED) { | ||
| out[i] = templatedBodies[templatedCursor++] as string; | ||
| } else { | ||
| out[i] = otherBodies?.get(i) ?? ""; | ||
| } | ||
| } | ||
| return out; | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Significant duplication with decode().
decodeBodies() duplicates ~200 lines from decode(): template dictionary parsing, slot-type table transposition, all 7 var-decoding passes. Only the final record-stitching differs.
Extract shared parsing into a helper returning intermediate structures, then have both functions consume it.
🧰 Tools
🪛 GitHub Actions: validate-o11ykit-monorepo
[error] 1208-1211: biome formatter: Formatter would have printed different generic Map type formatting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/src/codec-typed.ts` around lines 1168 - 1394,
decodeBodies duplicates the template dictionary, slot-type table transposition
and the seven var-decoding passes from decode(); extract that duplicated logic
into a single helper (e.g., parseTemplatesAndSlots or parseBodySkeleton) that
accepts the ByteCursor (or buf and returns a new cursor position) and returns
the intermediate structures needed by both functions: templateById,
slotTypeArrays, slotPrefixArrays, slotTsShapeArrays, rawStringByRecord,
templatedIndices, tplIds, varCounts, allVars (and any ts state needed); then
replace the in-function parsing blocks in both decodeBodies and decode to call
this helper and continue with only the function-specific steps (reconstructing
templatedBodies, sidecar JSON parsing, and final stitching), making sure the
helper preserves/advances the cursor state so subsequent reads behave
identically.
| const t0 = nowMillis(); | ||
| const bodies = readBodiesOnly(chunk, store.registry, policy); | ||
| let hasMatch = false; | ||
| for (let i = 0; i < bodies.length; i++) { | ||
| if (typeof bodies[i] === "string" && (bodies[i] as string).includes(needle)) { | ||
| hasMatch = true; | ||
| break; | ||
| } | ||
| } | ||
| if (!hasMatch) { | ||
| // No body in this chunk matches — skip full decode entirely | ||
| stats.decodeMillis += nowMillis() - t0; | ||
| stats.chunksPruned++; | ||
| continue; | ||
| } | ||
| // Some bodies match — need full records for time/severity post-filtering | ||
| const records = readRecords(chunk, store.registry, policy); | ||
| stats.decodeMillis += nowMillis() - t0; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Double decompression for matching chunks.
When a body match is found (line 171), you decompress & decode the chunk twice: once via readBodiesOnly (line 163), again via readRecords (line 178). For highly selective queries this trade-off pays off, but for queries matching most chunks, performance degrades.
Consider caching the decompressed payload or restructuring to do a single decompress with optional full-record materialization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/src/query.ts` around lines 162 - 179, The loop currently
calls readBodiesOnly(chunk, store.registry, policy) and then readRecords(chunk,
store.registry, policy) which decompresses/decodes the same chunk twice; update
the logic to decompress once and reuse the result—either change readBodiesOnly
to return the decompressed payload (or a tuple of bodies + raw decompressed
buffer) or add a new helper like decompressChunk(chunk, store.registry) and pass
its result into both body-only extraction and full record decoding so that when
hasMatch is true you reuse the cached decompressed payload instead of calling
readRecords on the original chunk again; adjust references to chunk,
readBodiesOnly, readRecords, store.registry, policy and stats.decodeMillis
accordingly.
The previous implementation had a heuristic that pruned chunks when the needle 'looked like a keyword' (alphanumeric, >= 3 chars). This is UNSOUND because variable values (PARAM_STR slots) can contain any string including keywords like 'error', 'admin', 'timeout'. The pruner would incorrectly skip chunks containing real matches. Template-token pruning requires additional metadata (bloom filters or variable-value token sets) to be correct. Disabled until we can implement it soundly. The partial column decode (decodeBodiesOnly) remains and provides a real 2-4× speedup on body_substring queries by skipping sidecar JSON.parse. This is the honest, safe optimization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/o11ylogsdb/src/query.ts`:
- Around line 149-171: The body-only fast path can incorrectly prune chunks when
a policy defines postDecode normalization because readBodiesOnly skips
policy.postDecode while readRecords applies it; update the use of the fast path
in useBodyFastPath handling (the block that computes needle, calls
chunkPrunedByTemplateTokens, readBodiesOnly and sets hasMatch) to skip the
body-only prefilter whenever policy.postDecode is present (or would mutate body
strings), i.e., only run the readBodiesOnly/hasMatch shortcut if
policy.postDecode is undefined/false; otherwise fall through to the full
readRecords path so postDecode is applied before substring checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 0be6b66a-7b94-4793-b0a0-038b97f2e222
📒 Files selected for processing (1)
packages/o11ylogsdb/src/query.ts
| if (useBodyFastPath) { | ||
| // Template-token pruning: if the chunk header carries template | ||
| // literal tokens (toks), check if any token contains the needle | ||
| // as a substring. If no template token matches AND the chunk has | ||
| // no raw-string bodies (raw strings might still match), we can | ||
| // skip ZSTD decompression entirely. | ||
| const needle = spec.bodyContains!; | ||
| if (chunkPrunedByTemplateTokens(chunk, needle)) { | ||
| stats.chunksPruned++; | ||
| continue; | ||
| } | ||
| // Fast path: decode only bodies, check which match the | ||
| // substring. Only do full decode if there are body matches. | ||
| const t0 = nowMillis(); | ||
| const bodies = readBodiesOnly(chunk, store.registry, policy); | ||
| let hasMatch = false; | ||
| for (let i = 0; i < bodies.length; i++) { | ||
| if (typeof bodies[i] === "string" && (bodies[i] as string).includes(needle)) { | ||
| hasMatch = true; | ||
| break; | ||
| } | ||
| } | ||
| if (!hasMatch) { |
There was a problem hiding this comment.
Don't use the body-only prefilter for postDecode-only policies.
hasMatch is computed from readBodiesOnly(), but that helper's fallback skips policy.postDecode() while readRecords() still applies it. That means this branch can prune a chunk that would match after normalization, which breaks the substring contract documented at Lines 30-33.
🔧 Minimal safe guard
- if (useBodyFastPath) {
+ const canUseBodyFastPath =
+ useBodyFastPath &&
+ (policy?.decodeBodiesOnly !== undefined ||
+ policy?.decodePayload !== undefined ||
+ policy?.postDecode === undefined);
+ if (canUseBodyFastPath) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/src/query.ts` around lines 149 - 171, The body-only fast
path can incorrectly prune chunks when a policy defines postDecode normalization
because readBodiesOnly skips policy.postDecode while readRecords applies it;
update the use of the fast path in useBodyFastPath handling (the block that
computes needle, calls chunkPrunedByTemplateTokens, readBodiesOnly and sets
hasMatch) to skip the body-only prefilter whenever policy.postDecode is present
(or would mutate body strings), i.e., only run the readBodiesOnly/hasMatch
shortcut if policy.postDecode is undefined/false; otherwise fall through to the
full readRecords path so postDecode is applied before substring checks.
- Fix noNonNullAssertion in query.ts (use explicit check) - Fix noAssignInExpressions in synthetic-corpora.ts and data-gen.js - Fix useIterableCallbackReturn in app.js - Fix useButtonType in index.html - Fix unused variable in data-gen.test.js Auto-formatted with biome --write
|
Fixed the lint errors:
Remaining Review ItemsThe following are architectural/concerns that require more analysis to address: Critical:
Major:
Minor/Quality:
RecommendationThese are substantial changes that warrant their own PRs. The lint issues are resolved. The remaining concerns are valid but complex. I recommend:
|
There was a problem hiding this comment.
Actionable comments posted: 14
♻️ Duplicate comments (5)
site/logsdb-engine/css/base.css (1)
8-10:⚠️ Potential issue | 🟡 MinorReplace the self-referential font tokens.
These custom properties reference themselves, creating a cycle. Consumers using
var(--mono),var(--sans), orvar(--display)can end up with an invalid value instead of the intended fallback stack. Assign the font lists directly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@site/logsdb-engine/css/base.css` around lines 8 - 10, The custom properties --mono, --sans, and --display are self-referential and create a cycle; replace their values with the actual font stacks instead of var(...) so consumers using var(--mono), var(--sans), or var(--display) get valid fallbacks — set --mono to the JetBrains Mono + monospace stack, --sans to Inter Tight + a sane sans-serif stack, and --display to Fraunces + serif (update the declarations for --mono, --sans, and --display in the CSS).packages/o11ylogsdb/src/query.ts (1)
125-128:⚠️ Potential issue | 🔴 CriticalDisable the body fast path when the policy has
postDecode().
readBodiesOnly()never runspolicy.postDecode(), butreadRecords()and the documented substring semantics operate on post-decoded bodies. Any policy that normalizes body strings inpostDecode()can be falsely pruned by this branch. Guard the optimization onpolicy.postDecode === undefined.Suggested fix
- const useBodyFastPath = spec.bodyContains !== undefined && !spec.bodyLeafEquals; + const wantsBodyFastPath = spec.bodyContains !== undefined && !spec.bodyLeafEquals; ... - if (useBodyFastPath) { + const canUseBodyFastPath = wantsBodyFastPath && policy?.postDecode === undefined; + if (canUseBodyFastPath) {Also applies to: 149-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/o11ylogsdb/src/query.ts` around lines 125 - 128, The current body-fast-path condition (used to choose readBodiesOnly()) ignores that policies with a postDecode() normalization can change bodies, so update the logic computing useBodyFastPath to also require policy.postDecode === undefined; specifically, where useBodyFastPath is computed from spec.bodyContains and spec.bodyLeafEquals, add a guard checking policy.postDecode is undefined so we never take the fast path when the policy defines postDecode (also apply the same additional guard to the similar body-only fast-path decision later in the same file).site/logsdb-engine/js/logs-model.js (1)
149-156:⚠️ Potential issue | 🟠 Major
errorRateis still derived from a capped query.
recordsEmittedstops at10_000, so any service above that error volume reports a smallererrorscount and a lowererrorRatethan reality. Use an uncapped count/aggregate path or label this metric as sampled.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@site/logsdb-engine/js/logs-model.js` around lines 149 - 156, The error count is being taken from a capped query (query(store, {..., limit: 10000})) so services[svc].errors (and thus services[svc].errorRate) is understated for high-volume services; update the logic in the block that calls query(store, ...) to use an uncapped aggregate/count path (e.g., a count() or aggregate response instead of limiting to 10000) to set services[svc].errors, or if you must keep sampling explicitly mark the metric as sampled (e.g., set a sampled flag or change the label) so callers know it's not a full count—adjust the code that computes errorResult, services[svc].errors, and services[svc].errorRate accordingly.site/logsdb-engine/js/data-gen.js (2)
502-510:⚠️ Potential issue | 🟡 MinorExact batch boundaries never emit completion progress.
When
count % batchSize === 0, the last in-loop yield reportsi / countand clearsbatch, so theprogress: 1branch never runs. Consumers waiting for an explicit completion event won't see one.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@site/logsdb-engine/js/data-gen.js` around lines 502 - 510, The generator currently yields batch progress inside the loop using batch, batchSize, count and i but when count % batchSize === 0 the final in-loop yield clears batch and the post-loop if (batch.length > 0) yield { batch, progress: 1 } never runs, so consumers miss an explicit completion event; fix by emitting an unconditional completion yield after the loop (e.g., always yield { batch: [], progress: 1 } or yield a final marker) so completion is signaled even when the last batch was emitted inside the loop, ensuring callers always receive a progress: 1 event.
335-340:⚠️ Potential issue | 🟠 MajorSeeded datasets are still time-dependent.
seedresets_seed, butbaseTimestill comes fromDate.now(), so identical options produce differenttimeUnixNanovalues on each run. That breaks reproducible benches and the demo's determinism contract.Also applies to: 452-458
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@site/logsdb-engine/js/data-gen.js` around lines 335 - 340, The generateLogs function sets _seed but uses Date.now() to compute baseTime, making outputs time-dependent; change generateLogs to derive baseTime deterministically (either accept and use an explicit opts.baseTime when provided, or compute baseTime from the seed/constant rather than Date.now()) so that timeUnixNano values are reproducible; update all places that compute baseTime (generateLogs and the other occurrence around lines 452-458) to reference the deterministic baseTime and preserve existing behavior when an explicit baseTime option is supplied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/o11ylogsdb/bench/ingest-throughput.bench.ts`:
- Around line 109-111: The warmup currently only generates syslog records via
CORPUS_GENERATORS.syslog(1000) but never passes them through the ingest path;
change the warmup to create a temporary/throwaway store (in-memory or a temp
dir) and run the same ingest/codec path used by the benchmark using records from
CORPUS_GENERATORS.syslog (e.g., generate and ingest ~10k records) so ZSTD
dictionaries, codecs, and any JIT compilation are exercised; after warming up,
close/remove the throwaway store and proceed with the real benchmark.
In `@packages/o11ylogsdb/bench/synthetic-corpora.ts`:
- Around line 345-374: The loop is fragile because calling
generateSyslogCorpus/generateStructuredCorpus/generateHighCardinalityCorpus/generateCloudNativeCorpus
with _seed lets those functions call setSeed(_seed) and clobber module RNG
state; fix by removing seed side-effects: refactor the sub-generators to accept
a RNG/seedless API (or an explicit RNG instance) and have them draw from the
shared rand() state instead of calling setSeed, then update the loop to call
generateXCorpus(1, rng) or generateXCorpus(1) (using the shared RNG) and
eliminate the ad-hoc _seed += 7 adjustments so the mixed corpus uses a single
consistent RNG stream.
In `@site/logsdb-engine/css/demo.css`:
- Around line 450-452: Replace the deprecated CSS property in the .log-body code
rule: remove word-break: break-word and add overflow-wrap: anywhere to achieve
the same wrap behavior; update the .log-body code selector in demo.css to use
overflow-wrap: anywhere so the styles remain standards-compliant.
In `@site/logsdb-engine/index.html`:
- Around line 101-131: The tab markup lacks ARIA roles and attributes, so update
the tab container (`#tabs-panel` / .tab-bar) and the tab buttons (.tab-btn with
data-tab="storage|logs|query") to include role="tablist" on the container,
role="tab" on each button, aria-selected="true/false" and aria-controls pointing
to the corresponding panel id (panel-storage, panel-logs, panel-query), and add
id attributes on the buttons to use as aria-labelledby values; also mark the
panels (panel-storage, panel-logs, panel-query) with role="tabpanel" and
aria-labelledby that references the matching tab id and toggle
hidden/aria-hidden consistently when activating a tab; ensure any JS that
switches tabs (the click/keyboard handler tied to .tab-btn or `#tabs-panel`)
updates aria-selected, aria-hidden, aria-controls/aria-labelledby and visible
state together so screen readers and keyboard users get correct semantics.
In `@site/logsdb-engine/js/app.js`:
- Around line 461-462: The variable _svcDist is assigned from
computeServiceDistribution(records) but never used; either remove the unused
declaration (_svcDist = computeServiceDistribution(records)) to eliminate dead
code or replace it with a used variable and wire it into the UI/logic where
service distribution is needed (refer to computeServiceDistribution and the
records input) so the result is consumed rather than ignored.
- Around line 62-64: The escapeHtml function currently only replaces &, <, and
>; extend it to also replace double-quote (") and single-quote (') characters so
escaped strings are safe when inserted into HTML attributes as well as text
nodes. Update the escapeHtml implementation (function name: escapeHtml) to
include replacements for " and ' (and optionally `) and ensure any call sites
that may place values into attributes continue to use escapeHtml. Keep the
escaping order so & is replaced first to avoid double-escaping.
In `@site/logsdb-engine/js/data-gen.js`:
- Around line 154-160: The module currently uses a shared module-level _seed and
a mutating mulberry32() which makes generators like generateLogBatches
non-reentrant; change mulberry32 into a pure RNG factory that accepts an initial
seed and returns a bound random function (e.g., function makeMulberry32(seed) {
return () => { ...use and update local seed... } }), remove mutations of
module-level _seed, and update generateLogBatches and any other exported
generators/helpers (references: mulberry32, _seed, generateLogBatches and the
other RNG-using helpers around the file) to create a per-invocation rng via the
factory and pass that rng down to helper functions instead of relying on shared
state.
In `@site/logsdb-engine/js/logs-model.js`:
- Around line 18-23: The warnings list is being capped with query(store, {
severityGte: 13, limit: 500 }) before dropping severityNumber >= 17, which lets
errors/fatals crowd out warnings; change the logic so the <17 filter is applied
inside the query or perform a separate query for warnings (e.g., call
query(store, { severityGte: 13, severityLt: 17, limit: 500 }) or query only for
severity <17) and then derive warningRecords from that result instead of
filtering warnings.records afterwards—update the code around the warnings
variable and warningRecords usage to use the pre-filtered query results.
- Around line 19-20: The code treats query(store, { limit: N }) as "recent" but
that API returns traversal order, not newest-first; update all places using
recent = query(store, { limit: ... }), the template analysis, timeline, and
getRecentLogs() to first perform an explicit top-N-by-timestamp selection (i.e.,
sort or use a timestamp-based index/path and pick the highest timestamps) and
only then slice/limit, so results are ordered newest-first by the record's
timestamp field before returning or further processing.
In `@site/logsdb-engine/js/query-model.js`:
- Around line 85-94: The code sets stats.totalTimeMs and stats.recordsPerMs to
strings via toFixed, which can surprise consumers; update the return from the
function so totalTimeMs is a numeric value (e.g., totalTimeMs:
Number(elapsed.toFixed(1)) or Math.round/appropriate numeric cast) and preserve
a formatted string alongside it (e.g., totalTimeMsFormatted or totalTimeMsStr),
and do the same for recordsPerMs (keep a numeric recordsPerMs and a formatted
recordsPerMsStr) using the existing elapsed and result.stats.recordsEmitted
values to compute both forms.
In `@site/logsdb-engine/js/storage-model.js`:
- Around line 27-39: The demo store in createStore currently sets
TypedColumnarDrainPolicy with bodyCodec: "raw" (in LogStore creation), which
disables compression and can mislead compression metrics; update the code near
createStore/LogStore/TypedColumnarDrainPolicy to add a clear comment (or a small
UI/debug flag) stating that bodyCodec is intentionally "raw" for the demo to
avoid WASM ZSTD overhead and therefore compression metrics do not apply, or
expose an option to switch to "zstd-19" for benchmarking so readers know why
compression differs.
In `@site/logsdb-engine/test/data-gen.test.js`:
- Around line 91-103: The test currently iterates over generateLogBatches only
collecting batches; record and assert the final progress value instead of
ignoring it. Modify the test that calls generateLogBatches to capture each
yielded item's progress (e.g., track lastProgress from each iteration of
generateLogBatches) and add an assertion that lastProgress === 1 after the loop;
reference the generateLogBatches generator and the yielded progress property to
locate where to add the tracking and the new expect(lastProgress).toBe(1)
assertion.
- Around line 12-17: The determinism test for generateLogs is incomplete because
it only asserts body and severityText; update the test to also assert the time
field to catch timestamp drift: when calling generateLogs({ count: 50,
durationMinutes: 5, seed: 123 }) for a and b, include
expect(a.records[0].timeUnixNano).toEqual(b.records[0].timeUnixNano) and
expect(a.records[49].timeUnixNano).toEqual(b.records[49].timeUnixNano) (or
alternatively replace the selective assertions with a deep equality check of
normalized full records from generateLogs to ensure complete deterministic
output).
In `@site/logsdb-engine/test/query-model.test.js`:
- Around line 165-175: Add a test case to verify computeServiceDistribution
falls back to "unknown" when service.name is missing: create records that either
have no attributes or have attributes that don't include key "service.name" (in
addition to existing valid records), call computeServiceDistribution(records)
and assert that dist.unknown equals the count of those fallback records while
existing assertions for dist.a/dist.b remain valid; reference
computeServiceDistribution and the test block that currently asserts
dist.a/dist.b to place this new expectation.
---
Duplicate comments:
In `@packages/o11ylogsdb/src/query.ts`:
- Around line 125-128: The current body-fast-path condition (used to choose
readBodiesOnly()) ignores that policies with a postDecode() normalization can
change bodies, so update the logic computing useBodyFastPath to also require
policy.postDecode === undefined; specifically, where useBodyFastPath is computed
from spec.bodyContains and spec.bodyLeafEquals, add a guard checking
policy.postDecode is undefined so we never take the fast path when the policy
defines postDecode (also apply the same additional guard to the similar
body-only fast-path decision later in the same file).
In `@site/logsdb-engine/css/base.css`:
- Around line 8-10: The custom properties --mono, --sans, and --display are
self-referential and create a cycle; replace their values with the actual font
stacks instead of var(...) so consumers using var(--mono), var(--sans), or
var(--display) get valid fallbacks — set --mono to the JetBrains Mono +
monospace stack, --sans to Inter Tight + a sane sans-serif stack, and --display
to Fraunces + serif (update the declarations for --mono, --sans, and --display
in the CSS).
In `@site/logsdb-engine/js/data-gen.js`:
- Around line 502-510: The generator currently yields batch progress inside the
loop using batch, batchSize, count and i but when count % batchSize === 0 the
final in-loop yield clears batch and the post-loop if (batch.length > 0) yield {
batch, progress: 1 } never runs, so consumers miss an explicit completion event;
fix by emitting an unconditional completion yield after the loop (e.g., always
yield { batch: [], progress: 1 } or yield a final marker) so completion is
signaled even when the last batch was emitted inside the loop, ensuring callers
always receive a progress: 1 event.
- Around line 335-340: The generateLogs function sets _seed but uses Date.now()
to compute baseTime, making outputs time-dependent; change generateLogs to
derive baseTime deterministically (either accept and use an explicit
opts.baseTime when provided, or compute baseTime from the seed/constant rather
than Date.now()) so that timeUnixNano values are reproducible; update all places
that compute baseTime (generateLogs and the other occurrence around lines
452-458) to reference the deterministic baseTime and preserve existing behavior
when an explicit baseTime option is supplied.
In `@site/logsdb-engine/js/logs-model.js`:
- Around line 149-156: The error count is being taken from a capped query
(query(store, {..., limit: 10000})) so services[svc].errors (and thus
services[svc].errorRate) is understated for high-volume services; update the
logic in the block that calls query(store, ...) to use an uncapped
aggregate/count path (e.g., a count() or aggregate response instead of limiting
to 10000) to set services[svc].errors, or if you must keep sampling explicitly
mark the metric as sampled (e.g., set a sampled flag or change the label) so
callers know it's not a full count—adjust the code that computes errorResult,
services[svc].errors, and services[svc].errorRate accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 33ab3148-391d-4d28-b62e-216043b592ef
📒 Files selected for processing (16)
packages/o11ylogsdb/bench/comprehensive-query.bench.tspackages/o11ylogsdb/bench/comprehensive-storage.bench.tspackages/o11ylogsdb/bench/ingest-throughput.bench.tspackages/o11ylogsdb/bench/synthetic-corpora.tspackages/o11ylogsdb/src/codec-typed.tspackages/o11ylogsdb/src/query.tssite/logsdb-engine/css/base.csssite/logsdb-engine/css/demo.csssite/logsdb-engine/index.htmlsite/logsdb-engine/js/app.jssite/logsdb-engine/js/data-gen.jssite/logsdb-engine/js/logs-model.jssite/logsdb-engine/js/query-model.jssite/logsdb-engine/js/storage-model.jssite/logsdb-engine/test/data-gen.test.jssite/logsdb-engine/test/query-model.test.js
| // Warmup pass | ||
| process.stderr.write(" Warmup…\n"); | ||
| CORPUS_GENERATORS.syslog(1000); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Warmup doesn't exercise the ingest path.
The warmup only generates 1000 syslog records but discards them. This warms up the RNG and generator but not the store/codec path. Consider also ingesting into a throwaway store to warm up ZSTD dictionaries and JIT.
♻️ Suggested warmup improvement
// Warmup pass
process.stderr.write(" Warmup…\n");
- CORPUS_GENERATORS.syslog(1000);
+ const warmupRecords = CORPUS_GENERATORS.syslog(1000);
+ const warmupStore = new LogStore({
+ registry: defaultRegistry().register(new ZstdCodec(19)),
+ policy: new TypedColumnarDrainPolicy({ bodyCodec: "zstd-19" }),
+ rowsPerChunk: 1024,
+ });
+ for (const r of warmupRecords) {
+ warmupStore.append(buildResource("syslog"), SCOPE, r);
+ }
+ warmupStore.flush();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Warmup pass | |
| process.stderr.write(" Warmup…\n"); | |
| CORPUS_GENERATORS.syslog(1000); | |
| // Warmup pass | |
| process.stderr.write(" Warmup…\n"); | |
| const warmupRecords = CORPUS_GENERATORS.syslog(1000); | |
| const warmupStore = new LogStore({ | |
| registry: defaultRegistry().register(new ZstdCodec(19)), | |
| policy: new TypedColumnarDrainPolicy({ bodyCodec: "zstd-19" }), | |
| rowsPerChunk: 1024, | |
| }); | |
| for (const r of warmupRecords) { | |
| warmupStore.append(buildResource("syslog"), SCOPE, r); | |
| } | |
| warmupStore.flush(); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/bench/ingest-throughput.bench.ts` around lines 109 - 111,
The warmup currently only generates syslog records via
CORPUS_GENERATORS.syslog(1000) but never passes them through the ingest path;
change the warmup to create a temporary/throwaway store (in-memory or a temp
dir) and run the same ingest/codec path used by the benchmark using records from
CORPUS_GENERATORS.syslog (e.g., generate and ingest ~10k records) so ZSTD
dictionaries, codecs, and any JIT compilation are exercised; after warming up,
close/remove the throwaway store and proceed with the real benchmark.
| for (let i = 0; i < count; i++) { | ||
| const roll = rand(); | ||
| let record: LogRecord; | ||
|
|
||
| if (roll < 0.4) { | ||
| // 40% syslog-style templated | ||
| const [r] = generateSyslogCorpus(1, _seed); | ||
| record = r!; | ||
| _seed += 7; | ||
| } else if (roll < 0.75) { | ||
| // 35% structured JSON | ||
| const [r] = generateStructuredCorpus(1, _seed); | ||
| record = r!; | ||
| _seed += 7; | ||
| } else if (roll < 0.9) { | ||
| // 15% high-cardinality | ||
| const [r] = generateHighCardinalityCorpus(1, _seed); | ||
| record = r!; | ||
| _seed += 7; | ||
| } else { | ||
| // 10% cloud-native | ||
| const [r] = generateCloudNativeCorpus(1, _seed); | ||
| record = r!; | ||
| _seed += 7; | ||
| } | ||
|
|
||
| // Override timestamp to maintain ordering | ||
| record!.timeUnixNano = baseNs + BigInt(i) * 1_000_000_000n; | ||
| records.push(record!); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Mixed corpus seed management is fragile.
Each sub-generator call (e.g., generateSyslogCorpus(1, _seed)) internally calls setSeed(_seed), overwriting the module-level _seed. Then the outer loop does _seed += 7. This works but is confusing — the seed passed to sub-generators is immediately clobbered by setSeed(), and the += 7 increment is somewhat arbitrary.
♻️ Consider isolating sub-generator calls
Instead of relying on setSeed side effects, generate the mixed corpus inline using the shared RNG state without calling the full sub-generators:
if (roll < 0.4) {
// 40% syslog-style templated
- const [r] = generateSyslogCorpus(1, _seed);
- record = r!;
- _seed += 7;
+ const { severityNumber, severityText } = randomSeverity();
+ record = {
+ timeUnixNano: 0n, // overwritten below
+ severityNumber,
+ severityText,
+ body: fillSyslogTemplate(pick(SYSLOG_TEMPLATES)),
+ attributes: [{ key: "host", value: `server-${randInt(1, 20)}` }],
+ };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/o11ylogsdb/bench/synthetic-corpora.ts` around lines 345 - 374, The
loop is fragile because calling
generateSyslogCorpus/generateStructuredCorpus/generateHighCardinalityCorpus/generateCloudNativeCorpus
with _seed lets those functions call setSeed(_seed) and clobber module RNG
state; fix by removing seed side-effects: refactor the sub-generators to accept
a RNG/seedless API (or an explicit RNG instance) and have them draw from the
shared rand() state instead of calling setSeed, then update the loop to call
generateXCorpus(1, rng) or generateXCorpus(1) (using the shared RNG) and
eliminate the ad-hoc _seed += 7 adjustments so the mixed corpus uses a single
consistent RNG stream.
| .log-body code { | ||
| font-size: 0.7rem; | ||
| word-break: break-word; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and view the specific lines
if [ -f "site/logsdb-engine/css/demo.css" ]; then
echo "=== File content around lines 450-452 ==="
sed -n '448,455p' site/logsdb-engine/css/demo.css
else
echo "File not found at specified path"
find . -name "demo.css" -type f 2>/dev/null | head -20
fiRepository: strawgate/o11ykit
Length of output: 205
🌐 Web query:
word-break: break-word CSS deprecated status 2024 2025
💡 Result:
word-break: break-word is deprecated in CSS. It remains supported in browsers for backward compatibility but should not be used in new code. The CSS Text Module Level 3/4 specification marks it as deprecated, with the same effect as word-break: normal combined with overflow-wrap: anywhere.
Citations:
- 1: https://rocketvalidator.com/html-validation/css-the-value-break-word-is-deprecated
- 2: https://developer.mozilla.org/en-US/docs/Web/CSS/word-break
- 3: https://developer.mozilla.org/en-US/docs/Web/CSS/Reference/Properties/word-break
🌐 Web query:
overflow-wrap: anywhere vs word-break: break-word CSS standard
💡 Result:
overflow-wrap: anywhere and word-break: break-word are both CSS properties/values that control line breaking for long words to prevent overflow, but they differ in behavior, especially regarding intrinsic sizing. Key points from CSS Text Module Level 3 (Candidate Recommendation) and MDN docs: - overflow-wrap: anywhere - Breaks unbreakable strings (long words/URLs) at arbitrary points only if needed to prevent overflow. - Soft wrap opportunities (potential break points) are considered when calculating min-content intrinsic sizes. - This affects flexbox, grid, floats, etc., where containers size based on content: the container can be smaller because breaks are anticipated. - word-break: break-word (deprecated/legacy) - Behaves like overflow-wrap: anywhere + word-break: normal. - For compatibility; now largely superseded by overflow-wrap: anywhere. - Supported in browsers but discouraged in favor of overflow-wrap. In most visual cases, they appear similar. The difference shows in min-content sizing scenarios (e.g., width: min-content with max-width constraint, flex items): - anywhere allows smaller containers by accounting for breaks in sizing. - break-word sizes as if no breaks, potentially causing larger containers or overflow until layout. Example (flex-grow scenario): - Long word in flex item: anywhere breaks early, balanced layout; break-word grows item disproportionately. Use overflow-wrap: anywhere for modern code. For compatibility: overflow-wrap: break-word; overflow-wrap: anywhere; (break-word falls back). Note: word-break: break-word is distinct from overflow-wrap: break-word (legacy word-wrap), which does NOT consider soft wraps in min-content (opposite of anywhere). The query likely intends the common legacy pairing with word-break: break-word ~ anywhere behavior.
Citations:
- 1: https://developer.mozilla.org/en/CSS/word-wrap
- 2: https://developer.mozilla.org/en-US/docs/Web/CSS/Reference/Properties/overflow-wrap
- 3: https://developer.mozilla.org/en-US/docs/Web/CSS/word-break
- 4: https://www.w3.org/TR/css-text-3/
- 5: https://stackoverflow.com/questions/68593758/difference-between-overflow-wrap-anywhere-overflow-wrap-break-word
- 6: https://blog.logrocket.com/guide-css-word-wrap-overflow-wrap-word-break/
- 7: https://stackoverflow.com/questions/17259916/difference-between-overflow-wrap-and-word-break
- 8: https://stackoverflow.com/questions/77651244/what-are-soft-wrap-opportunities-introduced-by-the-word-break
Replace deprecated word-break: break-word with overflow-wrap: anywhere.
word-break: break-word is deprecated per CSS Text Module Level 3/4 specification. Use overflow-wrap: anywhere to achieve the same wrapping behavior for long log bodies with modern, standards-compliant CSS.
🧰 Tools
🪛 Stylelint (17.9.0)
[error] 452-452: Deprecated keyword "break-word" for property "word-break" (declaration-property-value-keyword-no-deprecated)
(declaration-property-value-keyword-no-deprecated)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/css/demo.css` around lines 450 - 452, Replace the
deprecated CSS property in the .log-body code rule: remove word-break:
break-word and add overflow-wrap: anywhere to achieve the same wrap behavior;
update the .log-body code selector in demo.css to use overflow-wrap: anywhere so
the styles remain standards-compliant.
| <div id="tabs-panel" class="tabs-panel" hidden> | ||
| <div class="tab-bar"> | ||
| <button type="button" class="tab-btn active" data-tab="storage">Storage Explorer</button> | ||
| <button type="button" class="tab-btn" data-tab="logs">Logs Explorer</button> | ||
| <button type="button" class="tab-btn" data-tab="query">Query Builder</button> | ||
| </div> | ||
|
|
||
| <!-- Storage Explorer --> | ||
| <div id="panel-storage" class="tab-content"> | ||
| <h3>Per-Service Storage Breakdown</h3> | ||
| <div id="service-breakdown"></div> | ||
| <h3 style="margin-top:1.5rem">Chunks</h3> | ||
| <div id="chunk-list"></div> | ||
| </div> | ||
|
|
||
| <!-- Logs Explorer --> | ||
| <div id="panel-logs" class="tab-content" hidden> | ||
| <div id="logs-loading" class="muted" hidden>Analyzing...</div> | ||
| <h3>Service Health</h3> | ||
| <div id="service-health"></div> | ||
| <div id="error-clusters"></div> | ||
| <div id="template-analysis"></div> | ||
| </div> | ||
|
|
||
| <!-- Query Builder --> | ||
| <div id="panel-query" class="tab-content" hidden> | ||
| <h3>Query Builder</h3> | ||
| <p class="muted">Build a query against the log store. The engine uses zone-map pruning on time range and severity to skip chunks without decoding.</p> | ||
| <div id="query-form"></div> | ||
| <div id="query-results"></div> | ||
| </div> |
There was a problem hiding this comment.
Add real tab semantics to this tabbed UI.
The controls are visually tabs, but the markup lacks role="tablist", role="tab", aria-selected, aria-controls, and matching role="tabpanel"/aria-labelledby. Screen readers won't announce the relationship, and keyboard users lose context when the active panel changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/index.html` around lines 101 - 131, The tab markup lacks
ARIA roles and attributes, so update the tab container (`#tabs-panel` / .tab-bar)
and the tab buttons (.tab-btn with data-tab="storage|logs|query") to include
role="tablist" on the container, role="tab" on each button,
aria-selected="true/false" and aria-controls pointing to the corresponding panel
id (panel-storage, panel-logs, panel-query), and add id attributes on the
buttons to use as aria-labelledby values; also mark the panels (panel-storage,
panel-logs, panel-query) with role="tabpanel" and aria-labelledby that
references the matching tab id and toggle hidden/aria-hidden consistently when
activating a tab; ensure any JS that switches tabs (the click/keyboard handler
tied to .tab-btn or `#tabs-panel`) updates aria-selected, aria-hidden,
aria-controls/aria-labelledby and visible state together so screen readers and
keyboard users get correct semantics.
| function escapeHtml(str) { | ||
| return str.replace(/&/g, "&").replace(/</g, "<").replace(/>/g, ">"); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
escapeHtml doesn't escape quotes — safe in text nodes, risky in attributes.
Current usage places escaped values in <code> text content (safe), but if used in attribute contexts, unescaped " or ' could enable injection.
🛡️ Extend escapeHtml for attribute safety
function escapeHtml(str) {
- return str.replace(/&/g, "&").replace(/</g, "<").replace(/>/g, ">");
+ return str
+ .replace(/&/g, "&")
+ .replace(/</g, "<")
+ .replace(/>/g, ">")
+ .replace(/"/g, """)
+ .replace(/'/g, "'");
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/js/app.js` around lines 62 - 64, The escapeHtml function
currently only replaces &, <, and >; extend it to also replace double-quote (")
and single-quote (') characters so escaped strings are safe when inserted into
HTML attributes as well as text nodes. Update the escapeHtml implementation
(function name: escapeHtml) to include replacements for " and ' (and optionally
`) and ensure any call sites that may place values into attributes continue to
use escapeHtml. Keep the escaping order so & is replaced first to avoid
double-escaping.
| return { | ||
| records: result.records, | ||
| stats: { | ||
| ...result.stats, | ||
| totalTimeMs: elapsed.toFixed(1), | ||
| recordsPerMs: | ||
| result.stats.recordsEmitted > 0 ? (result.stats.recordsEmitted / elapsed).toFixed(1) : "0", | ||
| }, | ||
| spec, | ||
| }; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
totalTimeMs is a string, may surprise consumers.
elapsed.toFixed(1) returns a string, so stats.totalTimeMs is "12.3" not 12.3. The UI in app.js:467 interpolates it directly (fine), but any numeric comparison would fail silently.
📝 Consider keeping numeric value alongside formatted string
stats: {
...result.stats,
- totalTimeMs: elapsed.toFixed(1),
+ totalTimeMs: elapsed,
+ totalTimeMsFormatted: elapsed.toFixed(1),
recordsPerMs:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return { | |
| records: result.records, | |
| stats: { | |
| ...result.stats, | |
| totalTimeMs: elapsed.toFixed(1), | |
| recordsPerMs: | |
| result.stats.recordsEmitted > 0 ? (result.stats.recordsEmitted / elapsed).toFixed(1) : "0", | |
| }, | |
| spec, | |
| }; | |
| return { | |
| records: result.records, | |
| stats: { | |
| ...result.stats, | |
| totalTimeMs: elapsed, | |
| totalTimeMsFormatted: elapsed.toFixed(1), | |
| recordsPerMs: | |
| result.stats.recordsEmitted > 0 ? (result.stats.recordsEmitted / elapsed).toFixed(1) : "0", | |
| }, | |
| spec, | |
| }; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/js/query-model.js` around lines 85 - 94, The code sets
stats.totalTimeMs and stats.recordsPerMs to strings via toFixed, which can
surprise consumers; update the return from the function so totalTimeMs is a
numeric value (e.g., totalTimeMs: Number(elapsed.toFixed(1)) or
Math.round/appropriate numeric cast) and preserve a formatted string alongside
it (e.g., totalTimeMsFormatted or totalTimeMsStr), and do the same for
recordsPerMs (keep a numeric recordsPerMs and a formatted recordsPerMsStr) using
the existing elapsed and result.stats.recordsEmitted values to compute both
forms.
| export function createStore(opts = {}) { | ||
| const { rowsPerChunk = 1024 } = opts; | ||
|
|
||
| const store = new LogStore({ | ||
| rowsPerChunk, | ||
| policyFactory: () => | ||
| new TypedColumnarDrainPolicy({ | ||
| bodyCodec: "raw", | ||
| }), | ||
| }); | ||
|
|
||
| return store; | ||
| } |
There was a problem hiding this comment.
Demo uses bodyCodec: "raw" — compression disabled.
Unlike the benchmarks which use "zstd-19", the demo store uses "raw". This is fine for a browser demo (avoids WASM ZSTD overhead), but the compression metrics shown will be misleading since no compression occurs.
📝 Consider adding a comment or UI indicator
const store = new LogStore({
rowsPerChunk,
policyFactory: () =>
new TypedColumnarDrainPolicy({
- bodyCodec: "raw",
+ bodyCodec: "raw", // No compression for browser demo performance
}),
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export function createStore(opts = {}) { | |
| const { rowsPerChunk = 1024 } = opts; | |
| const store = new LogStore({ | |
| rowsPerChunk, | |
| policyFactory: () => | |
| new TypedColumnarDrainPolicy({ | |
| bodyCodec: "raw", | |
| }), | |
| }); | |
| return store; | |
| } | |
| export function createStore(opts = {}) { | |
| const { rowsPerChunk = 1024 } = opts; | |
| const store = new LogStore({ | |
| rowsPerChunk, | |
| policyFactory: () => | |
| new TypedColumnarDrainPolicy({ | |
| bodyCodec: "raw", // No compression for browser demo performance | |
| }), | |
| }); | |
| return store; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/js/storage-model.js` around lines 27 - 39, The demo store
in createStore currently sets TypedColumnarDrainPolicy with bodyCodec: "raw" (in
LogStore creation), which disables compression and can mislead compression
metrics; update the code near createStore/LogStore/TypedColumnarDrainPolicy to
add a clear comment (or a small UI/debug flag) stating that bodyCodec is
intentionally "raw" for the demo to avoid WASM ZSTD overhead and therefore
compression metrics do not apply, or expose an option to switch to "zstd-19" for
benchmarking so readers know why compression differs.
| it("has deterministic output (same seed)", () => { | ||
| const a = generateLogs({ count: 50, durationMinutes: 5, seed: 123 }); | ||
| const b = generateLogs({ count: 50, durationMinutes: 5, seed: 123 }); | ||
| expect(a.records[0].body).toEqual(b.records[0].body); | ||
| expect(a.records[49].severityText).toEqual(b.records[49].severityText); | ||
| }); |
There was a problem hiding this comment.
This determinism test misses the unstable field.
It only compares body and severityText, so timestamp drift from the current Date.now()-based baseTime still passes. Include timeUnixNano in the assertion set, or compare normalized full records for the same seed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/test/data-gen.test.js` around lines 12 - 17, The
determinism test for generateLogs is incomplete because it only asserts body and
severityText; update the test to also assert the time field to catch timestamp
drift: when calling generateLogs({ count: 50, durationMinutes: 5, seed: 123 })
for a and b, include
expect(a.records[0].timeUnixNano).toEqual(b.records[0].timeUnixNano) and
expect(a.records[49].timeUnixNano).toEqual(b.records[49].timeUnixNano) (or
alternatively replace the selective assertions with a deep equality check of
normalized full records from generateLogs to ensure complete deterministic
output).
| it("streaming generator yields complete coverage", () => { | ||
| const batches = []; | ||
| for (const { batch } of generateLogBatches({ | ||
| count: 500, | ||
| durationMinutes: 5, | ||
| batchSize: 100, | ||
| })) { | ||
| batches.push(batch); | ||
| } | ||
| const totalRecords = batches.reduce((sum, b) => sum + b.length, 0); | ||
| expect(totalRecords).toBe(500); | ||
| expect(batches.length).toBe(5); | ||
| }); |
There was a problem hiding this comment.
Assert the terminal progress event.
This checks coverage and batch count, but not that the last yielded progress is 1. The exact-batch-boundary bug in generateLogBatches() can regress here without failing the suite.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/test/data-gen.test.js` around lines 91 - 103, The test
currently iterates over generateLogBatches only collecting batches; record and
assert the final progress value instead of ignoring it. Modify the test that
calls generateLogBatches to capture each yielded item's progress (e.g., track
lastProgress from each iteration of generateLogBatches) and add an assertion
that lastProgress === 1 after the loop; reference the generateLogBatches
generator and the yielded progress property to locate where to add the tracking
and the new expect(lastProgress).toBe(1) assertion.
| it("computeServiceDistribution groups by service", () => { | ||
| const records = [ | ||
| { attributes: [{ key: "service.name", value: "a" }] }, | ||
| { attributes: [{ key: "service.name", value: "a" }] }, | ||
| { attributes: [{ key: "service.name", value: "b" }] }, | ||
| ]; | ||
| const dist = computeServiceDistribution(records); | ||
| expect(dist.a).toBe(2); | ||
| expect(dist.b).toBe(1); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider testing the "unknown" service fallback.
computeServiceDistribution defaults missing service names to "unknown", but the test only covers records with valid service.name attributes.
💚 Add test for missing service attribute
it("computeServiceDistribution groups by service", () => {
const records = [
{ attributes: [{ key: "service.name", value: "a" }] },
{ attributes: [{ key: "service.name", value: "a" }] },
{ attributes: [{ key: "service.name", value: "b" }] },
+ { attributes: [] }, // missing service.name
+ { attributes: undefined },
];
const dist = computeServiceDistribution(records);
expect(dist.a).toBe(2);
expect(dist.b).toBe(1);
+ expect(dist.unknown).toBe(2);
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| it("computeServiceDistribution groups by service", () => { | |
| const records = [ | |
| { attributes: [{ key: "service.name", value: "a" }] }, | |
| { attributes: [{ key: "service.name", value: "a" }] }, | |
| { attributes: [{ key: "service.name", value: "b" }] }, | |
| ]; | |
| const dist = computeServiceDistribution(records); | |
| expect(dist.a).toBe(2); | |
| expect(dist.b).toBe(1); | |
| }); | |
| }); | |
| it("computeServiceDistribution groups by service", () => { | |
| const records = [ | |
| { attributes: [{ key: "service.name", value: "a" }] }, | |
| { attributes: [{ key: "service.name", value: "a" }] }, | |
| { attributes: [{ key: "service.name", value: "b" }] }, | |
| { attributes: [] }, // missing service.name | |
| { attributes: undefined }, | |
| ]; | |
| const dist = computeServiceDistribution(records); | |
| expect(dist.a).toBe(2); | |
| expect(dist.b).toBe(1); | |
| expect(dist.unknown).toBe(2); | |
| }); | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@site/logsdb-engine/test/query-model.test.js` around lines 165 - 175, Add a
test case to verify computeServiceDistribution falls back to "unknown" when
service.name is missing: create records that either have no attributes or have
attributes that don't include key "service.name" (in addition to existing valid
records), call computeServiceDistribution(records) and assert that dist.unknown
equals the count of those fallback records while existing assertions for
dist.a/dist.b remain valid; reference computeServiceDistribution and the test
block that currently asserts dist.a/dist.b to place this new expectation.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
packages/o11ylogsdb/src/query.ts (2)
160-183:⚠️ Potential issue | 🟠 MajorAvoid decoding matching chunks twice.
Once
hasMatchflips true, the chunk has already gone throughreadBodiesOnly()and is then fully decoded again viareadRecords(). On low-selectivitybodyContainsqueries this makes the “fast path” slower than the baseline path. Reuse a shared decompressed payload, or have the body-only path populate the same decode cache thatreadRecords()consumes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/o11ylogsdb/src/query.ts` around lines 160 - 183, The fast-path decodes bodies with readBodiesOnly() and then decodes the chunk again with readRecords() once hasMatch is true; change the flow so the decompressed payload or per-index decoded body data produced by readBodiesOnly() is stored in a shared cache/structure that readRecords() can consume (or make readBodiesOnly() return both bodies and the decompressed blob) to avoid a second full decode; update the logic around bodies, hasMatch, readBodiesOnly, and readRecords to pass/reuse that cached payload and adjust stats updates accordingly so chunks are only decoded once.
125-128:⚠️ Potential issue | 🔴 CriticalGuard the body-only fast path by policy, not just by
QuerySpec.This still lets Line 149 prune a chunk from
readBodiesOnly()even when the active policy only normalizes body strings inpostDecode(). In that casehasMatchis computed on a different representation thanrecordMatches(), so valid matches can be dropped.Minimal safe guard
- const useBodyFastPath = spec.bodyContains !== undefined && !spec.bodyLeafEquals; - for (const id of store.streams.ids()) { @@ const chunks = store.streams.chunksOf(id); const policy = store.policyFor(id); + const canUseBodyFastPath = + spec.bodyContains !== undefined && + !spec.bodyLeafEquals && + policy?.postDecode === undefined; for (const chunk of chunks) { @@ - if (useBodyFastPath) { + if (canUseBodyFastPath) {Also applies to: 149-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/o11ylogsdb/src/query.ts` around lines 125 - 128, The body-only fast path (useBodyFastPath) is being enabled solely from QuerySpec, which allows readBodiesOnly() to prune chunks incorrectly when postDecode() normalizes body strings; change the guard so the fast-path is enabled only when both spec conditions (spec.bodyContains && !spec.bodyLeafEquals) AND the active policy indicates it is safe to evaluate bodyContains on the pre-decode representation (i.e., postDecode will not alter body text). Update the useBodyFastPath calculation and adjust readBodiesOnly()/hasMatch logic to consult that policy flag (or a function like policy.allowsBodyFastPath / policy.postDecodeNoopForBodies) before using pre-decode matching, and ensure recordMatches/postDecode remain authoritative when the policy disallows the fast path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@packages/o11ylogsdb/src/query.ts`:
- Around line 160-183: The fast-path decodes bodies with readBodiesOnly() and
then decodes the chunk again with readRecords() once hasMatch is true; change
the flow so the decompressed payload or per-index decoded body data produced by
readBodiesOnly() is stored in a shared cache/structure that readRecords() can
consume (or make readBodiesOnly() return both bodies and the decompressed blob)
to avoid a second full decode; update the logic around bodies, hasMatch,
readBodiesOnly, and readRecords to pass/reuse that cached payload and adjust
stats updates accordingly so chunks are only decoded once.
- Around line 125-128: The body-only fast path (useBodyFastPath) is being
enabled solely from QuerySpec, which allows readBodiesOnly() to prune chunks
incorrectly when postDecode() normalizes body strings; change the guard so the
fast-path is enabled only when both spec conditions (spec.bodyContains &&
!spec.bodyLeafEquals) AND the active policy indicates it is safe to evaluate
bodyContains on the pre-decode representation (i.e., postDecode will not alter
body text). Update the useBodyFastPath calculation and adjust
readBodiesOnly()/hasMatch logic to consult that policy flag (or a function like
policy.allowsBodyFastPath / policy.postDecodeNoopForBodies) before using
pre-decode matching, and ensure recordMatches/postDecode remain authoritative
when the policy disallows the fast path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 2f792218-8066-44ec-aff8-e87b212a8744
📒 Files selected for processing (1)
packages/o11ylogsdb/src/query.ts
…ranges Use materializeRangeOwned instead of materializeRange to ensure we get owned copies of arrays rather than borrowed references to shared scratch buffers that get invalidated on subsequent decodeView calls.
Summary
Brings o11ylogsdb toward production maturity with three key additions:
1. Comprehensive Benchmark Suite
New synthetic corpus generators covering 5 workload types at multiple scales:
Three new benchmark modules:
comprehensive-storage— B/log and compression ratio across all corpus typescomprehensive-query— Query latency for 10 predicate patterns across all corporaingest-throughput— Records/s and MB/s at 100K scale2. WeakMap Decode Cache (50-100× speedup)
Chunks are immutable after freeze — cached decoded records eliminate redundant ZSTD decompression on repeated queries:
3. Interactive LogsDB Demo (
site/logsdb-engine/)Three-panel experience using the real o11ylogsdb engine:
Includes 32 tests for the demo modules and browser compatibility fixes.
Storage Efficiency Results (10K records)
Key Findings