SAW-6662: add Track A Snowflake parquet schema mode#38
Conversation
There was a problem hiding this comment.
Telemetry Review — Changes Required
This PR introduces a new Snowflake Parquet adapter that creates a persistent, queryable archive with no equivalent in the existing Datadog path. Four high-severity findings require changes before this ships.
F1 — customer.id in indexed hot partition
customer.id is hardcoded in defaultAttributesHotKeys and written verbatim to attributes_hot_text with no redaction or hashing. Remove it from the hot list or pseudonymise the value.
F4 — Unfiltered PII in log body → Snowflake archive
deriveBodyJSONAndMessage writes the full body verbatim. Live log bodies carry userinfo.username, sender_login, session_id, ip, remote_addr, org_id, delivery_id, and installation_id. A field-level deny-list is needed before serialisation.
F2 / F3 — Unbounded cold attribute and tag key spaces
attributes_cold_text and tags_cold_text accept all 200+ live attribute/tag keys with no cap on key count or value length, including k8s.pod.uid, k8s.pod.ip, and k8s.pod.start_time. Add a configurable allowlist or max-keys/max-value-bytes guard.
F6 (medium) — transaction.id linkage in hot partition
Co-locating transaction.id with customer.id in the same indexed column enables trivial cross-row identity linkage. Consider moving transaction.id to cold or using a surrogate key.
F5 (medium) — Hot-bucket cardinality encoding
Both hot keys have unbounded cardinality and use DELTA_BYTE_ARRAY; clustering or search optimisation on the raw JSON column will not be effective. Consider dedicated top-level columns if indexed filtering is required.
There was a problem hiding this comment.
Compared with previous telemetry review: 0 new, 3 still open, 3 fixed, 0 accepted.
Two high-severity PII findings block this PR. Both introduce new persistent Snowflake archive paths with no equivalent in the existing Datadog adapter.
F1 — Direct identifier in indexed hot column (adapter.go:36): customer.id is hardcoded in defaultAttributesHotKeys and written verbatim to attributes_hot_text. The test at adapter_test.go:44 explicitly asserts the raw value lands there.
F2 — Unfiltered body PII to archive (adapter.go:248): deriveBodyJSONAndMessage serialises the full log body with no field filtering. Live log bodies carry userinfo.username, ip, remote_addr, session_id, sender_login, orgId, delivery_id, and installation_id.
F3 — Quasi-identifier linkage (adapter.go:37, advisory): transaction.id in the same hot column as customer.id enables trivial cross-row identity linkage.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.chloggen/saw-6662-track-a-snowflake-parquet.yaml:
- Line 3: The YAML has an unquoted plain scalar for the note key that contains a
colon, causing a parse error; update the note entry (the "note" mapping) so its
value is wrapped in double quotes (e.g., change the note value that currently
contains "schema: snowflake" to a quoted string) to prevent the YAML parser from
treating the colon as a mapping delimiter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 024b02e2-0ecd-487a-b9a8-e4e5a3ecbe45
📒 Files selected for processing (1)
.chloggen/saw-6662-track-a-snowflake-parquet.yaml
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".chloggen/saw-6662-track-a-snowflake-parquet.yaml">
<violation number="1" location=".chloggen/saw-6662-track-a-snowflake-parquet.yaml:3">
P2: Unquoted `note` value contains `schema: snowflake` (colon + space), which is invalid in a YAML plain scalar — the parser treats it as a mapping delimiter, producing a syntax error. Wrap the value in double quotes so changelog tooling can parse this file.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
There was a problem hiding this comment.
[ARCH-REVIEW] Request changes: 1 blocking issue.
extension/parquetlogencodingextension/extension.go:236 stops checking max_file_size_bytes after the first size-triggered flush in a batch, so a larger batch can refill the new writer past the configured cap before the next call. That defeats the size limit and can produce oversized parquet objects.
There was a problem hiding this comment.
3 issues found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="extension/parquetlogencodingextension/extension.go">
<violation number="1" location="extension/parquetlogencodingextension/extension.go:51">
P1: Shutdown does not account for queued flush payloads, so it can report success while buffered parquet blobs remain undelivered.</violation>
<violation number="2" location="extension/parquetlogencodingextension/extension.go:301">
P2: Popping queued flushes by reslicing without clearing the first slot can retain large parquet byte buffers in memory longer than needed.</violation>
</file>
<file name="extension/parquetlogencodingextension/adapters/snowflake/adapter.go">
<violation number="1" location="extension/parquetlogencodingextension/adapters/snowflake/adapter.go:323">
P2: Byte-level truncation `v[:maxArchivedStringValueSize]` can split a multi-byte UTF-8 character that straddles the 4096-byte boundary, producing an invalid UTF-8 string in the parquet output. Use `utf8`-aware truncation (e.g., iterate runes or use `strings.ToValidUTF8` after slicing) to guarantee valid output.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
There was a problem hiding this comment.
[ARCH-REVIEW] Re-review: request changes.
Prior size-gating issue in addLogRecordWithFlushMetadata is fixed. New blocking issue remains: once queuedFlushes exists, shutdown also needs to drain or account for it, otherwise already-serialized parquet chunks can be lost when the live writer buffer is empty.
There was a problem hiding this comment.
Two related telemetry issues introduced in adapters/snowflake/adapter.go around the zero-timestamp drop path:
- Silent data loss (F2):
logger.Debug(...)is used for a permanent record drop. At production log levels (Info+) the drop is completely invisible — no counter, no warning. The Datadog adapter does not drop these records at all. - Log spam (F1): The same Debug call sits inside the innermost of three nested loops. Under high-volume or misconfigured input it emits unbounded identical messages with no service/count context, flooding the log stream.
Suggested fix: Remove the per-record Debug call; increment a droppedCount counter in the loop body; after the loops emit a single logger.Warn with the total count when droppedCount > 0. Optionally add an OTel dropped-records counter for alerting.
|
@sawmills-architect-review latest head is Delta on this head:
Please re-review the current head. |
There was a problem hiding this comment.
[ARCH-REVIEW] Re-review: request changes.
The pending-record replay failure still exits before cleanup, so Shutdown can return an error while the buffer-state loop and telemetry remain active. If the caller does not retry, that leaves background work running after shutdown.
There was a problem hiding this comment.
[ARCH-REVIEW] Re-review: request changes.
Prior size-gating issue is fixed. One lifecycle concern remains: the buffered-records error path returns before stopBufferStateLoop() and telemetry.shutdown(), so a failed shutdown leaks the ticker/metrics machinery until a later retry.
| if e.writer != nil && len(e.writer.Objs) > 0 { | ||
| e.telemetry.shutdown() | ||
| e.mutex.Unlock() | ||
| return errors.New("buffered parquet records remain at shutdown; flush before shutdown") |
There was a problem hiding this comment.
🟡 [concern] This return skips stopBufferStateLoop() and telemetry.shutdown(), so if shutdown is failing because records are still buffered the extension keeps a ticker goroutine and metrics builder alive until some later retry. Please tear those down on the error path too, or move the cleanup into a defer.
There was a problem hiding this comment.
Fixed on the latest head. The buffered-records shutdown error path now stops the buffer-state loop and shuts down telemetry before returning, and the shutdown regression test now locks that behavior.
| bufferedRecordCount := len(e.writer.Objs) | ||
| bufferedCompressedBytes := e.getCurrentCompressedSize() | ||
|
|
||
| buf, reason, completedAt, err := e.checkAndFlushWithMetadata(false, flushReasonSize) |
There was a problem hiding this comment.
🟡 [concern] The size-flush error path requeues the tail records, but it returns immediately without calling recordBufferStateLocked() after appendPendingRecordsLocked(...). If checkAndFlushWithMetadata() fails, the buffer gauges stay stale until the next write or flush, which hides the retried tail from operators.
There was a problem hiding this comment.
Fixed on the latest head. The size-flush error path now refreshes buffer-state metrics after requeueing the pending tail, and the split-flush failure regression test now asserts the pending records remain visible to operators.
|
@sawmills-architect-review latest head is Delta since the last architect-reviewed head:
Current state on this head:
Please re-review the latest head. |
|
@sawmills-architect-review latest head is Fixed the new blocker from the failed check run:
Please re-review the current head. |
There was a problem hiding this comment.
[ARCH-REVIEW] Re-review: approve.
Prior blocking issue in extension.go shutdown buffering path is addressed by draining pendingRecords before the shutdown guard.
No unresolved review threads remain.
No new blocking architectural concerns found in the delta since the last review.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8895bbc. Configure here.
| } | ||
| if e.writer != nil && len(e.writer.Objs) > 0 { | ||
| e.telemetry.shutdown() | ||
| e.mutex.Unlock() |
There was a problem hiding this comment.
Shutdown leaks goroutine and telemetry on buffered-records error
Medium Severity
When Shutdown finds buffered records in the writer (after successfully moving pending records), it returns an error without calling stopBufferStateLoop() or telemetry.shutdown(). The first error path (movePendingRecordsIntoWriterLocked failure) correctly cleans up both, but the second error path ("buffered parquet records remain") skips cleanup entirely. If the collector's shutdown sequence doesn't retry, the background ticker goroutine from startBufferStateLoop and the OTel metric registrations leak. This asymmetry between the two error paths is inconsistent — one cleans up resources, the other doesn't.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8895bbc. Configure here.
There was a problem hiding this comment.
Fixed on the latest head. The buffered-records shutdown error path now stops the buffer-state loop and shuts down telemetry before returning, and the shutdown regression test now locks that behavior.
There was a problem hiding this comment.
[ARCH-REVIEW] Re-review: approve.
Prior blocking issue around buffered spill state at shutdown has been addressed by consolidating queued chunks into pendingRecords and enforcing shutdown guards on pending payloads.
No new architectural or correctness concerns found in the updated implementation.


parquet_log_encoding: Add additive
schema: snowflakeTrack A support for Snowflake-oriented parquet export in contrib.datadogdefault and newsnowflakeadapterValidation:
cd extension/parquetlogencodingextension && go test ./...cd exporter/awss3exporter && go test ./... -run 'TestLoadConfig_LegacyParquetEncoding|TestLoadConfig_SnowflakeParquetEncoding'\n-make genotelcontribcol\n-make otelcontribcol\n-./bin/otelcontribcol_darwin_arm64 validate --config=file:exporter/awss3exporter/testdata/snowflake-parquet.yamlNote
Medium Risk
Adds a new parquet encoding schema and refactors buffering/flush/telemetry paths, which can affect log export correctness and file boundaries; changes are gated behind
schema: snowflakebut touch shared flush logic.Overview
Adds an opt-in
schema: snowflakemode toparquet_log_encoding(default remainsdatadog) with a new Snowflake adapter that emits a Snowflake-friendly log row shape (typed top-level columns plus canonicalized JSON-text buckets) and configurable hot-key splitting.Refactors the parquet extension’s buffering/flush behavior to support spill/pending records across size-triggered flushes, periodically record buffer state (including oldest age), and emit a new dropped-records metric; includes extensive readback/behavior tests.
Wires the new extension into the contrib distribution and coverage config, updates
awss3exporterdocs/tests to validate pairing withparquet_log_encoding(including a Snowflake example), and adds Snowflake SQL artifacts + design docs for Track A benchmarking. Also adds agovulncheckwrapper script used bymake govulncheckto ignore selected no-fix advisories.Reviewed by Cursor Bugbot for commit 49f96e7. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by cubic
Adds an opt-in Snowflake-oriented Parquet schema to
parquet_log_encodingfor SAW-6662 Track A while keepingdatadogas default. Improves flush/shutdown robustness and telemetry, and wires the extension into the contrib build withawss3exporterpairing and Snowflake SQL artifacts.New Features
schema: snowflake(logs only);datadogremains default.ts,service,status,message_text,host,source,trace_id,span_id,schema_version) plus JSON-text buckets with precision-safe numbers; OTLP trace/span IDs; hot/cold buckets with sane defaults/bounds.awss3exporterpairing docs/tests and example; Snowflake SQL artifacts (external table, secure view, benchmark queries); Codecov component mapping. Aligns with SAW-6662 Track A.Bug Fixes
datadogtag parsing; reject non‑finite numeric tag values; preserve normal decimal JSON numbers in canonical JSON; preserve Snowflake cold‑bucket fields.govulncheckfilter script to ignore selected no‑fix advisories; add Codecov component mapping forextension_parquetlogencoding.Written for commit 49f96e7. Summary will update on new commits.
Summary by CodeRabbit
New Features
Documentation
Tests
Chores