feat(clp-s)!: Add Timestamp column type to replaceDateString column type; Bump the archive version to 0.5.0.#1788
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds explicit Timestamp support, renames DateString → DeprecatedDateString, refactors timestamp ingestion/serialization to return (epochtime, pattern_id) via a timestamp parser, updates readers/writers/dictionaries and query/search paths, bumps archive format with a deprecated-format marker, updates CMake linkages, and adds tests and test data. Changes
Sequence Diagram(s)sequenceDiagram
participant JsonParser
participant ArchiveWriter
participant TimestampDictWriter as TimestampDictionaryWriter
participant timestamp_parser as timestamp_parser
JsonParser->>ArchiveWriter: detect timestamp field & value
ArchiveWriter->>TimestampDictWriter: ingest_string_timestamp(key,node_id,value,is_json_literal)
TimestampDictWriter->>timestamp_parser: parse/identify pattern
timestamp_parser-->>TimestampDictWriter: (epochtime_t, pattern_id)
TimestampDictWriter-->>ArchiveWriter: (epochtime_t, pattern_id)
ArchiveWriter->>ArchiveWriter: create Timestamp node and write encoded value with pattern_id
sequenceDiagram
participant ArchiveReader
participant SchemaReader
participant TimestampDictReader as TimestampDictionaryReader
participant timestamp_parser as timestamp_parser
participant TimestampColumnReader
ArchiveReader->>SchemaReader: open archive
SchemaReader->>TimestampDictReader: read(decompressor, has_deprecated_timestamp_format)
TimestampDictReader->>timestamp_parser: construct/validate patterns
timestamp_parser-->>TimestampDictReader: TimestampPattern objects
SchemaReader->>TimestampColumnReader: instantiate for Timestamp column
TimestampColumnReader->>TimestampDictReader: append_timestamp_to_buffer(epoch, pattern_id, buffer)
TimestampDictReader->>timestamp_parser: format timestamp per pattern
timestamp_parser-->>TimestampDictReader: formatted string
TimestampDictReader-->>TimestampColumnReader: formatted timestamp bytes
TimestampColumnReader-->>ArchiveReader: deliver serialized timestamp value
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Timestamp column type; deprecate DateString column type.Timestamp column type; deprecate DateString column type.
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/sql.ts (1)
31-37: Update SQL query to include both deprecated and new timestamp types.The
WHEREclause filters only by the deprecated timestamp type (value 8). To support both old archives (usingDeprecatedDateString) and new archives (usingTimestamp), update the query to check for both enum values.🔧 Proposed fix to query both timestamp types
After determining the enum value for
NodeType::Timestamp(see verification script in previous comment), update the query:+const NEW_TIMESTAMP_TYPE = /* value of NodeType::Timestamp */; + const buildTimestampColumnsSql = (datasetName: string): string => ` SELECT DISTINCT ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.NAME} FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.COLUMN_METADATA} - WHERE ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.TYPE} = ${TIMESTAMP_TYPE} + WHERE ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.TYPE} IN (${TIMESTAMP_TYPE}, ${NEW_TIMESTAMP_TYPE}) ORDER BY ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.NAME}; `;components/core/src/clp_s/search/QueryRunner.cpp (2)
63-95: Track all DeprecatedDateString columns, not just one.
If a legacy archive contains multiple deprecated date columns, the single pointer drops all but the last and can lead to missed matches orat()failures later. Store them in a map (likem_timestamp_readers) and iterate accordingly in wildcard/exact paths.🛠️ Proposed fix (core change; update header + call sites)
- m_deprecated_datestring_reader = nullptr; + m_deprecated_datestring_readers.clear(); ... - } else if (auto* deprecated_date_reader - = dynamic_cast<DeprecatedDateStringColumnReader*>(column_reader); - nullptr != deprecated_date_reader) - { - m_deprecated_datestring_reader = deprecated_date_reader; + } else if (auto* deprecated_date_reader + = dynamic_cast<DeprecatedDateStringColumnReader*>(column_reader); + nullptr != deprecated_date_reader) + { + m_deprecated_datestring_readers.emplace(column_id, deprecated_date_reader); } else {Then update wildcard/exact timestamp paths to iterate
m_deprecated_datestring_readers.
1187-1219: Normalise deprecated DateString timestamps to nanoseconds before compare.
Legacy DateString values are millisecond‑precision; comparing them directly to Timestamp literals (nanoseconds) will miss matches by 1e6. Scale before callingevaluate_int_filter_core.🛠️ Proposed fix
bool QueryRunner::evaluate_epoch_date_filter( FilterOperation op, DeprecatedDateStringColumnReader* reader, std::shared_ptr<Literal>& operand ) { + constexpr int64_t cNanosecondsInMillisecond{1000 * 1000LL}; if (FilterOperation::EXISTS == op || FilterOperation::NEXISTS == op) { return true; } @@ - return evaluate_int_filter_core(op, reader->get_encoded_time(m_cur_message), op_value); + return evaluate_int_filter_core( + op, + reader->get_encoded_time(m_cur_message) * cNanosecondsInMillisecond, + op_value + ); }
🤖 Fix all issues with AI agents
In `@components/core/src/clp_s/ArchiveWriter.cpp`:
- Around line 333-341: Remove the unreachable break after the throw in the
NodeType::DeprecatedDateString case: locate the switch handling NodeType in
ArchiveWriter.cpp (the case labeled NodeType::DeprecatedDateString) and delete
the trailing "break;" that follows "throw OperationFailed(...)" so the case
cleanly throws without an unreachable statement; leave surrounding cases (e.g.,
NodeType::DeltaInteger and NodeType::Timestamp that append column writers)
unchanged.
In `@components/core/src/clp_s/clp-s.cpp`:
- Around line 186-192: The version comparison should mask off the development
flag so dev builds don't misclassify; change the condition that compares
archive_reader->get_header().version against cDatestringDeprecationVersion to
use (archive_reader->get_header().version &
~cArchiveFormatDevelopmentVersionFlag) < cDatestringDeprecationVersion, leaving
the existing ast::SetTimestampLiteralPrecision creation and
date_precision_pass.run(expr) call intact so pre-0.5.0 (minus dev flag) archives
still get millisecond precision normalization.
In `@components/core/src/clp_s/SchemaReader.cpp`:
- Around line 20-54: In SchemaReader::mark_column_as_timestamp, the
NodeType::Float branch casts the fractional seconds to epochtime_t before
scaling which truncates sub-second precision; modify the m_get_timestamp lambda
for the Float case to first multiply the extracted double by
cNanosecondsInSecond and then convert to epochtime_t (optionally using
std::lround or similar to preserve proper rounding) so the scaling preserves
sub‑second precision.
- Around line 495-500: The structured-template generators are dropping
NodeType::Timestamp and NodeType::DeprecatedDateString; update
generate_structured_object_template() to handle DeprecatedDateString via
JsonSerializer::Op::AddStringField and Timestamp via
JsonSerializer::Op::AddLiteralField (mirroring the root behavior in
generate_json_template()), and update generate_structured_array_template() to
emit timestamp elements by adding a new serializer operation
JsonSerializer::Op::AddLiteralValue and using it for NodeType::Timestamp while
using the existing string emission for DeprecatedDateString; ensure the new
AddLiteralValue is implemented in the JsonSerializer ops set and used where
array element literals are serialized so structured arrays no longer drop
timestamp values.
In `@components/core/src/clp_s/timestamp_parser/TimestampParser.cpp`:
- Around line 1818-1840: The function estimate_timestamp_precision negates
timestamp to compute abs_timestamp which UB for INT64_MIN; fix by computing an
unsigned absolute value instead of negating signed INT64_MIN: replace the signed
negation logic for abs_timestamp in estimate_timestamp_precision with an
unsigned conversion that handles INT64_MIN safely (e.g. convert timestamp to
uint64_t and, when negative, compute the magnitude by using -(timestamp+1) cast
to uint64_t then add 1), leaving the rest of the comparisons and returned pair
(cFactor and precision char 'N'/'C'/'L'/'E') unchanged.
In `@components/core/src/clp_s/TimestampDictionaryWriter.cpp`:
- Around line 204-208: TimestampDictionaryWriter::clear() resets m_next_id,
m_string_patterns_and_ids, and m_column_id_to_range but forgets to reset
m_numeric_pattern_to_id, which can leak numeric-pattern state and cause ID
collisions after m_next_id is reused; update the clear() implementation to also
clear m_numeric_pattern_to_id so all pattern-to-id maps are reset consistently
with m_next_id (ensure you reference and clear m_numeric_pattern_to_id alongside
m_string_patterns_and_ids and m_column_id_to_range in
TimestampDictionaryWriter::clear()).
- Around line 150-182: The multiplication epoch_timestamp = timestamp * factor
in TimestampDictionaryWriter::ingest_unknown_precision_epoch_timestamp can
overflow epochtime_t; change the calculation to use a wider intermediate type
(e.g., int128 or a 128-bit arithmetic helper) or perform a pre-multiply bounds
check against numeric_limits<epochtime_t>::min()/max() using the values from
timestamp_parser::estimate_timestamp_precision, and if the result would
overflow, log an error and fail fast (throw OperationFailed) before calling
timestamp_entry_it->second.ingest_timestamp; ensure the rest of the function
(pattern creation via m_numeric_pattern_to_id, m_next_id, and returned pattern
id) only runs when the epoch_timestamp has been validated.
In
`@components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/sql.ts`:
- Around line 15-19: The SQL currently only filters for TIMESTAMP_TYPE (constant
TIMESTAMP_TYPE = 8) which maps to DeprecatedDateString; add a new constant
NEW_TIMESTAMP_TYPE = 14 (or rename TIMESTAMP_TYPE to DEPRECATED_TIMESTAMP_TYPE
for clarity) and update the query WHERE clause that references
CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.TYPE to use an IN list including both
values (e.g., IN (${TIMESTAMP_TYPE}, ${NEW_TIMESTAMP_TYPE})) so columns using
the new NodeType::Timestamp are discovered.
components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/sql.ts
Outdated
Show resolved
Hide resolved
|
Besides the one presto webui change that rabbit pointed out, we likely also need to change how the webui interprets timestamps in search results -- I'll double check and change that if necessary. |
Can u create an issue to track this? |
| if (nullptr != m_deprecated_datestring_reader) { | ||
| if (evaluate_epoch_date_filter(op, m_deprecated_datestring_reader, literal)) { | ||
| return true; | ||
| } | ||
| } |
There was a problem hiding this comment.
Since m_deprecated_datestring_reader and m_timestamp_readers are mutually exclusive, shall we add continue instead of falling back to m_timestamp_readers check?
In general, I think there might be a better way to do this (for example, using a variant). But let's defer this to a future PR...
There was a problem hiding this comment.
Sorry, could you explain what you mean by that? I could put the iteration over the timestamp readers into the else case, but in general we can't short circuit this function because we need to go through all of the columns for the other possibly matching types.
E.g., for a query like * > 10 against the record {"ts": "1970-01-01", "int": 11}, where ts is stored as a DateString completely shortcircuiting this function after the DateString case would lead us to falsely reject this record since we wouldn't end up falling into the code block below where we consider all of the primitive types.
There was a problem hiding this comment.
Uh right, sorry for the confusion. I think the right thing to do is probably put m_timestamp_reader in an else case.
| return evaluate_timestamp_filter( | ||
| expr->get_operation(), | ||
| m_datestring_readers[column_id], | ||
| m_timestamp_readers.at(column_id), |
There was a problem hiding this comment.
I guess the upstream caller will ensure the column_id here must be a valid key in the reader map?
There was a problem hiding this comment.
Yep, it happens in the code to prepare a given ERT to be searched.
|
@junhaoliao could you take a look at the presto webui change? Presto/velox won't work with the v0.5.0 archives introduced by this PR until I update them to understand the new Timestamp column type, but I figured I'd try to update the presto webui code related to finding the timestamp column to work with v0.5.0 archives while I was changing other references the DateString column. I can revert the presto webui change and put it in a separate small PR after the presto/velox changes are merged if you prefer though. I'm fairly certain this change won't affect the rest of the webui, since I made sure the timestamp we associate with each search result as extra metadata is still millisecond precision after this change, and when I run the webui the timeline & timestamps appear correct. |
will do. Thanks for also updating the webui part from your side! |
junhaoliao
left a comment
There was a problem hiding this comment.
approved the webui changes
There was a problem hiding this comment.
lgtm. risk of merging is low as well - the change is self-contained and doesn't affect the rest of the WebUI.
LinZhihao-723
left a comment
There was a problem hiding this comment.
For the PR title, how about:
feat(clp-s)!: Add `Timestamp` column type to deprecate `DateString` column type; Bump the archive version to `0.5.0`.
Timestamp column type; deprecate DateString column type.Timestamp column type to replaceDateString column type; Bump the archive version to 0.5.0.
|
After this PR was merged, timestamp strings that are not ISO 8601–compliant (e.g., RFC 2822 / RFC 822 formats) are now rejected instead of being parsed on a best-effort basis. Attempting to compress logs with such timestamps will produce errors like: Since none of the existing CLP users rely on non–ISO8601–compliant timestamps, and our goal is to ensure strict and accurate timestamp parsing, we will retain the current behavior of rejecting unsupported timestamp formats. @coderabbitai please create:
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
🏁 Script executed: Length of output: 3173 Let me fix the script to properly handle the multi-line issue bodies: 2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 🏁 Script executed: Length of output: 92 Let me use Python to avoid shell escaping issues: 2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 🏁 Script executed: Length of output: 258 Let me create the issues one at a time with simpler scripts: 🏁 Script executed: Length of output: 75
Issue 1: Documentation UpdateTitle: Labels: Assignee: Body: ## Background
After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:
2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. Issue 2: Feature RequestTitle: Labels: Assignee: Body: ## Feature Request
Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.
## Background
Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:
2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. You can create these issues by:
Or use the GitHub CLI from your authenticated terminal: gh issue create --title "..." --body-file <file> --assignee junhaoliao --label <label>🧠 Learnings used |
@coderabbitai i'm sure you can create the issues. try again |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 🏁 Script executed: Length of output: 97
🏁 Script executed: Length of output: 263
Issue 1: Documentation UpdateYou can create this by running from your authenticated terminal: gh issue create \
--repo y-scope/clp \
--title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
--assignee junhaoliao \
--label documentation \
--body '## Background
After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non-ISO8601-compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors.
## Task
Temporarily remove datasets with non-ISO8601-compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md
These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).
## References
- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`'Issue 2: Feature Requestgh issue create \
--repo y-scope/clp \
--title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
--assignee junhaoliao \
--label enhancement \
--body '## Feature Request
Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.
## Background
Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors.
## Use Case
Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.
## Acceptance Criteria
- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation
## References
- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`'Simply copy and paste these commands into your terminal (where you have GitHub authentication configured) and both issues will be created immediately! 🚀 🧠 Learnings used |
Description
This PR integrates the new
Timestampcolumn type into the archive format, deprecatingDateStringcolumns. The newTimestampcolumn usesclp_s::timestamp_parserto represent string, float, and integer timestamp literals in the same column, storing timestamps as delta-encoded nanosecond-precision epoch timestamps.This PR deprecates the following encoding features on the write-side, but retains support for reading older archives with this content:
DateStringcolumnsDoubleEpochencoded timestamp rangesWe remove most code related to the write-side for these encodings, including most of the functionality for parsing timestamps of unknown format using
clp_s::TimestampPattern.Most of the rest of the diff in this PR comprises:
TimestampDictionaryReader/Writerto work withclp_s::timestamp_parserand the newTimestampcolumnJsonParserto handle parsing timestamps as the newTimestamptypeTimestamptypeThis change is marked breaking for the following reasons:
clp_s::timestamp_parser, which will remove this limitation.indexer, since the timestamp column in new archives should always be of typeTimestampBenchmarking results on compatible open-source datasets:
Generally, compression ratio improves (sometimes significantly), compression speed is largely unchanged, and decompression speed seems to have a strong dependence on dataset. I will investigate the datasets with slow decompression speed, but if performance work is required it will likely be part of a follow-up PR.
Note: compression speed results were very high variance, so I might re-gather them.
Checklist
breaking change.
Validation performed
timestamp()expression when querying timestamp columnTimestampcolumns.Summary by CodeRabbit
New Features
Behaviour
Tests
Chores