Skip to content

feat(clp-s)!: Add Timestamp column type to replaceDateString column type; Bump the archive version to 0.5.0.#1788

Merged
gibber9809 merged 54 commits intoy-scope:mainfrom
gibber9809:clp-s-add-timestamp-column
Feb 17, 2026
Merged

feat(clp-s)!: Add Timestamp column type to replaceDateString column type; Bump the archive version to 0.5.0.#1788
gibber9809 merged 54 commits intoy-scope:mainfrom
gibber9809:clp-s-add-timestamp-column

Conversation

@gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Dec 16, 2025

Description

This PR integrates the new Timestamp column type into the archive format, deprecating DateString columns. The new Timestamp column uses clp_s::timestamp_parser to represent string, float, and integer timestamp literals in the same column, storing timestamps as delta-encoded nanosecond-precision epoch timestamps.

This PR deprecates the following encoding features on the write-side, but retains support for reading older archives with this content:

  1. DateString columns
  2. DoubleEpoch encoded timestamp ranges

We remove most code related to the write-side for these encodings, including most of the functionality for parsing timestamps of unknown format using clp_s::TimestampPattern.

Most of the rest of the diff in this PR comprises:

  1. Substantially rewriting TimestampDictionaryReader/Writer to work with clp_s::timestamp_parser and the new Timestamp column
  2. Updating JsonParser to handle parsing timestamps as the new Timestamp type
  3. Adding/updating tests for the new Timestamp type

This change is marked breaking for the following reasons:

  1. It removes undefined behaviour in the timestamp dictionary that could previously occur when timestamps in an archive were not millisecond precision
  2. It temporarily prevents us from ingesting some datasets that we could handle before when those datasets contain timezones that require the tzdb for correct interpretation (the open source postgresql dataset, for example). Support for parsing timezones using the tzdb is planned for clp_s::timestamp_parser, which will remove this limitation.
  3. We remove the special-case indexing behaviour related to timestamps in the indexer, since the timestamp column in new archives should always be of type Timestamp

Benchmarking results on compatible open-source datasets:

dataset % improvement compression ratio % improvement compression speed % improvement decompression speed
cockroachdb 0.29% -9.58% 0.15%
mongodb 16.03% 11.96% -21.42%
elasticsearch 4.64% -1.03% -16.16%
spark -0.29% 0.53% 29.57%
geomean 4.97% 0.77% -3.84%

Generally, compression ratio improves (sometimes significantly), compression speed is largely unchanged, and decompression speed seems to have a strong dependence on dataset. I will investigate the datasets with slow decompression speed, but if performance work is required it will likely be part of a follow-up PR.

Note: compression speed results were very high variance, so I might re-gather them.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Updated search tests to use timestamp() expression when querying timestamp column
  • Added tests to validate end-to-end compression/decompression for Timestamp columns.
  • Manually validated that timeline renders correctly in webui, and timestamps associated with each record are correct.

Summary by CodeRabbit

  • New Features

    • Dedicated Timestamp type with DeprecatedDateString fallback; archives expose deprecated-format awareness.
  • Behaviour

    • Unified timestamp parsing/ingestion, storage, search and JSON emission (timestamps as literal fields); search queries accept explicit timestamp literals.
  • Tests

    • New end-to-end timestamp test, updated search tests to use timestamp literals, and added diverse timestamp test data.
  • Chores

    • Archive format bumped to 0.5.0; build now requires the timestamp parsing component.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds explicit Timestamp support, renames DateString → DeprecatedDateString, refactors timestamp ingestion/serialization to return (epochtime, pattern_id) via a timestamp parser, updates readers/writers/dictionaries and query/search paths, bumps archive format with a deprecated-format marker, updates CMake linkages, and adds tests and test data.

Changes

Cohort / File(s) Summary
Build / CMake
components/core/cmake/Options/options.cmake, components/core/src/clp_s/CMakeLists.txt, components/core/src/clp_s/indexer/CMakeLists.txt
Adds CLP_BUILD_CLP_S_TIMESTAMP_PARSER to validation and links clp_s::timestamp_parser for CLP‑S targets (archive reader/writer, indexer).
NodeType / Schema
components/core/src/clp_s/SchemaTree.hpp, components/core/src/clp_s/SchemaTree.cpp
Renames DateStringDeprecatedDateString and maps both DeprecatedDateString and Timestamp to LiteralType::TimestampT.
Archive Reader / Writer / Adaptor
components/core/src/clp_s/ArchiveReader.cpp, components/core/src/clp_s/ArchiveWriter.cpp, components/core/src/clp_s/ArchiveWriter.hpp, components/core/src/clp_s/ArchiveReaderAdaptor.cpp, components/core/src/clp_s/CMakeLists.txt
Replaces DateString handling with DeprecatedDateString/Timestamp; writer uses TimestampColumnWriter, reader uses TimestampColumnReader/DeprecatedDateStringColumnReader; ArchiveReader passes version flag to timestamp dictionary read; ingest calls updated to new pair-returning APIs.
Column Readers / Writers
components/core/src/clp_s/ColumnReader.hpp, components/core/src/clp_s/ColumnReader.cpp, components/core/src/clp_s/ColumnWriter.hpp, components/core/src/clp_s/ColumnWriter.cpp
Renames DateStringColumnReaderDeprecatedDateStringColumnReader; removes DateStringColumnWriter; refactors TimestampColumnWriter to use delta-encoded int64 storage plus separate encoding vector.
Timestamp Dictionary (Reader/Writer)
components/core/src/clp_s/TimestampDictionaryReader.hpp, components/core/src/clp_s/TimestampDictionaryReader.cpp, components/core/src/clp_s/TimestampDictionaryWriter.hpp, components/core/src/clp_s/TimestampDictionaryWriter.cpp
Introduces dual-path storage for deprecated vs new parser patterns; read now accepts has_deprecated_timestamp_format; renames get_string_encodingget_deprecated_timestamp_string_encoding; adds append_timestamp_to_buffer; adds ingest_*_timestamp APIs returning (epochtime, pattern_id) and changes serialization layout.
Timestamp Pattern / Parser / Entry
components/core/src/clp_s/TimestampPattern.hpp, components/core/src/clp_s/TimestampPattern.cpp, components/core/src/clp_s/timestamp_parser/TimestampParser.hpp, components/core/src/clp_s/timestamp_parser/TimestampParser.cpp, components/core/src/clp_s/TimestampEntry.hpp, components/core/src/clp_s/TimestampEntry.cpp
Removes legacy known-pattern init/search helpers; adds estimate_timestamp_precision API in parser; consolidates timestamp ingestion to epoch semantics with millisecond normalization; removes double‑epoch/merge_range helpers.
JSON Parsing / Serialization
components/core/src/clp_s/JsonParser.cpp, components/core/src/clp_s/JsonSerializer.hpp
Detects numeric/string timestamps and routes to new ingest APIs, mapping matched nodes to Timestamp; adds JsonSerializer::Op::AddLiteralField for literal timestamp emission.
Schema Reader / Templates
components/core/src/clp_s/SchemaReader.cpp
Adapts column marking, template generation, and JSON emission to support Timestamp and DeprecatedDateString; uses AddLiteralField for timestamps and adjusts time-unit handling.
ParsedMessage
components/core/src/clp_s/ParsedMessage.hpp
Removes std::pair<uint64_t, epochtime_t> variant and the add_value(int32_t,uint64_t,epochtime_t) overload from the public variant/type.
Search / Query Runner
components/core/src/clp_s/search/QueryRunner.hpp, components/core/src/clp_s/search/QueryRunner.cpp
Replaces date‑string reader map with TimestampColumnReader map and a DeprecatedDateStringColumnReader*; adds evaluate_timestamp_filter and updates epoch‑date filter signatures and evaluation flow.
Indexer / Metadata
components/core/src/clp_s/indexer/IndexManager.cpp
Stops remapping timestamp‑authoritative leaf nodes to DateString; preserves original node types when updating metadata.
Archive Versioning & Init
components/core/src/clp_s/SingleFileArchiveDefs.hpp, components/core/src/clp_s/clp-s.cpp, components/core/tests/clp_s_test_utils.cpp
Bumps archive minor/patch version and adds deprecated-format marker; adds ArchiveHeader::has_deprecated_timestamp_format(); gates millisecond precision adjustments for deprecated archives; removes unconditional TimestampPattern::init() calls.
Tests & Test Data
components/core/tests/*, components/core/tests/test_log_files/test_timestamp.jsonl
Adds end‑to‑end timestamp test, new JSONL test data covering varied timestamp formats, and updates search tests to use timestamp("...") literals.
Web UI Query SQL
components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/sql.ts
Adds DEPRECATED_TIMESTAMP_TYPE and includes both DeprecatedDateString and Timestamp in generated SQL (uses IN clause).

Sequence Diagram(s)

sequenceDiagram
    participant JsonParser
    participant ArchiveWriter
    participant TimestampDictWriter as TimestampDictionaryWriter
    participant timestamp_parser as timestamp_parser

    JsonParser->>ArchiveWriter: detect timestamp field & value
    ArchiveWriter->>TimestampDictWriter: ingest_string_timestamp(key,node_id,value,is_json_literal)
    TimestampDictWriter->>timestamp_parser: parse/identify pattern
    timestamp_parser-->>TimestampDictWriter: (epochtime_t, pattern_id)
    TimestampDictWriter-->>ArchiveWriter: (epochtime_t, pattern_id)
    ArchiveWriter->>ArchiveWriter: create Timestamp node and write encoded value with pattern_id
Loading
sequenceDiagram
    participant ArchiveReader
    participant SchemaReader
    participant TimestampDictReader as TimestampDictionaryReader
    participant timestamp_parser as timestamp_parser
    participant TimestampColumnReader

    ArchiveReader->>SchemaReader: open archive
    SchemaReader->>TimestampDictReader: read(decompressor, has_deprecated_timestamp_format)
    TimestampDictReader->>timestamp_parser: construct/validate patterns
    timestamp_parser-->>TimestampDictReader: TimestampPattern objects
    SchemaReader->>TimestampColumnReader: instantiate for Timestamp column
    TimestampColumnReader->>TimestampDictReader: append_timestamp_to_buffer(epoch, pattern_id, buffer)
    TimestampDictReader->>timestamp_parser: format timestamp per pattern
    timestamp_parser-->>TimestampDictReader: formatted string
    TimestampDictReader-->>TimestampColumnReader: formatted timestamp bytes
    TimestampColumnReader-->>ArchiveReader: deliver serialized timestamp value
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the main change: adding a new Timestamp column type and deprecating DateString. The title reflects the primary objective across the extensive changeset.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gibber9809 gibber9809 changed the title feat(clp-s): Add Timestamp column type; deprecate DateString column type. feat(clp-s)!: Add Timestamp column type; deprecate DateString column type. Jan 20, 2026
@gibber9809 gibber9809 marked this pull request as ready for review January 20, 2026 18:33
@gibber9809 gibber9809 requested a review from a team as a code owner January 20, 2026 18:33
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/sql.ts (1)

31-37: Update SQL query to include both deprecated and new timestamp types.

The WHERE clause filters only by the deprecated timestamp type (value 8). To support both old archives (using DeprecatedDateString) and new archives (using Timestamp), update the query to check for both enum values.

🔧 Proposed fix to query both timestamp types

After determining the enum value for NodeType::Timestamp (see verification script in previous comment), update the query:

+const NEW_TIMESTAMP_TYPE = /* value of NodeType::Timestamp */;
+
 const buildTimestampColumnsSql = (datasetName: string): string => `
     SELECT DISTINCT
         ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.NAME}
     FROM ${settings.SqlDbClpTablePrefix}${datasetName}_${SqlTableSuffix.COLUMN_METADATA}
-    WHERE ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.TYPE} = ${TIMESTAMP_TYPE}
+    WHERE ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.TYPE} IN (${TIMESTAMP_TYPE}, ${NEW_TIMESTAMP_TYPE})
     ORDER BY ${CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.NAME};
 `;
components/core/src/clp_s/search/QueryRunner.cpp (2)

63-95: Track all DeprecatedDateString columns, not just one.
If a legacy archive contains multiple deprecated date columns, the single pointer drops all but the last and can lead to missed matches or at() failures later. Store them in a map (like m_timestamp_readers) and iterate accordingly in wildcard/exact paths.

🛠️ Proposed fix (core change; update header + call sites)
-    m_deprecated_datestring_reader = nullptr;
+    m_deprecated_datestring_readers.clear();
...
-        } else if (auto* deprecated_date_reader
-                   = dynamic_cast<DeprecatedDateStringColumnReader*>(column_reader);
-                   nullptr != deprecated_date_reader)
-        {
-            m_deprecated_datestring_reader = deprecated_date_reader;
+        } else if (auto* deprecated_date_reader
+                   = dynamic_cast<DeprecatedDateStringColumnReader*>(column_reader);
+                   nullptr != deprecated_date_reader)
+        {
+            m_deprecated_datestring_readers.emplace(column_id, deprecated_date_reader);
         } else {

Then update wildcard/exact timestamp paths to iterate m_deprecated_datestring_readers.


1187-1219: Normalise deprecated DateString timestamps to nanoseconds before compare.
Legacy DateString values are millisecond‑precision; comparing them directly to Timestamp literals (nanoseconds) will miss matches by 1e6. Scale before calling evaluate_int_filter_core.

🛠️ Proposed fix
 bool QueryRunner::evaluate_epoch_date_filter(
         FilterOperation op,
         DeprecatedDateStringColumnReader* reader,
         std::shared_ptr<Literal>& operand
 ) {
+    constexpr int64_t cNanosecondsInMillisecond{1000 * 1000LL};
     if (FilterOperation::EXISTS == op || FilterOperation::NEXISTS == op) {
         return true;
     }
@@
-    return evaluate_int_filter_core(op, reader->get_encoded_time(m_cur_message), op_value);
+    return evaluate_int_filter_core(
+            op,
+            reader->get_encoded_time(m_cur_message) * cNanosecondsInMillisecond,
+            op_value
+    );
 }
🤖 Fix all issues with AI agents
In `@components/core/src/clp_s/ArchiveWriter.cpp`:
- Around line 333-341: Remove the unreachable break after the throw in the
NodeType::DeprecatedDateString case: locate the switch handling NodeType in
ArchiveWriter.cpp (the case labeled NodeType::DeprecatedDateString) and delete
the trailing "break;" that follows "throw OperationFailed(...)" so the case
cleanly throws without an unreachable statement; leave surrounding cases (e.g.,
NodeType::DeltaInteger and NodeType::Timestamp that append column writers)
unchanged.

In `@components/core/src/clp_s/clp-s.cpp`:
- Around line 186-192: The version comparison should mask off the development
flag so dev builds don't misclassify; change the condition that compares
archive_reader->get_header().version against cDatestringDeprecationVersion to
use (archive_reader->get_header().version &
~cArchiveFormatDevelopmentVersionFlag) < cDatestringDeprecationVersion, leaving
the existing ast::SetTimestampLiteralPrecision creation and
date_precision_pass.run(expr) call intact so pre-0.5.0 (minus dev flag) archives
still get millisecond precision normalization.

In `@components/core/src/clp_s/SchemaReader.cpp`:
- Around line 20-54: In SchemaReader::mark_column_as_timestamp, the
NodeType::Float branch casts the fractional seconds to epochtime_t before
scaling which truncates sub-second precision; modify the m_get_timestamp lambda
for the Float case to first multiply the extracted double by
cNanosecondsInSecond and then convert to epochtime_t (optionally using
std::lround or similar to preserve proper rounding) so the scaling preserves
sub‑second precision.
- Around line 495-500: The structured-template generators are dropping
NodeType::Timestamp and NodeType::DeprecatedDateString; update
generate_structured_object_template() to handle DeprecatedDateString via
JsonSerializer::Op::AddStringField and Timestamp via
JsonSerializer::Op::AddLiteralField (mirroring the root behavior in
generate_json_template()), and update generate_structured_array_template() to
emit timestamp elements by adding a new serializer operation
JsonSerializer::Op::AddLiteralValue and using it for NodeType::Timestamp while
using the existing string emission for DeprecatedDateString; ensure the new
AddLiteralValue is implemented in the JsonSerializer ops set and used where
array element literals are serialized so structured arrays no longer drop
timestamp values.

In `@components/core/src/clp_s/timestamp_parser/TimestampParser.cpp`:
- Around line 1818-1840: The function estimate_timestamp_precision negates
timestamp to compute abs_timestamp which UB for INT64_MIN; fix by computing an
unsigned absolute value instead of negating signed INT64_MIN: replace the signed
negation logic for abs_timestamp in estimate_timestamp_precision with an
unsigned conversion that handles INT64_MIN safely (e.g. convert timestamp to
uint64_t and, when negative, compute the magnitude by using -(timestamp+1) cast
to uint64_t then add 1), leaving the rest of the comparisons and returned pair
(cFactor and precision char 'N'/'C'/'L'/'E') unchanged.

In `@components/core/src/clp_s/TimestampDictionaryWriter.cpp`:
- Around line 204-208: TimestampDictionaryWriter::clear() resets m_next_id,
m_string_patterns_and_ids, and m_column_id_to_range but forgets to reset
m_numeric_pattern_to_id, which can leak numeric-pattern state and cause ID
collisions after m_next_id is reused; update the clear() implementation to also
clear m_numeric_pattern_to_id so all pattern-to-id maps are reset consistently
with m_next_id (ensure you reference and clear m_numeric_pattern_to_id alongside
m_string_patterns_and_ids and m_column_id_to_range in
TimestampDictionaryWriter::clear()).
- Around line 150-182: The multiplication epoch_timestamp = timestamp * factor
in TimestampDictionaryWriter::ingest_unknown_precision_epoch_timestamp can
overflow epochtime_t; change the calculation to use a wider intermediate type
(e.g., int128 or a 128-bit arithmetic helper) or perform a pre-multiply bounds
check against numeric_limits<epochtime_t>::min()/max() using the values from
timestamp_parser::estimate_timestamp_precision, and if the result would
overflow, log an error and fail fast (throw OperationFailed) before calling
timestamp_entry_it->second.ingest_timestamp; ensure the rest of the function
(pattern creation via m_numeric_pattern_to_id, m_next_id, and returned pattern
id) only runs when the epoch_timestamp has been validated.

In
`@components/webui/client/src/pages/SearchPage/SearchState/Presto/useTimestampKeyInit/sql.ts`:
- Around line 15-19: The SQL currently only filters for TIMESTAMP_TYPE (constant
TIMESTAMP_TYPE = 8) which maps to DeprecatedDateString; add a new constant
NEW_TIMESTAMP_TYPE = 14 (or rename TIMESTAMP_TYPE to DEPRECATED_TIMESTAMP_TYPE
for clarity) and update the query WHERE clause that references
CLP_COLUMN_METADATA_TABLE_COLUMN_NAMES.TYPE to use an IN list including both
values (e.g., IN (${TIMESTAMP_TYPE}, ${NEW_TIMESTAMP_TYPE})) so columns using
the new NodeType::Timestamp are discovered.

@gibber9809
Copy link
Contributor Author

gibber9809 commented Jan 20, 2026

Besides the one presto webui change that rabbit pointed out, we likely also need to change how the webui interprets timestamps in search results -- I'll double check and change that if necessary.

@LinZhihao-723
Copy link
Member

Did some quick profiling/investigation of the cases where we have slower decompression speeds. Seems like it mostly comes from our usage of fmt::format (particularly for padded integer fields), which we suspected might happen. Replacing these usages with our own function to marshal these padded fields seems to eliminate the slowdown for the datasets where we perform poorly right now -- I'll put up a follow-up PR that improves performance for these cases once this PR is merged.

Can u create an issue to track this?

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed QueryRunner.

Comment on lines +241 to +245
if (nullptr != m_deprecated_datestring_reader) {
if (evaluate_epoch_date_filter(op, m_deprecated_datestring_reader, literal)) {
return true;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since m_deprecated_datestring_reader and m_timestamp_readers are mutually exclusive, shall we add continue instead of falling back to m_timestamp_readers check?
In general, I think there might be a better way to do this (for example, using a variant). But let's defer this to a future PR...

Copy link
Contributor Author

@gibber9809 gibber9809 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could you explain what you mean by that? I could put the iteration over the timestamp readers into the else case, but in general we can't short circuit this function because we need to go through all of the columns for the other possibly matching types.

E.g., for a query like * > 10 against the record {"ts": "1970-01-01", "int": 11}, where ts is stored as a DateString completely shortcircuiting this function after the DateString case would lead us to falsely reject this record since we wouldn't end up falling into the code block below where we consider all of the primitive types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh right, sorry for the confusion. I think the right thing to do is probably put m_timestamp_reader in an else case.

return evaluate_timestamp_filter(
expr->get_operation(),
m_datestring_readers[column_id],
m_timestamp_readers.at(column_id),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the upstream caller will ensure the column_id here must be a valid key in the reader map?

Copy link
Contributor Author

@gibber9809 gibber9809 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it happens in the code to prepare a given ERT to be searched.

@gibber9809
Copy link
Contributor Author

@junhaoliao could you take a look at the presto webui change?

Presto/velox won't work with the v0.5.0 archives introduced by this PR until I update them to understand the new Timestamp column type, but I figured I'd try to update the presto webui code related to finding the timestamp column to work with v0.5.0 archives while I was changing other references the DateString column. I can revert the presto webui change and put it in a separate small PR after the presto/velox changes are merged if you prefer though.

I'm fairly certain this change won't affect the rest of the webui, since I made sure the timestamp we associate with each search result as extra metadata is still millisecond precision after this change, and when I run the webui the timeline & timestamps appear correct.

@junhaoliao
Copy link
Member

@junhaoliao could you take a look at the presto webui change?

Presto/velox won't work with the v0.5.0 archives introduced by this PR until I update them to understand the new Timestamp column type, but I figured I'd try to update the presto webui code related to finding the timestamp column to work with v0.5.0 archives while I was changing other references the DateString column. I can revert the presto webui change and put it in a separate small PR after the presto/velox changes are merged if you prefer though.

I'm fairly certain this change won't affect the rest of the webui, since I made sure the timestamp we associate with each search result as extra metadata is still millisecond precision after this change, and when I run the webui the timeline & timestamps appear correct.

will do. Thanks for also updating the webui part from your side!

Copy link
Member

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved the webui changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. risk of merging is low as well - the change is self-contained and doesn't affect the rest of the WebUI.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(clp-s)!: Add `Timestamp` column type to deprecate `DateString` column type; Bump the archive version to `0.5.0`.

@gibber9809 gibber9809 changed the title feat(clp-s)!: Add Timestamp column type; deprecate DateString column type. feat(clp-s)!: Add Timestamp column type to replaceDateString column type; Bump the archive version to 0.5.0. Feb 17, 2026
@gibber9809 gibber9809 merged commit f82e611 into y-scope:main Feb 17, 2026
28 checks passed
@junhaoliao
Copy link
Member

junhaoliao commented Feb 19, 2026

After this PR was merged, timestamp strings that are not ISO 8601–compliant (e.g., RFC 2822 / RFC 822 formats) are now rejected instead of being parsed on a best-effort basis. Attempting to compress logs with such timestamps will produce errors like:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp `"2023-03-27 00:26:35.719 EDT"` against known timestamp patterns.
2026-02-19T12:49:09.226+00:00 [error] FileWriter not closed before being destroyed - may cause data loss
2026-02-19T12:49:09.226+00:00 [error] FileWriter not closed before being destroyed - may cause data loss
2026-02-19T12:49:09.226+00:00 [error] FileWriter not closed before being destroyed - may cause data loss
2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression - /home/junhao/workspace/2-clp/components/core/src/clp_s/TimestampDictionaryWriter.cpp:74  Error code: 16

Since none of the existing CLP users rely on non–ISO8601–compliant timestamps, and our goal is to ensure strict and accurate timestamp parsing, we will retain the current behavior of rejecting unsupported timestamp formats.

@coderabbitai please create:

  1. An issue to temporarily remove non-compliant datasets (e.g., the “postgresql” dataset) from the resources documentation:
    https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

  2. A separate feature request to add support for parsing RFC 2822 / RFC 822 timestamp strings, after which the affected datasets can be restored to the resources documentation.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@junhaoliao, I'll create both issues for you now.


🏁 Script executed:

#!/bin/bash

# Issue 1: Temporarily remove non-compliant datasets from documentation
gh issue create \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body "## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

\`\`\`
2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp \\"2023-03-27 00:26:35.719 EDT\\" against known timestamp patterns.
2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression
\`\`\`

## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the \"postgresql\" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#discussion_r1914400000
- Requested by: `@junhaoliao`" \
  --assignee junhaoliao \
  --label documentation

# Issue 2: Feature request for RFC 2822/RFC 822 timestamp support
gh issue create \
  --title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
  --body "## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

\`\`\`
2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp \\"2023-03-27 00:26:35.719 EDT\\" against known timestamp patterns.
\`\`\`

## Use Case

Some datasets (e.g., \"postgresql\" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#discussion_r1914400000
- Requested by: `@junhaoliao`" \
  --assignee junhaoliao \
  --label enhancement

Length of output: 3173


Let me fix the script to properly handle the multi-line issue bodies:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.
2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression


## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#issuecomment
- Requested by: `@junhaoliao`
EOF

# Write Issue 2 body
cat > "$ISSUE2_BODY" << 'EOF'
## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.


## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#issuecomment
- Requested by: `@junhaoliao`
EOF

# Create Issue 1
echo "Creating Issue 1..."
gh issue create \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body-file "$ISSUE1_BODY" \
  --assignee junhaoliao \
  --label documentation

echo ""
echo "Creating Issue 2..."
# Create Issue 2
gh issue create \
  --title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
  --body-file "$ISSUE2_BODY" \
  --assignee junhaoliao \
  --label enhancement

# Clean up temporary files
rm -f "$ISSUE1_BODY" "$ISSUE2_BODY"

echo ""
echo "Both issues created successfully!"

🏁 Script executed:

#!/bin/bash

# Create temporary files for issue bodies
ISSUE1_BODY=$(mktemp)
ISSUE2_BODY=$(mktemp)

# Write Issue 1 body
cat > "$ISSUE1_BODY" << 'EOF'
## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

Length of output: 92


Let me use Python to avoid shell escaping issues:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.
2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression


## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`"""

# Issue 2 body
issue2_body = """## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.


## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`"""

try:
    # Create Issue 1
    print("Creating Issue 1...")
    result1 = subprocess.run(
        [
            "gh", "issue", "create",
            "--title", "docs: Temporarily remove non-ISO8601 datasets from resources documentation",
            "--body", issue1_body,
            "--assignee", "junhaoliao",
            "--label", "documentation"
        ],
        capture_output=True,
        text=True,
        check=True
    )
    print(result1.stdout)
    
    # Create Issue 2
    print("\nCreating Issue 2...")
    result2 = subprocess.run(
        [
            "gh", "issue", "create",
            "--title", "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings",
            "--body", issue2_body,
            "--assignee", "junhaoliao",
            "--label", "enhancement"
        ],
        capture_output=True,
        text=True,
        check=True
    )
    print(result2.stdout)
    
    print("\nBoth issues created successfully!")
    
except subprocess.CalledProcessError as e:
    print(f"Error: {e}", file=sys.stderr)
    print(f"stdout: {e.stdout}", file=sys.stderr)
    print(f"stderr: {e.stderr}", file=sys.stderr)
    sys.exit(1)

PYTHON_SCRIPT

🏁 Script executed:

#!/bin/bash

python3 << 'PYTHON_SCRIPT'
import subprocess
import sys

# Issue 1 body
issue1_body = """## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

Length of output: 258


Let me create the issues one at a time with simpler scripts:


🏁 Script executed:

#!/bin/bash
# Create Issue 1 only

cat > /tmp/issue1.txt << 'END'
## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non-ISO8601-compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors.

## Task

Temporarily remove datasets with non-ISO8601-compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added.

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`
END

gh issue create \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body-file /tmp/issue1.txt \
  --assignee junhaoliao \
  --label documentation

rm /tmp/issue1.txt

Length of output: 75


@junhaoliao, I apologize—I'm unable to create GitHub issues directly because the sandbox environment doesn't have GitHub authentication configured. However, I've prepared the complete issue templates below for you to create manually:


Issue 1: Documentation Update

Title: docs: Temporarily remove non-ISO8601 datasets from resources documentation

Labels: documentation

Assignee: junhaoliao

Body:

## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.
2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression


## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested in: https://github.com/y-scope/clp/pull/1788#discussion_comment

Issue 2: Feature Request

Title: feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings

Labels: enhancement

Assignee: junhaoliao

Body:

## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.


## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested in: https://github.com/y-scope/clp/pull/1788#discussion_comment

You can create these issues by:

  1. Going to https://github.com/y-scope/clp/issues/new
  2. Copy-pasting the title and body for each issue
  3. Adding the appropriate labels and assignee

Or use the GitHub CLI from your authenticated terminal:

gh issue create --title "..." --body-file <file> --assignee junhaoliao --label <label>

🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR `#1152` discussion.

Learnt from: haiqi96
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-07-29T14:04:13.769Z
Learning: User haiqi96 requested creating a GitHub issue to document a bug fix from PR `#1136`, which addressed MySQL compatibility issues with invalid SQL CAST operations in the WebUI component.

Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR `#1125` development.

Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1283
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:206-210
Timestamp: 2025-09-04T13:51:02.311Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/compress.py, an empty string ("") is a valid value for the --timestamp-key parameter, so validation should not treat empty strings as missing values.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

@junhaoliao
Copy link
Member

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain
@junhaoliao, I'll create both issues for you now.

🏁 Script executed:

#!/bin/bash

# Issue 1: Temporarily remove non-compliant datasets from documentation
gh issue create \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body "## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

\`\`\`
2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp \\"2023-03-27 00:26:35.719 EDT\\" against known timestamp patterns.
2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression
\`\`\`

## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the \"postgresql\" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#discussion_r1914400000
- Requested by: `@junhaoliao`" \
  --assignee junhaoliao \
  --label documentation

# Issue 2: Feature request for RFC 2822/RFC 822 timestamp support
gh issue create \
  --title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
  --body "## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

\`\`\`
2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp \\"2023-03-27 00:26:35.719 EDT\\" against known timestamp patterns.
\`\`\`

## Use Case

Some datasets (e.g., \"postgresql\" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#discussion_r1914400000
- Requested by: `@junhaoliao`" \
  --assignee junhaoliao \
  --label enhancement

Length of output: 3173

Let me fix the script to properly handle the multi-line issue bodies:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression


## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#issuecomment
- Requested by: `@junhaoliao`
EOF

# Write Issue 2 body
cat > "$ISSUE2_BODY" << 'EOF'
## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.


## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Comment: https://github.com/y-scope/clp/pull/1788#issuecomment
- Requested by: `@junhaoliao`
EOF

# Create Issue 1
echo "Creating Issue 1..."
gh issue create \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body-file "$ISSUE1_BODY" \
  --assignee junhaoliao \
  --label documentation

echo ""
echo "Creating Issue 2..."
# Create Issue 2
gh issue create \
  --title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
  --body-file "$ISSUE2_BODY" \
  --assignee junhaoliao \
  --label enhancement

# Clean up temporary files
rm -f "$ISSUE1_BODY" "$ISSUE2_BODY"

echo ""
echo "Both issues created successfully!"

🏁 Script executed:

#!/bin/bash

# Create temporary files for issue bodies
ISSUE1_BODY=$(mktemp)
ISSUE2_BODY=$(mktemp)

# Write Issue 1 body
cat > "$ISSUE1_BODY" << 'EOF'
## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

Length of output: 92

Let me use Python to avoid shell escaping issues:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression


## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`"""

# Issue 2 body
issue2_body = """## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.


## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`"""

try:
    # Create Issue 1
    print("Creating Issue 1...")
    result1 = subprocess.run(
        [
            "gh", "issue", "create",
            "--title", "docs: Temporarily remove non-ISO8601 datasets from resources documentation",
            "--body", issue1_body,
            "--assignee", "junhaoliao",
            "--label", "documentation"
        ],
        capture_output=True,
        text=True,
        check=True
    )
    print(result1.stdout)
    
    # Create Issue 2
    print("\nCreating Issue 2...")
    result2 = subprocess.run(
        [
            "gh", "issue", "create",
            "--title", "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings",
            "--body", issue2_body,
            "--assignee", "junhaoliao",
            "--label", "enhancement"
        ],
        capture_output=True,
        text=True,
        check=True
    )
    print(result2.stdout)
    
    print("\nBoth issues created successfully!")
    
except subprocess.CalledProcessError as e:
    print(f"Error: {e}", file=sys.stderr)
    print(f"stdout: {e.stdout}", file=sys.stderr)
    print(f"stderr: {e.stderr}", file=sys.stderr)
    sys.exit(1)

PYTHON_SCRIPT

🏁 Script executed:

#!/bin/bash

python3 << 'PYTHON_SCRIPT'
import subprocess
import sys

# Issue 1 body
issue1_body = """## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

Length of output: 258

Let me create the issues one at a time with simpler scripts:

🏁 Script executed:

#!/bin/bash
# Create Issue 1 only

cat > /tmp/issue1.txt << 'END'
## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non-ISO8601-compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors.

## Task

Temporarily remove datasets with non-ISO8601-compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added.

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`
END

gh issue create \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body-file /tmp/issue1.txt \
  --assignee junhaoliao \
  --label documentation

rm /tmp/issue1.txt

Length of output: 75

@junhaoliao, I apologize—I'm unable to create GitHub issues directly because the sandbox environment doesn't have GitHub authentication configured. However, I've prepared the complete issue templates below for you to create manually:

Issue 1: Documentation Update

Title: docs: Temporarily remove non-ISO8601 datasets from resources documentation

Labels: documentation

Assignee: junhaoliao

Body:

## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non–ISO8601–compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns. 2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression


## Task

Temporarily remove datasets with non–ISO8601–compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested in: https://github.com/y-scope/clp/pull/1788#discussion_comment

Issue 2: Feature Request

Title: feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings

Labels: enhancement

Assignee: junhaoliao

Body:

## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.


## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested in: https://github.com/y-scope/clp/pull/1788#discussion_comment

You can create these issues by:

  1. Going to y-scope/clp/issues/new
  2. Copy-pasting the title and body for each issue
  3. Adding the appropriate labels and assignee

Or use the GitHub CLI from your authenticated terminal:

gh issue create --title "..." --body-file <file> --assignee junhaoliao --label <label>

🧠 Learnings used

@coderabbitai i'm sure you can create the issues. try again

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@junhaoliao, you're right! Let me try again with a cleaner approach.

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.
2026-02-19T12:49:09.226+00:00 [error] Encountered error during compression


## Task

Temporarily remove datasets with non-ISO8601-compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`
ISSUE1END

echo "Creating Issue 1..."
gh issue create \
  --repo y-scope/clp \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body-file /tmp/issue1_body.md \
  --assignee junhaoliao \
  --label documentation

echo ""
echo "---"
echo ""

# Create Issue 2: Feature request
cat > /tmp/issue2_body.md << 'ISSUE2END'
## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors:

2026-02-19T12:49:09.225+00:00 [error] Failed to parse timestamp "2023-03-27 00:26:35.719 EDT" against known timestamp patterns.


## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`
ISSUE2END

echo "Creating Issue 2..."
gh issue create \
  --repo y-scope/clp \
  --title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
  --body-file /tmp/issue2_body.md \
  --assignee junhaoliao \
  --label enhancement

# Clean up
rm -f /tmp/issue1_body.md /tmp/issue2_body.md

echo ""
echo "Both issues created successfully!"

🏁 Script executed:

#!/bin/bash

# Create Issue 1: Documentation update
cat > /tmp/issue1_body.md << 'ISSUE1END'
## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non-ISO8601-compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors:

Length of output: 97


@junhaoliao, let me fix the script formatting issue and try again:


🏁 Script executed:

#!/bin/bash

# Create Issue 1 body
cat > /tmp/issue1.md <<'EOF'
## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non-ISO8601-compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors.

## Task

Temporarily remove datasets with non-ISO8601-compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`
EOF

echo "Creating Issue 1..."
gh issue create \
  --repo y-scope/clp \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --body-file /tmp/issue1.md \
  --assignee junhaoliao \
  --label documentation

echo ""
echo "================================"
echo ""

# Create Issue 2 body
cat > /tmp/issue2.md <<'EOF'
## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors.

## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`
EOF

echo "Creating Issue 2..."
gh issue create \
  --repo y-scope/clp \
  --title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
  --body-file /tmp/issue2.md \
  --assignee junhaoliao \
  --label enhancement

# Clean up
rm -f /tmp/issue1.md /tmp/issue2.md

echo ""
echo "✓ Both issues created successfully!"

Length of output: 263


@junhaoliao, I apologize for the confusion. The sandbox environment doesn't have GitHub authentication configured (no GH_TOKEN environment variable), which is why I cannot create the issues directly. However, I've prepared the complete issue templates below that you can create in seconds:


Issue 1: Documentation Update

You can create this by running from your authenticated terminal:

gh issue create \
  --repo y-scope/clp \
  --title "docs: Temporarily remove non-ISO8601 datasets from resources documentation" \
  --assignee junhaoliao \
  --label documentation \
  --body '## Background

After PR `#1788` was merged, CLP-S now uses stricter timestamp parsing that rejects non-ISO8601-compliant timestamp formats (e.g., RFC 2822 / RFC 822 formats). Attempting to compress logs with such timestamps now produces errors.

## Task

Temporarily remove datasets with non-ISO8601-compliant timestamps (e.g., the "postgresql" dataset) from the resources documentation at:
https://github.com/y-scope/clp/blob/a456dc914f26e85344e5ffe46471a45760fe804b/docs/src/user-docs/resources-datasets.md

These datasets can be restored once RFC 2822 / RFC 822 timestamp parsing support is added (see related feature request).

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`'

Issue 2: Feature Request

gh issue create \
  --repo y-scope/clp \
  --title "feat: Add support for parsing RFC 2822 / RFC 822 timestamp strings" \
  --assignee junhaoliao \
  --label enhancement \
  --body '## Feature Request

Add support for parsing RFC 2822 and RFC 822 timestamp formats in CLP-S timestamp parser.

## Background

Currently, CLP-S only accepts ISO8601-compliant timestamp formats. After PR `#1788` introduced stricter timestamp parsing, non-compliant formats like RFC 2822 / RFC 822 are rejected, causing compression errors.

## Use Case

Some datasets (e.g., "postgresql" logs) use RFC 2822 / RFC 822 timestamp formats. Supporting these formats would allow CLP to compress a wider range of log datasets.

## Acceptance Criteria

- CLP-S timestamp parser can parse RFC 2822 and RFC 822 timestamp strings
- Affected datasets can be restored to the resources documentation

## References

- PR: https://github.com/y-scope/clp/pull/1788
- Requested by: `@junhaoliao`'

Simply copy and paste these commands into your terminal (where you have GitHub authentication configured) and both issues will be created immediately! 🚀


🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR `#1152` discussion.

Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR `#1125` development.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants