Skip to content

docs: Remove the postgresql dataset from the resources documentation (fixes #2026).#2028

Merged
junhaoliao merged 3 commits intoy-scope:mainfrom
junhaoliao:remove-postgres-dataset
Feb 26, 2026
Merged

docs: Remove the postgresql dataset from the resources documentation (fixes #2026).#2028
junhaoliao merged 3 commits intoy-scope:mainfrom
junhaoliao:remove-postgres-dataset

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Feb 25, 2026

Description

Remove the postgresql dataset from the resources documentation page since it
uses non-ISO8601-compliant timestamps ("2023-03-27 00:26:35.719 EDT") that
CLP-S can no longer parse after #1788 introduced stricter timestamp parsing (ISO8601 only).

The postgresql dataset's timestamps use timezone abbreviations (e.g., EDT)
instead of numeric UTC offsets, which the new clp_s::timestamp_parser does not support. This dataset can be restored once RFC 2822 / RFC 822 timezone parsing support is added.

All other datasets were validated and compress successfully:

  • elasticsearch: ISO8601 timestamps (2023-03-28T04:00:00.040Z) — 141.86x compression
  • cockroachdb: Epoch float timestamps (1679711330.570420890) — 27.67x
    compression
  • mongodb: ISO8601 timestamps in MongoDB extended JSON ({"$date":"2023-03-21T23:34:54.576-04:00"}) — 231.02x compression
  • spark-event-logs: Integer epoch timestamps (1633683661085) — 7.03x
    compression
  • hive-24hr (text): Compresses with --unstructured — 30.33x compression
  • openstack-24hr (text): Compresses with --unstructured — 61.72x
    compression

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

1. Build the CLP package from a clean state

Task: Build the CLP package to test dataset compression.

Command:

task

Output (last lines):

task: [package] echo '0.9.1-dev' > '/home/junhao/workspace/8-clp/build/clp-package/VERSION'

2. Verify postgresql fails to compress (the removed dataset)

Task: Confirm that the postgresql dataset fails with the stricter timestamp
parser.

Commands:

cd build/clp-package
./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql/postgresql.log

Output:

2026-02-24T22:08:25.222 INFO [compress] Compression job 1 submitted.
2026-02-24T22:08:26.224 ERROR [compress] Compression failed. One or more compression tasks failed. See the error log at 'user/job_1_task_errors.txt' inside your configured logs directory (`logs_directory`) for more details.

Error log (compression-job-1-task-1-stderr.log):

2026-02-24T22:08:25.406+00:00 [error] Failed to parse timestamp `"2023-03-27 00:26:35.719 EDT"` against known timestamp patterns.
2026-02-24T22:08:25.406+00:00 [error] Encountered error during compression - /home/junhao/workspace/8-clp/components/core/src/clp_s/TimestampDictionaryWriter.cpp:74  Error code: 16

Explanation: The postgresql dataset uses timezone abbreviations (EDT) in its
timestamps, which are not ISO8601-compliant and cannot be parsed by the new
clp_s::timestamp_parser.

3. Verify elasticsearch compresses successfully

Task: Confirm that elasticsearch (ISO8601 @timestamp field) compresses
without issues.

Commands:

./sbin/stop-clp.sh
# Clean data between tests
rm -rf var/data var/log
./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key '@timestamp' ~/samples/elasticsearch/elasticsearch.log

Output:

2026-02-24T22:09:22.528 INFO [compress] Compression job 1 submitted.
2026-02-24T22:09:40.562 INFO [compress] Compressed 7.98GB into 57.58MB (141.86x). Speed: 454.62MB/s.
2026-02-24T22:09:41.063 INFO [compress] Compression finished.
2026-02-24T22:09:41.063 INFO [compress] Compressed 7.98GB into 57.58MB (141.86x). Speed: 449.45MB/s.

4. Verify spark-event-logs compresses successfully

Task: Confirm that spark-event-logs (integer epoch Timestamp field)
compresses without issues.

Commands:

./sbin/stop-clp.sh && rm -rf var/data var/log && ./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key Timestamp ~/samples/spark-event-logs/app-20211007095008-0000

Output:

2026-02-24T22:10:28.780 INFO [compress] Compression job 1 submitted.
2026-02-24T22:10:31.284 INFO [compress] Compression finished.
2026-02-24T22:10:31.284 INFO [compress] Compressed 234.16KB into 33.32KB (7.03x). Speed: 108.74KB/s.

5. Verify cockroachdb compresses successfully

Task: Confirm that cockroachdb (epoch float timestamp field) compresses
without issues.

Commands:

./sbin/stop-clp.sh && rm -rf var/data var/log && ./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/cockroachdb/cockroach.node1.log

Output:

2026-02-24T22:11:17.660 INFO [compress] Compression job 1 submitted.
2026-02-24T22:11:54.234 INFO [compress] Compression finished.
2026-02-24T22:11:54.234 INFO [compress] Compressed 9.79GB into 362.07MB (27.67x). Speed: 275.65MB/s.

6. Verify mongodb compresses successfully

Task: Confirm that mongodb (ISO8601 t.$date field in MongoDB extended
JSON) compresses without issues.

Commands:

./sbin/stop-clp.sh && rm -rf var/data var/log && ./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key 't.$date' ~/samples/mongodb/mongod.log.2023-03-22T03-45-46

Output:

2026-02-24T22:14:47.841 INFO [compress] Compression job 1 submitted.
2026-02-24T22:14:49.343 INFO [compress] Compression finished.
2026-02-24T22:14:49.343 INFO [compress] Compressed 256.00MB into 1.11MB (231.02x). Speed: 197.28MB/s.

7. Verify hive-24hr (text) compresses successfully

Task: Confirm that hive-24hr (unstructured text) compresses without issues.

Commands:

./sbin/stop-clp.sh && rm -rf var/data var/log && ./sbin/start-clp.sh
./sbin/compress.sh --unstructured ~/samples/hive-24hr/i-00c90a0f/

Output:

2026-02-24T22:16:37.984 INFO [compress] Compression job 1 submitted.
2026-02-24T22:16:39.487 INFO [compress] Compression finished.
2026-02-24T22:16:39.487 INFO [compress] Compressed 9.68MB into 326.74KB (30.33x). Speed: 9.54MB/s.

8. Verify openstack-24hr (text) compresses successfully

Task: Confirm that openstack-24hr (unstructured text) compresses without
issues.

Commands:

./sbin/stop-clp.sh && rm -rf var/data var/log && ./sbin/start-clp.sh
./sbin/compress.sh --unstructured ~/samples/openstack-24hr/openstack-151/logs/c-vol.log.2016-05-08-072252

Output:

2026-02-24T22:18:43.508 INFO [compress] Compression job 1 submitted.
2026-02-24T22:18:44.511 INFO [compress] Compression finished.
2026-02-24T22:18:44.511 INFO [compress] Compressed 4.99MB into 82.87KB (61.72x). Speed: 9.10MB/s.

9. Verify docs build without warnings

Task: Ensure the updated documentation builds cleanly.

Command:

task docs:serve

Output:

build succeeded.

Summary by CodeRabbit

  • Documentation
    • Removed PostgreSQL dataset entry from the resources documentation.

@junhaoliao junhaoliao requested a review from a team as a code owner February 25, 2026 07:05
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6aad37 and 732ec3e.

📒 Files selected for processing (1)
  • docs/src/user-docs/resources-datasets.md
💤 Files with no reviewable changes (1)
  • docs/src/user-docs/resources-datasets.md

Walkthrough

This pull request removes the PostgreSQL dataset entry from the datasets reference table in the documentation, including its associated footnote citation. The change involves deleting three lines from a single documentation file without altering any other content or logic.

Changes

Cohort / File(s) Summary
Documentation Update
docs/src/user-docs/resources-datasets.md
Removed PostgreSQL dataset row from the datasets table and deleted the corresponding footnote reference [8].

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: removing the PostgreSQL dataset from documentation. It is concise, specific, and directly reflects the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. PR title seems fine as well.

@junhaoliao junhaoliao merged commit 8efbadd into y-scope:main Feb 26, 2026
8 of 10 checks passed
@junhaoliao junhaoliao added this to the February 2026 milestone Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants