Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-s)!: Add support for ingesting auto-generated kv-pairs from kv-ir; Add support for searching auto-generated kv-pairs. #731

Merged
merged 14 commits into from
Mar 7, 2025

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Feb 25, 2025

Description

This PR adds support for ingesting and searching on auto-generated keys in kv-ir. Broadly this encompasses the following aspects:

  1. Traverse the auto-generated kv-pairs and auto-generated schema tree while ingesting kv-ir and add it to a namespaced subtree of the archive schema tree
  2. Add support for specifying and escaping namespaces when parsing kql column descriptors
  3. Respect namespaces in column resolution/search
  4. Respect namespaces in projection
  5. Respect namespaces in timestamp matching

Note that this is technically a breaking change because of item 2. In particular archives that contain an authoritative timestamp beginning with an '@' will incorrectly have that '@' interpreted as indicating a namespace per the new parsing rules. We do increment the minor version in this PR, and we could use that to add special casing when loading timestamp dictionaries from older archive versions, but that special casing has not been included in this PR.

The namespacing is achieved by dividing the schema tree as follows:

        /""<metadata>-...
-1:root--"@"<object>-... autogen subtree namespace distinguished by "@" name
        \""<object>-...  regular object subtree with "" name

Column resolution (and similar bits of code) have been changed to resolve a column against either the "@" object subtree or the default "" object subtree depending on the namespace associated with the column. One notable exception is that pure wildcards "*" currently do not respect namespaces: specifically *: value and @*: value will search for any column that matches the value (as long as it isn't in the NodeType::Metadata subtree). It would be possible to get pure wildcards to respect namespaces, but it would require turning a lot of linear scans into tree traversals and using more memory to explicitly track sets of matching nodes, and might not be a necessary feature anyway.

For decompression to JSON we simply marshal only the "" object subtree. We could probably add support for merging the "@" and "" namespace during JSON decompression in one way or another, but that is left as future work.

Once we add support for decompression to kv-ir we will simply need to marshal the "" subtree into the user-generated keys and the "@" subtree into the auto-generated keys.

Once again these new features appear to have slightly decreased ingestion performance when specifying timestamp keys. The results are as follows:

Dataset (new compression time) / (old compression time)
cockroach 1.080
mongod 1.062
elasticsearch 1.017
postgresql 1.058

There don't seem to be any noticeable new hotspots when I run these benchmarks using perf, but I will try to double-check my benchmarking methodology and continue looking into improving performance during the review process.

Checklist

  • [x ] The PR satisfies the contribution guidelines.
  • [x ] This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Added tests for new namespacing and namespace escaping rules
  • Manually validated compression/search/specifying timestamp in auto-generated sub-tree

Summary by CodeRabbit

  • New Features
    • Added support for querying auto-generated key–value pairs with namespace indicators.
  • Improvements
    • Enhanced timestamp and namespace handling improves search accuracy and overall data mapping.
    • Refined query parsing now supports additional special characters for more precise filtering.
  • Documentation
    • Updated the user guide to include a new section on querying auto-generated key–value pairs and guidelines for escaping special characters.

Copy link
Contributor

coderabbitai bot commented Feb 25, 2025

Walkthrough

The changes introduce namespace awareness across several core components. A new timestamp namespace variable is added to the ArchiveWriter, with corresponding logic changes in its methods. Multiple modules are updated to accept and propagate a descriptor namespace, including JSON parsing, schema management, and tokenization routines. Method signatures and constructors are modified to include additional parameters for namespace handling. The changes also update the archive version, adjust error handling, and enhance tests and documentation to validate the new namespace functionalities.

Changes

File(s) Change Summary
components/.../ArchiveWriter.cpp
components/.../ArchiveWriter.hpp
Added new member variable m_authoritative_timestamp_namespace; updated open, close, and add_node methods to incorporate namespace checks in timestamp handling.
components/.../JsonParser.cpp
components/.../JsonParser.hpp
Enhanced timestamp key parsing; added template parameter autogen; introduced m_timestamp_namespace; updated mappings and method signatures for auto-generated versus user-generated nodes.
components/.../SchemaReader.cpp Modified serializer initialization to check for a valid subtree root using a namespace (via constants::cDefaultNamespace) before generating a JSON template; added new includes.
components/.../SchemaTree.cpp
components/.../SchemaTree.hpp
Replaced single m_object_subtree_id with a map m_namespace_to_object_subtree_id; removed get_object_subtree_node_id() in favour of get_object_subtree_node_id_for_namespace(std::string const&).
components/.../SingleFileArchiveDefs.hpp Updated archive minor version from 2 to 3.
components/.../TimestampDictionaryReader.cpp
components/.../TimestampDictionaryReader.hpp
Introduced new variable descriptor_namespace in tokenization; changed the type of m_authoritative_timestamp_tokenized_column from an optional vector to an optional pair (tokens, namespace); updated getter signature accordingly.
components/.../Utils.cpp
components/.../Utils.hpp
Modified the tokenize_column_descriptor function to include a new parameter for descriptor_namespace; added handling for special character '@' in the unescape routine.
components/.../archive_constants.hpp Added two new constants: cAutogenNamespace (value "@") and cDefaultNamespace (empty string) to support namespace logic.
components/.../clp-s.cpp Updated the search_archive function to pass the descriptor_namespace to both the tokenization function and the ColumnDescriptor::create_from_escaped_tokens factory method.
components/.../indexer/IndexManager.cpp Revised schema tree traversal by using get_object_subtree_node_id_for_namespace with the default namespace; added early return if an invalid subtree root (i.e. -1) is encountered.
components/.../search/AddTimestampConditions.cpp
components/.../search/AddTimestampConditions.hpp
Updated the type of timestamp_column from an optional vector to an optional pair (vector, string); adjusted constructor and usage of ColumnDescriptor::create_from_escaped_tokens to reflect the change in expected input.
components/.../search/ColumnDescriptor.cpp
components/.../search/ColumnDescriptor.hpp
Removed the single-token constructor; updated constructors and factory methods to accept an additional descriptor_namespace parameter; added getter (get_namespace) and setter (set_namespace) for namespace management.
components/.../search/Projection.cpp Modified the resolve_column method to retrieve the object subtree node ID using the column’s namespace with get_object_subtree_node_id_for_namespace and added a check for an invalid node ID.
components/.../search/SchemaMatch.cpp Adjusted column mapping logic to use namespace-aware retrieval for the subtree node ID and passed the column's namespace to the ColumnDescriptor::create_from_descriptors method.
components/.../search/kql/Kql.g4 Updated the SPECIAL_CHARACTER fragment by adding the '@' character to the list of recognized special characters.
components/.../search/kql/kql.cpp Modified the prepend_column method and column descriptor creation to accept and enforce namespace handling; updated function signatures to require shared pointers for descriptors and included namespace validation.
components/tests/test-kql.cpp Added assertions to verify proper handling of column namespaces in various KQL filter expressions, including tests for both empty and '@' namespaces.
docs/.../reference-json-search-syntax.md Introduced a new documentation section on querying auto-generated kv-pairs and updated escaping rules to include the '@' character (when it appears as the first character in a key token).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant AW as ArchiveWriter
    participant Options

    Client->>AW: open(options)
    AW->>Options: Extract m_authoritative_timestamp_namespace
    AW->>AW: Initialize timestamp and namespace members
    Client->>AW: add_node(parent_node_id, node, key)
    AW->>AW: Evaluate node conditions using namespace & update matched prefix
    Client->>AW: close()
    AW->>AW: Clear timestamp and namespace members
Loading
sequenceDiagram
    participant User
    participant KQL as QueryParser
    participant Util as StringUtils::tokenize_column_descriptor
    participant CD as ColumnDescriptor

    User->>KQL: Submit query with column descriptor
    KQL->>Util: Tokenize descriptor (include descriptor_namespace)
    Util-->>KQL: Return tokens and descriptor_namespace
    KQL->>CD: Create ColumnDescriptor with tokens & namespace
    CD-->>KQL: Return descriptor with namespace set
    KQL->>User: Return processed query result
Loading

Possibly related PRs

Suggested reviewers

  • wraymo
  • kirkrodrigues

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gibber9809 gibber9809 marked this pull request as ready for review February 26, 2025 19:47
@gibber9809 gibber9809 requested review from wraymo and a team as code owners February 26, 2025 19:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
docs/src/user-guide/reference-json-search-syntax.md (1)

64-81: Add language specifiers to code blocks

The new section on querying auto-generated kv-pairs is well-documented and provides clear examples. However, the fenced code blocks at lines 69 and 78 should have language specifiers for better syntax highlighting and consistency with other code blocks in the document.

-```
+```text
 @key: value

- +text
@key: value

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

69-69: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


78-78: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

components/core/src/clp_s/SchemaTree.hpp (2)

131-143: Namespace-based subtree retrieval
This new method looks sound for retrieving the object subtree ID by namespace. However, if performance is paramount, consider using absl::flat_hash_map for constant-time lookups.


194-194: Storing namespace to object subtree mapping
Using a std::map<std::string, int32_t> is acceptable, but it might be beneficial to unify data structures in the codebase by using a hashed map (absl::flat_hash_map) consistent with other areas.

components/core/src/clp_s/search/AddTimestampConditions.hpp (1)

26-26: Consider using a struct instead of std::pair.
While this approach is valid, using a struct could improve clarity by making the namespace and token vector more descriptive.

components/core/src/clp_s/search/SchemaMatch.cpp (3)

79-93: Address the FIXME comment when feasible.
The additional commentary about overhead suggests you intend to refactor this code path. If you need help, let us know.


123-125: Maintain namespace consistency in wildcard expansions.
A TODO is present for namespace-aware wildcard expansion. Let me know if you'd like further assistance implementing this enhancement.


142-147: Consider adding a log statement when no subtree root is found.
Early return on -1 is sensible. For completeness, consider logging a message to help debugging.

components/core/src/clp_s/JsonParser.cpp (1)

772-871: Refined subtree parsing for auto-generated and user-generated pairs.
The logic systematically handles each NodeType, populating the parsed message. Consider extracting repeated blocks for simpler maintainability.

components/core/src/clp_s/search/kql/kql.cpp (1)

201-203: Wildcard descriptor namespace hint.
The TODO suggests potential future expansion to match all namespaces.

Could offer a specialized function to handle wildcard namespaces if that becomes necessary.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7202665 and 0435b7d.

📒 Files selected for processing (25)
  • components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (2 hunks)
  • components/core/src/clp_s/JsonParser.cpp (11 hunks)
  • components/core/src/clp_s/JsonParser.hpp (4 hunks)
  • components/core/src/clp_s/SchemaReader.cpp (2 hunks)
  • components/core/src/clp_s/SchemaTree.cpp (1 hunks)
  • components/core/src/clp_s/SchemaTree.hpp (3 hunks)
  • components/core/src/clp_s/SingleFileArchiveDefs.hpp (1 hunks)
  • components/core/src/clp_s/TimestampDictionaryReader.cpp (2 hunks)
  • components/core/src/clp_s/TimestampDictionaryReader.hpp (3 hunks)
  • components/core/src/clp_s/Utils.cpp (2 hunks)
  • components/core/src/clp_s/Utils.hpp (1 hunks)
  • components/core/src/clp_s/archive_constants.hpp (2 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/src/clp_s/indexer/IndexManager.cpp (2 hunks)
  • components/core/src/clp_s/search/AddTimestampConditions.cpp (1 hunks)
  • components/core/src/clp_s/search/AddTimestampConditions.hpp (3 hunks)
  • components/core/src/clp_s/search/ColumnDescriptor.cpp (3 hunks)
  • components/core/src/clp_s/search/ColumnDescriptor.hpp (2 hunks)
  • components/core/src/clp_s/search/Projection.cpp (1 hunks)
  • components/core/src/clp_s/search/SchemaMatch.cpp (4 hunks)
  • components/core/src/clp_s/search/kql/Kql.g4 (1 hunks)
  • components/core/src/clp_s/search/kql/kql.cpp (6 hunks)
  • components/core/tests/test-kql.cpp (7 hunks)
  • docs/src/user-guide/reference-json-search-syntax.md (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • components/core/src/clp_s/SingleFileArchiveDefs.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/search/Projection.cpp
  • components/core/src/clp_s/SchemaReader.cpp
  • components/core/src/clp_s/TimestampDictionaryReader.cpp
  • components/core/src/clp_s/archive_constants.hpp
  • components/core/src/clp_s/SchemaTree.cpp
  • components/core/src/clp_s/ArchiveWriter.hpp
  • components/core/src/clp_s/indexer/IndexManager.cpp
  • components/core/src/clp_s/search/AddTimestampConditions.cpp
  • components/core/tests/test-kql.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/search/AddTimestampConditions.hpp
  • components/core/src/clp_s/SchemaTree.hpp
  • components/core/src/clp_s/ArchiveWriter.cpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/src/clp_s/search/SchemaMatch.cpp
  • components/core/src/clp_s/TimestampDictionaryReader.hpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/search/kql/kql.cpp
  • components/core/src/clp_s/search/ColumnDescriptor.hpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/search/ColumnDescriptor.cpp
🧠 Learnings (1)
components/core/src/clp_s/archive_constants.hpp (1)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/reference-json-search-syntax.md

69-69: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


78-78: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (79)
components/core/src/clp_s/search/kql/Kql.g4 (1)

90-91: KQL grammar now recognizes '@' as a special character

The addition of '@' to the list of special characters in the KQL grammar aligns with the PR objective to support auto-generated key namespaces. This change means the '@' character now needs to be escaped in KQL queries.

Note that this is a breaking change, as mentioned in the PR objectives, which will affect any queries containing unescaped '@' characters, particularly those with timestamps beginning with '@'.

components/core/src/clp_s/SchemaTree.cpp (1)

20-22: Enhancement to support namespace-aware object subtrees

The change replaces a direct subtree ID with a map-based approach that associates namespaces with their corresponding subtree IDs. This implementation aligns with the PR objective to support auto-generated keys in a namespaced subtree.

components/core/src/clp_s/indexer/IndexManager.cpp (3)

103-108: Updated schema tree traversal to be namespace-aware

The code now properly retrieves the object subtree root using the namespace-aware method, with appropriate error handling to check for invalid subtree roots. This change is consistent with the PR's objective to support namespaced keys.

Note the use of the default namespace constant from archive_constants.hpp, which helps maintain consistency across the codebase.


113-113: Schema traversal now uses namespace-specific subtree root

Updated code to use the newly retrieved namespace-specific subtree root for traversal, maintaining compatibility with the namespace-aware changes throughout the codebase.


5-5:

✅ Verification successful

Verify if the string include is necessary

The added include for <string> might be redundant if string is already included through other headers in the file.


🏁 Script executed:

#!/bin/bash
# Check if string is already included through other headers
rg -A 3 "^#include" components/core/src/clp_s/indexer/IndexManager.hpp

Length of output: 260


Action: Retain the <string> include as it is required

After verifying that the header file (components/core/src/clp_s/indexer/IndexManager.hpp) does not include <string>—and noting that none of the other includes in the header provide it—the explicit inclusion of <string> in the corresponding implementation file (IndexManager.cpp) remains necessary. Assuming that IndexManager.cpp utilises std::string (or related functionality), this include is not redundant.

components/core/src/clp_s/archive_constants.hpp (2)

5-5: Added required header for std::string_view

The include for string_view has been added to support the new namespace constants.


32-33: New constants for namespace support

Added two well-named constants that clearly define the namespace structure: auto-generated keys use the '@' prefix, while regular objects use an empty string as the namespace. This implementation aligns perfectly with the PR objectives.

These constants will help maintain consistency in how namespaces are handled throughout the codebase, particularly for the distinction between auto-generated subtrees and regular object subtrees.

docs/src/user-guide/reference-json-search-syntax.md (1)

193-193: LGTM! Good documentation update

The addition of '@' to the list of characters that require escaping is appropriate and aligns with the new functionality for namespace handling.

components/core/src/clp_s/search/AddTimestampConditions.cpp (1)

21-24: LGTM! Updated for namespace support

The method call has been properly updated to pass both the tokens and namespace to the create_from_escaped_tokens method, supporting the new namespace functionality.

components/core/src/clp_s/search/Projection.cpp (1)

55-58: LGTM! Good namespace handling and error check

The modification correctly uses the column's namespace to find the appropriate subtree node. The added early return check for invalid node IDs is a good defensive programming practice.

components/core/src/clp_s/ArchiveWriter.hpp (2)

29-29: LGTM! New field for namespace support

The addition of the authoritative_timestamp_namespace field to the ArchiveWriterOption struct is appropriate for supporting the new namespace functionality.


248-248: LGTM! Member variable for namespace support

The addition of the m_authoritative_timestamp_namespace member variable aligns with the changes to the ArchiveWriterOption struct and supports the namespace functionality.

components/core/src/clp_s/SchemaReader.cpp (3)

4-4: Include for string is now required for std::string usage.

The addition of the <string> header is appropriate as it's now needed for the std::string type used in the updated code.


6-6: Include added for archive constants.

Good addition of the archive_constants.hpp header which provides the namespace constants needed below.


573-581: Improved conditional check to use namespace-aware schema tree access.

The code has been updated to utilize namespace awareness by checking for a specific namespace (constants::cDefaultNamespace) in the schema tree rather than simply checking if the tree is empty. This aligns with the PR objective of adding support for ingesting and searching on auto-generated keys.

components/core/tests/test-kql.cpp (8)

52-52: Added important namespace verification for wildcard key queries.

The test now verifies that pure wildcard key queries correctly have empty namespaces.


70-70: Added namespace verification for basic filters.

Ensures that basic KQL filters have empty namespaces as expected.


89-89: Added namespace verification for NOT filters.

Confirms that NOT filters have empty namespaces, maintaining consistent behavior.


138-138: Added namespace verification for AND expressions.

Ensures that columns in AND expressions maintain empty namespaces.


189-189: Added namespace verification for OR expressions.

Verifies that columns in OR expressions correctly have empty namespaces.


273-273: Added namespace verification for escape sequences in values.

Tests namespace handling for cases with escape sequences in values.


279-294: Added new test section for @ namespace in columns.

This excellent new test case verifies that columns with @ prefix are correctly recognized as being in the "@" namespace, which is a key feature of this PR.


296-311: Added test for escaped @ symbol in column names.

This test ensures that when @ is escaped in a column name, it is treated as part of the column name rather than a namespace designator. Good edge case coverage.

components/core/src/clp_s/Utils.hpp (1)

246-258: Updated method signature to support namespace extraction.

The tokenize_column_descriptor method signature has been updated to include a new descriptor_namespace parameter that will allow the method to return the namespace information along with the tokenized column.

This change supports the PR objective of introducing functionality for specifying and escaping namespaces when parsing KQL column descriptors.

components/core/src/clp_s/TimestampDictionaryReader.cpp (3)

19-19: Added variable to capture namespace information.

Added a new variable to store the namespace part when tokenizing timestamp column descriptors.


24-30: Updated method call to pass the namespace parameter.

The call to StringUtils::tokenize_column_descriptor has been updated to pass the new namespace parameter. The formatting change also improves readability.


41-41: Updated timestamp column storage to include namespace information.

The authoritative timestamp column now stores both the tokenized column and its namespace. This is necessary to support namespace-aware timestamp functionality.

components/core/src/clp_s/ArchiveWriter.cpp (4)

22-22: Add new archival timestamp namespace
No concerns. This assignment cleanly captures the new configuration from the option object.


120-120: Clearing the new namespace
Good to see the new member being cleared on close, consistent with the rest of the class’s reset logic.


244-245: Enhanced condition
The condition includes checking the timestamp size against the prefix length. Confirm that the extra comparison is correct for skipping zero-length or partial matches.


248-249: Matching root node namespace
This newly introduced check ensures that if the parent node is the root node, the code compares the key to the authoritative timestamp namespace. The logic is sound but verify the usage of string_view for exact matching.

components/core/src/clp_s/SchemaTree.hpp (1)

5-5: New header inclusion
The new <map> header is used for m_namespace_to_object_subtree_id. No issues identified.

components/core/src/clp_s/Utils.cpp (3)

533-534: Additional param for descriptor namespace
Introducing descriptor_namespace as a reference parameter is a good approach for capturing the extracted namespace while tokenizing. No issues spotted.


537-549: Namespace extraction logic
The new logic checks if the descriptor begins with @ and sets the namespace. Ensure that descriptor_namespace is validated if future code expects more advanced namespaces.


781-782: Support for unescaping '@'
Allowing the '@' character in unescape_kql_internal fosters the new namespace syntax. Looks correct.

components/core/src/clp_s/search/AddTimestampConditions.hpp (2)

6-6: Include standard library header for std::pair.
No concerns here—this addition is necessary for referencing std::pair.


42-42: Private member aligns with constructor usage.
This field consistently reflects the new timestamp column approach.

components/core/src/clp_s/clp-s.cpp (2)

205-211: Validate handling of empty namespace scenarios.
Adding descriptor_namespace is consistent with the new requirement. Double-check that an empty namespace doesn't cause mismatches or unexpected behaviour.


216-219: Namespace-aware column creation is correct.
Invoking create_from_escaped_tokens with both tokens and namespace correctly matches the new signature.

components/core/src/clp_s/search/SchemaMatch.cpp (2)

94-97: Ensure no risk of infinite loop in the while condition.
Although the preceding logic appears to handle -1 subtree roots, confirm that this sequence always terminates, particularly in edge cases.


106-109: Use of create_from_descriptors with namespace is correct.
No issues found—this properly matches the updated constructor.

components/core/src/clp_s/TimestampDictionaryReader.hpp (3)

6-6: Added appropriate header for std::pair usage.

The inclusion of <utility> is necessary to support the new pair type used in the timestamp column representation.


56-59: Updated return type to include namespace information.

The method signature has been properly updated to return a pair containing both the tokenized column and its namespace, aligning with the namespace support being added throughout the codebase.


75-76: Enhanced member variable to store namespace information.

The member variable type has been correctly updated to store both the tokenized column data and its associated namespace, which is essential for the namespace-aware functionality being implemented.

components/core/src/clp_s/JsonParser.hpp (6)

130-140: Added template parameter to support auto-generated subtrees.

The autogen template parameter clearly indicates whether the node is part of an auto-generated subtree, which aligns with the PR objective of supporting auto-generated keys in kv-ir.


147-156: Template parameter consistently added to node ID resolution method.

The consistent addition of the autogen template parameter to this method ensures proper handling of auto-generated nodes throughout the codebase.


158-169: Added helper method to process subtrees based on their generation type.

The new templated method parse_kv_log_event_subtree nicely encapsulates the logic for processing different types of subtrees, improving code organization and reusability.


174-174: Updated method signature for consistency.

The method signature now explicitly uses the fully qualified name for the parameter type, which improves code clarity and consistency.


222-222: Added namespace support for timestamps.

The new member variable m_timestamp_namespace supports the namespace handling functionality for timestamps, which is a key part of this PR.


234-235: Added separate mapping for auto-generated nodes.

This new hash map properly separates the auto-generated node mappings from regular nodes, ensuring they can be managed independently.

components/core/src/clp_s/search/ColumnDescriptor.cpp (4)

25-36: Updated constructor to support namespaces.

The constructor now properly initializes the namespace member variable, ensuring that column descriptors are namespace-aware.

🧰 Tools
🪛 Cppcheck (2.10-2)

[performance] 31-31: Variable 'm_descriptors' is assigned in constructor body. Consider performing initialization in initialization list.

(useInitializationList)


38-49: Consistently updated second constructor for namespace support.

The descriptor-based constructor has also been updated to handle namespaces, maintaining consistency across the class interface.

🧰 Tools
🪛 Cppcheck (2.10-2)

[performance] 44-44: Variable 'm_descriptors' is assigned in constructor body. Consider performing initialization in initialization list.

(useInitializationList)


51-56: Updated factory method to propagate namespace information.

The factory method now correctly passes the namespace parameter to the constructor, ensuring namespaces are preserved when creating column descriptors from tokens.


58-64: Factory method for descriptors consistently updated.

The second factory method has also been updated to include namespace support, ensuring consistency across the API.

components/core/src/clp_s/search/ColumnDescriptor.hpp (5)

132-135: Updated factory method declaration to include namespace parameter.

The declaration has been properly updated to match the implementation in the cpp file, ensuring interface consistency.


143-146: Consistently updated second factory method declaration.

Both factory methods now have consistent signatures that include the namespace parameter.


276-280: Added accessor methods for the namespace.

These accessor methods provide a proper interface for getting and setting the namespace, following good encapsulation practices.


285-285: Added member variable to store the namespace.

The private member variable correctly stores the namespace as a string, supporting the namespace handling functionality.


292-294: Updated constructor declarations to match the implementations.

The constructor declarations have been updated to match their implementations in the cpp file, ensuring consistency between the interface and implementation.

components/core/src/clp_s/JsonParser.cpp (14)

31-31: Include is appropriate and consistent with usage.
No concerns.


94-98: Validation approach for timestamps is correct.
The check ensures the timestamp descriptor is valid. Good adherence to the guideline preferring false == expression.


104-117: Ensure consistent de-escaping logic.
This step verifies that wildcards are not permitted in timestamps. The error handling is robust.


127-127: Explicitly specifying the authoritative timestamp namespace.
Looks good, as it aligns with the newly introduced namespace handling.


670-695: Implements parent-child mapping with auto-gen logic.
The approach to adjusting the key name for auto-generated subtrees solidly supports namespacing. Ensure that repeated IR node IDs won't overwrite existing mappings if that scenario arises.


699-710: Graceful fallback on existing mappings.
The method properly checks for previously cached nodes to avoid re-adding them. Looks good overall.


712-712: Conditional uses consistent style.
No issues here.


732-735: Intermediate mapping lookup.
Ensures that the parent's node ID is reused if it already exists, preventing duplication.


748-757: Adding node to archive and updating translations for final path component.
This structure is clear and concise.


758-765: Handling interstitial object nodes.
This block creates parent path nodes as needed. The code is consistent with the overall approach.


770-770: Returning the final node ID and timestamp match status.
No specific findings. Implementation looks clean.


875-882: Ingesting auto-generated and user-generated events.
Invoking both parse_kv_log_event_subtree and parse_kv_log_event_subtree is straightforward.


971-971: Clearing the IR-node-to-archive mapping before splitting.
Helps prevent stale node references across archive boundaries.


997-998: Resetting IR mappings after stream processing completes.
Ensures a clean state for the next iteration.

components/core/src/clp_s/search/kql/kql.cpp (6)

9-9: Includes constants header.
This addition ensures consistent usage of shared constants.


112-118: Namespace extraction for column descriptors.
The approach ensures each tokenized column is properly associated with a descriptor namespace.


124-128: Creating ColumnDescriptor from escaped tokens.
Nicely encapsulates descriptor logic, simplifying the remainder of the visitor’s code.


134-134: Prepending descriptor to nested expression.
Ensures the nested query inherits the parent column tokens.


170-171: Binding descriptors to list expressions.
Maintains the column context for each expression in the list.


224-227: Creating an empty descriptor with a default namespace.
This is consistent with the approach to unify parsing logic for columns and values.

Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this design! I’ve left some comments and will review it again tomorrow.

Comment on lines +52 to +60
std::vector<std::string> const& tokens,
std::string_view descriptor_namespace
) {
return std::shared_ptr<ColumnDescriptor>(new ColumnDescriptor(tokens));
return std::shared_ptr<ColumnDescriptor>(new ColumnDescriptor(tokens, descriptor_namespace));
}

std::shared_ptr<ColumnDescriptor> ColumnDescriptor::create_from_descriptors(
DescriptorList const& descriptors
DescriptorList const& descriptors,
std::string_view descriptor_namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to set a default value for descriptor_namespace to avoid modifying existing usage? That said, the current approach also works well for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a breaking change I personally prefer breaking the API in order to force callers to consider their usage.

@@ -179,7 +198,9 @@ class ParseTreeVisitor : public KqlBaseVisitor {

std::any visitValue_expression(KqlParser::Value_expressionContext* ctx) override {
auto lit = unquote_literal(ctx->LITERAL()->getText());
auto descriptor = ColumnDescriptor::create_from_escaped_token("*");
// TODO: consider if this should somehow be allowed to match all namespaces. "*" namespace?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's something we should consider

@@ -28,6 +29,8 @@ constexpr char cRootNodeName[] = "";
constexpr int32_t cRootNodeId = -1;
constexpr char cMetadataSubtreeName[] = "";
constexpr char cLogEventIdxName[] = "log_event_idx";
constexpr std::string_view cAutogenNamespace{"@"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's better to use char or char[]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way. I personally prefer constexpr char[], but our coding guidelines have shifted to forcing constexpr std::string_view for string constants.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/src/user-guide/reference-json-search-syntax.md (1)

64-81: Consider adding commas and language specifiers in the new section.

  1. There's a missing comma around line 75 for improved readability.
  2. Please specify a language identifier (e.g. ```none) for the fenced code blocks at lines 69 and 78 to pass markdown lint checks.

Below is an example fix incorporating both recommendations:

64 ### Querying auto-generated kv-pairs
...
75-To specify a key named "@key" within the default namespace the "@" namespace can be escaped as follows:
+To specify a key named "@key" within the default namespace, the "@" namespace can be escaped as follows:

69-```
+```none
@key: value

78- +none
@key: value

🧰 Tools
🪛 LanguageTool

[uncategorized] ~75-~75: Possible missing comma found.
Context: ...y a key named "@key" within the default namespace the "@" namespace can be escaped as fol...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 markdownlint-cli2 (0.17.2)

69-69: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


78-78: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

components/core/src/clp_s/search/SchemaMatch.cpp (2)

79-92: Open a new issue or remove the FIXME and TODO if addressed.

These comments highlight the need to handle more than just object subtrees and to reconsider the necessity of fully resolving descriptors. Consider opening a tracked issue to address these concerns or removing the comment if there's already a plan in motion.

Would you like me to create a separate issue with a concrete plan to handle non-object subtrees?


123-124: Review the namespace handling for wildcard expansion.

The TODO suggests extending the loop to respect namespaces for dynamic wildcard expansion. If this enhancement is crucial, consider prioritizing or removing the comment if you have an active plan.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29c936d and 7901487.

📒 Files selected for processing (2)
  • components/core/src/clp_s/search/SchemaMatch.cpp (4 hunks)
  • docs/src/user-guide/reference-json-search-syntax.md (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/search/SchemaMatch.cpp
🪛 LanguageTool
docs/src/user-guide/reference-json-search-syntax.md

[uncategorized] ~75-~75: Possible missing comma found.
Context: ...y a key named "@key" within the default namespace the "@" namespace can be escaped as fol...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/reference-json-search-syntax.md

69-69: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


78-78: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (4)
docs/src/user-guide/reference-json-search-syntax.md (1)

193-193: The added bullet for “@” is consistent and looks good.

No issues found.

components/core/src/clp_s/search/SchemaMatch.cpp (3)

93-97: Ensure the loop terminates safely.

While this loop correctly traverses up until it reaches the namespace root, confirm there are no edge cases where the node hierarchy might be malformed, preventing termination.


106-109: Namespace-based descriptor creation looks properly integrated.

This call to create_from_descriptors is consistent with the new requirement to handle the column's namespace.


143-147: Check subtree root retrieval logic.

If get_object_subtree_node_id_for_namespace returns -1, we exit and ignore the subtree. Ensure this behaviour is correct for all columns, particularly those relying on default or mixed namespaces.

@gibber9809 gibber9809 requested a review from wraymo March 3, 2025 16:31
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining part looks good to me.

Comment on lines +106 to +117
auto column = clp_s::search::ColumnDescriptor::create_from_escaped_tokens(
m_timestamp_column,
m_timestamp_namespace
);
m_timestamp_column.clear();
for (auto it = column->descriptor_begin(); it != column->descriptor_end(); ++it) {
if (it->wildcard()) {
SPDLOG_ERROR("Timestamp key can not contain wildcards: \"{}\"", m_timestamp_key);
throw OperationFailed(ErrorCodeBadParam, __FILENAME__, __LINE__);
}
m_timestamp_column.push_back(it->get_token());
}
Copy link
Contributor

@wraymo wraymo Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just traverse m_timestamp_column, and check if there's a * token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do this round trip to ColumnDescriptor because it actually does a bit more unescaping -- i.e. the tokenize_column_descriptor function will leave some things escaped so that ColumnDescriptor can then apply its own escaping rules.

For example \* is left as \* so that ColumnDescriptor can distinguish it from *. During initialization ColumnDescriptor will unescape cases like this, and internally set flags to indicate whether or not a token is a wildcard.

The previous behaviour of directly using the output of tokenize_column_descriptor was incorrect because it resulted in comparing the escaped tokens from tokenize_column_descriptor with the unescaped json keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the important part of this loop is clearing and then re-populating m_timestamp_column off of the ColumnDescriptor. If we only wanted to check for * then traversing m_timestamp_column as you suggested would work.

@gibber9809 gibber9809 requested a review from wraymo March 6, 2025 18:13
wraymo
wraymo previously approved these changes Mar 6, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed docs.

For the PR title, how about:

feat(clp-s)!: Add support for ingesting auto-generated kv-pairs from kv-ir; Add support for searching auto-generated kv-pairs.

The previous title implies that we can search kv-ir which is not true yet.

@@ -172,6 +190,7 @@ Keys containing the following literal characters must escape the characters usin
* `"`
* `.`
* `*`
* `@` (only when `@` is the first character in a token making up the key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, what happens if the user unnecessarily escapes a @ or some other character that doesn't need escaping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you unnecessarily escape an @ later on the string we'll still unescape it and treat it as a literal @. The same goes for all of the other characters that you can escape but don't have to in certain contexts (e.g. don't need to escape < within a quoted string, but you still can if you want).

If you try to escape some character that we don't support escaping for we treat it as an error and fail search.

@gibber9809 gibber9809 changed the title feat(clp-s)!: Add support for ingesting and searching on auto-generated keys in kv-ir feat(clp-s)!: Add support for ingesting auto-generated kv-pairs from kv-ir; Add support for searching auto-generated kv-pairs. Mar 7, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/src/user-guide/reference-json-search-syntax.md (3)

64-82: Clarity and Completeness of the New "Querying auto-generated kv-pairs" Section

The new section clearly explains how to query auto‐generated key–value pairs by using the “@” prefix to indicate the namespace. It effectively clarifies that the “@” is not part of the key name and explains how to escape it when the key genuinely begins with “@”. For enhanced usability, consider adding a cross-reference to the “Characters that require escaping” section so that users can review the full list of escapable characters in one place.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

69-69: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


79-79: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


69-71: Fenced Code Block Language Specification (Example 1)

The fenced code block showing the query example (lines 69–71) currently does not specify a language. Adding an appropriate language identifier (e.g., text) would improve clarity and remove markdown lint warnings.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

69-69: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


79-81: Fenced Code Block Language Specification (Example 2)

Similarly, the fenced code block provided at lines 79–81 for the escaped key example lacks a language specifier. Including one (for example, text) will ensure consistent formatting and compliance with markdown best practices.

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

79-79: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ebd138 and 86bdf96.

📒 Files selected for processing (1)
  • docs/src/user-guide/reference-json-search-syntax.md (2 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/reference-json-search-syntax.md

69-69: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


79-79: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build-macos (macos-13, false)

@gibber9809 gibber9809 requested a review from wraymo March 7, 2025 16:15
@kirkrodrigues kirkrodrigues merged commit 9fdc765 into y-scope:main Mar 7, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants