-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-s)!: Add support for ingesting auto-generated kv-pairs from kv-ir; Add support for searching auto-generated kv-pairs. #731
Conversation
WalkthroughThe changes introduce namespace awareness across several core components. A new timestamp namespace variable is added to the ArchiveWriter, with corresponding logic changes in its methods. Multiple modules are updated to accept and propagate a descriptor namespace, including JSON parsing, schema management, and tokenization routines. Method signatures and constructors are modified to include additional parameters for namespace handling. The changes also update the archive version, adjust error handling, and enhance tests and documentation to validate the new namespace functionalities. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant AW as ArchiveWriter
participant Options
Client->>AW: open(options)
AW->>Options: Extract m_authoritative_timestamp_namespace
AW->>AW: Initialize timestamp and namespace members
Client->>AW: add_node(parent_node_id, node, key)
AW->>AW: Evaluate node conditions using namespace & update matched prefix
Client->>AW: close()
AW->>AW: Clear timestamp and namespace members
sequenceDiagram
participant User
participant KQL as QueryParser
participant Util as StringUtils::tokenize_column_descriptor
participant CD as ColumnDescriptor
User->>KQL: Submit query with column descriptor
KQL->>Util: Tokenize descriptor (include descriptor_namespace)
Util-->>KQL: Return tokens and descriptor_namespace
KQL->>CD: Create ColumnDescriptor with tokens & namespace
CD-->>KQL: Return descriptor with namespace set
KQL->>User: Return processed query result
Possibly related PRs
Suggested reviewers
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (9)
docs/src/user-guide/reference-json-search-syntax.md (1)
64-81
: Add language specifiers to code blocksThe new section on querying auto-generated kv-pairs is well-documented and provides clear examples. However, the fenced code blocks at lines 69 and 78 should have language specifiers for better syntax highlighting and consistency with other code blocks in the document.
-``` +```text @key: value-
+
text
@key: value🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
69-69: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
78-78: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
components/core/src/clp_s/SchemaTree.hpp (2)
131-143
: Namespace-based subtree retrieval
This new method looks sound for retrieving the object subtree ID by namespace. However, if performance is paramount, consider usingabsl::flat_hash_map
for constant-time lookups.
194-194
: Storing namespace to object subtree mapping
Using astd::map<std::string, int32_t>
is acceptable, but it might be beneficial to unify data structures in the codebase by using a hashed map (absl::flat_hash_map
) consistent with other areas.components/core/src/clp_s/search/AddTimestampConditions.hpp (1)
26-26
: Consider using a struct instead ofstd::pair
.
While this approach is valid, using a struct could improve clarity by making the namespace and token vector more descriptive.components/core/src/clp_s/search/SchemaMatch.cpp (3)
79-93
: Address the FIXME comment when feasible.
The additional commentary about overhead suggests you intend to refactor this code path. If you need help, let us know.
123-125
: Maintain namespace consistency in wildcard expansions.
A TODO is present for namespace-aware wildcard expansion. Let me know if you'd like further assistance implementing this enhancement.
142-147
: Consider adding a log statement when no subtree root is found.
Early return on-1
is sensible. For completeness, consider logging a message to help debugging.components/core/src/clp_s/JsonParser.cpp (1)
772-871
: Refined subtree parsing for auto-generated and user-generated pairs.
The logic systematically handles each NodeType, populating the parsed message. Consider extracting repeated blocks for simpler maintainability.components/core/src/clp_s/search/kql/kql.cpp (1)
201-203
: Wildcard descriptor namespace hint.
The TODO suggests potential future expansion to match all namespaces.Could offer a specialized function to handle wildcard namespaces if that becomes necessary.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
components/core/src/clp_s/ArchiveWriter.cpp
(3 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(2 hunks)components/core/src/clp_s/JsonParser.cpp
(11 hunks)components/core/src/clp_s/JsonParser.hpp
(4 hunks)components/core/src/clp_s/SchemaReader.cpp
(2 hunks)components/core/src/clp_s/SchemaTree.cpp
(1 hunks)components/core/src/clp_s/SchemaTree.hpp
(3 hunks)components/core/src/clp_s/SingleFileArchiveDefs.hpp
(1 hunks)components/core/src/clp_s/TimestampDictionaryReader.cpp
(2 hunks)components/core/src/clp_s/TimestampDictionaryReader.hpp
(3 hunks)components/core/src/clp_s/Utils.cpp
(2 hunks)components/core/src/clp_s/Utils.hpp
(1 hunks)components/core/src/clp_s/archive_constants.hpp
(2 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)components/core/src/clp_s/indexer/IndexManager.cpp
(2 hunks)components/core/src/clp_s/search/AddTimestampConditions.cpp
(1 hunks)components/core/src/clp_s/search/AddTimestampConditions.hpp
(3 hunks)components/core/src/clp_s/search/ColumnDescriptor.cpp
(3 hunks)components/core/src/clp_s/search/ColumnDescriptor.hpp
(2 hunks)components/core/src/clp_s/search/Projection.cpp
(1 hunks)components/core/src/clp_s/search/SchemaMatch.cpp
(4 hunks)components/core/src/clp_s/search/kql/Kql.g4
(1 hunks)components/core/src/clp_s/search/kql/kql.cpp
(6 hunks)components/core/tests/test-kql.cpp
(7 hunks)docs/src/user-guide/reference-json-search-syntax.md
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- components/core/src/clp_s/SingleFileArchiveDefs.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/search/Projection.cpp
components/core/src/clp_s/SchemaReader.cpp
components/core/src/clp_s/TimestampDictionaryReader.cpp
components/core/src/clp_s/archive_constants.hpp
components/core/src/clp_s/SchemaTree.cpp
components/core/src/clp_s/ArchiveWriter.hpp
components/core/src/clp_s/indexer/IndexManager.cpp
components/core/src/clp_s/search/AddTimestampConditions.cpp
components/core/tests/test-kql.cpp
components/core/src/clp_s/Utils.hpp
components/core/src/clp_s/Utils.cpp
components/core/src/clp_s/search/AddTimestampConditions.hpp
components/core/src/clp_s/SchemaTree.hpp
components/core/src/clp_s/ArchiveWriter.cpp
components/core/src/clp_s/clp-s.cpp
components/core/src/clp_s/search/SchemaMatch.cpp
components/core/src/clp_s/TimestampDictionaryReader.hpp
components/core/src/clp_s/JsonParser.hpp
components/core/src/clp_s/search/kql/kql.cpp
components/core/src/clp_s/search/ColumnDescriptor.hpp
components/core/src/clp_s/JsonParser.cpp
components/core/src/clp_s/search/ColumnDescriptor.cpp
🧠 Learnings (1)
components/core/src/clp_s/archive_constants.hpp (1)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/reference-json-search-syntax.md
69-69: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
78-78: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (79)
components/core/src/clp_s/search/kql/Kql.g4 (1)
90-91
: KQL grammar now recognizes '@' as a special characterThe addition of '@' to the list of special characters in the KQL grammar aligns with the PR objective to support auto-generated key namespaces. This change means the '@' character now needs to be escaped in KQL queries.
Note that this is a breaking change, as mentioned in the PR objectives, which will affect any queries containing unescaped '@' characters, particularly those with timestamps beginning with '@'.
components/core/src/clp_s/SchemaTree.cpp (1)
20-22
: Enhancement to support namespace-aware object subtreesThe change replaces a direct subtree ID with a map-based approach that associates namespaces with their corresponding subtree IDs. This implementation aligns with the PR objective to support auto-generated keys in a namespaced subtree.
components/core/src/clp_s/indexer/IndexManager.cpp (3)
103-108
: Updated schema tree traversal to be namespace-awareThe code now properly retrieves the object subtree root using the namespace-aware method, with appropriate error handling to check for invalid subtree roots. This change is consistent with the PR's objective to support namespaced keys.
Note the use of the default namespace constant from
archive_constants.hpp
, which helps maintain consistency across the codebase.
113-113
: Schema traversal now uses namespace-specific subtree rootUpdated code to use the newly retrieved namespace-specific subtree root for traversal, maintaining compatibility with the namespace-aware changes throughout the codebase.
5-5
:✅ Verification successful
Verify if the string include is necessary
The added include for
<string>
might be redundant ifstring
is already included through other headers in the file.
🏁 Script executed:
#!/bin/bash # Check if string is already included through other headers rg -A 3 "^#include" components/core/src/clp_s/indexer/IndexManager.hppLength of output: 260
Action: Retain the
<string>
include as it is requiredAfter verifying that the header file (
components/core/src/clp_s/indexer/IndexManager.hpp
) does not include<string>
—and noting that none of the other includes in the header provide it—the explicit inclusion of<string>
in the corresponding implementation file (IndexManager.cpp
) remains necessary. Assuming thatIndexManager.cpp
utilisesstd::string
(or related functionality), this include is not redundant.components/core/src/clp_s/archive_constants.hpp (2)
5-5
: Added required header for std::string_viewThe include for string_view has been added to support the new namespace constants.
32-33
: New constants for namespace supportAdded two well-named constants that clearly define the namespace structure: auto-generated keys use the '@' prefix, while regular objects use an empty string as the namespace. This implementation aligns perfectly with the PR objectives.
These constants will help maintain consistency in how namespaces are handled throughout the codebase, particularly for the distinction between auto-generated subtrees and regular object subtrees.
docs/src/user-guide/reference-json-search-syntax.md (1)
193-193
: LGTM! Good documentation updateThe addition of '@' to the list of characters that require escaping is appropriate and aligns with the new functionality for namespace handling.
components/core/src/clp_s/search/AddTimestampConditions.cpp (1)
21-24
: LGTM! Updated for namespace supportThe method call has been properly updated to pass both the tokens and namespace to the
create_from_escaped_tokens
method, supporting the new namespace functionality.components/core/src/clp_s/search/Projection.cpp (1)
55-58
: LGTM! Good namespace handling and error checkThe modification correctly uses the column's namespace to find the appropriate subtree node. The added early return check for invalid node IDs is a good defensive programming practice.
components/core/src/clp_s/ArchiveWriter.hpp (2)
29-29
: LGTM! New field for namespace supportThe addition of the
authoritative_timestamp_namespace
field to theArchiveWriterOption
struct is appropriate for supporting the new namespace functionality.
248-248
: LGTM! Member variable for namespace supportThe addition of the
m_authoritative_timestamp_namespace
member variable aligns with the changes to theArchiveWriterOption
struct and supports the namespace functionality.components/core/src/clp_s/SchemaReader.cpp (3)
4-4
: Include for string is now required for std::string usage.The addition of the
<string>
header is appropriate as it's now needed for thestd::string
type used in the updated code.
6-6
: Include added for archive constants.Good addition of the
archive_constants.hpp
header which provides the namespace constants needed below.
573-581
: Improved conditional check to use namespace-aware schema tree access.The code has been updated to utilize namespace awareness by checking for a specific namespace (
constants::cDefaultNamespace
) in the schema tree rather than simply checking if the tree is empty. This aligns with the PR objective of adding support for ingesting and searching on auto-generated keys.components/core/tests/test-kql.cpp (8)
52-52
: Added important namespace verification for wildcard key queries.The test now verifies that pure wildcard key queries correctly have empty namespaces.
70-70
: Added namespace verification for basic filters.Ensures that basic KQL filters have empty namespaces as expected.
89-89
: Added namespace verification for NOT filters.Confirms that NOT filters have empty namespaces, maintaining consistent behavior.
138-138
: Added namespace verification for AND expressions.Ensures that columns in AND expressions maintain empty namespaces.
189-189
: Added namespace verification for OR expressions.Verifies that columns in OR expressions correctly have empty namespaces.
273-273
: Added namespace verification for escape sequences in values.Tests namespace handling for cases with escape sequences in values.
279-294
: Added new test section for @ namespace in columns.This excellent new test case verifies that columns with @ prefix are correctly recognized as being in the "@" namespace, which is a key feature of this PR.
296-311
: Added test for escaped @ symbol in column names.This test ensures that when @ is escaped in a column name, it is treated as part of the column name rather than a namespace designator. Good edge case coverage.
components/core/src/clp_s/Utils.hpp (1)
246-258
: Updated method signature to support namespace extraction.The
tokenize_column_descriptor
method signature has been updated to include a newdescriptor_namespace
parameter that will allow the method to return the namespace information along with the tokenized column.This change supports the PR objective of introducing functionality for specifying and escaping namespaces when parsing KQL column descriptors.
components/core/src/clp_s/TimestampDictionaryReader.cpp (3)
19-19
: Added variable to capture namespace information.Added a new variable to store the namespace part when tokenizing timestamp column descriptors.
24-30
: Updated method call to pass the namespace parameter.The call to
StringUtils::tokenize_column_descriptor
has been updated to pass the new namespace parameter. The formatting change also improves readability.
41-41
: Updated timestamp column storage to include namespace information.The authoritative timestamp column now stores both the tokenized column and its namespace. This is necessary to support namespace-aware timestamp functionality.
components/core/src/clp_s/ArchiveWriter.cpp (4)
22-22
: Add new archival timestamp namespace
No concerns. This assignment cleanly captures the new configuration from the option object.
120-120
: Clearing the new namespace
Good to see the new member being cleared on close, consistent with the rest of the class’s reset logic.
244-245
: Enhanced condition
The condition includes checking the timestamp size against the prefix length. Confirm that the extra comparison is correct for skipping zero-length or partial matches.
248-249
: Matching root node namespace
This newly introduced check ensures that if the parent node is the root node, the code compares the key to the authoritative timestamp namespace. The logic is sound but verify the usage of string_view for exact matching.components/core/src/clp_s/SchemaTree.hpp (1)
5-5
: New header inclusion
The new<map>
header is used form_namespace_to_object_subtree_id
. No issues identified.components/core/src/clp_s/Utils.cpp (3)
533-534
: Additional param for descriptor namespace
Introducingdescriptor_namespace
as a reference parameter is a good approach for capturing the extracted namespace while tokenizing. No issues spotted.
537-549
: Namespace extraction logic
The new logic checks if the descriptor begins with@
and sets the namespace. Ensure thatdescriptor_namespace
is validated if future code expects more advanced namespaces.
781-782
: Support for unescaping '@'
Allowing the '@' character inunescape_kql_internal
fosters the new namespace syntax. Looks correct.components/core/src/clp_s/search/AddTimestampConditions.hpp (2)
6-6
: Include standard library header forstd::pair
.
No concerns here—this addition is necessary for referencingstd::pair
.
42-42
: Private member aligns with constructor usage.
This field consistently reflects the new timestamp column approach.components/core/src/clp_s/clp-s.cpp (2)
205-211
: Validate handling of empty namespace scenarios.
Addingdescriptor_namespace
is consistent with the new requirement. Double-check that an empty namespace doesn't cause mismatches or unexpected behaviour.
216-219
: Namespace-aware column creation is correct.
Invokingcreate_from_escaped_tokens
with both tokens and namespace correctly matches the new signature.components/core/src/clp_s/search/SchemaMatch.cpp (2)
94-97
: Ensure no risk of infinite loop in the while condition.
Although the preceding logic appears to handle-1
subtree roots, confirm that this sequence always terminates, particularly in edge cases.
106-109
: Use ofcreate_from_descriptors
with namespace is correct.
No issues found—this properly matches the updated constructor.components/core/src/clp_s/TimestampDictionaryReader.hpp (3)
6-6
: Added appropriate header for std::pair usage.The inclusion of
<utility>
is necessary to support the new pair type used in the timestamp column representation.
56-59
: Updated return type to include namespace information.The method signature has been properly updated to return a pair containing both the tokenized column and its namespace, aligning with the namespace support being added throughout the codebase.
75-76
: Enhanced member variable to store namespace information.The member variable type has been correctly updated to store both the tokenized column data and its associated namespace, which is essential for the namespace-aware functionality being implemented.
components/core/src/clp_s/JsonParser.hpp (6)
130-140
: Added template parameter to support auto-generated subtrees.The
autogen
template parameter clearly indicates whether the node is part of an auto-generated subtree, which aligns with the PR objective of supporting auto-generated keys in kv-ir.
147-156
: Template parameter consistently added to node ID resolution method.The consistent addition of the
autogen
template parameter to this method ensures proper handling of auto-generated nodes throughout the codebase.
158-169
: Added helper method to process subtrees based on their generation type.The new templated method
parse_kv_log_event_subtree
nicely encapsulates the logic for processing different types of subtrees, improving code organization and reusability.
174-174
: Updated method signature for consistency.The method signature now explicitly uses the fully qualified name for the parameter type, which improves code clarity and consistency.
222-222
: Added namespace support for timestamps.The new member variable
m_timestamp_namespace
supports the namespace handling functionality for timestamps, which is a key part of this PR.
234-235
: Added separate mapping for auto-generated nodes.This new hash map properly separates the auto-generated node mappings from regular nodes, ensuring they can be managed independently.
components/core/src/clp_s/search/ColumnDescriptor.cpp (4)
25-36
: Updated constructor to support namespaces.The constructor now properly initializes the namespace member variable, ensuring that column descriptors are namespace-aware.
🧰 Tools
🪛 Cppcheck (2.10-2)
[performance] 31-31: Variable 'm_descriptors' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
38-49
: Consistently updated second constructor for namespace support.The descriptor-based constructor has also been updated to handle namespaces, maintaining consistency across the class interface.
🧰 Tools
🪛 Cppcheck (2.10-2)
[performance] 44-44: Variable 'm_descriptors' is assigned in constructor body. Consider performing initialization in initialization list.
(useInitializationList)
51-56
: Updated factory method to propagate namespace information.The factory method now correctly passes the namespace parameter to the constructor, ensuring namespaces are preserved when creating column descriptors from tokens.
58-64
: Factory method for descriptors consistently updated.The second factory method has also been updated to include namespace support, ensuring consistency across the API.
components/core/src/clp_s/search/ColumnDescriptor.hpp (5)
132-135
: Updated factory method declaration to include namespace parameter.The declaration has been properly updated to match the implementation in the cpp file, ensuring interface consistency.
143-146
: Consistently updated second factory method declaration.Both factory methods now have consistent signatures that include the namespace parameter.
276-280
: Added accessor methods for the namespace.These accessor methods provide a proper interface for getting and setting the namespace, following good encapsulation practices.
285-285
: Added member variable to store the namespace.The private member variable correctly stores the namespace as a string, supporting the namespace handling functionality.
292-294
: Updated constructor declarations to match the implementations.The constructor declarations have been updated to match their implementations in the cpp file, ensuring consistency between the interface and implementation.
components/core/src/clp_s/JsonParser.cpp (14)
31-31
: Include is appropriate and consistent with usage.
No concerns.
94-98
: Validation approach for timestamps is correct.
The check ensures the timestamp descriptor is valid. Good adherence to the guideline preferringfalse == expression
.
104-117
: Ensure consistent de-escaping logic.
This step verifies that wildcards are not permitted in timestamps. The error handling is robust.
127-127
: Explicitly specifying the authoritative timestamp namespace.
Looks good, as it aligns with the newly introduced namespace handling.
670-695
: Implements parent-child mapping with auto-gen logic.
The approach to adjusting the key name for auto-generated subtrees solidly supports namespacing. Ensure that repeated IR node IDs won't overwrite existing mappings if that scenario arises.
699-710
: Graceful fallback on existing mappings.
The method properly checks for previously cached nodes to avoid re-adding them. Looks good overall.
712-712
: Conditional uses consistent style.
No issues here.
732-735
: Intermediate mapping lookup.
Ensures that the parent's node ID is reused if it already exists, preventing duplication.
748-757
: Adding node to archive and updating translations for final path component.
This structure is clear and concise.
758-765
: Handling interstitial object nodes.
This block creates parent path nodes as needed. The code is consistent with the overall approach.
770-770
: Returning the final node ID and timestamp match status.
No specific findings. Implementation looks clean.
875-882
: Ingesting auto-generated and user-generated events.
Invoking both parse_kv_log_event_subtree and parse_kv_log_event_subtree is straightforward.
971-971
: Clearing the IR-node-to-archive mapping before splitting.
Helps prevent stale node references across archive boundaries.
997-998
: Resetting IR mappings after stream processing completes.
Ensures a clean state for the next iteration.components/core/src/clp_s/search/kql/kql.cpp (6)
9-9
: Includes constants header.
This addition ensures consistent usage of shared constants.
112-118
: Namespace extraction for column descriptors.
The approach ensures each tokenized column is properly associated with a descriptor namespace.
124-128
: Creating ColumnDescriptor from escaped tokens.
Nicely encapsulates descriptor logic, simplifying the remainder of the visitor’s code.
134-134
: Prepending descriptor to nested expression.
Ensures the nested query inherits the parent column tokens.
170-171
: Binding descriptors to list expressions.
Maintains the column context for each expression in the list.
224-227
: Creating an empty descriptor with a default namespace.
This is consistent with the approach to unify parsing logic for columns and values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this design! I’ve left some comments and will review it again tomorrow.
std::vector<std::string> const& tokens, | ||
std::string_view descriptor_namespace | ||
) { | ||
return std::shared_ptr<ColumnDescriptor>(new ColumnDescriptor(tokens)); | ||
return std::shared_ptr<ColumnDescriptor>(new ColumnDescriptor(tokens, descriptor_namespace)); | ||
} | ||
|
||
std::shared_ptr<ColumnDescriptor> ColumnDescriptor::create_from_descriptors( | ||
DescriptorList const& descriptors | ||
DescriptorList const& descriptors, | ||
std::string_view descriptor_namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to set a default value for descriptor_namespace
to avoid modifying existing usage? That said, the current approach also works well for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a breaking change I personally prefer breaking the API in order to force callers to consider their usage.
@@ -179,7 +198,9 @@ class ParseTreeVisitor : public KqlBaseVisitor { | |||
|
|||
std::any visitValue_expression(KqlParser::Value_expressionContext* ctx) override { | |||
auto lit = unquote_literal(ctx->LITERAL()->getText()); | |||
auto descriptor = ColumnDescriptor::create_from_escaped_token("*"); | |||
// TODO: consider if this should somehow be allowed to match all namespaces. "*" namespace? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's something we should consider
@@ -28,6 +29,8 @@ constexpr char cRootNodeName[] = ""; | |||
constexpr int32_t cRootNodeId = -1; | |||
constexpr char cMetadataSubtreeName[] = ""; | |||
constexpr char cLogEventIdxName[] = "log_event_idx"; | |||
constexpr std::string_view cAutogenNamespace{"@"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's better to use char
or char[]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way. I personally prefer constexpr char[]
, but our coding guidelines have shifted to forcing constexpr std::string_view
for string constants.
Co-authored-by: wraymo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
docs/src/user-guide/reference-json-search-syntax.md (1)
64-81
: Consider adding commas and language specifiers in the new section.
- There's a missing comma around line 75 for improved readability.
- Please specify a language identifier (e.g. ```none) for the fenced code blocks at lines 69 and 78 to pass markdown lint checks.
Below is an example fix incorporating both recommendations:
64 ### Querying auto-generated kv-pairs ... 75-To specify a key named "@key" within the default namespace the "@" namespace can be escaped as follows: +To specify a key named "@key" within the default namespace, the "@" namespace can be escaped as follows: 69-``` +```none @key: value78-
+
none
@key: value🧰 Tools
🪛 LanguageTool
[uncategorized] ~75-~75: Possible missing comma found.
Context: ...y a key named "@key" within the default namespace the "@" namespace can be escaped as fol...(AI_HYDRA_LEO_MISSING_COMMA)
🪛 markdownlint-cli2 (0.17.2)
69-69: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
78-78: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
components/core/src/clp_s/search/SchemaMatch.cpp (2)
79-92
: Open a new issue or remove the FIXME and TODO if addressed.These comments highlight the need to handle more than just object subtrees and to reconsider the necessity of fully resolving descriptors. Consider opening a tracked issue to address these concerns or removing the comment if there's already a plan in motion.
Would you like me to create a separate issue with a concrete plan to handle non-object subtrees?
123-124
: Review the namespace handling for wildcard expansion.The TODO suggests extending the loop to respect namespaces for dynamic wildcard expansion. If this enhancement is crucial, consider prioritizing or removing the comment if you have an active plan.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/search/SchemaMatch.cpp
(4 hunks)docs/src/user-guide/reference-json-search-syntax.md
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
components/core/src/clp_s/search/SchemaMatch.cpp
🪛 LanguageTool
docs/src/user-guide/reference-json-search-syntax.md
[uncategorized] ~75-~75: Possible missing comma found.
Context: ...y a key named "@key" within the default namespace the "@" namespace can be escaped as fol...
(AI_HYDRA_LEO_MISSING_COMMA)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/reference-json-search-syntax.md
69-69: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
78-78: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (4)
docs/src/user-guide/reference-json-search-syntax.md (1)
193-193
: The added bullet for “@” is consistent and looks good.No issues found.
components/core/src/clp_s/search/SchemaMatch.cpp (3)
93-97
: Ensure the loop terminates safely.While this loop correctly traverses up until it reaches the namespace root, confirm there are no edge cases where the node hierarchy might be malformed, preventing termination.
106-109
: Namespace-based descriptor creation looks properly integrated.This call to
create_from_descriptors
is consistent with the new requirement to handle the column's namespace.
143-147
: Check subtree root retrieval logic.If
get_object_subtree_node_id_for_namespace
returns-1
, we exit and ignore the subtree. Ensure this behaviour is correct for all columns, particularly those relying on default or mixed namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining part looks good to me.
auto column = clp_s::search::ColumnDescriptor::create_from_escaped_tokens( | ||
m_timestamp_column, | ||
m_timestamp_namespace | ||
); | ||
m_timestamp_column.clear(); | ||
for (auto it = column->descriptor_begin(); it != column->descriptor_end(); ++it) { | ||
if (it->wildcard()) { | ||
SPDLOG_ERROR("Timestamp key can not contain wildcards: \"{}\"", m_timestamp_key); | ||
throw OperationFailed(ErrorCodeBadParam, __FILENAME__, __LINE__); | ||
} | ||
m_timestamp_column.push_back(it->get_token()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just traverse m_timestamp_column
, and check if there's a *
token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do this round trip to ColumnDescriptor because it actually does a bit more unescaping -- i.e. the tokenize_column_descriptor
function will leave some things escaped so that ColumnDescriptor
can then apply its own escaping rules.
For example \*
is left as \*
so that ColumnDescriptor
can distinguish it from *
. During initialization ColumnDescriptor
will unescape cases like this, and internally set flags to indicate whether or not a token is a wildcard.
The previous behaviour of directly using the output of tokenize_column_descriptor
was incorrect because it resulted in comparing the escaped tokens from tokenize_column_descriptor
with the unescaped json keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the important part of this loop is clearing and then re-populating m_timestamp_column
off of the ColumnDescriptor
. If we only wanted to check for *
then traversing m_timestamp_column
as you suggested would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed docs.
For the PR title, how about:
feat(clp-s)!: Add support for ingesting auto-generated kv-pairs from kv-ir; Add support for searching auto-generated kv-pairs.
The previous title implies that we can search kv-ir which is not true yet.
@@ -172,6 +190,7 @@ Keys containing the following literal characters must escape the characters usin | |||
* `"` | |||
* `.` | |||
* `*` | |||
* `@` (only when `@` is the first character in a token making up the key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, what happens if the user unnecessarily escapes a @
or some other character that doesn't need escaping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you unnecessarily escape an @
later on the string we'll still unescape it and treat it as a literal @
. The same goes for all of the other characters that you can escape but don't have to in certain contexts (e.g. don't need to escape <
within a quoted string, but you still can if you want).
If you try to escape some character that we don't support escaping for we treat it as an error and fail search.
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
docs/src/user-guide/reference-json-search-syntax.md (3)
64-82
: Clarity and Completeness of the New "Querying auto-generated kv-pairs" SectionThe new section clearly explains how to query auto‐generated key–value pairs by using the “@” prefix to indicate the namespace. It effectively clarifies that the “@” is not part of the key name and explains how to escape it when the key genuinely begins with “@”. For enhanced usability, consider adding a cross-reference to the “Characters that require escaping” section so that users can review the full list of escapable characters in one place.
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
69-69: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
79-79: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
69-71
: Fenced Code Block Language Specification (Example 1)The fenced code block showing the query example (lines 69–71) currently does not specify a language. Adding an appropriate language identifier (e.g.,
text
) would improve clarity and remove markdown lint warnings.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
69-69: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
79-81
: Fenced Code Block Language Specification (Example 2)Similarly, the fenced code block provided at lines 79–81 for the escaped key example lacks a language specifier. Including one (for example,
text
) will ensure consistent formatting and compliance with markdown best practices.🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
79-79: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-guide/reference-json-search-syntax.md
(2 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/reference-json-search-syntax.md
69-69: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
79-79: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build-macos (macos-13, false)
Description
This PR adds support for ingesting and searching on auto-generated keys in kv-ir. Broadly this encompasses the following aspects:
Note that this is technically a breaking change because of item 2. In particular archives that contain an authoritative timestamp beginning with an '@' will incorrectly have that '@' interpreted as indicating a namespace per the new parsing rules. We do increment the minor version in this PR, and we could use that to add special casing when loading timestamp dictionaries from older archive versions, but that special casing has not been included in this PR.
The namespacing is achieved by dividing the schema tree as follows:
Column resolution (and similar bits of code) have been changed to resolve a column against either the "@" object subtree or the default "" object subtree depending on the namespace associated with the column. One notable exception is that pure wildcards "*" currently do not respect namespaces: specifically
*: value
and@*: value
will search for any column that matches the value (as long as it isn't in the NodeType::Metadata subtree). It would be possible to get pure wildcards to respect namespaces, but it would require turning a lot of linear scans into tree traversals and using more memory to explicitly track sets of matching nodes, and might not be a necessary feature anyway.For decompression to JSON we simply marshal only the "" object subtree. We could probably add support for merging the "@" and "" namespace during JSON decompression in one way or another, but that is left as future work.
Once we add support for decompression to kv-ir we will simply need to marshal the "" subtree into the user-generated keys and the "@" subtree into the auto-generated keys.
Once again these new features appear to have slightly decreased ingestion performance when specifying timestamp keys. The results are as follows:
There don't seem to be any noticeable new hotspots when I run these benchmarks using perf, but I will try to double-check my benchmarking methodology and continue looking into improving performance during the review process.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit