-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] add subset string view comparator #248
base: master
Are you sure you want to change the base?
Conversation
Sorry, I'm having trouble understanding what you want to achieve. It does not help that the proposed name seems to be clashing with what it is actually doing: So before we go into details, can you please explain what is the problem you're trying to solve? Is it ensuring that all member fields are accounted for in a serialization roundtrip? If so, why the |
I want to catch misspelled keys entered by the user that is not used when deserializing an object (without applying that logic in any deserialization code). To solve this problem, I though of checking if a node is a subset of another node. This uses the
I meant that there is no decoding, that it is strictly done in the string view which is a string comparison. I wanted to have New Idea: I could reuse the logic implemented in the PR and try decoding for all fundamental types bool is_subset(...) {
// before checking for equality try decoding key/val with these fundamentals types in order
// string -> eq operator in stringview
// double -> atod() would work for floats also ?
// int64 -> atoi() would work for all signed integral values
// uint64 -> atoi() would work for all unsigned integral values
// bool -> from_chars(csubstr buf, bool * C4_RESTRICT v)
}
// Would it be better if the user could control which fundamentals he wants to decode from ? Hope this helps! I can give a more concrete example if you need. |
So in other words, you want to verify that a given node has child keys only from within a certain set, and/or that any child key must be in that certain set? Then these would be the main predicates (improvised here, untested): struct Tree
{
// ....
// non-recursive predicates:
/// return true if subject_node has all the keys or indices in refnode from a reftree
/// @note does not check values, only keys (for maps) or indices (for seqs)
bool has_all(size_t subject_node, Tree const* reftree, size_t refnode) const;
bool has_exactly(size_t subject_node, Tree const* reftree, size_t refnode) const;
// and helper functions to drive the recursive descent:
bool has_all_recursive(size_t subject_node, Tree const* reftree, size_t refnode) const;
bool has_exactly_recursive(size_t subject_node, Tree const* reftree, size_t refnode) const;
// ....
}; and the predicate implementation would look like this: bool Tree::has_all(size_t subject_node, Tree const* reftree, size_t refnode) const
{
C4_ASSERT(is_container(subject_node));
C4_ASSERT(reftree->is_container(refnode));
if(is_map(subject_node))
{
C4_ASSERT(reftree->is_map(refnode));
for(size_t ch = reftree->first_child(refnode); ch != NONE; ch = reftree->next_sibling(ch))
if( ! has_child(reftree->key(ch)))
return false;
return true;
}
else if(is_seq(subject_node))
{
C4_ASSERT(reftree->is_seq(refnode));
return num_children(subject_node) >= reftree->num_children(refnode);
}
else
{
C4_NEVER_REACH();
}
}
bool Tree::has_exactly(size_t subject_node, Tree const* reftree, size_t refnode) const
{
C4_ASSERT(is_container(subject_node));
C4_ASSERT(reftree->is_container(refnode));
if(is_map(subject_node))
{
C4_ASSERT(reftree->is_map(refnode));
for(size_t ch = reftree->first_child(refnode); ch != NONE; ch = reftree->next_sibling(ch))
if( ! has_child(reftree->key(ch)))
return false;
for(size_t ch = first_child(subject_node); ch != NONE; ch = next_sibling(ch))
if( ! reftree->has_child(key(ch)))
return false;
return true;
}
else if(is_seq(subject_node))
{
C4_ASSERT(reftree->is_seq(refnode));
return num_children(subject_node) == reftree->num_children(refnode);
}
else
{
C4_NEVER_REACH();
}
} As for the values, I don't see the usefulness in the comparison. Comparison of map keys or seq indices makes sense to ensure that a data tree is complete. But if you also have value equality requirements, doesn't that mean that the ref tree is the actual data tree? 😕 And as for the float comparison, let's leave that question for later. It seems very specific, and related to the value question, and I want to make sure the main functionality is there before looking at that side. |
Well I think it depends on the formatters used. These yaml documents would be considered equals. So in that case, the ref tree would not be the actual data tree. ---
bool: 1
---
bool: true
---
bool: True Even for keys, should we not decode them before doing the equality ? It could be opt-in, ---
1: mario
255: luigi
---
true: mario
0xff: luigi
---
True: mario
0o377: luigi I'm all for simplicity, but validating keys and/or val, with or without decoding does not change code that much... except for tests. struct Tree
{
// ....
// The way I see it there would be
bool is_subset(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval = true, bool decode = false) const;
bool is_superset(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval = true, bool decode = false) const;
// and helper functions to drive the recursive descent:
bool is_subset_recursive(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval, bool decode) const;
bool is_superset_recursive(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval, bool decode) const;
// ....
}; Is |
I think I understand know!
Do we even need a |
EDITED: I think there's some confusion here. Let's leave the I know you want to get closer to value semantics, but for now that's just not here - only string semantics. That may still be not enough for what you need, but at least this much will be available for you and other users. This is because adding the value semantics is too disruptive for now, and also a really hard problem from multiple points of view. So that one will be best left to a different PR. Also, focusing on string semantics allows us to focus on the structure, using the plain So what comes next assumes only the existing string semantics. And for now, all of those keys are necessarily different, just as the string Notice that for given nodes a and b, Further, That's why the complement to Now, I agree the names are a bit cryptic. Maybe |
To clarify, I am stating that:
|
Thx for the detailed explanation and for your patience!
Agreed ! For my usecase it made sense, but is not true for everyone. I imagine comparing a key anchor/ref is an unsolvable problem also...
The
Fair enough. Should I ensure that the node # tree_a
--- docval
mario: 1up
# tree_b
---
mario: 1up
# Should `tree_a->has_all( 0, tree_b, 0)` return true ?
# -----------------------------------------------------
# tree_c
mario: &life 1up
# tree_d
mario: 1up
# Should `tree_c->has_all( 0, tree_d, 0)` return true ?
|
5e53b6d
to
857af21
Compare
Anchors are meta-structure, so:
Assuming that
Ideally, if anchors are playing a role, then you should resolve the trees before making this sort of query. But to clarify, what sort of role do you want anchors to play here? I may be missing something. |
This is what I had in mind, I just wanted your wisdom on this!
It was really meant as a doc value Maybe we should call these functions Last question, how do you start a specific test in linux, e.g. if I wanted to test the |
Forgot to mention, what is also important to me is the structure of a sequence which has keys and nested keys. For values I don't check them, but the order and type MUST match, e.g. # tree_a
- key1:
nested_key: 2
- val1
- val2
# tree_b
- key1:
nested_key: 5
- dontcare
# `tree_a->has_all( 0, tree_b, 0)` would return false
# `tree_b->has_all( 0, tree_a, 0)` would return true The second commit account for that. I don't know if you had that in mind ? |
# tree_a
--- docval
mario: 1up
That's not valid YAML. It cannot be both a doc value and a doc map. Eg, see this. |
This is really helpfull! Maybe it has been updated but getting an error with rapidyaml in sandbox for this case: a: 2
- 1 |
The # tree_a
- key1:
nested_key: 2
- val1
- val2
# tree_b
- key1:
nested_key: 5
- dontcare
# tree_c
- key1:
nested_key: 5
Actually, it's different; assert( tree_a->has_all(0, tree_b, 0)); // root a has at least as many members as root b
assert( tree_a->has_all(0, tree_c, 0)); // root a has at least as many members as root c
assert( ! tree_b->has_all(0, tree_a, 0)); // root b has less members than root a
assert( tree_b->has_all(0, tree_c, 0)); // root b has at least as many members as root c
assert( ! tree_c->has_all(0, tree_a, 0)); // root c has less members than root a
assert( ! tree_c->has_all(0, tree_b, 0)); // root c has less members than root b |
Actually rapidyaml does NOT error out, and that's the problem. Indeed, that's being worked on. |
To enable the tests, add the cmake flag Next, just run the target Or use ctest: go into the build directory, cd into |
OK! I really do understand now! I was wondering why my PR and your proposed code was so different. Is there any reason not to traverse sequences that have keys ? This is what I need and implemented. I'm totally fine with keeping this in my code. I don't see any usecase for the
Well my example was simplified, sorry for that. I need to know if a # tree_a
- key1:
nested_key1: ~
- key2:
nested_key1: ~
nested_key2: ~
misspelled: ~
# tree_b
- key1:
nested_key1: ~
nested_key2: ~
- key2:
nested_key1: ~
nested_key2: ~
nested_key3: ~ // I think my reference of `has_all` is reversed compare to yours. The 'tree_a' is the one that will be looped against the 'tree_b'.
assert( ! tree_a->has_all(0, tree_b, 0)); // root a has at least as many members as root b, but 'misspelled' key is missing from tree_b
assert( ! tree_b->has_all(0, tree_a, 0)); // root b has less members then root a: 'nested_key2' from key1 and 'nested_key3' from key2 is missing from tree_a |
d128813
to
507400e
Compare
This PR implements a subset comparator for string views. It is really helpful to find mispelled yaml keys that are usually silently ignored when deserializing objects that have optional keys. The
is_subset_strview
method is highly dependant on formaters so not really usable, but theis_subset_strview_skipval
is usefull for my context because my keys don't have floating numbers.I have tried to follow your C++ style. You can close if you don't want this subset implementation. Will change code to your liking. Tags are not compared at the moment. Are they resolved before being added in the string views ?
I will also implement ais_subset
method in another PR, which will need decoding with variadic templates.EDIT: Not needed, can be implemented here.
TODO:
test/test_subset.cpp
(have tested against 15 tests internally)