Skip to content

json: add hand-rolled json sanitizer#20428

Closed
jmarantz wants to merge 36 commits intoenvoyproxy:mainfrom
jmarantz:json-sanitizer
Closed

json: add hand-rolled json sanitizer#20428
jmarantz wants to merge 36 commits intoenvoyproxy:mainfrom
jmarantz:json-sanitizer

Conversation

@jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Mar 20, 2022

Commit Message: Creates a hand-coded json string sanitizer, functionally equivalent to creating a protobuf and serializing to JSON, but is 13x faster when there are escapes, and 128x faster without escapes, which is going to be most of the time for stats.

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_ProtoEncoderNoEscape                1151 ns         1150 ns       517465
BM_JsonSanitizerNoEscape               8.76 ns         8.75 ns     78240400
BM_StaticJsonSanitizerNoEscape         10.2 ns         10.2 ns     68492637
BM_ProtoEncoderWithEscape              1313 ns         1313 ns       527792
BM_JsonSanitizerWithEscape              113 ns          113 ns      6205893
BM_StaticJsonSanitizerWithEscape        113 ns          113 ns      6173230

Additional Description: This is intended to be adapted into the JSON renderer introduced in #19693
Risk Level: low for now, this is not used by anything yet. when we integrate it into a flow it should be 'medium' risk.
Testing: Just the new unit-test and benchmark
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@wbpcode
Copy link
Member

wbpcode commented Mar 20, 2022

Looks great. Thanks.

Copy link
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and some minor comments are added.

Comment on lines +40 to +45
// Character-indexed array of translation strings. If an entry is nullptr then
// the character does not require substitution. This strategy is dependent on
// the property of UTF-8 where all two-byte characters have the high-order bit
// set for both bytes, and don't require escaping for JSON. Thus we can
// consider each character in isolation for escaping.
std::unique_ptr<std::string> char_escapes_[256];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32 bytes for every string and 8 bytes for every pointer, then here would be 10240 bytes external memory overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the 256 element array has nullptr but I think your point is the footprint of this can be smaller. I think this could be an array of char[8], to allow fitting \uABCD (6 chars) and avoid the indirection. I'll make that change; then it's only 2k overhead, and will probably be faster too due to better cache behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this comment helped me to remove the level of indirection, making this code run significantly faster!

I could also just make this array 127 or 128 long, as no characters >= 127 get escaped, at the cost of one extra condition-check per character when doing sanitization. Let me try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the array smaller and adding the extra conditional made things significantly slower so I won't commit that. 2k bytes overhead seems OK for something we only need one of.

Comment on lines +11 to +14
auto control_char = [this](char escape, absl::string_view letter) {
char_escapes_[static_cast<uint32_t>(escape)] =
std::make_unique<std::string>(absl::StrCat("\\", letter));
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we construct this char_escapes_ array only once and shared it across all the JsonSanitizer instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking there would only be one of these, instantiated by StatsHandler or as a CONSTRUCT_ON_FIRST_USE. JsonSanitizer otherwise has no state; only const methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can make the char_escapes_ array static by the CONSTRUCT_ON_FIRST_USE, then sanitize can be a static method. This should make the method much easier to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can, but that makes the fast-path a lot slower, because it needs to do an atomic test and construct on every access. There's nothing wrong with using CONSTRUCT_ON_FIRST_USE for the whole object but I think applications will want to save a reference to the JsonSanitizer object and use that in a fast loop, rather than force the atomic operation on every sanitization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to do an atomic test and construct on every access

I'm a bit confused about this point, why? Why can we just construct a static char_escapes_ and ref it in our static sanitize method? The char_escapes_ should only be constructed for once. 🤔 Is there something I've missed?

Copy link
Contributor Author

@jmarantz jmarantz Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm envisioning what needs to occur in when the compiler generates code for the code behind CONSTRUCT_ON_FIRST_USE. It the first use can come from any thread, so it needs to lazily, atomically, create the object. The perf results confirm this is not free, especially in the context of the streamlined no-escape case. though I didn't look exactly at what gets generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A normal static local variable is used for the CONSTRUCT_ON_FIRST_USE. The compile guarantees it's initialization is thread-safe. But l do don't know if there would be some additional overhead or not.

Okay, then I get it. I will also take a look at the compiler afterwards and test if there is additional overhead after local static variable initialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we verify later that CONSTRUCT_ON_FIRST_USE does not cause performance problems, we can consider modifying the implementation here again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already verified it does cause performance problems, and moreover for the use-case I have in mind it's easy to instantiate one in a context that lives for the entire server, or even for the entire request.

Comment on lines +41 to +61
size_t past_escape = absl::string_view::npos;
for (uint32_t i = 0, n = str.size(); i < n; ++i) {
uint32_t index = static_cast<uint32_t>(static_cast<uint8_t>(str[i]));
const std::string* escape = char_escapes_[index].get();
if (escape != nullptr) {
if (past_escape == absl::string_view::npos) {
// We only initialize buffer when we first learn we need to add an
// escape-sequence to the sanitized string.
if (i == 0) {
buffer = *escape;
} else {
buffer = absl::StrCat(str.substr(0, i), *escape);
}
} else if (i == past_escape) {
absl::StrAppend(&buffer, *escape);
} else {
absl::StrAppend(&buffer, str.substr(past_escape, i - past_escape), *escape);
}
past_escape = i + 1;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be codes can be more simple by this way?

Suggested change
size_t past_escape = absl::string_view::npos;
for (uint32_t i = 0, n = str.size(); i < n; ++i) {
uint32_t index = static_cast<uint32_t>(static_cast<uint8_t>(str[i]));
const std::string* escape = char_escapes_[index].get();
if (escape != nullptr) {
if (past_escape == absl::string_view::npos) {
// We only initialize buffer when we first learn we need to add an
// escape-sequence to the sanitized string.
if (i == 0) {
buffer = *escape;
} else {
buffer = absl::StrCat(str.substr(0, i), *escape);
}
} else if (i == past_escape) {
absl::StrAppend(&buffer, *escape);
} else {
absl::StrAppend(&buffer, str.substr(past_escape, i - past_escape), *escape);
}
past_escape = i + 1;
}
}
size_t past_escape = 0;
for (uint32_t i = 0, n = str.size(); i < n; ++i) {
const std::string* escape = char_escapes_[static_cast<uint8_t>(str[i])].get();
if (escape != nullptr) {
absl::StrAppend(&buffer, str.substr(past_escape, i - past_escape), *escape);
past_escape = i + 1;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above suggestion is part of the code I propose, but it doesn't handle some of the other cases, such as 2 escaped chars in a row, or clearing the buffer between each call to sanitize that requires escaping. I'll add more comments to make each conditional clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code could be simpler but I wanted to avoid having the user clear buffer on every call, because that does not optimize for the case where the string does not require escaping, which in the expected usage will be the common case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound reasonable. But string clear just sets string length to zero and first char to '\0', will this brings significant performance impact? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the code to what you suggested definitely made things materially slower, but adding the clear() at the start was only marginally slower (within margin of noise), I guess due to compilers/CPUs doing branch prediction well (clear() will need to do different things depending on whether the std::string has inlined the bytes or note).

I'll mess around a little more with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it. If we can simplify the code without paying a performance price, it is surely best to do so. Otherwise, just ignore my suggestion. 😸

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…lasting with memset.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
…ze one char at a time.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@daixiang0
Copy link
Member

The benchmark result looks very good!

BTW, not sure it is a common optimization for protobuf -> json encoder, if it is, we can upstream it as well.

@jmarantz
Copy link
Contributor Author

potentially, sure. I think if you already have a protobuf structure and want to generate JSON from it, it might perform OK. But in our case we had a C++ structure, and replicated that as a protobuf, building up a huge parallel structure, and then serialized it.

The serializing of JSON is I think mostly trivial if you have a sanitizer you can trust, so this is kind of a building block for a better way to translate C++ structures to json.

wbpcode
wbpcode previously approved these changes Mar 21, 2022
Copy link
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. @envoyproxy/envoy-maintainers for second pass.

@jmarantz
Copy link
Contributor Author

/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link

@envoyproxy/senior-maintainers assignee is @htuch

🐱

Caused by: a #20428 (comment) was created by @jmarantz.

see: more, trace.

@htuch
Copy link
Member

htuch commented Mar 22, 2022

@jmarantz how hardened is this security-wise? I imagine we might want to use it in JSON logs as well as admin, so would like to get some clear understanding of how to think about threat model when looking at this code.

@jmarantz
Copy link
Contributor Author

RE hardening, I was thinking about adding fuzz tests. Differential fuzz tests are possible by comparing sanitized results with the protobuf json encoder. However, the protobuf encoder generates error messages and yields an empty sanitized result whenever an invalid utf8 input is given. So I'm not sure if we can do that much better than the unit tests. What do you recommend?

daixiang0
daixiang0 previously approved these changes Mar 22, 2022
Copy link
Member

@daixiang0 daixiang0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz dismissed stale reviews from daixiang0 and wbpcode via 696cc04 March 22, 2022 13:42
@jmarantz jmarantz marked this pull request as draft March 22, 2022 14:06
@jmarantz
Copy link
Contributor Author

converting to draft temporarily while I iterate a little on a fuzzing solution.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

I found some incorrect behavior with invalid input. Iterating again.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review March 25, 2022 03:04
@jmarantz
Copy link
Contributor Author

Thanks to @wbpcode who showed me a fast utf-8 decoder without branches that I think could crash on invalid input...making me wonder about how this sanitizer would do on invalid input. It does help to go ahead and test that :)

Also I borrowed the 'branchless' concept to short-circuit the 99% case where nothing needs to be escaped, bringing it back to 128x the protobuf serializer performance in the fast case.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

I think the prod code is correct, though I'm able to run the fuzzer now beyond my own corpus and it finds discrepancies, so I'll go back to draft mode till those are sorted.

@jmarantz jmarantz marked this pull request as draft March 28, 2022 17:34
…hat protobufs reject.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review March 29, 2022 12:21
@jmarantz
Copy link
Contributor Author

jmarantz commented Mar 29, 2022

I was able to get a local asan-fuzzer built and ran it for a few hours, finding no problems, so I think this is ready for review again.

The fuzzer didn't previously find production problems but it did find utf-8 sequences producing what appear to be valid unicode values that protobufs reject, making differential fuzzing find discrepencies rather quickly. We now have a map of which unicode ranges cannot be protobuf-serialized, and a mechanism to re-compute that map if needed in the future.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
}

bool test(Value value) const override {
const auto left_pos = intervals_.lower_bound(Interval(value, value + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assuming an integer representation of intervals? Just trying to figure out why the +1 works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interval is documented as being [left, right). I'd have to think about how this would be used with floats. Maybe it's better just to document this as integer-only.

Escape& escape = unicode_escapes_[index];
escape.size_ = escape_str.size();
RELEASE_ASSERT(escape.size_ <= sizeof(escape.chars_), "escaped string too large");
memcpy(escape.chars_, escape_str.data(), escape_str.size()); // NOLINT(safe-memcpy)*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this approach in general, I think we have a lot of complexity in code being executed at runtime. Sure, it's going to build the same table every time, but it's a bit hard to reason about.

Can't we build an offline table in the build process and then just #include it? The advantage of that is it would also allow for easy inspection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would instantiate one of these Json tables in the StatsHandler so it would only be on startup.

However I think it's fine to move this code to a generator and either generate during build, or generate manually and check it in as source, annotating the array with comments.

return slowSanitize(buffer, str);
}

absl::string_view JsonSanitizer::slowSanitize(std::string& buffer, absl::string_view str) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you audited https://nlohmann.github.io/json/ for similar functions? I'm wondering if we we have anything that will simplify either sanitize or the UTF8 piece in our existing deps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- I didn't know about nlohmann and now I see we are already pulling that in. I'll take a look.

Do you know why we started using protobufs as an IR to generate JSON in the first place, given that?

Copy link
Contributor Author

@jmarantz jmarantz Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE nlohmann a quick read suggests it has a similar usage model as protobufs: https://github.com/nlohmann/json#serialization--deserialization

ie to serialize you need to populate an object and then serialize it, rather than serializing directly. I think this pattern of creating intermediate objects in order to convert a string to a safer string is the core reason the protobuf version is so slow.

Note that even though I named this function slowSanitize it is still 10x faster than protobuf serialization due to not needing an intermediate representation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The history here was that Protobuf for the longest time was the only reliable dependency here, and we were trying to reduce our dependency on rapidjson. The introduction of nlohmann JSON has changed this dynamic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run a quick benchmark to convince that the nlohmann JSON library is not suitable? I think Protobuf is slow for all kinds of reasons, it's never been highly tuned for performance on this path most likely, whereas the nlohmann library is designed for speed.

Basically, what we have in this PR represents a real speed up that seems to justify the complexity, but we should do the due diligence that there isn't something else within reasonable reach in one of our existing dependencies (the only ones I can think of are nlohmann JSON, maybe Abseil?) that does the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm! That wasn't too hard, and probably that's the right way to go. I think the fast-path wrapper I have should stay because for stats that will be used most of the time, it's still 20x faster than nlohmann, and it's pretty easy to reason about. But nlohmann is only 2x slower than the sanitizer I added here and it might not be worth it adding all the complexity:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_ProtoEncoderNoEscape                1330 ns         1330 ns       458178
BM_JsonSanitizerNoEscape               8.83 ns         8.82 ns     79050443
BM_NlohmannNoEscape                     187 ns          187 ns      3752300
BM_StaticJsonSanitizerNoEscape         9.52 ns         9.52 ns     73460589
BM_ProtoEncoderWithEscape              1483 ns         1483 ns       471766
BM_NlohmannWithEscape                   199 ns          199 ns      3489229
BM_JsonSanitizerWithEscape              103 ns          103 ns      6850721
BM_StaticJsonSanitizerWithEscape        102 ns          102 ns      6859187

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more point about nlohmann -- it's not at all compatible in terms of its output with Protobufs. It makes all kinds of different (more rational) escaping decisions than protobufs, but it can't be differentially fuzzed against them.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 1, 2022

closing this PR for now and will open a new one that just uses nlohman as the slowSanitizer impl after the test changes are sorted out.

I think I just have to have the tests generally do unicode unpacking when testing equivalence against protobuf sanitization, or we could just punt all that and assume nlohmann is correct due to its own tests.

@jmarantz jmarantz closed this Apr 1, 2022
@jmarantz
Copy link
Contributor Author

jmarantz commented Apr 3, 2022

This PR is superseded by #20637 which uses the nlohmann json library for the slow sanitization path -- it's about 2x slower than what we have here but still significantly faster, and more correct, than protobuf sanitization.

htuch pushed a commit that referenced this pull request Apr 8, 2022
Creates a JSON sanitizer that delegates to the nlohmann library when there's any escapes required, or characters with the 8th bit set, but otherwise just returns its input.

This PR supersedes #20428 which has a faster hand-rolled sanitizer, but the speed penalty for nlohmann is not too bad, compared with the protobuf sanitizer, and will generally be rarely needed anyway.

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ProtoEncoderNoEscape         1445 ns         1444 ns       455727
BM_NlohmannNoEscape             9.79 ns         9.79 ns     71449511
BM_ProtoEncoderWithEscape       1521 ns         1521 ns       462697
BM_NlohmannWithEscape            215 ns          215 ns      3264218
Additional Description: This is intended to be adapted into the JSON renderer introduced in #19693
Risk Level: low -- this is layered on top of an already well-trusted library.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
Creates a JSON sanitizer that delegates to the nlohmann library when there's any escapes required, or characters with the 8th bit set, but otherwise just returns its input.

This PR supersedes envoyproxy#20428 which has a faster hand-rolled sanitizer, but the speed penalty for nlohmann is not too bad, compared with the protobuf sanitizer, and will generally be rarely needed anyway.

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ProtoEncoderNoEscape         1445 ns         1444 ns       455727
BM_NlohmannNoEscape             9.79 ns         9.79 ns     71449511
BM_ProtoEncoderWithEscape       1521 ns         1521 ns       462697
BM_NlohmannWithEscape            215 ns          215 ns      3264218
Additional Description: This is intended to be adapted into the JSON renderer introduced in envoyproxy#19693
Risk Level: low -- this is layered on top of an already well-trusted library.
Testing: //test/...
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants