Skip to content

Conversation

@carterkozak
Copy link
Contributor

Most benchmarks use the same set of input data for each iteration, which is heavily biased toward caches. This benchmark is meant to stress canonicalzation cache misses in a way that we observe in some production systems.

My implementation isn't ideal because I'm doing a fair bit of work to generate randomized data within the measured component of the benchmark, however I've profiled the results and most time is not spent within the setup portion, and the benchmark setup is comparable between configurations such that the results should be comparable between flags.

Initially I began this investigation based on the InternCache, which I expected to be the primary bottleneck for reasons listed here: https://shipilev.net/jvm/anatomy-quarks/10-string-intern/
There is a measurable impact disabling interning, particularly with smaller heap sizes, but there's a much larger improvement when we opt out of canonicalization entirely, especially in paths which rely on the ByteQuadsCanonicalizer rather than CharsToNameCanonicalizer.

My initial results running java -Xmx256m -jar target/perf.jar ".*JsonArbitraryFieldNameBenchmark.*" -wi 4 -w 4 -i 4 -r 4 -f 1 -t max -rf json (28 threads):

(note the score is microseconds per operation where lower is better, not operations per second)

Benchmark                                       (mode)                   (shape)        (type)  Mode  Cnt    Score    Error  Units
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  245.657 ± 77.651  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP        READER  avgt    4   22.317 ±  1.231  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  245.578 ± 54.366  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   26.413 ±  5.014  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  225.419 ±  4.009  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP        READER  avgt    4    9.220 ±  0.082  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  222.956 ± 20.320  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   10.543 ±  0.051  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4    8.427 ±  0.044  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP        READER  avgt    4    8.376 ±  0.024  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4    9.176 ±  0.056  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4    9.086 ±  0.053  us/op

Most benchmarks use the same set of input data for each iteration,
which is heavily biased toward caches. This benchmark is meant to
stress canonicalzation cache misses in a way that we observe in
some production systems.

My implementation isn't ideal because I'm doing a fair bit of work
to generate randomized data within the measured component of the
benchmark, however I've profiled the results and most time is
not spent within the setup portion, and the benchmark setup is
comparable between configurations such that the results should
be comparable between flags.

Initially I began this investigation based on the InternCache,
which I expected to be the primary bottle-neck for reasons
listed here: https://shipilev.net/jvm/anatomy-quarks/10-string-intern/
There is a measurable impact disabling interning, particularly
with smaller heap sizes, but there's a much larger improvement
when we opt out of canonicalization entirely, especially in
paths which rely on the ByteQuadsCanonicalizer rather than
CharsToNameCanonicalizer.
@cowtowncoder
Copy link
Member

Yes, ByteQuadsCanonicalizer definitely takes heavier hit for decoding in case of symbol table miss -- intern()ing is of less concern there (although obviously not helpful either).


public String stringThree;

@JsonCreator(mode = JsonCreator.Mode.PROPERTIES)
Copy link
Member

@cowtowncoder cowtowncoder Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using Creators, maybe simple setters/getters; this is faster than using Constructors.
Or actually just assigning directly into public Fields as they are there already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered avoiding databinding altogether, but it's a convenient way to consume the data with some level of validation. Happy to update to getters and setters (or perhaps public property fields?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not a huge deal by any measure, either works fine.

@Setup
public void setup() {
factory = mode.apply(new JsonFactory());
ObjectMapper mapper = new ObjectMapper(factory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to JsonFactory, let's use builders here to make code 3.0 compatible.

INPUT_STREAM() {
@Override
JsonParser create(JsonFactory factory, Supplier<String> jsonSupplier) throws IOException {
return factory.createParser(new ByteArrayInputStream(jsonSupplier.get().getBytes(StandardCharsets.UTF_8)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make supplier provide bytes, to avoid additional conversion overhead here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you have an example? I reached for the easiest way to build dynamic json on the fly. I suspect that String.getBytes(UTF_8) performance isn't as far from building the byte array directly with some logic to encode keys due to the vectorized fast-path converting a jdk11+ compact string to ascii-compatible bytes (array copy without encoding).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic part may be problematic, but existing tests have helper class (InputData?) that creates data, extracts bytes etc. So it's more a question of not doing this conversion for each parse/generation run than performance of getting bytes itself. That is, keeping it out of measurement loop. If that makes sense.

@carterkozak
Copy link
Contributor Author

It's interesting that the UTF8StreamJsonParser implementation seems much more performant at reading randomized string values than ReaderBasedJsonParser, so it would stand to reason that an option to turn off canonicalization in UTF8StreamJsonParser would be more efficient than falling back to ReaderBasedJsonParser when canonicalization is not enabled. Before I investigate/prototype, I'm curious if this matches your expectation.

No rush, as always I appreciate your help :-)

@cowtowncoder
Copy link
Member

@carterkozak Yes, UTF8StreamJsonParser in general should be more efficient than Reader-backed one, since it combines UTF8-decoding and tokenization. I spent more time optimizing it as well. So having to use generally slower one (due to canonicalization not working due to unbounded key space) is unfortunate. It would of course be possible to design alternative that was optimized differently (faster decoding for non-canonicalizing case) but that'd be slower for cases where canonicalization (direct from byte[] to String).

@cowtowncoder cowtowncoder merged commit b4d08c1 into FasterXML:2.15 Apr 19, 2023
@cowtowncoder
Copy link
Member

@carterkozak One other thing: as I implied, decoding for "miss" case for UTF8StreamJsonParser is not heavily optimized, so another thing that might be useful with profiling would be to see if there are wins. It's quite possible, given lack of focus; and now with reproduction (can run JMH test under profiler) it'd be relatively easy to get numbers.

@carterkozak
Copy link
Contributor Author

That makes sense, I’d definitely prefer to improve all configurations rather than exclusively a non-default case. I’ll report back once I’ve done some investigation. Thanks!

@carterkozak
Copy link
Contributor Author

Initial testing seems to indicate that without canonicalization, UTF8StreamJsonParser (with very light modification to check _symbols.isCanonicalizing() outperforms the reader-based alternative in this benchmark (when the reader-based implementation wraps an InputStreamReader, for a like-for-like comparison).

When canonicalization is enabled, in cases without a bounded set of keys, the performance difference between ByteQuadsCanonicalizer and CharsToNameCanonicalizer appears to come down to buckets: The ByteQuadsCanonicalizer features a flattened hash table which is fantastic for locality, however any time mutations are required, the array copies are much larger, therefore more expensive.
In the benchmark with canonicalization, the CharsToNameCanonicalizer symbols array grows from 64 elements to 256 elements, and the buckets array grows from 32 to 128 buckets with a total size of 12000 strings. Copies are individually inexpensive on smaller arrays, and bucket mutations are inexpensive due to the immutable bucket datastructure (although traversal of the linked list is fairly expensive).
ByteQuadsCanonicalizer names array grows from 1024 elements initially, up to 16384 before resetting and the main hash array grows from 4096 elements to 65536, so each time a new key is encountered, we end up copying roughly a third of a megabyte on heap (based on 16384 length names array times 4 bytes per object reference, assuming compressed references plus 65536 length integer array at 4 bytes a piece).

Aside/pipedream: I wonder if we could prepopulate the canonicalization table, for example databind is likely to encounter most property names based on annotation scanning models. I suppose in an ideal world we might have contextual canonicalization based on the set of expected fields rather than a global cache.

carterkozak pushed a commit to palantir/conjure-java-runtime that referenced this pull request Apr 20, 2023
We expect unlimited keyspace due to maps keyed by random IDs which
are usually UUID-based. This interacts poorly with string
canonicalization components in Jackson in ways that cause heavy heap
churn. See the discussion here for the details:
FasterXML/jackson-benchmarks#6

The primary risk is that without canonicalization, we allocate
reused strings much more heavily. Strings used to match setter
methods should be very short lived and should not escape the
parsing thread, so the JIT may be able to do terribly clever
things to help us out.

Note: it may be possible to canonicalize contextually in jackson
3, which would give us the best of both worlds!
@cowtowncoder
Copy link
Member

cowtowncoder commented Apr 21, 2023

One thing wrt Jackson 3.0 (I know, I know, far away...) is that per BeanDeserializer known keys are pre-canonicalized. I may have mentioned this. That keeps them away from "global" ByteQuadsCanonicalizer (and CharsToNameCanonicalizer).

Another smaller possible win is of course considering what size is big enough symbol table that works for bounded sets (with whatever 98% of cases etc); if that's smaller than current max.
That is, could/should we lower the limit for pathological cases.

But pre-population idea is not bad either: given that new entries are appended at the end (slower), canonicalization reused keys is beneficial.
(or is this how it works? I'd need to double-check)

@carterkozak
Copy link
Contributor Author

Ah yes, the PropertyNameMatcher in 3.0 exactly what I had in mind! I admit I hadn't read through much of the 3.0 code until today.

Another smaller possible win is of course considering what size is big enough symbol table that works for bounded sets (with whatever 98% of cases etc); if that's smaller than current max.

One idea I'm toying with is a compromise between the flat structure and full copy by allowing canonicalizer tables to be stacked. On lookup, the local table can be checked, then the parent on cache miss, but on store, only the child table is updated without creating a full copy. The two must eventually be merged in some way to avoid stacking more than two tables (degenerating into a linked list), but such an update can be done at the end of the process to avoid concurrent copies most of which will be thrown away. I need to think a bit more about how this could improve the single-threaded case since there are no concurrent updates, and the table is always merged back to the parent at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants