When dumping to IO, dump directly #538

headius · 2023-08-15T00:30:44Z

Note, this PR is a proof-of-concept of direct IO dumping in the json library. It works, but is not as fast as it could be, and there's no CRuby implementation yet.

Json.dump allows you to pass an IO to which the dump output will be sent, but it still buffers the entire output in memory before sending it to the given IO. This leads to issues on JRuby like jruby/jruby#6265 when it tries to create a byte[] that exceeds the maximum size of a signed int (JVM's array size limit).

This commit plumbs the IO all the way through the generation logic so that it can be written to directly without filling a temporary memory buffer first. This allow JRuby to dump object graphs that would normally produce more content than the JVM can hold in a single array, providing a workaround for jruby/jruby#6265.

It is unfortunately a bit slow to dump directly to IO due to the many small writes that all acquire locks and participate in the IO encoding subsystem. A more direct path that can skip some of these pieces could be more competitive with the in-memory version, but functionally it expands the size of graphs that cana be dumped when using JRuby.

See #524

Json.dump allows you to pass an IO to which the dump output will be sent, but it still buffers the entire output in memory before sending it to the given IO. This leads to issues on JRuby like jruby/jruby#6265 when it tries to create a byte[] that exceeds the maximum size of a signed int (JVM's array size limit). This commit plumbs the IO all the way through the generation logic so that it can be written to directly without filling a temporary memory buffer first. This allow JRuby to dump object graphs that would normally produce more content than the JVM can hold in a single array, providing a workaround for jruby/jruby#6265. It is unfortunately a bit slow to dump directly to IO due to the many small writes that all acquire locks and participate in the IO encoding subsystem. A more direct path that can skip some of these pieces could be more competitive with the in-memory version, but functionally it expands the size of graphs that cana be dumped when using JRuby. See ruby#524

headius · 2023-08-15T00:32:43Z

Note that prior to this patch, the script provided in jruby/jruby#6265 required 2-3GB of memory to run. Afterwards it varies between 200-600MB and completes successfully.

segiddins · 2023-08-19T17:28:19Z

For performance, it might help to have a buffered proxy to the underlying IO? That way, you only incur the locking/encoding overhead every buffer block size, vs every substring that gets written?

headius · 2023-08-19T18:25:36Z

@segiddins Yeah I hoped that internal buffering in IO would be sufficient for this but that buffer may simply be too small or there's enough overhead with character encoding checks that we lose the benefit. I'd like to play with some other buffering strategies, and some of the larger methods that do many tiny writes could do those rights into a temporary string that gets dumped in one go.

headius · 2023-08-19T18:29:48Z

Oh, I did run a profile of this on JRuby and the other cost we have using IO as the buffer is all the locking that's required to keep it thread safe. So if you're writing individual characters, that's a lot of lock acquisition and releasing. Batching up those rights into coarser chunks would make a big difference.

byroot · 2024-11-05T15:57:41Z

I'm assuming you're not actively working on this, so I'll close.

headius · 2024-11-05T16:38:41Z

@byroot I never stopped working on this but never got feedback from any stakeholders. I still believe it should be done.

headius · 2024-11-05T16:40:37Z

FWIW we have a customer that needed this, and I assume they still do. Dumping directly to an in-memory buffer is prohibitive for very large json streams.

byroot · 2024-11-05T16:46:32Z

Yeah, it makes sense, and looking at the C implementation, I think it would be relatively easy to do the C implementation.

I'm just trying to keep the repo tidy, if you feel strongly about keeping that draft open feel free to re-open.

but never got feedback from any stakeholders.

Well, as mentioned previously, from my point of view you are the Java implementation maintainer, so if the feature doesn't require to change the public API, feel fee to implement and merge whatever you want.

headius · 2024-11-05T16:49:11Z

@byroot I was not a maintainer at the time, so I was not enthusiastic to do much with it. That's why it sat for a year+ without any work being done; if the maintainer was not on board, there would not be much point in me doing it.

I also did not hear from a single maintainer of the C extension and I do not have the knowledge of that codebase to make a similar change, so that further dampened my enthusiasm.

I will re-open. I believe we should do this for both extensions and json-pure.

byroot · 2024-11-05T16:51:16Z

I believe we should do this for both extensions and json-pure.

json_pure is gone as of a few minutes ago.

As for the C extension, I can take care of it, but the nice thing about this feature is that AFAICT it doesn't change anything about the public API it's just an internal implementation detail, so IMO we can perfectly merge the Java side of it and see later for the C side.

headius force-pushed the streaming_output branch from 2766f66 to 7576f5c Compare August 15, 2023 00:31

headius marked this pull request as draft August 15, 2023 00:31

byroot force-pushed the master branch from ddee95a to 54363cb Compare October 18, 2024 09:52

byroot force-pushed the master branch from 1af38e5 to d54063a Compare October 29, 2024 13:38

byroot closed this Nov 5, 2024

headius reopened this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When dumping to IO, dump directly #538

When dumping to IO, dump directly #538

headius commented Aug 15, 2023 •

edited

Loading

headius commented Aug 15, 2023

segiddins commented Aug 19, 2023

headius commented Aug 19, 2023

headius commented Aug 19, 2023 •

edited

Loading

byroot commented Nov 5, 2024

headius commented Nov 5, 2024

headius commented Nov 5, 2024

byroot commented Nov 5, 2024

headius commented Nov 5, 2024

byroot commented Nov 5, 2024

When dumping to IO, dump directly #538

Are you sure you want to change the base?

When dumping to IO, dump directly #538

Conversation

headius commented Aug 15, 2023 • edited Loading

headius commented Aug 15, 2023

segiddins commented Aug 19, 2023

headius commented Aug 19, 2023

headius commented Aug 19, 2023 • edited Loading

byroot commented Nov 5, 2024

headius commented Nov 5, 2024

headius commented Nov 5, 2024

byroot commented Nov 5, 2024

headius commented Nov 5, 2024

byroot commented Nov 5, 2024

headius commented Aug 15, 2023 •

edited

Loading

headius commented Aug 19, 2023 •

edited

Loading