Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load a gzipped JSON trace with multiple blocks #872

Closed
vmarkovtsev opened this issue Aug 29, 2024 · 9 comments
Closed

Cannot load a gzipped JSON trace with multiple blocks #872

vmarkovtsev opened this issue Aug 29, 2024 · 9 comments

Comments

@vmarkovtsev
Copy link

Perfetto trace loader doesn't support "FEXTRA" multi-block gzip files. How to reproduce:

  1. Install https://github.com/vinlyx/mgzip
  2. Take any exiting JSON trace
  3. Code
with open("trace.json") as fin:
    with mgzip.open("trace.json.gz", "wt", thread=8, blocksize=1 << 16) as fout:
        while buffer := fin.read(1 << 16):
            fout.write(buffer)
  1. For example,
../trace_processor --httpd trace.json.gz
JSON trace file is incomplete
  1. This will work:
gzip -d trace.json.gz
gzip trace.json
../trace_processor --httpd trace.json.gz

Why does this weird gzip format property matter to me? We, 100B-parameter base LLM trainers in PyTorch, deal with a few hundred megs of profile that require considerable time to compress every few minutes, so compressing them on 192 available CPU cores gives a considerable benefit.

@LalitMaganti
Copy link
Collaborator

LalitMaganti commented Aug 29, 2024

We use zlib to implement decompression. If there's a way to configure zlib to read these gzip streams, happy to add support. Otherwise, we would not be able to fix this as it's just a bit too niche to justify adding to trace processor.

We'd need someone externally to spend a bit of time to figure out tho how to configure zlib to read this though and doubly helpful if this can be contributed as it's a very self contained issue.

@vmarkovtsev
Copy link
Author

Thanks @LalitMaganti
I will post my investigation intermediates and work in this issue, and hopefully engineer a PR sooner or later.
For now, I found this: https://stackoverflow.com/questions/65188890/what-gzip-extra-field-subfields-exist

@vmarkovtsev
Copy link
Author

@LalitMaganti
Copy link
Collaborator

In practice, the use of zlib happens in https://github.com/google/perfetto/blob/master/src/trace_processor/util/gzip_utils.cc

@vmarkovtsev
Copy link
Author

This is what I learned today:

  • The missing feature is support concatenated gzip streams, which is a part of RFC1952 section 2.2.

A gzip file consists of a series of "members" (compressed data sets).

@vmarkovtsev
Copy link
Author

Therefore, I would change code inspired by node.js

case Z_STREAM_END:
    return Result{ResultCode::kEof, out_size - z_stream_->avail_out};

to

case Z_STREAM_END:
    if (next two bytes at offset out_size - z_stream_->avail_out are magic 0x1f 0x8b) {
        // next stream detected
        return Result{ResultCode::kOk, out_size - z_stream_->avail_out};
    } else {
        return Result{ResultCode::kEof, out_size - z_stream_->avail_out};
    }

@LalitMaganti
Copy link
Collaborator

Approach seems good to me in that case, patches adding support for this are welcome: please follow https://perfetto.dev/docs/contributing/getting-started#contributing

@vmarkovtsev
Copy link
Author

@LalitMaganti
Copy link
Collaborator

https://r.android.com/3250057 should solve the edge case I point out in your change :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants