Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ingesting/synthesizing custom binary data file #15

Open
fretz12 opened this issue Jul 1, 2021 · 7 comments
Open

Add support for ingesting/synthesizing custom binary data file #15

fretz12 opened this issue Jul 1, 2021 · 7 comments
Labels
core A change which affects the synth core enhancement New feature or request

Comments

@fretz12
Copy link
Contributor

fretz12 commented Jul 1, 2021

Required Functionality

While binary data can come in many shapes and forms, the particular format I'm after is unencoded/uncompressed binary data that have different fields packed next to each other. Additionally, the file begins with a header, and is concluded by a footer. In the middle, is the payload data, where entries are repeated many times.

Here is a pictorial of such a format:

Header
Entry 1
Entry 2
...
Entry N
Footer

Each entry is of fixed size, and can have multiple fields of different data types occupying a different amount of bytes. Example:

timestamp (8 bytes) my_u32 (4 bytes) my_bool (1 byte) my_string (24 bytes)

Proposed Solution

The user will be required to supply additional schema info to tell synth how to parse the fields. A possible format may look something like this:

  "binary_schema": {
    "entry_size_bytes": 37,
    "is_little_endian": true,
    "payload_start_offset_bytes": 4096,
    "payload_end_offset_bytes": 1024, -> this will be bytes from the end of the file
    "fields": [
      {
        "name": "timestamp",
        "type": "u64",
        "byte_start": 0,
        "byte_end": 7
      },
      {
        "name": "my_u32",
        "type": "u32",
        "byte_start": 8,
        "byte_end": 11
      },
      {
        "name": "my_bool",
        "type": "bool",
        "byte_start": 12,
        "byte_end": 12
      },
      {
        "name": "my_string",
        "type": "string",
        "byte_start": 13,
        "byte_end": 36
      }
    ]
  }

Such a binary schema can also be used to define extensions in the future, like encoding, var-length data etc.

Synth should be able to take such a schema and data file, infer from it, and output a variant of the fields. A nice to have would be to take the original data file's header and footer, and stuff it into the generated file as is.

Use case
The use case pertains to protocol data files used in the storage industry. NVMe is one example. Other storage and networking protocols typically follow such a format to some degree, as well.

@llogiq
Copy link
Contributor

llogiq commented Jul 1, 2021

I just talked to a former colleague who works in statistical data processing. For interop reasons they work with binary files containing the data as fixed-width rows of little-endian 16-bit integers; sometimes 32 bit integers for larger value ranges.

They could also make use of such a feature.

@christos-h christos-h added enhancement New feature or request core A change which affects the synth core labels Jul 1, 2021
@christos-h
Copy link
Member

christos-h commented Jul 1, 2021

@fretz12 thanks for this.

This is a really interesting use case that requires some additional core features to be introduced to synth.

Some notes:

  1. We probably need to add a new variant to synth_core::graph::Value which represents binary data. We could use the bytes crate.
  2. How would this data be serialized? Given that synth currently outputs (primarily) JSON, would we need to run this binary data through some encoding? Or does your use case require it to be written directly to a file?

@fretz12
Copy link
Contributor Author

fretz12 commented Jul 2, 2021

@christoshadjiaslanis -

Regarding 2, I'm only needing the synthesized data to be written to a file.

Bear with me if I'm making naive suggestions... but I'm thinking there would be a BinaryFileExportStrategy (impl ExportStrategy) that would still extract synthesized JSON Value. Unlike the other export strategies where it would insert the synthesized fields out of Value into a DB, it would look at the user supplied "binary_schema" i mentioned above, and serialize the fields out of Value into a file in the correct order.

Though I think our binary format is fairly simplistic, binary formats in general can be wildly varying. One thought is that for hard to customize things like SerDes, perhaps using a plugin interface where the user can write their own (de)serializers and make a .so or .dylib out of it, and synth would load those dynamic libs to execute serdes, with APIs binding binary -> Value and Value -> binary

@fretz12
Copy link
Contributor Author

fretz12 commented Jul 2, 2021

@llogiq - thx!

@shuttle-hq shuttle-hq deleted a comment from allcontributors bot Jul 3, 2021
@llogiq
Copy link
Contributor

llogiq commented Jul 3, 2021

Let's try that again:

@all-contributors please add @fretz12 for awesome ideas.

@allcontributors
Copy link
Contributor

@llogiq

I've put up a pull request to add @fretz12! 🎉

@christos-h
Copy link
Member

christos-h commented Jul 9, 2021

@fretz12 yeah I think that for binary serializers we need user-defined serializers.

It's not a fully formed thought yet, but roughly speaking we have our existing schema which defines how data is generated, and a second piece of config which needs to dictate how that data is mapped to a binary serialization format.

I'm not sure how this would work exactly. Perhaps we can create an RFC for this and try to design something that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core A change which affects the synth core enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants