Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings in the new IR format are not enforced to be UTF8 encoded. #686

Open
LinZhihao-723 opened this issue Jan 21, 2025 · 0 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@LinZhihao-723
Copy link
Member

Bug

The new IR format (key-value pair IR format) serializes data directly from a msgpack map. Currently, we support serializing msgpack strings into string values in our IR format. However, limited by msgpack's type spec here, UTF8 is not enforced in string objects. This means non-UTF8 byte sequences might be given and successfully serialized. The serialized IR can be successfully deserialized through clp::ffi::ir_stream::Deserializer, but it will trigger issues when converting the deserialized results into other formats, such as JSON string or Python dictionaries through Python ffi.
Solutions:

  • We should enforce UTF8 checking for string types at some point
  • We should add additional types to support serializing raw byte sequence (similar to BINARY type in msgpack)

CLP version

0c00a94

Environment

Any

Reproduction steps

  • Create a msgpack that contains invalid UTF8 strings and serialize it into IR format using clp::ffi::ir_stream::Serializer
  • Deserialize the stream using clp::ffi::ir_stream::Deserializer
  • Serialize the deserialized log event to JSON string using clp::ffi::KeyValuePairLogEvent::serialize_to_json and it will trigger a JSON exception
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant