Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for validating and escaping UTF-8 strings. #453

Merged
merged 13 commits into from
Jun 27, 2024

Conversation

LinZhihao-723
Copy link
Member

Description

Motivation

When outputting a C++ string to a human-readable (printable) format, we must ensure the string is UTF8 valid, and escape special characters such as the new line character \n or the quote mark ". This is required if we want to serialize string data into a JSON format.
There are third-party libraries that already implement such a validation + escaping mechanism. For example, nlohmann JSON provides string dump to execute the validation and escape. However, using external libraries have the following problems:

  1. Run time overhead. We need to convert a string view into their representation. In our previous benchmark, using nlohmann JSON for escaping purpose can introduce significant overhead.
  2. Library dependencies in ffi libraries: We need extra dependencies on our ffi libraries.

Therefore, we decide to implement our own implementation for validating and escaping UTF8 strings.

Implementation

The functionalities can be breakdown as follows:

  • Validate UTF8 encoding:

    • Validate UTF8 byte sequence
    • Validate code point ranges
  • Escape special characters

    • Escape printable characters by add ''
    • Escape control characters in the form of \\u00xx

    This PR adds a util function to implement the functionalities listed above.

Validation performed

  • Passed clang-tidy checks
  • Added unit tests to cover the functionalities discussed above. Details can be found in the unit test implementations.

components/core/src/clp/ffi/utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp/ffi/utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp/ffi/utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp/ffi/utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp/ffi/utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp/ffi/utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp/ffi/utils.hpp Outdated Show resolved Hide resolved
components/core/src/clp/ffi/utils.hpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
components/core/tests/test-ffi_utils.cpp Outdated Show resolved Hide resolved
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. A few remaining renaming suggestions.

components/core/src/clp/utf8_utils.hpp Outdated Show resolved Hide resolved
components/core/src/clp/utf8_utils.hpp Outdated Show resolved Hide resolved
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring fixes

components/core/src/clp/ffi/utils.hpp Outdated Show resolved Hide resolved
@kirkrodrigues kirkrodrigues changed the title Add support for validating and escaping UTF8 strings. Add support for validating and escaping UTF-8 strings. Jun 27, 2024
@LinZhihao-723 LinZhihao-723 merged commit 249816b into y-scope:main Jun 27, 2024
12 of 13 checks passed
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants