Skip to content

Latest commit

 

History

History
167 lines (128 loc) · 9.91 KB

2024-03-15-FirstRelease.md

File metadata and controls

167 lines (128 loc) · 9.91 KB

yaml-rust2's first real release

If you are not interested in how this crate was born and just want to know what differs from yaml-rust, scroll down to "This release" or click here.

The why

Sometime in August 2023, an ordinary developer (that's me) felt the urge to start scribbling about an OpenAPI linter. I had worked with the OpenAPI format and tried different linters, but none of them felt right. And me needing 3 different linters to lint my OpenAPI was a pain to me. Like any sane person would do, I would write my own (author's note: you are not not sane if you wouldn't). In order to get things started, I needed a YAML parser.

On August 14th 2023, I forked yaml-rust and started working on it. The crate stated that some YAML features were not yet available and I felt that was an issue I could tackle. I started by getting to know the code, understanding it, adding warnings, refactoring, tinkering, documenting, ... . Anything I could do that made me feel that codebase was better, I would do it. I wanted this crate to be as clean as it could be.

Fixing YAML compliance

In my quest to understand YAML better, I found the YAML test suite: a compilation of corner cases and intricate YAML examples with their expected output / behavior. Interestingly enough, there was an open pull request on yaml-rust by tanriol which integrated the YAML test suite as part of the crate tests. Comments mention that the maintainer wasn't around anymore and that new contributions would probably never be accepted.

That, however, was a problem for future-past-me, as I was determined (somehow) to have yaml-rust pass every single test of the YAML test suite. Slowly, over the course of multiple months (from August 2023 to January 2024), I would sometimes pick a test from the test suite, fix it, commit and start again. On the 23rd of January, the last commit fixing a test was created.

According to the YAML test matrix, there is to this day only 1 library that is fully compliant (aside from the Perl parser generated by the reference). This would make yaml-rust2 the second library to be fully YAML-compliant. You really wouldn't believe how much you have to stretch YAML so that it's not valid YAML anymore.

Performance

With so many improvements, the crate was now perfect!.. Except for performance. Adding conditions for every little bit of compliance has lead the code to be much more complex and branch-y, which CPUs hate. I was around 20% slower than the code was when I started.

For a bit over 3 weeks, I stared at flamegraphs and made my CPU repeat the same instructions until it could do it faster. There have been a bunch of improvements for performance since yaml-rust's last commit. Here are a few of them:

  • Avoid putting characters in a VecDeque<char> buffer when we can push them directly into a String.
  • Be a bit smarter about reallocating temporaries: it's best if we know the size in advance, but when we don't we can sometimes avoid pushing characters 1 at a time.
  • The scanner skips over characters one at a time. When skipping them, it needs to check whether they're a linebreak to update the location. Sometimes, we know we skip over a letter (which is not a linebreak). Several "skip" functions have been added for specific uses.

And the big winner, for an around 15% decrease in runtime was: use a statically-sized buffer instead of a dynamically allocated one. (Almost) Every character goes from the input stream into the buffer and then gets read from the buffer. This means that VecDeque::push and VecDeque::pop were called very frequently. The former always has to check for capacity. Using an ArrayDeque removed the need for constant capacity checks, at the cost of a minor decrease in performance if a line is deeply indented. Hopefully, nobody has 42 nested YAML objects.

Here is in the end the performance breakdown:

Comparison of the performance between yaml-rust, yaml-rust2 and the C libfyaml. yaml-rust2 is faster in every test than yaml-rust, but libfyaml remains faster overall.

Here is a short description of what the files contain:

  • big: A large array of records with few fields. One of the fields is a description, a large text block scalar spanning multiple lines. Most of the scanning happens in block scalars.
  • nested: Very short key-value pairs that nest deeply.
  • small_objects: A large array of 2 key-value mappings.
  • strings_array: A large array of lipsum one-liners (~150-175 characters in length).

As you can see, yaml-rust2 performs better than yaml-rust on every benchmark. However, when compared against the C libfyaml, we can see that there is still much room for improvement.

I'd like to end this section with a small disclaimer: I am not a benchmark expert. I tried to have an heterogenous set of files that would highlight how the parser performs when stressed different ways. I invite you to take a look at the code generating the YAML files and, if you are more knowledgeable than I am, improve upon them. yaml-rust2 performs better with these files because those are the ones I could work with. If you find a file with which yaml-rust2 is slower than yaml-rust, do file an issue!

This release

Improvements from yaml-rust

This release should improve over yaml-rust over 3 major points:

  • Performance: We all love fast software. I want to help you achieve it. I haven't managed to make this crate twice as fast, but you should notice a 15-20% improvement in performance.
  • Compliance: You may not notice it, since I didn't know most of the bugs I fixed were bugs to begin with, but this crate should now be fully YAML-compliant.
  • Documentation: The documentation of yaml-rust is unfortunately incomplete. Documentation here is not exhaustive, but most items are documented. Notably, private items are documented, making it much easier to understand where something happens. There are also in-code comments that help figure out what is going on under the hood.

Also, last but not least, I do plan on keeping this crate alive as long as I can. Nobody can make promises on that regard, of course, but I have poured hours of work into this, and I would hate to see this go to waste.

Switching to yaml-rust2

This release is v0.6.0, chosen to explicitly differ in minor from yaml-rust. v0.4.x does not exist in this crate to avoid any confusion between the 2 crates.

Switching to yaml-rust2 should be a very simple process. Change your Cargo.toml to use yaml-rust2 instead of yaml-rust:

-yaml-rust = "0.4.4"
+yaml-rust2 = "0.8.0"

As for your code, you have one of two solutions:

  • Changing your imports from use yaml_rust::Yaml to use yaml_rust2::Yaml if you import items directly, or change occurrences of yaml_rust to yaml_rust2 if you use fully qualified paths.
  • Alternatively, you can alias yaml_rust2 with use yaml_rust2 as yaml_rust. This would keep your code working if you use fully qualified paths.

Whichever you decide is up to you.

Courtesy of davvid, there is another solution. You can combine both approaches and tell Cargo.toml to add yaml-rust2 and to create a yaml_rust alias for your code with the following:

-yaml-rust = "0.4.4"
+yaml-rust = { version = "0.6", package = "yaml-rust2" }

This allows you to switch to yaml-rust2 while continuing to refer to yaml_rust in your code (e.g. use yaml_rust::YamlLoader; will continue to work so that no Rust code changes are required).

What about API breakage?

Most of what I have changed is in the implementation details. You might notice more documentation appearing on your LSP, but documentation isn't bound by the API. There is only one change I made that could lead to compile errors. It is unlikely you used that feature, but I'd hate to leave this undocumented.

If you use the low-level event parsing API (Parser, EventReceiver / MarkedEventReceiver) and namely the yaml_rust::Event enumeration, there is one change that might break your code. This was needed for tests in the YAML test suite. In yaml-rust, YAML tags are not forwarded from the lower-level Scanner API to the low-level Parser API.

Here is the change that was made in the library:

 pub enum Event {
   // ...
-SequenceStart(usize),
-MappingStart(usize),
+SequenceStart(usize, Option<Tag>),
+MappingStart(usize, Option<Tag>),
   // ...
 }

This means that you may now see YAML tags appearing in your code.

Closing words

YAML is hard. Much more than I had anticipated. If you are exploring dark corners of YAML that yaml-rust2 supports but yaml-rust doesn't, I'm curious to know what it is.

Work on this crate is far from over. I will try and match libfyaml's performance. Today is the first time I benched against it, and I wouldn't have guessed it to outperform yaml-rust2 that much.

If you're interested in upgrading your yaml-rust crate, please do take a look at davvid's fork of yaml-rust. Very recent developments on this crate sparked from an issue on advisory-db about the unmaintained state of yaml-rust. I hope it will be that YAML in Rust will improve following this issue.

Thank you for reading through this. If you happen to have issues with yaml-rust2 or suggestions, do drop an issue!

If however you wanted an OpenAPI linter, I'm afraid you're out of luck. Just as much as I'm out of time ;)

-Ethiraric

EDIT(20-03-2024): Add davvid's method of switching to yaml-rust2 by creating a Cargo alias.