Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between iterating over byte and char indices #86

Open
Manishearth opened this issue Dec 18, 2022 · 2 comments
Open

Inconsistency between iterating over byte and char indices #86

Manishearth opened this issue Dec 18, 2022 · 2 comments

Comments

@Manishearth
Copy link
Member

Manishearth commented Dec 18, 2022

This is something I kinda discovered whilst working on #85

The majority of the bidi algorithm involves taking the set of original bidi classes, going through them, and overriding some of them. To do this, unicode-bidi passes around a processing_classes: &mut [BidiClass] that starts out as a clone of original_classes: &[BidiClass], a slice mapping byte indices to the classes of each character.

Of course, non-character boundary byte indices (I'm going to call these "off-bytes") are kinda meaningless in the context of the bidi algorithm. original_classes starts out with copied bidi classes for these, but we have to be careful about both maintaining this property and also not accidentally treating byte indices as property indices.

Further code is inconsistent about iterating over characters or bytes, and it's tricky to see if it's updating off-bytes consistently.

Analysis

This is a writeup of my process walking through the code to see what breaks due to this. TLDR: the property is maintained (often rather subtly) but iterating over bytes causes at least one bug (#87), and it also makes stuff annoying to reason about when editing the code. Feel free to take my word for it and skip this section.

explicit::compute() iterates over char_indices() and performs additional fixups later to fill in the copied classes, maintaining the invariant that off-bytes match the class of their character. Huzzah. We could probably do some kind of deferred update thing where that fixup is not run for each character.

However, resolve_weak() iterates over sequence.runs.iter().cloned().flat_map(id as fn(LevelRun) -> LevelRun); 1, which is a fancy way of saying "every byte index contained in each level run in the level run sequence"2. This seems great: we can continue maintaining the aforementioned invariant, but uh oh: this loop has some state! This loop tracks prev_class (and some other stuff), which won't necessarily be correct when processing the off-bytes of a multibyte character.

This is fine in the following cases, and maintains the invariant:

  • W1: off-bytes will end up copying whatever happened for the main byte. seems fine.
  • W2: last_strong_is_al is maintained correctly so we're fine
  • W5 & W6: The rule is dealing with sequences of ENs, and off bytes are merely additional members of that sequence. W5 does check prev_class, but only for boundaries of EN sequences.

However, it's broken for W4, which cares about single separators sandwiched between numbers, and multibyte separators will just not hit it. This is fixable by changing the code for that rule only. The invariant is still maintained, however, mostly because W4 is brokenly skipping the cases where the invariant would otherwise have problems. I don't yet have a testcase for this, but it ought to be easy. I have a fix at #87

On to resolve_neutral(). The code I just added for N0 (#85) does maintain the invariant. While we've shown resolve_weak() maintains the invariant, the code for N0 actually rather resilient to invariant-breaking in resolve_weak() because resolve_weak() doesn't affect the presence of strong classes: it only introduces new strong classes, and that's all we care about in N0: "which strong classes are in this substring" and "what's the first strong class in this substring, starting at the end". When updating the classes for parentheses, N0 updates all of the classes, and the NSM sequence code also updates everything since it's dealing with sequences.

While N0 is fine here, it took me a couple passes to get right because I had to pay attention to what everyone else was doing.

Great. What about N1 and N2? These are handled with the sequence-run iteration thing again, and also have some state. They look for NIs, handling sequences.

They do not distinguish between sequences and individual characters, and they treat the entire sequence as a single unit to be updated, so they're also fine as long as the invariant has been maintained so far, which it seems to be.

Finally, resolve_levels() iterates over levels, so it has no problems here. After that processing_classes is no longer relevant, thank heavens. assign_levels_to_removed_chars() is short and doesn't really do anything weird with original_classes so we're fine there too.

Moving forward

I think there are a couple issues here. We have at least one bug, and this property is excruciating to reason about and rather fragile. Furthermore, we're iterating over a lot of unnecessary bytes, which might be extra work?

We can do a couple things here:

  • We can update to using char_indices() everywhere, and just letting the off-bytes in the classes arrays be dead space. This would mostly simplify the algorithms, but at the cost of using character indicing which is a bit more expensive. On the other hand, not needing to iterate each and every byte might be a win. Worth looking into from a perf level.
  • We can wrap processing_classes in a way that makes per-byte mutation impossible (while still maintaining easy read indexing), and also pepper the code with comments about this so that additional state in loops is maintained correctly.

Thoughts? @eggrobin @sffc @mbrubeck @behnam

Footnotes

  1. Written cleaner as sequence.runs.iter().flat_map(Clone::clone), which we do in resolve_neutral()

  2. Are level runs in an isolating run sequence always adjacent? This would be simpler to reason about if we always treated isolating run sequences as a Single range instead of breaking it down further.

@Manishearth
Copy link
Member Author

Are level runs in an isolating run sequence always adjacent? This would be simpler to reason about if we always treated isolating run sequences as a Single range instead of breaking it down further.

The answer is "no", and the spec has something to say about this:

When applying a rule to an isolating run sequence, the last character of each level run in the isolating run sequence is treated as if it were immediately followed by the first character in the next level run in the sequence, if any.

@sffc
Copy link
Contributor

sffc commented Dec 19, 2022

This thread reminds me a bit of google/diff-match-patch#13 and format_to_parts.md and others. Basically, what is the best way to store character annotations in UTF-8 or UTF-16?

A few general approaches:

  1. Store byte-for-byte annotations.
    • Pro: Efficient lookup and cross-referencing.
    • Pro: Works well with insertions and deletions.
    • Con: Tricky to keep updated / in sync since trail bytes can get forgotten.
    • Con: Wasted space since annotations may occur multiple times for the same code point.
  2. Store annotations for only the first byte, leaving others empty.
    • Pro: Same as above
    • Con: Same as above, but maybe easier to keep updated?
  3. Store annotations on UTF-32 indices, keeping the original string as UTF-8 or UTF-16
    • Pro: Easy to access annotations by themselves
    • Pro: No bias towards UTF-8 versus UTF-16 (still needed for Java and JavaScript compatibility)
    • Pro: No wasted space
    • Con: May require linear-time lookup to cross-reference the string with its annotations
  4. Operate on UTF-32 strings instead of UTF-8 strings, converting at the boundary if necessary
    • Pro: Much easier to reason about; less likely for errors to crop up
    • Con: Requires allocating more memory
    • Con: Probably not as efficient (but maybe by not as much as one would think, since the algorithms get a lot simpler when you do it this way)
  5. Store annotations in their own special data structure such as a range list (see format_to_parts.md)
    • Pro: Can result in efficient lookup for annotations
    • Con: Trickier when handling insertions and deletions in the middle of the string; may be an expensive operation in the annotation structure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants