Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between invalid and valid-but-incomplete utf-8 #42

Closed
thomcc opened this issue Mar 8, 2020 · 15 comments
Closed

Distinguish between invalid and valid-but-incomplete utf-8 #42

thomcc opened this issue Mar 8, 2020 · 15 comments

Comments

@thomcc
Copy link
Contributor

thomcc commented Mar 8, 2020

I have &[u8] that represents a section of a streaming read of some text. I'd like to convert it (possibly lossily) to a String as if it had all arrived in a single chunk. The issue is that the chunk may end part-way-through a valid utf8 sequence, and naively converting will corrupt any character unlucky enough to get sliced in this manner. To avoid that, I need to be able to distinguish if the final utf8 sequence is invalid or just incomplete.

Right now I do something like this:

let (ch, size) = bstr::decode_last_utf8(incoming_buffer);
let to_process = if ch.is_some() {
    // Ends with valid utf8, no worries.
    cur_buffer
} else {
    // Ends with either invalid or incomplete utf8, figure out which
    let (valid, invalid) = incoming_buffer.split_at(incoming_buffer.len() - size);
    if utf8_valid_prefix(invalid) {
        // Incomplete: Only process the valid part and leave
        // the rest for next time
        add_to_start_of_next_buffer(invalid);
        valid
    } else {
        // Process the whole thing -- we'll replace the invalid bit at the end
        // with the replacement char.
        cur_buffer
    }
};

Where utf8_valid_prefix looks like this:

fn utf8_valid_prefix(s: &[u8]) -> bool {
    let (head, tail) = if let Some((&h, t)) = s.split_first() {
        (h, t)
    } else {
        return true;
    };
    let seqlen = match head {
        0b0000_0000..=0b0111_1111 => 1,
        0b1000_0000..=0b1011_1111 => return false,
        0b1100_0000..=0b1101_1111 => 2,
        0b1110_0000..=0b1110_1111 => 3,
        0b1111_0000..=0b1111_0111 => 4,
        0b1111_1000..=0b1111_1111 => return false,
    };
    s.len() < seqlen
        && tail
            .iter()
            .all(|b| (0b1000_0000..=0b1011_1111).contains(b))
}

This works fine [0], (assuming I didn't mangle it too badly trying to simplify/clean it up for the issue), but it feels redundant given that bstr already done this when decoding (or something equivalent), and it just feels like the library can probably help more.

Anyway, I don't have strong opinions on what an API for this should look like, but I figured I'd bring it up.

One easy option, might be adding e.g. ByteSlice::is_char_boundary(&self, index: usize) -> bool and ByteSlice::utf8_sequence_len(&self, starting_at: usize) -> Option<usize> might be enough. Maybe also a prev_char_boundary? Had these existed, I probably wouldn't be filing this issue, but implementing these for not-necessarily-valid utf8 might end up hard or producing confusing results, and they might not be obvious functions to look for if you aren't familiar with how utf8 is encoded. I'm open to thoughts though.

Also, apologies in advance if something for this exists and I'm just missing it...


[0]: Actually, I suspect the last part (where we check that tail is valid) has a bug in that it can cause us to take a sequence which is both invalid and incomplete, and cause it to turn into two replacement characters, whereas if we had processed the whole buffer in one go we would only have one. That said, I'm not worried about it really, since the data was already invalid at that position. That said, it might be an indication that "distinguish between invalid and incomplete" isn't quite right.

@BurntSushi
Copy link
Owner

@thomcc Does the example on the Utf8Error type help at all? It is supposed to handle this exact use case, but maybe there is a mismatch somewhere?

@thomcc
Copy link
Contributor Author

thomcc commented Mar 10, 2020

😅 (To be fair, I figured this might be the case, hence my apology-in-advance). That said it's a little tricky to use for my case, because I do intend to do convert the non-truncated portion in a lossy way (via to_str_lossy_into). That said, it will probably let me delete utf8_valid_prefix, which is good.

I think the remaining issues I have with the code here would be solved if bstr had APIs like I suggested above, but I'll file another issue for it, since it's not directly related to this problem, and the names need a bit of bikeshed discussion since I think they imply valid UTF-8.

@thomcc
Copy link
Contributor Author

thomcc commented Mar 13, 2020

Had another case where I was slightly frustrated by this in bstr.

Here's the concrete example, with a caveat that it's a pretty frustrating case,
because ultimately it's driven by heuristics, although I'm not sure that part is
actually relevant.

Mouse input has a few styles it can come in to a terminal program. Two methods
(both deprecated, although amusingly one of them because at the time, requiring
utf8 was seen as problematic) are:

  1. \x1b[M followed by 3 bytes containing button state, x position, and y
    position. (AKA X10 mouse mode)
  2. \x1b[M followed by 3 codepoints containing the same data encoded as utf-8.
    This one is capable of representing way larger X/Y positions. (AKA UTF8
    mouse mode 1005 or UTF8 1005 mouse mode).

You don't really know which of these formats the data is in if you turned on
mouse mode via terminfo. In practice you might know, not always. Yeah, this is
ambiguous*.

Anyway, you get this data from nonblocking IO, and you have no idea if it's
complete or not, and there may be other events after it in the buffer, which
have no guarantee of being valid utf8.

So, assuming you don't already know the format for sure, basically what you want
to do is grab the next 3 codepoints:

  • if it turns out they're definitely invalid, then you know you're in format 1.
  • if it turns out they're valid you use some heurstics to decide if they look
    right and decide based on that.
  • if they're valid but incomplete you do some similar heuristics, and ask for
    more data for a short delay, and if none comes in you assume format 1.

The difficulty here is:

  • you can't tell ByteString::to_str() to only run for the first 3 codepoints
    • you can't just assume 12 and then just look at the first 3 because the
      data after these is not guaranteed to be valid utf8.
  • even if there was a version of to_str that took how many codepoints to
    consider -- I want the decoded value of the codepoints, not the value as a
    str, so I'd then have to convert, which is a hassle.
  • so what it feels like what you want is several calls to decode_utf8, but...
    you don't really know the reason it failed, so if it fails you have to check
    with to_str().

So yeah, it's pretty annoying. I ended up punting on it because it was a hassle,
and just handling the first format above, since it's more common (and there's a
3rd better format which seems to have replaced the second for most cases anyway).

That said, the point here isn't so much any specifics about parsing these (it's
pretty bog-standard programming), just another example where the current API
that's exposed isn't very helpful.


* In practice the ambiguity is not a huge problem since heuristics work well,
most of the time they're the same, and if you ever see data that's definitely in
format 1, you know it all must be -- and format 1 is more common in practice.

@BurntSushi
Copy link
Owner

@thomcc Sorry, I might be missing something, but what API would be useful in that context? (It's not clear to me why decode_utf8 is insufficient.)

@thomcc
Copy link
Contributor Author

thomcc commented Mar 15, 2020

A version of decode_utf8 that returns a Utf8Error would be a definite improvement.

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 15, 2020

Hmmm, I'm not sure I quite follow.?

The decode_utf8 API only decodes one codepoint at a time. So if it fails, that implies that valid_up_to would always be zero I think. And error_len is explicitly returned by decode_utf8 as the number of bytes consumed, even in the error case.

@thomcc
Copy link
Contributor Author

thomcc commented Mar 15, 2020

Right, I'm interested in truncation within a codepoint -- I'd like to be able to distinguish a case like decode_utf8(b"\xFF") from decode_utf8(b"\xF0") -- these both return (None, 1)

Utf8Error seems do distinguish these cases -- b"\xFF".to_str() gives me error_len == Some(1), and b"\xF0".to_str() gives me None

The Utf8Error API is just awkward for what I'm doing, for reasons described above (I need the codepoints, and before getting them I don't know where the region that's valid ends)

@BurntSushi
Copy link
Owner

Oh I see. Yeah, that is fundamentally at odds with decode_utf8 I think, since returning that information requires knowledge of what follows. For example, what would decode_utf8(b"\xF0\xFF") return? It sounds like it would have to be different than what decode_utf8(b"\xF0") returns I think.

Maybe there is another kind of helper function that could help here? Perhaps a is_utf8_prefix function that returns true if and only if the given bytes correspond to a valid prefix of a UTF-8 encoded codepoint, and false otherwise? It seems a little weird to me, but I think I'd be more amenable to that than another decoding function.

@thomcc
Copy link
Contributor Author

thomcc commented Mar 15, 2020

Oh I see. Yeah, that is fundamentally at odds with decode_utf8 I think, since returning that information requires knowledge of what follows. For example, what would decode_utf8(b"\xF0\xFF") return? It sounds like it would have to be different than what decode_utf8(b"\xF0") returns I think.

I mean, I'm making the API up but the difference I want between \xf0\xff and \xf0 would just be just why the failure happened in a very broad sense -- e.g. if one says it saw an illegal byte and the other that it saw truncation that would be usable.

E.g. I don't plan on using error_len, just error_len().is_some(). Hell, for my use case, I don't even need to know how long it took before a bad byte was seen. If the bytes are invalid utf8 at all (to be clear: distinct from truncated) that tells me I have to reprocess with a different algorithm.

Anyway I don't think this requires looking ahead, you already know these things at the time you return the data.

Note that the docs in these cases are imprecise in ways that have caused a bit of confusion for me In particular decode_byte (but also see[0])

When unsuccessful, None is returned along with the number of bytes that make up a maximal prefix of a valid UTF-8 code unit sequence. In this case, the number of bytes consumed is always between 0 and 3, inclusive, where 0 is only returned when slice is empty.

You should probably adjust to "a maximal prefix of a valid UTF-8 code unit sequence, or if none exists, 1" which matches the behavior and the description of "maximal subpart" in https://www.unicode.org/review/pr-121.html

Sadly, the way it's written would be more helpful to me for this case than the current function. It feels like the current function is mainly useful when you plan on replacing anything invalid anyway -- which is only in the rarest cases a thing I want to do, and even then, only when I'm absolutely certain it's not caused by truncation.

That said, it's possible I'm on the outside here -- bstr is certainly a library written with the assumption that lossy replacement is an okay default (and for very many cases this is probably true -- I'm not attempting to be critical of the design on the whole). And for the case where you do want to replace, it does seem to give you the ideal result.

Perhaps a is_utf8_prefix function that returns true if and only if the given bytes correspond to a valid prefix of a UTF-8 encoded codepoint, and false otherwise

Hmm, I mean, maybe? I already feel like I'm cobbling together what I need from a bunch of other APIs. It would be possible to use this, but I think it would make the code more awkward.

It's worth noting again: before I decode, I don't know how long the codepoint is going to be (obviously), and I also have no guarantee that the bytes after the codepoint are going to be valid UTF8 at all -- its completely reasonable for them to be non-utf8.

So either is_utf8_prefix would have to only look at one codepoints worth of bytes? It's not like I could pass in &bytes[..decode_utf8(bytes).1] because then it would say "yes" to something like \xf0\xff -- I guess I would have to distinguish this based on testing versus bytes.len()... (but currently I can almost do this, except for the case where len == 1, since len == 1 could mean 'invalid but lets move past it' or 'incomplete'). And so it would only help a very little bit I think.

Would it help if I gave you example code to show you?


[0] This one is dubious also and confused me earlier in this conversation: https://docs.rs/bstr/0.2.8/bstr/struct.Utf8Error.html#method.valid_up_to

Returns the byte index of the position immediately following the last valid UTF-8 byte

Since it goes on to talk about sub-bytes within a codepoint I thought this would mean it would tell me about a partial codepoint read, but what it actually does only seems to consider bytes which lie on codepoint boundaries, or on the boundaries where you'd insert a replacement character.

@BurntSushi
Copy link
Owner

I think I have pretty well lost track of what it is you're trying to do, so yes, I think a short example would help here.

Hmm, I mean, maybe? I already feel like I'm cobbling together what I need from a bunch of other APIs. It would be possible to use this, but I think it would make the code more awkward.

I don't know whether it's worth making your use case completely non-awkward. If your use case is very rare, then I'd consider "possible but awkward" to be a success here. But I don't mean to state any absolutes. I'm generally happy to add more methods or functions that are orthogonal in purpose, assuming a use case for them can be easily articulated in the docs with a short example.

The road I don't want to go down is to provide "alternate" APIs of things like decode_utf8 unless they are exceptionally well motivated. The problem I have with alternate APIs ("decode_utf8 but one that returns Utf8Error") is that they increase the API complexity of the crate in subtle ways. IMO, these sort of alternate APIs create a sort of decision-paralysis that makes it difficult for users of the crate to know what to use.

As far as documentation clarity, I'm always happy to see that improved. I confess that I do not quite grok your confusion on the existing wording. I think the best way to hash that out, if you're willing, is to submit a PR with the docs rewritten to match your ideal.

Thanks for your patience in dealing with me here. :)

@thomcc
Copy link
Contributor Author

thomcc commented Mar 18, 2020

Here's an example. It's not the same as the example I opened this bug with, essentially each time this comes up for me, I've found the way I have to handle it in bstr fairly painful.

That said, I don't know if it's compelling, as real usage would be more complex, and it's vaguely similar to what I'm asking for.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3a760cd1eda574b6978a23c192fa96e3

Anyway, the issues with it are:

  1. Decodes the string multiple times, with multiple decoding functions. If these behave differently, it will probably panic.
  2. Very unclear if it's failing to handle any edge cases. I'd want to test this rather exhaustively if I were going to actually use it. (Actually, I'd almost certainly just do the decode manually rather than use this implementation...)
  3. As I mentioned, it very much feels like it's cobbling together what it needs from existing APIs, when really being able to distinguish between 'invalid' and 'incomplete' feels like it should be supported in a more first class way...

And I guess that's the crux of it -- do you want to support cases where the action on for seeing invalid utf8 is anything other than 'replace' or 'bail out'? IMO the answer should be yes, at least in the case where the reason it's invalid is that it's truncated...

That said, essentially every API in the crate that can make a choice here seems to behave the other way -- either they perform the replacement for you, or they discard the information (e.g. https://github.com/BurntSushi/bstr/blob/master/src/utf8.rs#L601) you'd need in order to handle it without replacing.

It's admittedly a little frustrating, although I truly apologize if I've let my frustration come through or I've come across as impatient.

@BurntSushi
Copy link
Owner

OK, I've finally had a chance to look at your code. Your example helped a lot.
It was really really hard for me to understand what you really wanted without
that.

I'd like to brainstorm some possible solutions for you. Up front, I'd
personally like to think that adding a new decode function is off the table.
I really would like to figure out how to make your use case less awkward
without inciting confusion in the public API, which I think is what will happen
if we add multiple decoding functions. This may mean that your use cases
becomes "less awkward" rather than "not awkward at all."

All right, so here is the relevant code from the playground:

#[derive(Debug, Clone, Copy, PartialEq)]
enum ReadErr {
    Partial,
    Invalid,
}

fn try_decode(b: &[u8]) -> Result<(char, usize), ReadErr> {
    if b.is_empty() {
        return Err(ReadErr::Partial);
    }
    let (ch, l) = bstr::decode_utf8(b);
    if let Some(c) = ch {
        Ok((c, l))
    } else if l < b.len() {
        // If we didn't see all characters, something must have been invalid?
        Err(ReadErr::Invalid)
    } else {
        // Could either be invalid-because-truncate, or invalid-because-invalid
        assert_eq!(l, b.len(), "decode_utf8 returned length greater than input slice?");
        let err = b.to_str().expect_err("should fail here...");
        // sanity check...
        debug_assert_eq!(err.valid_up_to(), 0);
        if err.error_len().is_some() {
            Err(ReadErr::Invalid)
        } else {
            Err(ReadErr::Partial)
        }
    }
}

For me personally, I think that if we could make the else branch less
awkward, then that would be a win. It seems to me like if we had a
is_utf8_preifx method, or perhaps utf8_prefix_len, then your else branch
could be simplified to:

if utf8_prefix_len(b) == 0 {
    Err(ReadErr::Invalid)
} else {
    Err(ReadErr::Partial)
}

And that way, you avoid a weird second decoding step. Specifically, I think
utf8_prefix_len would have the following signature:

/// Returns the number of bytes at the beginning of `slice` that correspond to
/// a valid prefix of a UTF-8 encoded codepoint.
///
/// If slice does not start with a valid prefix, then this returns 0. If slice
/// starts with a complete UTF-8 encoded codepoint, then the prefix length is
/// equivalent to the number of bytes in the encoded codepoint.
///
/// The maximum value this function can return is 4.
fn utf8_prefix_len(slice: &[u8]) -> usize;

That said, essentially every API in the crate that can make a choice here
seems to behave the other way -- either they perform the replacement for you,
or they discard the information

It would help if you could be more specific than "every API," because to me,
that's just not true. As far as I can tell, the decode and decode_last
routines are the only APIs with this problem.

Now, there are many APIs that subtitute in replacement codepoints, but that's
because there really isn't another choice. For example, the chars() iterator
yields char values, so it could 1) panic or 2) make it a Result API or
3) use a sentinel value like the replacement codepoint or 4) invent its own
character type. (3) seems like by far the best choice there, and it's not
really clear how truncation would even work in APIs like that.

Moreover, truncation is precisely the motivation behind the Utf8Error
methods. In particular, the whole "decode one codepoint at a time" should
generally already be a very niche operation. So you're really operating in a
niche of a niche here. Typically, the thing most people are going to want is
to convert as much as they can to a &str as possible. And the Utf8Error
returned by the to_str APIs when invalid UTF-8 is seen is designed to let you
do this in a streaming context where truncation can occur. It just so happens
that it looks like you're working in a context where to_str isn't
appropriate and you care about truncation. I'm not sure how often that
actually occurs in practice.

Since my judgment at the moment (absent further evidence) is that this is a
weird use case, I'd like to help make it less awkward for you, but with as
little disruption to the API as possible. I hope that makes sense.

@BurntSushi BurntSushi reopened this May 10, 2020
@thomcc
Copy link
Contributor Author

thomcc commented May 10, 2020

Up front, I'd personally like to think that adding a new decode function is off the table.

You've expressed as much. My desire is mainly wanting to minimize how much
knowlege about UTF-8 internals I have to add to code in the cases where bstr
behaves differently than how I'd like.

For me personally, I think that if we could make the else branch less
awkward, then that would be a win.

Yes, I agree. Also, thank you for writing the signature and documentation
comment. That clears up #44, I think.

I'll respond to the rest of this just to get my thoughts here down.

It would help if you could be more specific than "every API," because to me,
that's just not true

You're right, of course -- and looking more closely, I had thought it was worse
than it actually is (as one of a few examples: I would have bet that the various
{lower,upper}case functions could insert replacement chars).

I think part of the difficulty is that the functions which can produce
replacement characters are not always sign-posted. As a point of comparison, I
believe that every function in Rust's stdlib that can produce replacement
characters notes this fact by adding _lossy in the name. To be clear, I'm not
suggesting that bstr should do this!

Just that when when many commonly used or desirable functions in the library
will insert replacement chars, it feels like the library is saying that it
thinks this is an okay default, which makes it unclear to me (as a user of the
library) where it will apply this default.

Which I guess says it comes down to the fact that I don't really having a
feeling for which functions will and won't do this. I suspect some of this could
be fixed by docs -- if there's an overarching philosophy here it would help to
internalize it better.

(There's more I could say here (not always a path forward when you realize the
API doesn't handle things how you want, etc), but... yeah).

In particular, the whole "decode one codepoint at a time" should generally
already be a very niche operation

I mean... I've seen s.chars().next() and variations many times in different
codebases as a way to get the next codepoint (or get the codepoint given a byte
index if combined with indexing). This is really just a bad API for decode (IMO,
anyway -- personaly I was very surprised there was nothing more direct than
using the iterator)

In an ideal world this sort of operation would be less common, though... But
that's not bstr's fault / responsibility.

@BurntSushi
Copy link
Owner

BurntSushi commented May 10, 2020

std uses lossy only for decoding stuff, which is what bstr does as well. Otherwise, lossiness doesn't really make sense in str APIs because of the guarantee of valid UTF-8. bstr has to have a fundamentally different design. I think there is really only one sensible implementation of chars() in bstr's design space, for example, so I'm not really sure what more I can do. AFAIK, every API that does Unicode replacement explicitly mentions it in its docs. And there is an entire top-level section of the crate docs dedicated to the topic.

It would really help if you could suggest ways to improve. I personally don't know what to do, but I'm ready and willing to believe it's confusing and I would love to fix it. I just don't know how.

This is really just a bad API for decode (IMO,
anyway -- personaly I was very surprised there was nothing more direct than
using the iterator)

Right. That's why bstr has a top-level decode function, which was influenced by Go's equivalent. But chars works great in most cases and I miss it whenever I write Go. Otherwise, bstr takes a lot of its design inspiration from how Go handles its strings, which are also only conventionally UTF-8.

@thomcc
Copy link
Contributor Author

thomcc commented Aug 14, 2020

I'm going to close this since life stuff happened before I got a chance to do it, and I believe that in the mean time there it looks like #68 is a better approach overall.

I will note that:

Right. That's why bstr has a top-level decode function,

This is one of the initial things I really liked about bstr. That, char_indices returning start and end (So great!), and the deliberate choice not to be a microcrate. (Anyway, deeply sorry if I came off as frustrated or short with you in this issue, it's no excuse but I was going through a lot. I deeply respect you and your work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants