Skip to content

A few updates to the Unicode documentation.#1771

Merged
parrt merged 1 commit intoantlr:masterfrom
mike-lischke:master
Mar 20, 2017
Merged

A few updates to the Unicode documentation.#1771
parrt merged 1 commit intoantlr:masterfrom
mike-lischke:master

Conversation

@mike-lischke
Copy link
Member

It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.
@parrt
Copy link
Member

parrt commented Mar 19, 2017

@bhamiltoncx should we take a wack at proposing a similar API for other targets or leave as Java only?

@parrt parrt added the comp:doc label Mar 19, 2017
@parrt parrt added this to the 4.7 milestone Mar 19, 2017
@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 19, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 19, 2017

Ok, if it's not a huge hassle, could you make a PR the @antlr/antlr-targets people can examine?

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 19, 2017 via email

@mike-lischke
Copy link
Member Author

mike-lischke commented Mar 19, 2017

Uhm, why do you want to add extra input stream code just for handling full Unicode? To me it makes much more sense to just have ANTLRInputStream handle everything.

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 19, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 19, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 19, 2017 via email

@mike-lischke
Copy link
Member Author

Sorry for my ignorance, but why can't you have backward compatibility and still use ANTLRInputStream? Just because this class would allow now values > 0xFFFF doesn't mean it would break any existing application. They don't use such values so there is no behavior change for them. Sounds like fixing a problem where none is, to me.

@parrt
Copy link
Member

parrt commented Mar 19, 2017 via email

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 19, 2017

@mike-lischke: So, the situation is that in Java, C#, and JavaScript, the runtime for ANTLRInputStream (or its language equivalent) actually exposed 16-bit UTF-16 code units — not Unicode code points — to the lexer.

So, for example, the Unicode input U+1F600 (😀) when sent through ANTLRInputStream in Java, C#, or JavaScript, would actually expose two values to the lexer (\uD83D\uDE00), not one.

Changing that now would be a fairly serious backwards-compatibility issue (it was perfectly legal to write a lexer for those two values \uD83D\uDE00), so we punted and I made sure to require clients to call a different API in those languages.

You're correct that the C++ runtime didn't have this issue.

In my analysis on #276 (comment) , I confirmed that Go, Swift, and Python 3 were also OK.

Later, I found Python 2 was OK on Linux, but not Mac or Windows.

Since more than one language had issues (and the API to the ANTLRInputStream equivalent of each language is pretty different), I figured it'd be a good idea to consolidate everything into CharStreams.

@mike-lischke
Copy link
Member Author

So, for example, the Unicode input U+1F600 (😀) when sent through ANTLRInputStream in Java, C#, or JavaScript, would actually expose two values to the lexer (\uD83D\uDE00), not one.

Yes, sure, that's why the ANTLRInputStream must be enhanced to return the 32bit values.

Changing that now would be a fairly serious backwards-compatibility issue (it was perfectly legal to write a lexer for those two values \uD83D\uDE00),

Not at all if you would add encoding support to ANTLRInputStream. You can support UTF-16 (and make that the default to stay compatible) and also support UTF-8 (what is now in CharStreams) and even UTF-32 (which doesn't require any conversion). That would be a much better implementation because users don't have to use a new API and don't have to choose between the two just because their input can be so or so (that's just crazy). Seriously, I see that as unnecessarily confusing to have to use different stream implementations.

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 20, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 20, 2017

For long-lived libraries, it's common to change the interface and leave the old; e.g., job ahead to fix the streams so that they handled Unicode / utf-8 properly and created the Reader interface. Let's just see what @bhamiltoncx comes up with for the various targets and let those target authors discuss. It sounds like there's no point in creating something for C++ as it has no legacy (yet!) :)

By the way, users will not have to choose which stream to use. For new code they just use the new interface and it does the right thing.

@bhamiltoncx
Copy link
Contributor

OK, sent out #1775 and #1776 as examples of how the CharStreams API would look for C# and JavaScript.

C# and JS were the other two languages besides Java whose ANTLRInputStream equivalents exposed UTF-16 code units instead of Unicode code points. We wouldn't really need to provide this for other languages unless we wanted consistency.

@parrt parrt merged commit 6de2f3f into antlr:master Mar 20, 2017
@mike-lischke
Copy link
Member Author

mike-lischke commented Mar 21, 2017

OK, seems I'm fighting on a lost battle here. I'm sorry but in this case I believe you are not doing the right thing. There were so many changes since 4.5.x without that you created alternative classes to keep "backward compatibility". Backward compatibility is a nice thing, but you can take it too far and I believe this happens here. Version 4.7 is not a point release. Nobody expects that you just drop it in and forget about it. People have to and will check if their software still works after upgrade. I'm sure it's not asked too much to also check if their solution still works with the enhanced input handling. Most people won't notice anyway.

Now we have half of the targets using a different input solution than the others, which makes documentation more complicated and examples harder to convert. I really disagree with the approach taken here. I'm sorry for that.

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 21, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 21, 2017

What say the other @antlr/antlr-targets authors? I think @ericvergnaud had some specific concerns. Maybe we can address those directly. The correctness and speed of new streams appears to be good.

@parrt
Copy link
Member

parrt commented Mar 21, 2017

Ok, NOW I remember: the new stream has to use 32-bits per symbol, which literally doubles the size requirements to hold a document in memory. Given that I regularly process large corpora and hold them all in memory at once, this is a nontrivial burden. For a parser, ANTLR-generated parsers are already memory pigs.

I therefore cannot simply replace ANTLRFileStream with one that takes twice as much memory. I think that the best solution is to simply add a new ANTLRFileStream32 or some such and users that are willing to take the hit in order to do 32-bit Unicode, can do so. Existing users will not experience any increase in memory or speed reduction or backward compatibility. Given this, it doesn't make sense to have a new interface CharStreams.XXX; we should stick with the ctor-based mechanism and do new ANTLRFileStream32(...) or whatever.

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 21, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 21, 2017

yeah, I was wondering about encoding it but it seems like the lookup table would either add a lot of memory or slowdown access for LT(i).

I think we should just introduce the new CodePointCharStream class with appropriate constructors. I am open to that class name or ANTLRInputStream32 or some such.

@mike-lischke
Copy link
Member Author

mike-lischke commented Mar 21, 2017

I was thinking about how to make the input stream decode utf-8 on the fly. No need to keep the entire document in memory then. The token stream will keep the start/stop indexes of the chars for seeking in the input. We lose the 1:1 matching for index and code point then, however.

@parrt
Copy link
Member

parrt commented Mar 21, 2017

Well, we also have the unbuffered stream. I'd rather not mix the two.

Also I noticed that the new CodePointCharStream can only handle UTF-8 versus any other encoding. Ben indicated that that is more or less what everyone uses. Can people comment about other locales, such as India/Japan/Laos? In other words, do people need the ability to read non-unicode char points and possibly with some non-UTF-X related encoding? In other words, people surely had file formats for Japan before Unicode came around... do we need to worry about this? I ask this out of ignorance.

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 21, 2017 via email

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 21, 2017

My main problem with that is I feel pretty strongly that:

  1. The main API should Do The Right Thing (code points) by default (Python, Go, C++, and Swift all do the right thing by default)
  2. Introducing an API with numbers at the end almost guarantees people will not use or understand it
  3. We want some kind of path to deprecate the UTF-16 API eventually for Java, JS, and C#

I don't think introducing an ANTLRInputStream32 really gets us there.

@parrt
Copy link
Member

parrt commented Mar 21, 2017

I mostly agree with 1 and 2 but I'm not sure using 50% less RAM will go out of style for me. ;) Maybe we make ANTLRFileStream / ANTLRInputStream be the 32-bit code point stuff and shift 16 bit stuff to a new name. Getting those streams to work correctly for chars, tokens, etc... was fiddly. I hesitate to try for an all-in-one class. I don't see how to get single interface w/o factory which means changing API from ctor style. That means C++/Python/etc... will be different from Java and the other ones we have PRs for.

@bhamiltoncx
Copy link
Contributor

I hear you on improving RAM usage! I can send a diff to update CodePointCharStream to choose one of IntBuffer / CharBuffer / ByteBuffer for storage based on the contents of the input.

I would be OK with shifting the existing 16-bit logic in the runtimes to a legacy name if that's good with everyone else. However, we'd have to rename all the subclasses as well.

@bhamiltoncx
Copy link
Contributor

Basically, my strategy would be to assume US-ASCII only, and start with ByteBuffer until we see a code point > U+007F. If we see a 16-bit code point between U+007F and U+FFFF, we upgrade to CharBuffer. If we see a surrogate or code point larger than U+FFFF, we upgrade to IntBuffer.

@millergarym
Copy link
Contributor

I'm not sure if this useful, as I'm not fully across the issue. Here a perspective from Go land.

In Go all strings are utf8 encoded. Not supprising since the authors of utf8 authored Go. All built in functions & controls know this. len( somestring ) return the number if 'runes' not bytes, range somestring returns runes, somestring[i] is a rune. For ascii a rune is a byte and bigger for codepoints higher the ascii (look up the tech details if necessary). Strings are really stored as a byte array but from the programmers perspective are an array of runes.

I don't know how Java handles this. Might get expensive if every byte needs to be boxes.

Does talking about ascii vs utf8 make a difference than talking about 8 or 32 bits?

@bhamiltoncx
Copy link
Contributor

@millergarym : Right. The logic to transparently choose between 8, 16, and 32-bit units for Go runes is basically what I'm suggesting.

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Mar 22, 2017 via email

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Mar 22, 2017 via email

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Mar 22, 2017 via email

@bhamiltoncx
Copy link
Contributor

@ericvergnaud : I'm just talking about the default behavior of ANTLRInputStream / ANTLRFileStream in Java, which is to read the entire stream into memory. We can make better-behaved ones of course.

The CharStreams code I added doesn't require UTF-8, of course. It supports other encodings, but the Java lgoic assumes they will never have code points > U+FFFF.

@bhamiltoncx
Copy link
Contributor

@parrt: I sent out #1781 to improve memory usage of the new Java runtime.

It should actually be ~ 2x better than the existing ANTLRInputStream for US-ASCII-only content, depending on the initial buffer size we choose.

@bhamiltoncx
Copy link
Contributor

In banks we have all sorts of mature systems with all sorts of encodings.

And yes, @ericvergnaud is totally correct — I just meant it might not be as important to worry about making the non-UTF-8 code path performance-optimal. (It of course needs to be fully functional and bug-free!)

@parrt
Copy link
Member

parrt commented Mar 22, 2017

@ericvergnaud regarding reading entire input; that's exactly what we've been doing for years! hahaha

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 22, 2017

@parrt : I do see @ericvergnaud 's point, though — for the factory-based APIs which take in a stream, it makes sense to have a version which also takes a callback which we invoke after each read to ask the client if we should continue reading.

That allows the client to parse multiple objects directly from a stream with some kind of separator (imagine a streaming JSON parser which separates JSON objects by \n).

The tricky bit there is somebody will have to cache any remaining bits after the separator. :)

@parrt
Copy link
Member

parrt commented Mar 22, 2017

The problem is that the parser does all sorts of lookahead and itself must control how much of the input the buffer, possibly the entire thing.

@bhamiltoncx
Copy link
Contributor

Sure, but the socket client can have some other kind of inexpensive delimiter like \n which guarantees the parser will have a full buffer available, after which another full buffer will occur.

@ericvergnaud 's point is that without a way for the client to say "hold off on parsing more", there's no way to implement these types of streaming parsers.

@parrt
Copy link
Member

parrt commented Mar 22, 2017

There is no guarantee that the parser does not need to see beyond \n. It all depends on the skill of the grammar writer, unfortunately. The best way is to have a reader, outside of antlr stuff, that collects one string at a time and sends that to the parser.

Whatever we have now appears to be working, unless there's a case people are simply not telling me about. The guys at Apple specifically asked me if they could do unbuffered input for reading sockets and that sort of thing and without building a tree. I made sure that is possible.

@bhamiltoncx
Copy link
Contributor

I suspect the missing connection here is that the new CharStreams factory API doesn't have an unbuffered input option from which clever parser authors can create such a client.

@parrt
Copy link
Member

parrt commented Mar 22, 2017

ah. heh, i just found something i wrote down (in the book):

13.8 Unbuffered Character and Token Streams
Because ANTLR recognizers buffer up the entire input character stream and all input tokens by default, they can’t handle input files that are bigger than a computer’s memory and can’t handle infinite streams like socket connec- tions. To overcome this, you can use unbuffered versions of the character and token streams, which keep just a small sliding window into the streams: UnbufferedCharStream and UnbufferedTokenStream.

CharStream input = new UnbufferedCharStream(is);
CSVLexer lex = new CSVLexer(input);
// copy text out of sliding buffer and store in tokens
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Mar 23, 2017 via email

@parrt
Copy link
Member

parrt commented Mar 23, 2017

@ericvergnaud what do you mean?

@parrt
Copy link
Member

parrt commented Mar 23, 2017

Per @mike-lischke, we really should make the API consistent across language targets. @bhamiltoncx says:

C# and JS were the other two languages besides Java whose ANTLRInputStream equivalents exposed UTF-16 code units instead of Unicode code points. We wouldn't really need to provide this for other languages unless we wanted consistency.

This would mean a factory style CharStreams.fromFileName(...) type thing for all targets. Is this even meaningful in C++? I suppose it could be just a static function scoped inside CharStreams::fromFileName(...)? It would appear that @ericvergnaud has no fundamental disagreement with adding this interface for his targets. What about the other @antlr/antlr-targets? I am warming to the new factory style interface.

@bhamiltoncx has removed my size objection to the new Unicode code point streams by making it 50% the size that he used to be. haha. kudos. It of course is using the new factory interface, though.

The new interface would be described in the documentation as the way to get input into the lexer.

I think the only remaining question is what to do with existing ANTLRInputStream and friends. choices:

  1. we leave the ANTLRInputStream and friends around as-is so that existing code compiles and runs.
  2. we leave those class names around but replace their guts with the new code point stream. This has the advantage of having just one internal mechanism to handle character streams; the class names would hang around just for backward compatibility.

In light of the fact that the new stream will be 8-bit optimized, it seems like option number 2 is the right choice.

@bhamiltoncx
Copy link
Contributor

I like that proposal (adopting the factory-style interface, which could just be namespaced functions for C++). We can make the existing ANTLRInputStream a simple wrapper if we think the perf and memory usage of the new code is good.

I agree the factory-style interface doesn't do a ton for C++ today, but it opens up possibilities for more types of input streams in the future.

If we're cool with getting rid of backwards compatibility with UTF-16 input in Java, C#, and JS, I think it'd be a big step forward.

@parrt
Copy link
Member

parrt commented Mar 23, 2017

Re UTF-16: oh right. There were a number of grammars that were manually looking for UTF-16 code units in an effort to simulate 32-bit code points right? Hmm... probably okay.

I forgot to ask about different encodings, like the legacy ones Eric mentioned from bank systems. How do we incorporate different encodings than UTF-X? Will this explode the number of methods we need in the common character stream factory interface?

@bhamiltoncx
Copy link
Contributor

bhamiltoncx commented Mar 23, 2017

Re UTF-16: oh right. There were a number of grammars that were manually looking for UTF-16 code units in an effort to simulate 32-bit code points right? Hmm... probably okay.

Right. So, we have three options:

  1. Provide no support for such grammars
  2. Keep ANTLRInputStream around as-is for such grammars but mark it and its equivalents as deprecated in Java, C#, and JavaScript
  3. Provide a new, supported API to create CharStreams which expose UTF-16 code units

I forgot to ask about different encodings, like the legacy ones Eric mentioned from bank systems. How do we incorporate different encodings than UTF-X? Will this explode the number of methods we need in the common character stream factory interface?

Mostly done. I made the new factory APIs take an optional Charset:

public static CharStream fromPath(Path path, Charset charset) throws IOException {

Right now, these APIs delegate to ANTLRInputStream for non-UTF-8 so we can convert from those legacy encodings to UTF-16, but I'll follow up after #1781 lands and make them directly convert to CodePointCharStream.

@mike-lischke
Copy link
Member Author

mike-lischke commented Mar 24, 2017

I see there's really a lot going on here, great, with @bhamiltoncx giving 100% for a good solution. Still to me that entire thing looks very Java-centric. Language have streams and ANTLR should support those (I mean their native streams), via a simple interface (CharStream probably). The idea should be: "user, give me a stream that returns 32bit code points to me, period". The user then has to take care to convert other encodings to UTF-32, if required; ANTLR cannot cope with all used encodings in this world. The factory approach doesn't allow to select the stream class that fits a specific purpose. IMO ANTLR shouldn't care what native stream is used under the hood. This is also the most flexible approach for us target authors.

For convenience ANTLR provides one such stream out of the box (the one that takes UTF-8). If backwards compatibility is such a big concern then we can indeed deprecate the current ANTLRInputStream class and provide a new AntlrUtf8InputStream. The new targets just rename their ANTLRInputStream classes and the older ones add a new class with that name. This would be my preferred solution here.

Just a few more thoughts for the general handling as I imagine it:

If a stream loads everything into memory or not should not be the decision of the ANTLR runtime (users might have a [native] stream interface for memory mapped files that allow to deal with gigabyte sized input, which is both: not buffered and allows random seeking). Streams can indicate if they support random seek or not. Most of the time users will probably use a string stream, which allows to handle in-memory strings like a stream (some targets like C++ already have a class for this). This approach will also end the buffered vs unbuffered separation and we also don't need the CharStreams stuff.

@ericvergnaud
Copy link
Contributor

@parrt not sure what I had in mind... best to forget about it :-)

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Mar 24, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants