A few updates to the Unicode documentation. by mike-lischke · Pull Request #1771 · antlr/antlr4

mike-lischke · 2017-03-18T12:02:15Z

It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

parrt · 2017-03-19T16:18:16Z

@bhamiltoncx should we take a wack at proposing a similar API for other targets or leave as Java only?

bhamiltoncx · 2017-03-19T16:19:18Z

Let's standardize the other runtimes if we can!

…

On Sun, Mar 19, 2017, 10:18 AM Terence Parr ***@***.***> wrote: @bhamiltoncx <https://github.com/bhamiltoncx> should we take a wack at proposing a similar API for other targets or leave as Java only? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU6Xoaxc3bE9OVKdxFbP5EnjE5qaTks5rnVVMgaJpZM4MhZLJ> .

parrt · 2017-03-19T16:20:25Z

Ok, if it's not a huge hassle, could you make a PR the @antlr/antlr-targets people can examine?

bhamiltoncx · 2017-03-19T17:00:19Z

Will do! We're having a bit of a wildfire here today, so hopefully we won't be evacuated. :)

…

On Sun, Mar 19, 2017, 10:20 AM Terence Parr ***@***.***> wrote: Ok, if it's not a huge hassle, could you make a PR the @antlr/antlr-targets <https://github.com/orgs/antlr/teams/antlr-targets> people can examine? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU0XdGj68cyisInxIls_xfCitpiurks5rnVXMgaJpZM4MhZLJ> .

mike-lischke · 2017-03-19T17:19:35Z

Uhm, why do you want to add extra input stream code just for handling full Unicode? To me it makes much more sense to just have ANTLRInputStream handle everything.

bhamiltoncx · 2017-03-19T18:13:12Z

Mike: The worry was that changing ANTLRInputStream's default behavior would break backwards compatibility.

…

On Sun, Mar 19, 2017, 11:19 AM Mike Lischke ***@***.***> wrote: Uhm, why do you want to add extra code just for handling full Unicode? To me it makes much more sense to just have ANTLRInputStream handle it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU_rK7flRGxhnuWPi0e0pIy2B5HIXks5rnWOrgaJpZM4MhZLJ> .

parrt · 2017-03-19T18:14:02Z

Read about that! Be safe!

…

Sent from my iPhone

On Mar 19, 2017, at 10:00 AM, Ben Hamilton (Ben Gertzfield) ***@***.***> wrote: Will do! We're having a bit of a wildfire here today, so hopefully we won't be evacuated. :) On Sun, Mar 19, 2017, 10:20 AM Terence Parr ***@***.***> wrote: > Ok, if it's not a huge hassle, could you make a PR the > @antlr/antlr-targets <https://github.com/orgs/antlr/teams/antlr-targets> > people can examine? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1771 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AApAU0XdGj68cyisInxIls_xfCitpiurks5rnVXMgaJpZM4MhZLJ> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

parrt · 2017-03-19T18:15:03Z

Well in the case of C plus plus, you don't have any legacy code so I guess it doesn't make sense for that target. To avoid breaking anything that exists we are leaving the old streams in place for other target by providing a new interface to access the new Unicode functionality

…

Sent from my iPhone

On Mar 19, 2017, at 10:19 AM, Mike Lischke ***@***.***> wrote: Uhm, why do you want to add extra code just for handling full Unicode? To me it makes much more sense to just have ANTLRInputStream handle it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mike-lischke · 2017-03-19T18:50:39Z

Sorry for my ignorance, but why can't you have backward compatibility and still use ANTLRInputStream? Just because this class would allow now values > 0xFFFF doesn't mean it would break any existing application. They don't use such values so there is no behavior change for them. Sounds like fixing a problem where none is, to me.

parrt · 2017-03-19T19:02:43Z

Once they are 32 bit code points it's no problem. It's properly decoding the input stream that is encoded as utf-x not 32-bit Ints

…

Sent from my iPhone

On Mar 19, 2017, at 11:50 AM, Mike Lischke ***@***.***> wrote: Sorry for my ignorance, but why can't you have backward compatibility and still use ANTLRInputStream? Just because this class would allow now values > 0xFFFF doesn't mean it would break any existing application. They don't use such values so there is no behavior change for them. Sounds like fixing a problem where none is, to me. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

bhamiltoncx · 2017-03-19T19:05:06Z

@mike-lischke: So, the situation is that in Java, C#, and JavaScript, the runtime for ANTLRInputStream (or its language equivalent) actually exposed 16-bit UTF-16 code units — not Unicode code points — to the lexer.

So, for example, the Unicode input U+1F600 (😀) when sent through ANTLRInputStream in Java, C#, or JavaScript, would actually expose two values to the lexer (\uD83D\uDE00), not one.

Changing that now would be a fairly serious backwards-compatibility issue (it was perfectly legal to write a lexer for those two values \uD83D\uDE00), so we punted and I made sure to require clients to call a different API in those languages.

You're correct that the C++ runtime didn't have this issue.

In my analysis on #276 (comment) , I confirmed that Go, Swift, and Python 3 were also OK.

Later, I found Python 2 was OK on Linux, but not Mac or Windows.

Since more than one language had issues (and the API to the ANTLRInputStream equivalent of each language is pretty different), I figured it'd be a good idea to consolidate everything into CharStreams.

mike-lischke · 2017-03-20T08:43:12Z

So, for example, the Unicode input U+1F600 (😀) when sent through ANTLRInputStream in Java, C#, or JavaScript, would actually expose two values to the lexer (\uD83D\uDE00), not one.

Yes, sure, that's why the ANTLRInputStream must be enhanced to return the 32bit values.

Changing that now would be a fairly serious backwards-compatibility issue (it was perfectly legal to write a lexer for those two values \uD83D\uDE00),

Not at all if you would add encoding support to ANTLRInputStream. You can support UTF-16 (and make that the default to stay compatible) and also support UTF-8 (what is now in CharStreams) and even UTF-32 (which doesn't require any conversion). That would be a much better implementation because users don't have to use a new API and don't have to choose between the two just because their input can be so or so (that's just crazy). Seriously, I see that as unnecessarily confusing to have to use different stream implementations.

bhamiltoncx · 2017-03-20T13:30:31Z

There are two encodings to think about here. First is the input encoding: when reading streams of bytes, how do we convert those bytes to Unicode? Second is the internal storage encoding used inside ANTLRInputStream: which Unicode encoding should we use to store the Unicode code points in memory? ANTLRInputStream already supports many input encodings via subclasses like ANTLRFileStream. It's true that ANTLRInputStream only supports a single internal storage encoding (UTF-16) and we could add support for other storage encodings. However, ANTLRInputStream is a very leaky abstraction, as it exposes its internals to subclasses via protected members. It is fairly common for code outside the ANTLR runtime itself to manipulate these internals via subclasses. Changing the internal storage encoding would mean changing not just ANTLR but all external subclasses as well. Not impossible, but when I proposed it, others suggested just avoiding breaking backwards compatibility by using a separate class entirely. Ben

…

On Mon, Mar 20, 2017, 2:43 AM Mike Lischke ***@***.***> wrote: So, for example, the Unicode input U+1F600 (😀) when sent through ANTLRInputStream in Java, C#, or JavaScript, would actually expose two values to the lexer (\uD83D\uDE00), not one. Yes, sure, that's why the ANTLRInputStream must be enhanced to return the 32bit values. Changing that now would be a fairly serious backwards-compatibility issue (it was perfectly legal to write a lexer for those two values \uD83D\uDE00), Not at all if you would add encoding support to ANTLRInputStream. You can support UTF-16 (and make that the default to stay compatible) and also support UTF-8 (what is now in CharStreams) and even UTF-32 (which doesn't require any conversion). That would be a much better implementation because users don't have to use a new API and don't have to choose between the two just because *their input can be so or so* (that's just crazy). Seriously, I see that as unnecessarily confusing to have to use different stream implementations. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU5kchGZDmN9tWzo5ejgya5C6EB9-ks5rnjwkgaJpZM4MhZLJ> .

parrt · 2017-03-20T16:50:12Z

For long-lived libraries, it's common to change the interface and leave the old; e.g., job ahead to fix the streams so that they handled Unicode / utf-8 properly and created the Reader interface. Let's just see what @bhamiltoncx comes up with for the various targets and let those target authors discuss. It sounds like there's no point in creating something for C++ as it has no legacy (yet!) :)

By the way, users will not have to choose which stream to use. For new code they just use the new interface and it does the right thing.

bhamiltoncx · 2017-03-20T20:25:30Z

OK, sent out #1775 and #1776 as examples of how the CharStreams API would look for C# and JavaScript.

C# and JS were the other two languages besides Java whose ANTLRInputStream equivalents exposed UTF-16 code units instead of Unicode code points. We wouldn't really need to provide this for other languages unless we wanted consistency.

mike-lischke · 2017-03-21T08:26:13Z

OK, seems I'm fighting on a lost battle here. I'm sorry but in this case I believe you are not doing the right thing. There were so many changes since 4.5.x without that you created alternative classes to keep "backward compatibility". Backward compatibility is a nice thing, but you can take it too far and I believe this happens here. Version 4.7 is not a point release. Nobody expects that you just drop it in and forget about it. People have to and will check if their software still works after upgrade. I'm sure it's not asked too much to also check if their solution still works with the enhanced input handling. Most people won't notice anyway.

Now we have half of the targets using a different input solution than the others, which makes documentation more complicated and examples harder to convert. I really disagree with the approach taken here. I'm sorry for that.

bhamiltoncx · 2017-03-21T13:02:18Z

I hear you! Personally, I agree and think we should break compatibility here as well, but I understand the concerns of Terence and Sam.

…

On Tue, Mar 21, 2017, 2:26 AM Mike Lischke ***@***.***> wrote: OK, seems I'm fighting on a lost battle here. I'm sorry but in the case I believe you are not doing the right thing. There were so many changes since 4.5.x without that you created alternative classes to keep "backward compatibility". Backward compatibility is a nice thing, but you can take it too far and I believe this happens here. Version 4.7 is not a point release. Nobody expects that you just drop it in and forget about it. People have to and will check if their software still works after upgrade. I'm sure it's not asked too much to also check if their solution still works with the enhanced input handling. Most people won't notice anyway. Now we have half of the targets using a different input solution than the others, which makes documentation more complicated and examples harder to convert. I really disagree with the approach taken here. I'm sorry for that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU4vpMa2-FAZI4f18eWyx33MA4qqTks5rn4mpgaJpZM4MhZLJ> .

parrt · 2017-03-21T14:35:46Z

What say the other @antlr/antlr-targets authors? I think @ericvergnaud had some specific concerns. Maybe we can address those directly. The correctness and speed of new streams appears to be good.

parrt · 2017-03-21T16:33:26Z

Ok, NOW I remember: the new stream has to use 32-bits per symbol, which literally doubles the size requirements to hold a document in memory. Given that I regularly process large corpora and hold them all in memory at once, this is a nontrivial burden. For a parser, ANTLR-generated parsers are already memory pigs.

I therefore cannot simply replace ANTLRFileStream with one that takes twice as much memory. I think that the best solution is to simply add a new ANTLRFileStream32 or some such and users that are willing to take the hit in order to do 32-bit Unicode, can do so. Existing users will not experience any increase in memory or speed reduction or backward compatibility. Given this, it doesn't make sense to have a new interface CharStreams.XXX; we should stick with the ctor-based mechanism and do new ANTLRFileStream32(...) or whatever.

bhamiltoncx · 2017-03-21T16:52:17Z

We could keep the memory use down (or even improve it for ASCII text) in several ways without introducing a new class: 1) Use UTF-8 encoding and build a lookup table of code point byte offsets following any non-ASCII code points 2) Use UTF-16 encoding and build a lookup table of code point offsets > U+FFFF

…

On Tue, Mar 21, 2017, 10:33 AM Terence Parr ***@***.***> wrote: Ok, NOW I remember: the new stream has to use 32-bits per symbol, which literally doubles the size requirements to hold a document in memory. Given that I regularly process large corpora and hold them all in memory at once, this is a nontrivial burden. For a parser, ANTLR-generated parsers are already memory pigs. I therefore cannot simply replace ANTLRFileStream with one that takes twice as much memory. I think that the best solution is to simply add a new ANTLRFileStream32 or some such and users that are willing to take the hit in order to do 32-bit Unicode, can do so. Existing users will not experience any increase in memory or speed reduction or backward compatibility. Given this, it doesn't make sense to have a new interface CharStreams.XXX; we should stick with the ctor-based mechanism and do new ANTLRFileStream32(...) or whatever. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAUzKfvgV3CGGy5J2r9Mi62AuV7f3tks5rn_vbgaJpZM4MhZLJ> .

parrt · 2017-03-21T16:58:23Z

yeah, I was wondering about encoding it but it seems like the lookup table would either add a lot of memory or slowdown access for LT(i).

I think we should just introduce the new CodePointCharStream class with appropriate constructors. I am open to that class name or ANTLRInputStream32 or some such.

mike-lischke · 2017-03-21T17:00:01Z

I was thinking about how to make the input stream decode utf-8 on the fly. No need to keep the entire document in memory then. The token stream will keep the start/stop indexes of the chars for seeking in the input. We lose the 1:1 matching for index and code point then, however.

parrt · 2017-03-21T17:02:28Z

Well, we also have the unbuffered stream. I'd rather not mix the two.

Also I noticed that the new CodePointCharStream can only handle UTF-8 versus any other encoding. Ben indicated that that is more or less what everyone uses. Can people comment about other locales, such as India/Japan/Laos? In other words, do people need the ability to read non-unicode char points and possibly with some non-UTF-X related encoding? In other words, people surely had file formats for Japan before Unicode came around... do we need to worry about this? I ask this out of ignorance.

bhamiltoncx · 2017-03-21T17:11:15Z

It's extremely unlikely that anyone using ANTLR will be parsing non-UTF-8 input, except on Windows. Lookup tables are fine for applications like ANTLR, since non-ASCII input is pretty unlikely. We could also dynamically change between 8/16/32 bit storage depending on the input.

…

On Tue, Mar 21, 2017, 11:02 AM Terence Parr ***@***.***> wrote: Well, we also have the unbuffered stream. I'd rather not mix the two. Also I noticed that the new CodePointCharStream can only handle UTF-8 versus any other encoding. Ben indicated that that is more or less what everyone uses. Can people comment about other locales, such as India/Japan/Laos? In other words, do people need the ability to read non-unicode char points and possibly with some non-UTF-X related encoding? In other words, people surely had file formats for Japan before Unicode came around... do we need to worry about this? I ask this out of ignorance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApAU3txaVQ-3dGXciWa7fK_tTURcyMVks5roAKpgaJpZM4MhZLJ> .

bhamiltoncx · 2017-03-21T22:51:50Z

My main problem with that is I feel pretty strongly that:

The main API should Do The Right Thing (code points) by default (Python, Go, C++, and Swift all do the right thing by default)
Introducing an API with numbers at the end almost guarantees people will not use or understand it
We want some kind of path to deprecate the UTF-16 API eventually for Java, JS, and C#

I don't think introducing an ANTLRInputStream32 really gets us there.

parrt · 2017-03-21T23:05:41Z

I mostly agree with 1 and 2 but I'm not sure using 50% less RAM will go out of style for me. ;) Maybe we make ANTLRFileStream / ANTLRInputStream be the 32-bit code point stuff and shift 16 bit stuff to a new name. Getting those streams to work correctly for chars, tokens, etc... was fiddly. I hesitate to try for an all-in-one class. I don't see how to get single interface w/o factory which means changing API from ctor style. That means C++/Python/etc... will be different from Java and the other ones we have PRs for.

bhamiltoncx · 2017-03-21T23:14:07Z

I hear you on improving RAM usage! I can send a diff to update CodePointCharStream to choose one of IntBuffer / CharBuffer / ByteBuffer for storage based on the contents of the input.

I would be OK with shifting the existing 16-bit logic in the runtimes to a legacy name if that's good with everyone else. However, we'd have to rename all the subclasses as well.

bhamiltoncx · 2017-03-21T23:15:56Z

Basically, my strategy would be to assume US-ASCII only, and start with ByteBuffer until we see a code point > U+007F. If we see a 16-bit code point between U+007F and U+FFFF, we upgrade to CharBuffer. If we see a surrogate or code point larger than U+FFFF, we upgrade to IntBuffer.

millergarym · 2017-03-21T23:20:29Z

I'm not sure if this useful, as I'm not fully across the issue. Here a perspective from Go land.

In Go all strings are utf8 encoded. Not supprising since the authors of utf8 authored Go. All built in functions & controls know this. len( somestring ) return the number if 'runes' not bytes, range somestring returns runes, somestring[i] is a rune. For ascii a rune is a byte and bigger for codepoints higher the ascii (look up the tech details if necessary). Strings are really stored as a byte array but from the programmers perspective are an array of runes.

I don't know how Java handles this. Might get expensive if every byte needs to be boxes.

Does talking about ascii vs utf8 make a difference than talking about 8 or 32 bits?

bhamiltoncx · 2017-03-21T23:36:50Z

@millergarym : Right. The logic to transparently choose between 8, 16, and 32-bit units for Go runes is basically what I'm suggesting.

ericvergnaud · 2017-03-22T00:35:53Z

We should NOT read the entire contents of the stream. We've seen use cases where the consumer fetches a construct repeatedly on a potentially infinite stream Envoyé de mon iPhone

…

Le 22 mars 2017 à 01:29, Ben Hamilton (Ben Gertzfield) ***@***.***> a écrit : Since we read the entire contents of the stream into memory before constructing the object, we can choose the optimal width for storage once, so there would be no impact at runtime. A lookup table can be made in a number of ways to avoid perf impact. On Tue, Mar 21, 2017, 11:23 AM Terence Parr ***@***.***> wrote: > Wouldn't it complicate the code tremendously to switch between 3 different > fields, or at least decreased performance quite a bit? > > The lookup table to me seems like it either cost even more memory than > just having 32-bit characters or it would drop performance from O(1) to > O(n) for indexed lookup. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1771 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AApAUyril1C6ZfgRu6O18p7IA6euLkf2ks5roAeDgaJpZM4MhZLJ> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ericvergnaud · 2017-03-22T00:39:04Z

Sorry but this is not unlikely at all. In banks we have all sorts of mature systems with all sorts of encodings. An ANTLR parser running on a Linux box consuming data from these is much more frequent than what people might think. Envoyé de mon iPhone

…

Le 22 mars 2017 à 01:11, Ben Hamilton (Ben Gertzfield) ***@***.***> a écrit : It's extremely unlikely that anyone using ANTLR will be parsing non-UTF-8 input, except on Windows. Lookup tables are fine for applications like ANTLR, since non-ASCII input is pretty unlikely. We could also dynamically change between 8/16/32 bit storage depending on the input. On Tue, Mar 21, 2017, 11:02 AM Terence Parr ***@***.***> wrote: > Well, we also have the unbuffered stream. I'd rather not mix the two. > > Also I noticed that the new CodePointCharStream can only handle UTF-8 > versus any other encoding. Ben indicated that that is more or less what > everyone uses. Can people comment about other locales, such as > India/Japan/Laos? In other words, do people need the ability to read > non-unicode char points and possibly with some non-UTF-X related encoding? > In other words, people surely had file formats for Japan before Unicode > came around... do we need to worry about this? I ask this out of ignorance. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1771 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AApAU3txaVQ-3dGXciWa7fK_tTURcyMVks5roAKpgaJpZM4MhZLJ> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ericvergnaud · 2017-03-22T00:40:09Z

We definitely do! Envoyé de mon iPhone

…

Le 22 mars 2017 à 01:02, Terence Parr ***@***.***> a écrit : Well, we also have the unbuffered stream. I'd rather not mix the two. Also I noticed that the new CodePointCharStream can only handle UTF-8 versus any other encoding. Ben indicated that that is more or less what everyone uses. Can people comment about other locales, such as India/Japan/Laos? In other words, do people need the ability to read non-unicode char points and possibly with some non-UTF-X related encoding? In other words, people surely had file formats for Japan before Unicode came around... do we need to worry about this? I ask this out of ignorance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bhamiltoncx · 2017-03-22T01:11:44Z

@ericvergnaud : I'm just talking about the default behavior of ANTLRInputStream / ANTLRFileStream in Java, which is to read the entire stream into memory. We can make better-behaved ones of course.

The CharStreams code I added doesn't require UTF-8, of course. It supports other encodings, but the Java lgoic assumes they will never have code points > U+FFFF.

bhamiltoncx · 2017-03-22T01:12:27Z

@parrt: I sent out #1781 to improve memory usage of the new Java runtime.

It should actually be ~ 2x better than the existing ANTLRInputStream for US-ASCII-only content, depending on the initial buffer size we choose.

bhamiltoncx · 2017-03-22T16:33:01Z

In banks we have all sorts of mature systems with all sorts of encodings.

And yes, @ericvergnaud is totally correct — I just meant it might not be as important to worry about making the non-UTF-8 code path performance-optimal. (It of course needs to be fully functional and bug-free!)

parrt · 2017-03-22T16:37:59Z

@ericvergnaud regarding reading entire input; that's exactly what we've been doing for years! hahaha

bhamiltoncx · 2017-03-22T16:50:31Z

@parrt : I do see @ericvergnaud 's point, though — for the factory-based APIs which take in a stream, it makes sense to have a version which also takes a callback which we invoke after each read to ask the client if we should continue reading.

That allows the client to parse multiple objects directly from a stream with some kind of separator (imagine a streaming JSON parser which separates JSON objects by \n).

The tricky bit there is somebody will have to cache any remaining bits after the separator. :)

parrt · 2017-03-22T16:51:44Z

The problem is that the parser does all sorts of lookahead and itself must control how much of the input the buffer, possibly the entire thing.

bhamiltoncx · 2017-03-22T17:00:35Z

Sure, but the socket client can have some other kind of inexpensive delimiter like \n which guarantees the parser will have a full buffer available, after which another full buffer will occur.

@ericvergnaud 's point is that without a way for the client to say "hold off on parsing more", there's no way to implement these types of streaming parsers.

parrt · 2017-03-22T17:06:30Z

There is no guarantee that the parser does not need to see beyond \n. It all depends on the skill of the grammar writer, unfortunately. The best way is to have a reader, outside of antlr stuff, that collects one string at a time and sends that to the parser.

Whatever we have now appears to be working, unless there's a case people are simply not telling me about. The guys at Apple specifically asked me if they could do unbuffered input for reading sockets and that sort of thing and without building a tree. I made sure that is possible.

bhamiltoncx · 2017-03-22T17:29:15Z

I suspect the missing connection here is that the new CharStreams factory API doesn't have an unbuffered input option from which clever parser authors can create such a client.

parrt · 2017-03-22T17:33:35Z

ah. heh, i just found something i wrote down (in the book):

13.8 Unbuffered Character and Token Streams
Because ANTLR recognizers buffer up the entire input character stream and all input tokens by default, they can’t handle input files that are bigger than a computer’s memory and can’t handle infinite streams like socket connec- tions. To overcome this, you can use unbuffered versions of the character and token streams, which keep just a small sliding window into the streams: UnbufferedCharStream and UnbufferedTokenStream.

CharStream input = new UnbufferedCharStream(is);
CSVLexer lex = new CSVLexer(input);
// copy text out of sliding buffer and store in tokens
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);

ericvergnaud · 2017-03-23T16:01:17Z

c’m'on!!!

…

Le 23 mars 2017 à 00:51, Terence Parr ***@***.***> a écrit : The problem is that the parser does all sorts of lookahead and itself must control how much of the input the buffer, possibly the entire thing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADLYJDVuzL8s_Zuv7ffc_JbA0QIY0SxCks5roVGjgaJpZM4MhZLJ>.

parrt · 2017-03-23T16:36:26Z

@ericvergnaud what do you mean?

parrt · 2017-03-23T17:00:33Z

Per @mike-lischke, we really should make the API consistent across language targets. @bhamiltoncx says:

C# and JS were the other two languages besides Java whose ANTLRInputStream equivalents exposed UTF-16 code units instead of Unicode code points. We wouldn't really need to provide this for other languages unless we wanted consistency.

This would mean a factory style CharStreams.fromFileName(...) type thing for all targets. Is this even meaningful in C++? I suppose it could be just a static function scoped inside CharStreams::fromFileName(...)? It would appear that @ericvergnaud has no fundamental disagreement with adding this interface for his targets. What about the other @antlr/antlr-targets? I am warming to the new factory style interface.

@bhamiltoncx has removed my size objection to the new Unicode code point streams by making it 50% the size that he used to be. haha. kudos. It of course is using the new factory interface, though.

The new interface would be described in the documentation as the way to get input into the lexer.

I think the only remaining question is what to do with existing ANTLRInputStream and friends. choices:

we leave the ANTLRInputStream and friends around as-is so that existing code compiles and runs.
we leave those class names around but replace their guts with the new code point stream. This has the advantage of having just one internal mechanism to handle character streams; the class names would hang around just for backward compatibility.

In light of the fact that the new stream will be 8-bit optimized, it seems like option number 2 is the right choice.

bhamiltoncx · 2017-03-23T17:25:19Z

I like that proposal (adopting the factory-style interface, which could just be namespaced functions for C++). We can make the existing ANTLRInputStream a simple wrapper if we think the perf and memory usage of the new code is good.

I agree the factory-style interface doesn't do a ton for C++ today, but it opens up possibilities for more types of input streams in the future.

If we're cool with getting rid of backwards compatibility with UTF-16 input in Java, C#, and JS, I think it'd be a big step forward.

parrt · 2017-03-23T17:33:02Z

Re UTF-16: oh right. There were a number of grammars that were manually looking for UTF-16 code units in an effort to simulate 32-bit code points right? Hmm... probably okay.

I forgot to ask about different encodings, like the legacy ones Eric mentioned from bank systems. How do we incorporate different encodings than UTF-X? Will this explode the number of methods we need in the common character stream factory interface?

bhamiltoncx · 2017-03-23T17:41:15Z

Re UTF-16: oh right. There were a number of grammars that were manually looking for UTF-16 code units in an effort to simulate 32-bit code points right? Hmm... probably okay.

Right. So, we have three options:

Provide no support for such grammars
Keep ANTLRInputStream around as-is for such grammars but mark it and its equivalents as deprecated in Java, C#, and JavaScript
Provide a new, supported API to create CharStreams which expose UTF-16 code units

I forgot to ask about different encodings, like the legacy ones Eric mentioned from bank systems. How do we incorporate different encodings than UTF-X? Will this explode the number of methods we need in the common character stream factory interface?

Mostly done. I made the new factory APIs take an optional Charset:

antlr4/runtime/Java/src/org/antlr/v4/runtime/CharStreams.java

Line 59 in 1f6a329

    
           public static CharStream fromPath(Path path, Charset charset) throws IOException {

Right now, these APIs delegate to ANTLRInputStream for non-UTF-8 so we can convert from those legacy encodings to UTF-16, but I'll follow up after #1781 lands and make them directly convert to CodePointCharStream.

mike-lischke · 2017-03-24T08:21:43Z

I see there's really a lot going on here, great, with @bhamiltoncx giving 100% for a good solution. Still to me that entire thing looks very Java-centric. Language have streams and ANTLR should support those (I mean their native streams), via a simple interface (CharStream probably). The idea should be: "user, give me a stream that returns 32bit code points to me, period". The user then has to take care to convert other encodings to UTF-32, if required; ANTLR cannot cope with all used encodings in this world. The factory approach doesn't allow to select the stream class that fits a specific purpose. IMO ANTLR shouldn't care what native stream is used under the hood. This is also the most flexible approach for us target authors.

For convenience ANTLR provides one such stream out of the box (the one that takes UTF-8). If backwards compatibility is such a big concern then we can indeed deprecate the current ANTLRInputStream class and provide a new AntlrUtf8InputStream. The new targets just rename their ANTLRInputStream classes and the older ones add a new class with that name. This would be my preferred solution here.

Just a few more thoughts for the general handling as I imagine it:

If a stream loads everything into memory or not should not be the decision of the ANTLR runtime (users might have a [native] stream interface for memory mapped files that allow to deal with gigabyte sized input, which is both: not buffered and allows random seeking). Streams can indicate if they support random seek or not. Most of the time users will probably use a string stream, which allows to handle in-memory strings like a stream (some targets like C++ already have a class for this). This approach will also end the buffered vs unbuffered separation and we also don't need the CharStreams stuff.

ericvergnaud · 2017-03-24T12:32:53Z

@parrt not sure what I had in mind... best to forget about it :-)

ericvergnaud · 2017-03-24T17:35:11Z

Hi, I’m sure all of us are already comfortable with the below, but just in case it may be worth restating the difference between character set and encoding ASCII, Unicode 3.0 (U+0000 to U+FFFF) and Unicode 3.1 (U+0000 to U+10FFFF) are character sets, not encodings. Technically, ANTLR is character set agnostic i.e. it makes no assumption about the meaning of a byte or series of bytes coming from the input stream. But since Java chars were defined from Unicode 3.0 and are 16 bits wide, there is a strong inclination to use the Unicode 3.0 character set, and UTF-16, even without knowing. UTF-8, UTF-16 and UTF-32 are encodings of Unicode 3.1. i.e. they all support the widest character set. UTF-8 encodes each character using varying lengths (1 to 4 bytes). Great for space, horrible for string manipulation. UTF-16 uses 1 or 2 shorts. UTF-32 uses 1 32-bit int. Other examples of encodings are MBCS, DBCS, Shift-JIS etc.. they are a way to encode a character set, not the character set itself. I’m personally comfortable with the idea that ANTLR itself would internally only support Unicode 3 and Unicode 3.1 character sets, since Unicode 3.1 encompasses all glyphs ever imagined, and upcoming ones such as emoticons. For binary streams, character sets have no meaning so it would have no impact. Practically, it means that a user of the « factory » should only have to: - specify the technical source (file, string, buffer, socket, stdin…) and its encoding (currently utf-8 by default, but can be specified) - specify the character set to use internally i.e. Unicode 3 (using 16-bit code points) or Unicode 3.1(using 32-bit code points) I’m unable to state what exact conversions are supported by each language, but I’d be surprised if at minimal conversion from all widely used encodings to the language’s native encoding was not supported by native utilities. In Java: InputStreamReader, in C#: StreamReader, in Python: codecs.decode, in Node: fs.readFileSync, in Swift: Utils.readFile, all these accept an encoding as parameter (and currently used by ANTLR) This shows that while the conversion is specified through an ANTLR API, it is performed outside ANTLR. So not sure why ANTLR would have to cope with more than the 2 proposed encodings, since the conversion itself is already handled by the native decoders. In short, all ANTLR has to provide and maintain is conversion from a target’s native encoding (coming out of the native conversion utility) to the one to be used internally i.e. 16 or 32 bits code points From there, it doesn’t seem unachievable to expose the same API across all targets, especially now that @bhamiltoncx <https://github.com/bhamiltoncx> has done all the hard work. I notice that Go and Cpp do not yet support file stream encoding. On the short term, I would recommend adding the parameter, and throw if it is not UTF-8. This would I believe clarify the requirement, until conversion is supported. in Go:golang.org/x/text/encoding <http://golang.org/x/text/encoding>, in Cpp: ICU4C provide the equivalent functionality. Eric n.b. I also agree that incoming data should not be buffered totally.

…

Le 24 mars 2017 à 16:21, Mike Lischke ***@***.***> a écrit : I see there's really a lot going on here, great, with @bhamiltoncx <https://github.com/bhamiltoncx> giving 100% for a good solution. Still to me that entire thing looks very Java-centric. Language have streams and ANTLR should support those (I mean their native streams), via a simple interface (CharStream probably). The idea should be: "user, give me a stream that returns 32bit code points to me, period". The user then has to take care to convert other encodings to UTF-32, if required; ANTLR cannot cope with all used encodings in this world. The factory approach doesn't allow to select the stream class that fits a specific purpose. IMO ANTLR shouldn't care what native stream is used under the hood. This is also the most flexible approach for us target authors. For convenience ANTLR provides one such stream out of the box (the one that takes UTF-8). If backwards compatibility is such a big concern then we can indeed deprecate the current ANTLRInputStream class and provide a new AntlrUtf8InputStream. The new targets just rename their ANTLRInputStream classes and the older ones add a new class with that name. This would be my preferred solution here. Just a few more thoughts for the general handling as I imagine it: If a stream loads everything into memory or not should not be the decision of the ANTLR runtime (users might have a [native] stream interface for memory mapped files that allow to deal with gigabyte sized input, which is both: not buffered and allows random seeking). Streams can indicate if they support random seek or not. Most of the time users will probably use a string stream, which allows to handle in-memory strings like a stream (some targets like C++ already have a class for this). This approach will also end the buffered vs unbuffered separation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADLYJPn2KYMJV5FNzWl6OUySPP-CCcxdks5ro30cgaJpZM4MhZLJ>.

A few updates to the Unicode documentation.

54601f7

It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.

parrt added the comp:doc label Mar 19, 2017

parrt added this to the 4.7 milestone Mar 19, 2017

This was referenced Mar 20, 2017

New C# CharStreams static factory class #1775

Merged

New JavaScript CharStreams functions #1776

Merged

parrt merged commit 6de2f3f into antlr:master Mar 20, 2017

100mango mentioned this pull request Mar 22, 2021

Add Support For Swift Package Manager #3132

Merged

hanjoes mentioned this pull request Oct 11, 2021

Preparing for 4.9.3 release #3301

Closed

Conversation

mike-lischke commented Mar 18, 2017

Uh oh!

parrt commented Mar 19, 2017

Uh oh!

bhamiltoncx commented Mar 19, 2017 via email

Uh oh!

parrt commented Mar 19, 2017

Uh oh!

bhamiltoncx commented Mar 19, 2017 via email

Uh oh!

mike-lischke commented Mar 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhamiltoncx commented Mar 19, 2017 via email

Uh oh!

parrt commented Mar 19, 2017 via email

Uh oh!

parrt commented Mar 19, 2017 via email

Uh oh!

mike-lischke commented Mar 19, 2017

Uh oh!

parrt commented Mar 19, 2017 via email

Uh oh!

bhamiltoncx commented Mar 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mike-lischke commented Mar 20, 2017

Uh oh!

bhamiltoncx commented Mar 20, 2017 via email

Uh oh!

parrt commented Mar 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhamiltoncx commented Mar 20, 2017

Uh oh!

mike-lischke commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhamiltoncx commented Mar 21, 2017 via email

Uh oh!

parrt commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parrt commented Mar 21, 2017

Uh oh!

bhamiltoncx commented Mar 21, 2017 via email

Uh oh!

parrt commented Mar 21, 2017

Uh oh!

mike-lischke commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parrt commented Mar 21, 2017

Uh oh!

bhamiltoncx commented Mar 21, 2017 via email

Uh oh!

bhamiltoncx commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parrt commented Mar 21, 2017

Uh oh!

bhamiltoncx commented Mar 21, 2017

Uh oh!

bhamiltoncx commented Mar 21, 2017

Uh oh!

millergarym commented Mar 21, 2017

Uh oh!

bhamiltoncx commented Mar 21, 2017

Uh oh!

ericvergnaud commented Mar 22, 2017 via email

Uh oh!

ericvergnaud commented Mar 22, 2017 via email

Uh oh!

ericvergnaud commented Mar 22, 2017 via email

Uh oh!

bhamiltoncx commented Mar 22, 2017

Uh oh!

bhamiltoncx commented Mar 22, 2017

Uh oh!

mike-lischke commented Mar 19, 2017 •

edited

Loading

bhamiltoncx commented Mar 19, 2017 •

edited

Loading

parrt commented Mar 20, 2017 •

edited

Loading

mike-lischke commented Mar 21, 2017 •

edited

Loading

parrt commented Mar 21, 2017 •

edited

Loading

mike-lischke commented Mar 21, 2017 •

edited

Loading

bhamiltoncx commented Mar 21, 2017 •

edited

Loading

bhamiltoncx commented Mar 22, 2017 •

edited

Loading

parrt commented Mar 22, 2017 •

edited

Loading

parrt commented Mar 22, 2017 •

edited

Loading

bhamiltoncx commented Mar 23, 2017 •

edited

Loading

mike-lischke commented Mar 24, 2017 •

edited

Loading