A few updates to the Unicode documentation.#1771
A few updates to the Unicode documentation.#1771parrt merged 1 commit intoantlr:masterfrom mike-lischke:master
Conversation
It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.
|
@bhamiltoncx should we take a wack at proposing a similar API for other targets or leave as Java only? |
|
Let's standardize the other runtimes if we can!
…On Sun, Mar 19, 2017, 10:18 AM Terence Parr ***@***.***> wrote:
@bhamiltoncx <https://github.com/bhamiltoncx> should we take a wack at
proposing a similar API for other targets or leave as Java only?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1771 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU6Xoaxc3bE9OVKdxFbP5EnjE5qaTks5rnVVMgaJpZM4MhZLJ>
.
|
|
Ok, if it's not a huge hassle, could you make a PR the @antlr/antlr-targets people can examine? |
|
Will do! We're having a bit of a wildfire here today, so hopefully we won't
be evacuated. :)
…On Sun, Mar 19, 2017, 10:20 AM Terence Parr ***@***.***> wrote:
Ok, if it's not a huge hassle, could you make a PR the
@antlr/antlr-targets <https://github.com/orgs/antlr/teams/antlr-targets>
people can examine?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1771 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU0XdGj68cyisInxIls_xfCitpiurks5rnVXMgaJpZM4MhZLJ>
.
|
|
Uhm, why do you want to add extra input stream code just for handling full Unicode? To me it makes much more sense to just have ANTLRInputStream handle everything. |
|
Mike: The worry was that changing ANTLRInputStream's default behavior would
break backwards compatibility.
…On Sun, Mar 19, 2017, 11:19 AM Mike Lischke ***@***.***> wrote:
Uhm, why do you want to add extra code just for handling full Unicode? To
me it makes much more sense to just have ANTLRInputStream handle it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1771 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU_rK7flRGxhnuWPi0e0pIy2B5HIXks5rnWOrgaJpZM4MhZLJ>
.
|
|
Read about that! Be safe!
…Sent from my iPhone
On Mar 19, 2017, at 10:00 AM, Ben Hamilton (Ben Gertzfield) ***@***.***> wrote:
Will do! We're having a bit of a wildfire here today, so hopefully we won't
be evacuated. :)
On Sun, Mar 19, 2017, 10:20 AM Terence Parr ***@***.***>
wrote:
> Ok, if it's not a huge hassle, could you make a PR the
> @antlr/antlr-targets <https://github.com/orgs/antlr/teams/antlr-targets>
> people can examine?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1771 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AApAU0XdGj68cyisInxIls_xfCitpiurks5rnVXMgaJpZM4MhZLJ>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Well in the case of C plus plus, you don't have any legacy code so I guess it doesn't make sense for that target. To avoid breaking anything that exists we are leaving the old streams in place for other target by providing a new interface to access the new Unicode functionality
…Sent from my iPhone
On Mar 19, 2017, at 10:19 AM, Mike Lischke ***@***.***> wrote:
Uhm, why do you want to add extra code just for handling full Unicode? To me it makes much more sense to just have ANTLRInputStream handle it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Sorry for my ignorance, but why can't you have backward compatibility and still use |
|
Once they are 32 bit code points it's no problem. It's properly decoding the input stream that is encoded as utf-x not 32-bit Ints
…Sent from my iPhone
On Mar 19, 2017, at 11:50 AM, Mike Lischke ***@***.***> wrote:
Sorry for my ignorance, but why can't you have backward compatibility and still use ANTLRInputStream? Just because this class would allow now values > 0xFFFF doesn't mean it would break any existing application. They don't use such values so there is no behavior change for them. Sounds like fixing a problem where none is, to me.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
@mike-lischke: So, the situation is that in Java, C#, and JavaScript, the runtime for So, for example, the Unicode input Changing that now would be a fairly serious backwards-compatibility issue (it was perfectly legal to write a lexer for those two values You're correct that the C++ runtime didn't have this issue. In my analysis on #276 (comment) , I confirmed that Go, Swift, and Python 3 were also OK. Later, I found Python 2 was OK on Linux, but not Mac or Windows. Since more than one language had issues (and the API to the |
Yes, sure, that's why the ANTLRInputStream must be enhanced to return the 32bit values.
Not at all if you would add encoding support to |
|
There are two encodings to think about here.
First is the input encoding: when reading streams of bytes, how do we
convert those bytes to Unicode?
Second is the internal storage encoding used inside ANTLRInputStream: which
Unicode encoding should we use to store the Unicode code points in memory?
ANTLRInputStream already supports many input encodings via subclasses like
ANTLRFileStream.
It's true that ANTLRInputStream only supports a single internal storage
encoding (UTF-16) and we could add support for other storage encodings.
However, ANTLRInputStream is a very leaky abstraction, as it exposes its
internals to subclasses via protected members. It is fairly common for code
outside the ANTLR runtime itself to manipulate these internals via
subclasses.
Changing the internal storage encoding would mean changing not just ANTLR
but all external subclasses as well. Not impossible, but when I proposed
it, others suggested just avoiding breaking backwards compatibility by
using a separate class entirely.
Ben
…On Mon, Mar 20, 2017, 2:43 AM Mike Lischke ***@***.***> wrote:
So, for example, the Unicode input U+1F600 (😀) when sent through
ANTLRInputStream in Java, C#, or JavaScript, would actually expose two
values to the lexer (\uD83D\uDE00), not one.
Yes, sure, that's why the ANTLRInputStream must be enhanced to return the
32bit values.
Changing that now would be a fairly serious backwards-compatibility issue
(it was perfectly legal to write a lexer for those two values \uD83D\uDE00),
Not at all if you would add encoding support to ANTLRInputStream. You can
support UTF-16 (and make that the default to stay compatible) and also
support UTF-8 (what is now in CharStreams) and even UTF-32 (which doesn't
require any conversion). That would be a much better implementation because
users don't have to use a new API and don't have to choose between the two
just because *their input can be so or so* (that's just crazy).
Seriously, I see that as unnecessarily confusing to have to use different
stream implementations.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1771 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU5kchGZDmN9tWzo5ejgya5C6EB9-ks5rnjwkgaJpZM4MhZLJ>
.
|
|
For long-lived libraries, it's common to change the interface and leave the old; e.g., job ahead to fix the streams so that they handled Unicode / utf-8 properly and created the Reader interface. Let's just see what @bhamiltoncx comes up with for the various targets and let those target authors discuss. It sounds like there's no point in creating something for C++ as it has no legacy (yet!) :) By the way, users will not have to choose which stream to use. For new code they just use the new interface and it does the right thing. |
|
OK, sent out #1775 and #1776 as examples of how the C# and JS were the other two languages besides Java whose |
|
OK, seems I'm fighting on a lost battle here. I'm sorry but in this case I believe you are not doing the right thing. There were so many changes since 4.5.x without that you created alternative classes to keep "backward compatibility". Backward compatibility is a nice thing, but you can take it too far and I believe this happens here. Version 4.7 is not a point release. Nobody expects that you just drop it in and forget about it. People have to and will check if their software still works after upgrade. I'm sure it's not asked too much to also check if their solution still works with the enhanced input handling. Most people won't notice anyway. Now we have half of the targets using a different input solution than the others, which makes documentation more complicated and examples harder to convert. I really disagree with the approach taken here. I'm sorry for that. |
|
I hear you! Personally, I agree and think we should break compatibility
here as well, but I understand the concerns of Terence and Sam.
…On Tue, Mar 21, 2017, 2:26 AM Mike Lischke ***@***.***> wrote:
OK, seems I'm fighting on a lost battle here. I'm sorry but in the case I
believe you are not doing the right thing. There were so many changes since
4.5.x without that you created alternative classes to keep "backward
compatibility". Backward compatibility is a nice thing, but you can take it
too far and I believe this happens here. Version 4.7 is not a point
release. Nobody expects that you just drop it in and forget about it.
People have to and will check if their software still works after upgrade.
I'm sure it's not asked too much to also check if their solution still
works with the enhanced input handling. Most people won't notice anyway.
Now we have half of the targets using a different input solution than the
others, which makes documentation more complicated and examples harder to
convert. I really disagree with the approach taken here. I'm sorry for that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1771 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU4vpMa2-FAZI4f18eWyx33MA4qqTks5rn4mpgaJpZM4MhZLJ>
.
|
|
What say the other @antlr/antlr-targets authors? I think @ericvergnaud had some specific concerns. Maybe we can address those directly. The correctness and speed of new streams appears to be good. |
|
Ok, NOW I remember: the new stream has to use 32-bits per symbol, which literally doubles the size requirements to hold a document in memory. Given that I regularly process large corpora and hold them all in memory at once, this is a nontrivial burden. For a parser, ANTLR-generated parsers are already memory pigs. I therefore cannot simply replace ANTLRFileStream with one that takes twice as much memory. I think that the best solution is to simply add a new ANTLRFileStream32 or some such and users that are willing to take the hit in order to do 32-bit Unicode, can do so. Existing users will not experience any increase in memory or speed reduction or backward compatibility. Given this, it doesn't make sense to have a new interface |
|
We could keep the memory use down (or even improve it for ASCII text) in
several ways without introducing a new class:
1) Use UTF-8 encoding and build a lookup table of code point byte offsets
following any non-ASCII code points
2) Use UTF-16 encoding and build a lookup table of code point offsets >
U+FFFF
…On Tue, Mar 21, 2017, 10:33 AM Terence Parr ***@***.***> wrote:
Ok, NOW I remember: the new stream has to use 32-bits per symbol, which
literally doubles the size requirements to hold a document in memory. Given
that I regularly process large corpora and hold them all in memory at once,
this is a nontrivial burden. For a parser, ANTLR-generated parsers are
already memory pigs.
I therefore cannot simply replace ANTLRFileStream with one that takes
twice as much memory. I think that the best solution is to simply add a new
ANTLRFileStream32 or some such and users that are willing to take the hit
in order to do 32-bit Unicode, can do so. Existing users will not
experience any increase in memory or speed reduction or backward
compatibility. Given this, it doesn't make sense to have a new interface
CharStreams.XXX; we should stick with the ctor-based mechanism and do new
ANTLRFileStream32(...) or whatever.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1771 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAUzKfvgV3CGGy5J2r9Mi62AuV7f3tks5rn_vbgaJpZM4MhZLJ>
.
|
|
yeah, I was wondering about encoding it but it seems like the lookup table would either add a lot of memory or slowdown access for I think we should just introduce the new |
|
I was thinking about how to make the input stream decode utf-8 on the fly. No need to keep the entire document in memory then. The token stream will keep the start/stop indexes of the chars for seeking in the input. We lose the 1:1 matching for index and code point then, however. |
|
Well, we also have the unbuffered stream. I'd rather not mix the two. Also I noticed that the new |
|
It's extremely unlikely that anyone using ANTLR will be parsing non-UTF-8
input, except on Windows.
Lookup tables are fine for applications like ANTLR, since non-ASCII input
is pretty unlikely.
We could also dynamically change between 8/16/32 bit storage depending on
the input.
…On Tue, Mar 21, 2017, 11:02 AM Terence Parr ***@***.***> wrote:
Well, we also have the unbuffered stream. I'd rather not mix the two.
Also I noticed that the new CodePointCharStream can only handle UTF-8
versus any other encoding. Ben indicated that that is more or less what
everyone uses. Can people comment about other locales, such as
India/Japan/Laos? In other words, do people need the ability to read
non-unicode char points and possibly with some non-UTF-X related encoding?
In other words, people surely had file formats for Japan before Unicode
came around... do we need to worry about this? I ask this out of ignorance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1771 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU3txaVQ-3dGXciWa7fK_tTURcyMVks5roAKpgaJpZM4MhZLJ>
.
|
|
My main problem with that is I feel pretty strongly that:
I don't think introducing an |
|
I mostly agree with 1 and 2 but I'm not sure using 50% less RAM will go out of style for me. ;) Maybe we make ANTLRFileStream / ANTLRInputStream be the 32-bit code point stuff and shift 16 bit stuff to a new name. Getting those streams to work correctly for chars, tokens, etc... was fiddly. I hesitate to try for an all-in-one class. I don't see how to get single interface w/o factory which means changing API from ctor style. That means C++/Python/etc... will be different from Java and the other ones we have PRs for. |
|
I hear you on improving RAM usage! I can send a diff to update I would be OK with shifting the existing 16-bit logic in the runtimes to a legacy name if that's good with everyone else. However, we'd have to rename all the subclasses as well. |
|
Basically, my strategy would be to assume US-ASCII only, and start with |
|
I'm not sure if this useful, as I'm not fully across the issue. Here a perspective from Go land. In Go all strings are utf8 encoded. Not supprising since the authors of utf8 authored Go. All built in functions & controls know this. len( somestring ) return the number if 'runes' not bytes, range somestring returns runes, somestring[i] is a rune. For ascii a rune is a byte and bigger for codepoints higher the ascii (look up the tech details if necessary). Strings are really stored as a byte array but from the programmers perspective are an array of runes. I don't know how Java handles this. Might get expensive if every byte needs to be boxes. Does talking about ascii vs utf8 make a difference than talking about 8 or 32 bits? |
|
@millergarym : Right. The logic to transparently choose between 8, 16, and 32-bit units for Go |
|
We should NOT read the entire contents of the stream.
We've seen use cases where the consumer fetches a construct repeatedly on a potentially infinite stream
Envoyé de mon iPhone
… Le 22 mars 2017 à 01:29, Ben Hamilton (Ben Gertzfield) ***@***.***> a écrit :
Since we read the entire contents of the stream into memory before
constructing the object, we can choose the optimal width for storage once,
so there would be no impact at runtime.
A lookup table can be made in a number of ways to avoid perf impact.
On Tue, Mar 21, 2017, 11:23 AM Terence Parr ***@***.***>
wrote:
> Wouldn't it complicate the code tremendously to switch between 3 different
> fields, or at least decreased performance quite a bit?
>
> The lookup table to me seems like it either cost even more memory than
> just having 32-bit characters or it would drop performance from O(1) to
> O(n) for indexed lookup.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1771 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AApAUyril1C6ZfgRu6O18p7IA6euLkf2ks5roAeDgaJpZM4MhZLJ>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Sorry but this is not unlikely at all.
In banks we have all sorts of mature systems with all sorts of encodings.
An ANTLR parser running on a Linux box consuming data from these is much more frequent than what people might think.
Envoyé de mon iPhone
… Le 22 mars 2017 à 01:11, Ben Hamilton (Ben Gertzfield) ***@***.***> a écrit :
It's extremely unlikely that anyone using ANTLR will be parsing non-UTF-8
input, except on Windows.
Lookup tables are fine for applications like ANTLR, since non-ASCII input
is pretty unlikely.
We could also dynamically change between 8/16/32 bit storage depending on
the input.
On Tue, Mar 21, 2017, 11:02 AM Terence Parr ***@***.***>
wrote:
> Well, we also have the unbuffered stream. I'd rather not mix the two.
>
> Also I noticed that the new CodePointCharStream can only handle UTF-8
> versus any other encoding. Ben indicated that that is more or less what
> everyone uses. Can people comment about other locales, such as
> India/Japan/Laos? In other words, do people need the ability to read
> non-unicode char points and possibly with some non-UTF-X related encoding?
> In other words, people surely had file formats for Japan before Unicode
> came around... do we need to worry about this? I ask this out of ignorance.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1771 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AApAU3txaVQ-3dGXciWa7fK_tTURcyMVks5roAKpgaJpZM4MhZLJ>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
We definitely do!
Envoyé de mon iPhone
… Le 22 mars 2017 à 01:02, Terence Parr ***@***.***> a écrit :
Well, we also have the unbuffered stream. I'd rather not mix the two.
Also I noticed that the new CodePointCharStream can only handle UTF-8 versus any other encoding. Ben indicated that that is more or less what everyone uses. Can people comment about other locales, such as India/Japan/Laos? In other words, do people need the ability to read non-unicode char points and possibly with some non-UTF-X related encoding? In other words, people surely had file formats for Japan before Unicode came around... do we need to worry about this? I ask this out of ignorance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
@ericvergnaud : I'm just talking about the default behavior of The |
And yes, @ericvergnaud is totally correct — I just meant it might not be as important to worry about making the non-UTF-8 code path performance-optimal. (It of course needs to be fully functional and bug-free!) |
|
@ericvergnaud regarding reading entire input; that's exactly what we've been doing for years! hahaha |
|
@parrt : I do see @ericvergnaud 's point, though — for the factory-based APIs which take in a stream, it makes sense to have a version which also takes a callback which we invoke after each read to ask the client if we should continue reading. That allows the client to parse multiple objects directly from a stream with some kind of separator (imagine a streaming JSON parser which separates JSON objects by The tricky bit there is somebody will have to cache any remaining bits after the separator. :) |
|
The problem is that the parser does all sorts of lookahead and itself must control how much of the input the buffer, possibly the entire thing. |
|
Sure, but the socket client can have some other kind of inexpensive delimiter like @ericvergnaud 's point is that without a way for the client to say "hold off on parsing more", there's no way to implement these types of streaming parsers. |
|
There is no guarantee that the parser does not need to see beyond Whatever we have now appears to be working, unless there's a case people are simply not telling me about. The guys at Apple specifically asked me if they could do unbuffered input for reading sockets and that sort of thing and without building a tree. I made sure that is possible. |
|
I suspect the missing connection here is that the new |
|
ah. heh, i just found something i wrote down (in the book): 13.8 Unbuffered Character and Token Streams CharStream input = new UnbufferedCharStream(is);
CSVLexer lex = new CSVLexer(input);
// copy text out of sliding buffer and store in tokens
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex); |
|
c’m'on!!!
… Le 23 mars 2017 à 00:51, Terence Parr ***@***.***> a écrit :
The problem is that the parser does all sorts of lookahead and itself must control how much of the input the buffer, possibly the entire thing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADLYJDVuzL8s_Zuv7ffc_JbA0QIY0SxCks5roVGjgaJpZM4MhZLJ>.
|
|
@ericvergnaud what do you mean? |
|
Per @mike-lischke, we really should make the API consistent across language targets. @bhamiltoncx says:
This would mean a factory style @bhamiltoncx has removed my size objection to the new Unicode code point streams by making it 50% the size that he used to be. haha. kudos. It of course is using the new factory interface, though. The new interface would be described in the documentation as the way to get input into the lexer. I think the only remaining question is what to do with existing ANTLRInputStream and friends. choices:
In light of the fact that the new stream will be 8-bit optimized, it seems like option number 2 is the right choice. |
|
I like that proposal (adopting the factory-style interface, which could just be namespaced functions for C++). We can make the existing I agree the factory-style interface doesn't do a ton for C++ today, but it opens up possibilities for more types of input streams in the future. If we're cool with getting rid of backwards compatibility with UTF-16 input in Java, C#, and JS, I think it'd be a big step forward. |
|
Re UTF-16: oh right. There were a number of grammars that were manually looking for UTF-16 code units in an effort to simulate 32-bit code points right? Hmm... probably okay. I forgot to ask about different encodings, like the legacy ones Eric mentioned from bank systems. How do we incorporate different encodings than UTF-X? Will this explode the number of methods we need in the common character stream factory interface? |
Right. So, we have three options:
Mostly done. I made the new factory APIs take an optional Right now, these APIs delegate to |
|
I see there's really a lot going on here, great, with @bhamiltoncx giving 100% for a good solution. Still to me that entire thing looks very Java-centric. Language have streams and ANTLR should support those (I mean their native streams), via a simple interface (CharStream probably). The idea should be: "user, give me a stream that returns 32bit code points to me, period". The user then has to take care to convert other encodings to UTF-32, if required; ANTLR cannot cope with all used encodings in this world. The factory approach doesn't allow to select the stream class that fits a specific purpose. IMO ANTLR shouldn't care what native stream is used under the hood. This is also the most flexible approach for us target authors. For convenience ANTLR provides one such stream out of the box (the one that takes UTF-8). If backwards compatibility is such a big concern then we can indeed deprecate the current ANTLRInputStream class and provide a new AntlrUtf8InputStream. The new targets just rename their ANTLRInputStream classes and the older ones add a new class with that name. This would be my preferred solution here. Just a few more thoughts for the general handling as I imagine it: If a stream loads everything into memory or not should not be the decision of the ANTLR runtime (users might have a [native] stream interface for memory mapped files that allow to deal with gigabyte sized input, which is both: not buffered and allows random seeking). Streams can indicate if they support random seek or not. Most of the time users will probably use a string stream, which allows to handle in-memory strings like a stream (some targets like C++ already have a class for this). This approach will also end the buffered vs unbuffered separation and we also don't need the CharStreams stuff. |
|
@parrt not sure what I had in mind... best to forget about it :-) |
|
Hi,
I’m sure all of us are already comfortable with the below, but just in case it may be worth restating the difference between character set and encoding
ASCII, Unicode 3.0 (U+0000 to U+FFFF) and Unicode 3.1 (U+0000 to U+10FFFF) are character sets, not encodings.
Technically, ANTLR is character set agnostic i.e. it makes no assumption about the meaning of a byte or series of bytes coming from the input stream.
But since Java chars were defined from Unicode 3.0 and are 16 bits wide, there is a strong inclination to use the Unicode 3.0 character set, and UTF-16, even without knowing.
UTF-8, UTF-16 and UTF-32 are encodings of Unicode 3.1. i.e. they all support the widest character set.
UTF-8 encodes each character using varying lengths (1 to 4 bytes). Great for space, horrible for string manipulation. UTF-16 uses 1 or 2 shorts. UTF-32 uses 1 32-bit int.
Other examples of encodings are MBCS, DBCS, Shift-JIS etc.. they are a way to encode a character set, not the character set itself.
I’m personally comfortable with the idea that ANTLR itself would internally only support Unicode 3 and Unicode 3.1 character sets, since Unicode 3.1 encompasses all glyphs ever imagined, and upcoming ones such as emoticons.
For binary streams, character sets have no meaning so it would have no impact.
Practically, it means that a user of the « factory » should only have to:
- specify the technical source (file, string, buffer, socket, stdin…) and its encoding (currently utf-8 by default, but can be specified)
- specify the character set to use internally i.e. Unicode 3 (using 16-bit code points) or Unicode 3.1(using 32-bit code points)
I’m unable to state what exact conversions are supported by each language, but I’d be surprised if at minimal conversion from all widely used encodings to the language’s native encoding was not supported by native utilities.
In Java: InputStreamReader, in C#: StreamReader, in Python: codecs.decode, in Node: fs.readFileSync, in Swift: Utils.readFile, all these accept an encoding as parameter (and currently used by ANTLR)
This shows that while the conversion is specified through an ANTLR API, it is performed outside ANTLR.
So not sure why ANTLR would have to cope with more than the 2 proposed encodings, since the conversion itself is already handled by the native decoders.
In short, all ANTLR has to provide and maintain is conversion from a target’s native encoding (coming out of the native conversion utility) to the one to be used internally i.e. 16 or 32 bits code points
From there, it doesn’t seem unachievable to expose the same API across all targets, especially now that @bhamiltoncx <https://github.com/bhamiltoncx> has done all the hard work.
I notice that Go and Cpp do not yet support file stream encoding.
On the short term, I would recommend adding the parameter, and throw if it is not UTF-8. This would I believe clarify the requirement, until conversion is supported.
in Go:golang.org/x/text/encoding <http://golang.org/x/text/encoding>, in Cpp: ICU4C provide the equivalent functionality.
Eric
n.b. I also agree that incoming data should not be buffered totally.
… Le 24 mars 2017 à 16:21, Mike Lischke ***@***.***> a écrit :
I see there's really a lot going on here, great, with @bhamiltoncx <https://github.com/bhamiltoncx> giving 100% for a good solution. Still to me that entire thing looks very Java-centric. Language have streams and ANTLR should support those (I mean their native streams), via a simple interface (CharStream probably). The idea should be: "user, give me a stream that returns 32bit code points to me, period". The user then has to take care to convert other encodings to UTF-32, if required; ANTLR cannot cope with all used encodings in this world. The factory approach doesn't allow to select the stream class that fits a specific purpose. IMO ANTLR shouldn't care what native stream is used under the hood. This is also the most flexible approach for us target authors.
For convenience ANTLR provides one such stream out of the box (the one that takes UTF-8). If backwards compatibility is such a big concern then we can indeed deprecate the current ANTLRInputStream class and provide a new AntlrUtf8InputStream. The new targets just rename their ANTLRInputStream classes and the older ones add a new class with that name. This would be my preferred solution here.
Just a few more thoughts for the general handling as I imagine it:
If a stream loads everything into memory or not should not be the decision of the ANTLR runtime (users might have a [native] stream interface for memory mapped files that allow to deal with gigabyte sized input, which is both: not buffered and allows random seeking). Streams can indicate if they support random seek or not. Most of the time users will probably use a string stream, which allows to handle in-memory strings like a stream (some targets like C++ already have a class for this). This approach will also end the buffered vs unbuffered separation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADLYJPn2KYMJV5FNzWl6OUySPP-CCcxdks5ro30cgaJpZM4MhZLJ>.
|
It should be made clear that the recommended use of CharStreams.fromPath() is a Java-only solution. The other targets just have their ANTLRInputStream class extended to support full Unicode.