Conversation
| let rec as_obj_sub lab t = match promote t with | ||
| | Obj (s, tfs) -> s, tfs | ||
| | Array t -> as_obj_sub lab (array_obj t) | ||
| | Prim Text -> as_obj_sub lab text_obj |
There was a problem hiding this comment.
I think this implicit coercion of things to objects is a wart, and I am not sure if we want to extend it. Is
for (c in t.chars)
really so much better than
for (c in chars(t))
There was a problem hiding this comment.
Two reasons:
- Consistency with how the language works now, and
- The first version (with implicit object coercion) is "better" in the sense that the first version doesn't pollute the prelude namespace with some primitive function called
chars, orcharsOf, etc.
Having said that, I don't have a preference, I'm just following the conventions that I see for arrays; I'm trying to be consistent. Text is a special kind of restricted array, with no random access, just iteration. Once arrays act differently, I can follow that same pattern for Text. Why be inconsistent?
There was a problem hiding this comment.
I actually would not mind changing it for arrays as well. This is motivated by the backend, that has to compile foo.bar differently from foo.length; in the latter case it must do dynamic dispatch on the kind of object on the heap. This is pretty ugly, and I wish we would not need it.
Also, it is odd to have a few built-in thing this way, without giving the user the ability to extend it likewise.
There was a problem hiding this comment.
But this does not need to hold up this PR: You can do it this way first, and we can switch all over eventually, should we decide to do so.
There was a problem hiding this comment.
OK, thanks for the explanation. I suspected that your motivation was about compilation, and that seems fairly compelling. If we can eschew the OO style in the future in favor of something with a simpler or more efficient compilation story, I agree that'd be preferable.
For this PR, what shall I do for the compilation part of this? (it's still missing here).
There was a problem hiding this comment.
For this PR, what shall I do for the compilation part of this? (it's still missing here)
Hmm, that part needs a proper utf8-decoding iterator in the backend, right?
Well, let’s leave it unimplemented in the backend for now, and make an issue for it. I might tackle that then soon. It is fine if the backend lags behind the reference interpreter a bit. (You can run make accept to record the expected behavior with the feature not yet implemented in the backend.)
There was a problem hiding this comment.
I know this is a hack, but could we piggyback AS support of UTF-8 by extending the SystemAPI with the UTF-8 support provided by V8 (or worse JS). I guess that might make gas accounting hard for these operations...
There was a problem hiding this comment.
Ssh, you said the g** word.
A utf8-decoder is not too bad, we can do that once we need it.
There was a problem hiding this comment.
I agree that we'll probably want to reconsider the object subsumption, but for now it makes sense to treat arrays and text consistently.
|
Actually, can we have some unicode in the example? |
The reason being (as pointed out in the PR comments), the lack of `iconv`-like library to identify Unicode code points, as separate characters. I used an online converter to translate the russian text to UTF-8 and it spew out "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd1\x8e\x2c\x20\xd0\xbc\xd0\xb8\xd1\x80\x21\x0a" which is precisely what the `.ok` files contain atm.
|
@matthewhammer I took the liberty to update this branch a bit. Please yell if you dislike something! |
Interesting observation: UTF-8 strings made by `Text.lit` are correct, but those that come from the .as file are garbled.
We encode them as Utf-8 first, and print them as strings.
| | t -> Obj (Object Local, List.sort compare_field (immut t)) | ||
|
|
||
| let text_obj = | ||
| let immut = |
There was a problem hiding this comment.
Nit: no reason for this let or for sorting the single-element list, see iter_obj above.
| let rec as_obj_sub lab t = match promote t with | ||
| | Obj (s, tfs) -> s, tfs | ||
| | Array t -> as_obj_sub lab (array_obj t) | ||
| | Prim Text -> as_obj_sub lab text_obj |
There was a problem hiding this comment.
I agree that we'll probably want to reconsider the object subsumption, but for now it makes sense to treat arrays and text consistently.
|
@ggreif please request a review from me when you are done and this is ready for review. |
* clean up and refactor decodeUTF8 for the compiler * spruce up the tests a bit
There was a problem hiding this comment.
@matthewhammer Text iteration over Unicode contents seem to work now, but at first glance I did not see where you perform the Utf-8 decoding to obtain the code points. I shall examine the code soon.
Edit: Never mind, found it. It is in obj_of_text and looks good.
@nomeata The compiler does not implement for (c in str.chars()) { ... yet. Maybe I can finish that tomorrow. In the meantime I'd be happy to receive comments on the decodeUTF8 functionality. It is a building block for Text iteration.
src/compile.ml
Outdated
| if E.mode env = DfinityMode | ||
| then | ||
| G.i Drop ^^ | ||
| Text.lit env "H" ^^ |
There was a problem hiding this comment.
This needs to be implemented.
|
Did you see Andreas cleanup of the lexer? Maybe you can steal some ideas. (didn't look at your code yet) |
Absolutely. That's why I came here, directly :-) So please disregard that complication for now. My brain was set on autopilot by the overly confused rosetta code fragments. Damn. |
(we could also consider `size`) rudimentary method impls for Text in compiler. More to come here.
Co-Authored-By: ggreif <ggreif@gmail.com>
The interpreter does it right, compiler is not done yet. That's why the test still fails. I have to figure out how to do the compiled iteration over code-points. Hopefully today, maybe tomorrow. For getting the UTF-8 bytes one could introduce a method |
|
Ah, okay. If we wanted to add "storage size" then I would more explicitly name it something like |
I am unsure about its utility, maybe we should just push len and call `Text.alloc`. (Easy enough to revert and define allocFixedLen locally.)
to cater for some microoptimisation start getting rid of `printChar`
intro `compile_bitand_const`
to use comparisons
not sure that this is more clean than the previous, though
## Changelog for candid: Branch: master Commits: [dfinity/candid@184078e4...b319e249](dfinity/candid@184078e...b319e24) * [`b319e249`](dfinity/candid@b319e24) refactor type table for deserialization ([dfinity/candid#197](http://r.duckduckgo.com/l/?uddg=https://github.com/dfinity/candid/issues/197))
## Changelog for candid: Branch: master Commits: [dfinity/candid@184078e4...b319e249](dfinity/candid@184078e...b319e24) * [`b319e249`](dfinity/candid@b319e24) refactor type table for deserialization ([dfinity/candid#197](http://r.duckduckgo.com/l/?uddg=https://github.com/dfinity/candid/issues/197))
Support iteration, character by character, over text values.
For example:
(I also added minimal support for printing individual characters, for testing purposes.)
Not done yet: