Text iteration by matthewhammer · Pull Request #197 · caffeinelabs/motoko

matthewhammer · 2019-02-27T23:35:17Z

Support iteration, character by character, over text values.

For example:

for (c in myText.chars()) {
  printChar c;
}

(I also added minimal support for printing individual characters, for testing purposes.)

Not done yet:

Compilation of these new features
Extensive tests of these new features

nomeata · 2019-02-28T09:42:42Z

src/type.ml

 let rec as_obj_sub lab t = match promote t with
  | Obj (s, tfs) -> s, tfs
  | Array t -> as_obj_sub lab (array_obj t)
+  | Prim Text -> as_obj_sub lab text_obj


I think this implicit coercion of things to objects is a wart, and I am not sure if we want to extend it. Is

for (c in t.chars)

really so much better than

for (c in chars(t))

Two reasons:

Consistency with how the language works now, and

The first version (with implicit object coercion) is "better" in the sense that the first version doesn't pollute the prelude namespace with some primitive function called chars, or charsOf, etc.

Having said that, I don't have a preference, I'm just following the conventions that I see for arrays; I'm trying to be consistent. Text is a special kind of restricted array, with no random access, just iteration. Once arrays act differently, I can follow that same pattern for Text. Why be inconsistent?

I actually would not mind changing it for arrays as well. This is motivated by the backend, that has to compile foo.bar differently from foo.length; in the latter case it must do dynamic dispatch on the kind of object on the heap. This is pretty ugly, and I wish we would not need it.

Also, it is odd to have a few built-in thing this way, without giving the user the ability to extend it likewise.

But this does not need to hold up this PR: You can do it this way first, and we can switch all over eventually, should we decide to do so.

OK, thanks for the explanation. I suspected that your motivation was about compilation, and that seems fairly compelling. If we can eschew the OO style in the future in favor of something with a simpler or more efficient compilation story, I agree that'd be preferable.

For this PR, what shall I do for the compilation part of this? (it's still missing here).

For this PR, what shall I do for the compilation part of this? (it's still missing here)

Hmm, that part needs a proper utf8-decoding iterator in the backend, right?

Well, let’s leave it unimplemented in the backend for now, and make an issue for it. I might tackle that then soon. It is fine if the backend lags behind the reference interpreter a bit. (You can run make accept to record the expected behavior with the feature not yet implemented in the backend.)

I know this is a hack, but could we piggyback AS support of UTF-8 by extending the SystemAPI with the UTF-8 support provided by V8 (or worse JS). I guess that might make gas accounting hard for these operations...

Ssh, you said the g** word.

A utf8-decoder is not too bad, we can do that once we need it.

@crusso @nomeata I am working on the UTF-8 decoder. However, while testing, I found that both Text and Char are wrongly encoded when non-ASCII. Curiously, strings encoded by Text.lit env "<some Unicode>" work just fine. I'll submit a fix for both separately.

I agree that we'll probably want to reconsider the object subsumption, but for now it makes sense to treat arrays and text consistently.

nomeata · 2019-02-28T14:56:26Z

Actually, can we have some unicode in the example?

The reason being (as pointed out in the PR comments), the lack of `iconv`-like library to identify Unicode code points, as separate characters. I used an online converter to translate the russian text to UTF-8 and it spew out "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd1\x8e\x2c\x20\xd0\xbc\xd0\xb8\xd1\x80\x21\x0a" which is precisely what the `.ok` files contain atm.

ggreif · 2019-03-19T11:17:08Z

@matthewhammer I took the liberty to update this branch a bit. Please yell if you dislike something!

Interesting observation: UTF-8 strings made by `Text.lit` are correct, but those that come from the .as file are garbled.

We encode them as Utf-8 first, and print them as strings.

rossberg

For the compiler parts I'll have to defer to @nomeata, the others LGTM.

rossberg · 2019-03-21T11:37:46Z

src/type.ml

  | t -> Obj (Object Local, List.sort compare_field (immut t))

+let text_obj =
+  let immut =


Nit: no reason for this let or for sorting the single-element list, see iter_obj above.

rossberg · 2019-03-21T11:39:00Z

src/type.ml

 let rec as_obj_sub lab t = match promote t with
  | Obj (s, tfs) -> s, tfs
  | Array t -> as_obj_sub lab (array_obj t)
+  | Prim Text -> as_obj_sub lab text_obj


I agree that we'll probably want to reconsider the object subsumption, but for now it makes sense to treat arrays and text consistently.

nomeata · 2019-03-21T11:48:11Z

@ggreif please request a review from me when you are done and this is ready for review.

* clean up and refactor decodeUTF8 for the compiler * spruce up the tests a bit

ggreif

@matthewhammer Text iteration over Unicode contents seem to work now, but at first glance I did not see where you perform the Utf-8 decoding to obtain the code points. I shall examine the code soon.

Edit: Never mind, found it. It is in obj_of_text and looks good.

@nomeata The compiler does not implement for (c in str.chars()) { ... yet. Maybe I can finish that tomorrow. In the meantime I'd be happy to receive comments on the decodeUTF8 functionality. It is a building block for Text iteration.

ggreif · 2019-03-21T19:10:04Z

src/compile.ml

+    if E.mode env = DfinityMode
+    then
+      G.i Drop ^^
+      Text.lit env "H" ^^


This needs to be implemented.

src/compile.ml

nomeata · 2019-03-22T07:46:54Z

Did you see Andreas cleanup of the lexer? Maybe you can steal some ideas. (didn't look at your code yet)

ggreif · 2019-03-22T08:04:50Z

Did you see Andreas cleanup of the lexer? Maybe you can steal some ideas. (didn't look at your code yet)

Absolutely. That's why I came here, directly :-)

So please disregard that complication for now. My brain was set on autopilot by the overly confused rosetta code fragments. Damn.

src/compile.ml

(we could also consider `size`) rudimentary method impls for Text in compiler. More to come here.

Co-Authored-By: ggreif <ggreif@gmail.com>

ggreif · 2019-03-22T12:30:29Z

Wouldn't this be the length in UTF-8 bytes? If we provide text length then it would have to be the number of code points, since everything else is an implementation detail. Though I'm not sure it's even needed. What's the use case?

The interpreter does it right, compiler is not done yet. That's why the test still fails. I have to figure out how to do the compiled iteration over code-points. Hopefully today, maybe tomorrow.

For getting the UTF-8 bytes one could introduce a method size or storage_size.

rossberg · 2019-03-22T12:37:55Z

Ah, okay. If we wanted to add "storage size" then I would more explicitly name it something like utf8_len. But the question is why provide it -- and why not also utf16_len, ucs2_len, etc. I think we should leave it out until there is a clear use case.

Co-Authored-By: ggreif <ggreif@gmail.com>

I am unsure about its utility, maybe we should just push len and call `Text.alloc`. (Easy enough to revert and define allocFixedLen locally.)

src/compile.ml

to cater for some microoptimisation start getting rid of `printChar`

src/compile.ml

intro `compile_bitand_const`

to use comparisons

not sure that this is more clean than the previous, though

## Changelog for candid: Branch: master Commits: [dfinity/candid@184078e4...b319e249](dfinity/candid@184078e...b319e24) * [`b319e249`](dfinity/candid@b319e24) refactor type table for deserialization ([dfinity/candid⁠#197](http://r.duckduckgo.com/l/?uddg=https://github.com/dfinity/candid/issues/197))

text iteration

73f6d96

matthewhammer requested review from crusso, nomeata and rossberg February 27, 2019 23:35

clean and expand test file

b7a3eb7

nomeata reviewed Feb 28, 2019

View reviewed changes

ggreif added 3 commits March 19, 2019 10:56

Merge remote-tracking branch 'origin/master' into text-iter

fe4e324

WIP: compile printChar provisionally

a22e618

ggreif added 3 commits March 20, 2019 22:55

WIP: play with UTF-8 char decoding

6d45431

Interesting observation: UTF-8 strings made by `Text.lit` are correct, but those that come from the .as file are garbled.

fix character SR, accept

052d5e2

allow printing of Unicode characters

713d8aa

We encode them as Utf-8 first, and print them as strings.

rossberg reviewed Mar 21, 2019

View reviewed changes

ggreif added 3 commits March 21, 2019 17:36

Merge branch 'master' into text-iter

53e0bd2

renix

fb6e872

implement decodeUTF8 for the interpreter

12cf851

* clean up and refactor decodeUTF8 for the compiler * spruce up the tests a bit

ggreif reviewed Mar 21, 2019

View reviewed changes

ggreif requested a review from nomeata March 21, 2019 19:20

remove one kludge

e77e404

nomeata reviewed Mar 22, 2019

View reviewed changes

src/compile.ml Outdated Show resolved Hide resolved

ggreif and others added 3 commits March 22, 2019 13:16

intro len method for Text

3743c39

(we could also consider `size`) rudimentary method impls for Text in compiler. More to come here.

review Feedback

0670e2e

Co-Authored-By: ggreif <ggreif@gmail.com>

follow-up

3e9fb8b

ggreif and others added 6 commits March 25, 2019 11:32

refactor

7714ad4

shift-related refactoring

d62e5e9

feedback

2e41bb0

Co-Authored-By: ggreif <ggreif@gmail.com>

renamings

000ee13

define and use allocFixedLen

fdf9204

I am unsure about its utility, maybe we should just push len and call `Text.alloc`. (Easy enough to revert and define allocFixedLen locally.)

refactor

9de17c7

ggreif reviewed Mar 25, 2019

View reviewed changes

src/compile.ml Outdated Show resolved Hide resolved

ggreif added 4 commits March 25, 2019 14:15

intro and use Text.unskewed_payload_offset

cd6f8ac

to cater for some microoptimisation start getting rid of `printChar`

move prim_showChar to Text module

aa98b66

remodel printChar to call new showChar primitive

4c15850

better naming

823b6ab

nomeata reviewed Mar 25, 2019

View reviewed changes

src/compile.ml Outdated Show resolved Hide resolved

src/compile.ml Outdated Show resolved Hide resolved

src/compile.ml Outdated Show resolved Hide resolved

ggreif added 10 commits March 25, 2019 16:04

refactoring, use helpers where possible

cdecce8

intro `compile_bitand_const`

tweak of the char_length_of_UTF8 comment

f127ff6

simplifications

910da2c

review feedback

ce32ccf

rewrite len_UTF8_head

8776e15

to use comparisons

keep the utf-8 byte local to len_UTF8_head

f6cd812

add summary

dbe1f4c

review feedback

7b63466

not sure that this is more clean than the previous, though

add Char <--> Text tests

dca07fd

revert fixed allocation, not worth it

fa32164

nomeata approved these changes Mar 26, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into text-iter

b31b81f

ggreif merged commit 11cfb7d into master Mar 26, 2019

ggreif deleted the text-iter branch March 26, 2019 13:04

ggreif mentioned this pull request Mar 26, 2019

Eliminators for tagged heap objects #273

Merged

ggreif changed the title ~~WIP: Text iteration~~ Text iteration Mar 26, 2019

Conversation

matthewhammer commented Feb 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewhammer Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nomeata Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusso Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nomeata commented Feb 28, 2019

Uh oh!

ggreif commented Mar 19, 2019

Uh oh!

rossberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nomeata commented Mar 21, 2019

Uh oh!

ggreif left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nomeata commented Mar 22, 2019

Uh oh!

ggreif commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggreif commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rossberg commented Mar 22, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

matthewhammer Feb 28, 2019 •

edited

Loading

nomeata Feb 28, 2019 •

edited

Loading

crusso Mar 18, 2019 •

edited

Loading

ggreif left a comment •

edited

Loading

ggreif commented Mar 22, 2019 •

edited

Loading

ggreif commented Mar 22, 2019 •

edited

Loading