Skip to content

Text iteration#197

Merged
ggreif merged 42 commits intomasterfrom
text-iter
Mar 26, 2019
Merged

Text iteration#197
ggreif merged 42 commits intomasterfrom
text-iter

Conversation

@matthewhammer
Copy link
Contributor

Support iteration, character by character, over text values.

For example:

for (c in myText.chars()) {
  printChar c;
}

(I also added minimal support for printing individual characters, for testing purposes.)

Not done yet:

  • Compilation of these new features
  • Extensive tests of these new features

let rec as_obj_sub lab t = match promote t with
| Obj (s, tfs) -> s, tfs
| Array t -> as_obj_sub lab (array_obj t)
| Prim Text -> as_obj_sub lab text_obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this implicit coercion of things to objects is a wart, and I am not sure if we want to extend it. Is

for (c in t.chars)

really so much better than

for (c in chars(t))

Copy link
Contributor Author

@matthewhammer matthewhammer Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  • Consistency with how the language works now, and
  • The first version (with implicit object coercion) is "better" in the sense that the first version doesn't pollute the prelude namespace with some primitive function called chars, or charsOf, etc.

Having said that, I don't have a preference, I'm just following the conventions that I see for arrays; I'm trying to be consistent. Text is a special kind of restricted array, with no random access, just iteration. Once arrays act differently, I can follow that same pattern for Text. Why be inconsistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually would not mind changing it for arrays as well. This is motivated by the backend, that has to compile foo.bar differently from foo.length; in the latter case it must do dynamic dispatch on the kind of object on the heap. This is pretty ugly, and I wish we would not need it.

Also, it is odd to have a few built-in thing this way, without giving the user the ability to extend it likewise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this does not need to hold up this PR: You can do it this way first, and we can switch all over eventually, should we decide to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the explanation. I suspected that your motivation was about compilation, and that seems fairly compelling. If we can eschew the OO style in the future in favor of something with a simpler or more efficient compilation story, I agree that'd be preferable.

For this PR, what shall I do for the compilation part of this? (it's still missing here).

Copy link
Contributor

@nomeata nomeata Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, what shall I do for the compilation part of this? (it's still missing here)

Hmm, that part needs a proper utf8-decoding iterator in the backend, right?

Well, let’s leave it unimplemented in the backend for now, and make an issue for it. I might tackle that then soon. It is fine if the backend lags behind the reference interpreter a bit. (You can run make accept to record the expected behavior with the feature not yet implemented in the backend.)

Copy link
Contributor

@crusso crusso Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a hack, but could we piggyback AS support of UTF-8 by extending the SystemAPI with the UTF-8 support provided by V8 (or worse JS). I guess that might make gas accounting hard for these operations...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ssh, you said the g** word.

A utf8-decoder is not too bad, we can do that once we need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusso @nomeata I am working on the UTF-8 decoder. However, while testing, I found that both Text and Char are wrongly encoded when non-ASCII. Curiously, strings encoded by Text.lit env "<some Unicode>" work just fine. I'll submit a fix for both separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we'll probably want to reconsider the object subsumption, but for now it makes sense to treat arrays and text consistently.

@nomeata
Copy link
Contributor

nomeata commented Feb 28, 2019

Actually, can we have some unicode in the example?

ggreif added 3 commits March 19, 2019 10:56
The reason being (as pointed out in the PR comments), the lack of
`iconv`-like library to identify Unicode code points, as separate
characters.

I used an online converter to translate the russian text to UTF-8
and it spew out "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82\xd1\x81\xd1\x82\xd0\xb2\xd1\x83\xd1\x8e\x2c\x20\xd0\xbc\xd0\xb8\xd1\x80\x21\x0a"
which is precisely what the `.ok` files contain atm.
@ggreif
Copy link
Contributor

ggreif commented Mar 19, 2019

@matthewhammer I took the liberty to update this branch a bit. Please yell if you dislike something!

ggreif added 3 commits March 20, 2019 22:55
Interesting observation: UTF-8 strings made by `Text.lit` are correct,
but those that come from the .as file are garbled.
We encode them as Utf-8 first, and print them as strings.
Copy link
Contributor

@rossberg rossberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the compiler parts I'll have to defer to @nomeata, the others LGTM.

| t -> Obj (Object Local, List.sort compare_field (immut t))

let text_obj =
let immut =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no reason for this let or for sorting the single-element list, see iter_obj above.

let rec as_obj_sub lab t = match promote t with
| Obj (s, tfs) -> s, tfs
| Array t -> as_obj_sub lab (array_obj t)
| Prim Text -> as_obj_sub lab text_obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we'll probably want to reconsider the object subsumption, but for now it makes sense to treat arrays and text consistently.

@nomeata
Copy link
Contributor

nomeata commented Mar 21, 2019

@ggreif please request a review from me when you are done and this is ready for review.

ggreif added 3 commits March 21, 2019 17:36
* clean up and refactor decodeUTF8 for the compiler
* spruce up the tests a bit
Copy link
Contributor

@ggreif ggreif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhammer Text iteration over Unicode contents seem to work now, but at first glance I did not see where you perform the Utf-8 decoding to obtain the code points. I shall examine the code soon.

Edit: Never mind, found it. It is in obj_of_text and looks good.

@nomeata The compiler does not implement for (c in str.chars()) { ... yet. Maybe I can finish that tomorrow. In the meantime I'd be happy to receive comments on the decodeUTF8 functionality. It is a building block for Text iteration.

src/compile.ml Outdated
if E.mode env = DfinityMode
then
G.i Drop ^^
Text.lit env "H" ^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be implemented.

@ggreif ggreif requested a review from nomeata March 21, 2019 19:20
@nomeata
Copy link
Contributor

nomeata commented Mar 22, 2019

Did you see Andreas cleanup of the lexer? Maybe you can steal some ideas. (didn't look at your code yet)

@ggreif
Copy link
Contributor

ggreif commented Mar 22, 2019

Did you see Andreas cleanup of the lexer? Maybe you can steal some ideas. (didn't look at your code yet)

Absolutely. That's why I came here, directly :-)

So please disregard that complication for now. My brain was set on autopilot by the overly confused rosetta code fragments. Damn.

ggreif and others added 3 commits March 22, 2019 13:16
(we could also consider `size`)
rudimentary method impls for Text in compiler.
More to come here.
Co-Authored-By: ggreif <ggreif@gmail.com>
@ggreif
Copy link
Contributor

ggreif commented Mar 22, 2019

Wouldn't this be the length in UTF-8 bytes? If we provide text length then it would have to be the number of code points, since everything else is an implementation detail. Though I'm not sure it's even needed. What's the use case?

The interpreter does it right, compiler is not done yet. That's why the test still fails. I have to figure out how to do the compiled iteration over code-points. Hopefully today, maybe tomorrow.

For getting the UTF-8 bytes one could introduce a method size or storage_size.

@rossberg
Copy link
Contributor

Ah, okay. If we wanted to add "storage size" then I would more explicitly name it something like utf8_len. But the question is why provide it -- and why not also utf16_len, ucs2_len, etc. I think we should leave it out until there is a clear use case.

ggreif and others added 6 commits March 25, 2019 11:32
Co-Authored-By: ggreif <ggreif@gmail.com>
I am unsure about its utility, maybe we should just push len and call `Text.alloc`.

(Easy enough to revert and define allocFixedLen locally.)
@ggreif ggreif merged commit 11cfb7d into master Mar 26, 2019
@ggreif ggreif deleted the text-iter branch March 26, 2019 13:04
@ggreif ggreif changed the title WIP: Text iteration Text iteration Mar 26, 2019
dfinity-bot added a commit that referenced this pull request Mar 11, 2021
## Changelog for candid:
Branch: master
Commits: [dfinity/candid@184078e4...b319e249](dfinity/candid@184078e...b319e24)

* [`b319e249`](dfinity/candid@b319e24) refactor type table for deserialization ([dfinity/candid⁠#197](http://r.duckduckgo.com/l/?uddg=https://github.com/dfinity/candid/issues/197))
mergify bot pushed a commit that referenced this pull request Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants