Add string.toByte() and string.toCodepoint() #13553

dmk42 · 2019-07-26T00:09:03Z

The deprecated ascii() function was replaced by string.byte(1), which provided extra functionality by varying its argument, but looked a little heavy-weight for the purposes for which ascii() was used. This change adds string.toByte() as a better direct replacement for ascii() (while still retaining the more general string.byte()). This change also adds string.toCodepoint() as the codepoint analog.

The key difference between ascii() and string.toByte() is that the latter will issue an error message if the string is not exactly one byte long. Likewise, string.toCodepoint() will issue an error message if the string is not exactly one codepoint long.

Some existing code needed a param version of ascii(), so a param version of string.toByte() has been created as well. There has not yet been a need for a param string.toCodepoint(), but that could be added in the future. (See https://github.com/Cray/chapel-private/issues/348 .)

Existing Chapel code that had been changed from using ascii() to string.byte(1) is changed here to use string.toByte() wherever it was operating on a string that was required to be one byte long.

Some tests have been added to fill in gaps.

At some point, the String module should be migrated to use HaltWrappers. However, HaltWrappers are not yet able to handle some of the error messages in String, so the migration has not occurred yet. Consequently, this change stays consistent with the rest of String by using halt() and leaves the migration to a future PR. (See https://github.com/Cray/chapel-private/issues/349 .)

Note: modules/packages/LAPACK.chpl is a very large diff. The only change to that file is to change all occurrences of .byte(1) to .toByte(). These were all former uses of ascii().

Passed full local and full GASNet testing on linux64.
Passed Cray XC testing (particularly LAPACK) as well as --fast.
Retested after review.

Closes #13167
Closes https://github.com/Cray/chapel-private/issues/80

dmk42 · 2019-07-26T00:42:08Z

@mppf - Would you mind reviewing this?

cassella · 2019-07-26T05:48:41Z

modules/internal/String.chpl

+      var nbytes: c_int;
+      var multibytes = localThis.buff: c_string;
+      var maxbytes = localThis.len: ssize_t;
+      qio_decode_char_buf(cp, nbytes, multibytes, maxbytes);


OOC, why isn't the return value of qio_decode_char_buf() checked here or in the rest of the file? It looks like it has error cases that can leave nbytes looking reasonable.

In expected operation with UTF8 strings (which is all this is supposed to handle), it will return an error code and set the codepoint to the unicode replacement character. That's reasonable in some ways. But beyond that, the design is that a string can only hold valid UTF-8, so if there is an invalid UTF-8 character in it, that should have been raised as an error somewhere in the process of creating the string.

I thought about checking the return code, but I ended up not doing it because of the reasons Michael gave, and I convinced myself that nbytes would return what I wanted even in an error condition.

mppf

Looks good, thanks! I have some minor requests.

mppf · 2019-07-26T13:47:42Z

compiler/resolution/postFold.cpp

@@ -410,7 +417,14 @@ static Expr* postFoldPrimop(CallExpr* call) {

        INT_ASSERT(ie && ie->symbol()->isParameter() && get_int(ie, &val));


Ugh, get_int is computing something that is used later in the function in the control flow, but it's being called from an INT_ASSERT. That's bad because one might reasonably assume that the INT_ASSERT can be removed at only the cost of less error checking.

Could you move the get_int call to a separate line?

Could you move the get_int call to a separate line?

Done.

mppf · 2019-07-26T13:51:20Z

modules/internal/String.chpl

+      var nbytes: c_int;
+      var multibytes = localThis.buff: c_string;
+      var maxbytes = localThis.len: ssize_t;
+      qio_decode_char_buf(cp, nbytes, multibytes, maxbytes);


In expected operation with UTF8 strings (which is all this is supposed to handle), it will return an error code and set the codepoint to the unicode replacement character. That's reasonable in some ways. But beyond that, the design is that a string can only hold valid UTF-8, so if there is an invalid UTF-8 character in it, that should have been raised as an error somewhere in the process of creating the string.

mppf · 2019-07-26T13:55:52Z

test/types/string/dmk/oobbyte2.prediff

@@ -0,0 +1,4 @@
+#! /bin/bash
+
+sed -e 's/^.*String\.chpl:[0-9]*: //' <$2 >$2.tmp


It seems that these error line numbers should be in the user's code. I believe you can achieve that with pragma "insert line file info" pragma "always propagate line file info" on the function in question but this case might need some other handling to do with the primitive error handling. But couldn't we make the compiler code with the primitive raise only internal errors and make the user-facing compilerErrors occur in the modules code implementing ascii and toByte ? It looks like the -1 case which is handled that way is giving the right line number for the error...

Unfortunately, the user errors must remain within the primitive because we don't know a string's length in param Chapel code.

I tried the pragmas, but they don't seem to do anything in this case.

we don't know a string's length in param Chapel code.

I don't think that's right.

primitive.cpp shows "string_length" next to "ascii", and in fact it is implemented in postFold just above ascii.

And this program works for me:

param len = "aa".length; var tup:len*int; writeln(len); writeln(tup);

Oh, that's very strange. I can't use string.len in param code, but it does accept string.length which is a proc that references string.len. I will see what I can do with this.

I see what's going on. There is a separate param string.length that calls the primitive.

I was trying to avoid string.length in case its meaning changes to count codepoints, which would break this code.

You could call the primitive directly. It's likely that we'll need different primitives for the codepoint vs bytes cases, once we get the compiler working with those.

OK, fixed. Thanks. I will retest today and merge this weekend.

mppf · 2019-07-26T13:57:28Z

test/types/string/dmk/oobcodepoint.suppressif

+# --fast sets --no-checks and this test is explicitly checking that one
+# of our runtime checks is working correctly. There's been debate as to
+# whether tests like this should be suppressed or skipped. For now we
+# suppress with the argument that we're testing that --fast actually


I don't think this argument actually explains why you should suppressif it. I think you should skipif it. I think we should use suppressif for cases that we expect to be resolved on their own at some point (e.g. backend C compiler issues) and .futures for things that will only resolve with some effort on our part. But this case seems to be neither of those - I don't have any expectation that --fast will ever stop throwing --no-checks.

All that said it doesn't bother me if you want to keep this a .suppressif for some reason. I just don't understand the reason to do so from what is written here.

Actually, I don't care which way it goes myself. I copied this suppressif file byte-for-byte from one of Brad's, which is how I decided to suppress instead of skip. I think I'll leave it as a suppression only because it came from Brad. I don't mind if the whole issue gets decided differently later, in which case several suppressif files in various parts of the tests will need to be changed to skipifs.

dmk42 added 4 commits July 23, 2019 13:41

Add string.toByte() and string.toCodepoint()

cf40b6d

extra test

9fa40f6

extra comments, fixed .good files

760a520

improve toCodepoint

fb7fd1c

cassella reviewed Jul 26, 2019

View reviewed changes

mppf approved these changes Jul 26, 2019

View reviewed changes

dmk42 added 2 commits July 26, 2019 10:19

address review comments

1fdca2a

address more review comments

fefb603

dmk42 mentioned this pull request Jul 26, 2019

Add bytes type #13519

Merged

dmk42 added the stat: Ready to Merge label Jul 26, 2019

dmk42 merged commit c5aeaa5 into chapel-lang:master Jul 28, 2019

dmk42 deleted the tobyte branch August 6, 2019 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add string.toByte() and string.toCodepoint() #13553

Add string.toByte() and string.toCodepoint() #13553

dmk42 commented Jul 26, 2019 •

edited

Loading

dmk42 commented Jul 26, 2019

cassella Jul 26, 2019

mppf Jul 26, 2019

dmk42 Jul 26, 2019

mppf left a comment

mppf Jul 26, 2019

dmk42 Jul 26, 2019

mppf Jul 26, 2019

mppf Jul 26, 2019

dmk42 Jul 26, 2019

mppf Jul 26, 2019

dmk42 Jul 26, 2019

dmk42 Jul 26, 2019

mppf Jul 26, 2019

dmk42 Jul 26, 2019

mppf Jul 26, 2019

dmk42 Jul 26, 2019

		@@ -410,7 +417,14 @@ static Expr* postFoldPrimop(CallExpr* call) {

		INT_ASSERT(ie && ie->symbol()->isParameter() && get_int(ie, &val));

		@@ -0,0 +1,4 @@
		#! /bin/bash

		sed -e 's/^.String\.chpl:[0-9]: //' <$2 >$2.tmp

Add string.toByte() and string.toCodepoint() #13553

Add string.toByte() and string.toCodepoint() #13553

Conversation

dmk42 commented Jul 26, 2019 • edited Loading

dmk42 commented Jul 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mppf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmk42 commented Jul 26, 2019 •

edited

Loading