Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string.toByte() and string.toCodepoint() #13553

Merged
merged 6 commits into from
Jul 28, 2019
Merged

Add string.toByte() and string.toCodepoint() #13553

merged 6 commits into from
Jul 28, 2019

Conversation

dmk42
Copy link
Contributor

@dmk42 dmk42 commented Jul 26, 2019

The deprecated ascii() function was replaced by string.byte(1), which provided extra functionality by varying its argument, but looked a little heavy-weight for the purposes for which ascii() was used. This change adds string.toByte() as a better direct replacement for ascii() (while still retaining the more general string.byte()). This change also adds string.toCodepoint() as the codepoint analog.

The key difference between ascii() and string.toByte() is that the latter will issue an error message if the string is not exactly one byte long. Likewise, string.toCodepoint() will issue an error message if the string is not exactly one codepoint long.

Some existing code needed a param version of ascii(), so a param version of string.toByte() has been created as well. There has not yet been a need for a param string.toCodepoint(), but that could be added in the future. (See https://github.com/Cray/chapel-private/issues/348 .)

Existing Chapel code that had been changed from using ascii() to string.byte(1) is changed here to use string.toByte() wherever it was operating on a string that was required to be one byte long.

Some tests have been added to fill in gaps.

At some point, the String module should be migrated to use HaltWrappers. However, HaltWrappers are not yet able to handle some of the error messages in String, so the migration has not occurred yet. Consequently, this change stays consistent with the rest of String by using halt() and leaves the migration to a future PR. (See https://github.com/Cray/chapel-private/issues/349 .)

Note: modules/packages/LAPACK.chpl is a very large diff. The only change to that file is to change all occurrences of .byte(1) to .toByte(). These were all former uses of ascii().

Passed full local and full GASNet testing on linux64.
Passed Cray XC testing (particularly LAPACK) as well as --fast.
Retested after review.

Closes #13167
Closes https://github.com/Cray/chapel-private/issues/80

@dmk42
Copy link
Contributor Author

dmk42 commented Jul 26, 2019

@mppf - Would you mind reviewing this?

var nbytes: c_int;
var multibytes = localThis.buff: c_string;
var maxbytes = localThis.len: ssize_t;
qio_decode_char_buf(cp, nbytes, multibytes, maxbytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, why isn't the return value of qio_decode_char_buf() checked here or in the rest of the file? It looks like it has error cases that can leave nbytes looking reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In expected operation with UTF8 strings (which is all this is supposed to handle), it will return an error code and set the codepoint to the unicode replacement character. That's reasonable in some ways. But beyond that, the design is that a string can only hold valid UTF-8, so if there is an invalid UTF-8 character in it, that should have been raised as an error somewhere in the process of creating the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about checking the return code, but I ended up not doing it because of the reasons Michael gave, and I convinced myself that nbytes would return what I wanted even in an error condition.

Copy link
Member

@mppf mppf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! I have some minor requests.

@@ -410,7 +417,14 @@ static Expr* postFoldPrimop(CallExpr* call) {

INT_ASSERT(ie && ie->symbol()->isParameter() && get_int(ie, &val));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, get_int is computing something that is used later in the function in the control flow, but it's being called from an INT_ASSERT. That's bad because one might reasonably assume that the INT_ASSERT can be removed at only the cost of less error checking.

Could you move the get_int call to a separate line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the get_int call to a separate line?

Done.

var nbytes: c_int;
var multibytes = localThis.buff: c_string;
var maxbytes = localThis.len: ssize_t;
qio_decode_char_buf(cp, nbytes, multibytes, maxbytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In expected operation with UTF8 strings (which is all this is supposed to handle), it will return an error code and set the codepoint to the unicode replacement character. That's reasonable in some ways. But beyond that, the design is that a string can only hold valid UTF-8, so if there is an invalid UTF-8 character in it, that should have been raised as an error somewhere in the process of creating the string.

@@ -0,0 +1,4 @@
#! /bin/bash

sed -e 's/^.*String\.chpl:[0-9]*: //' <$2 >$2.tmp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that these error line numbers should be in the user's code. I believe you can achieve that with pragma "insert line file info" pragma "always propagate line file info" on the function in question but this case might need some other handling to do with the primitive error handling. But couldn't we make the compiler code with the primitive raise only internal errors and make the user-facing compilerErrors occur in the modules code implementing ascii and toByte ? It looks like the -1 case which is handled that way is giving the right line number for the error...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the user errors must remain within the primitive because we don't know a string's length in param Chapel code.

I tried the pragmas, but they don't seem to do anything in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't know a string's length in param Chapel code.

I don't think that's right.

primitive.cpp shows "string_length" next to "ascii", and in fact it is implemented in postFold just above ascii.

And this program works for me:

param len = "aa".length;

var tup:len*int;
writeln(len);
writeln(tup);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's very strange. I can't use string.len in param code, but it does accept string.length which is a proc that references string.len. I will see what I can do with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what's going on. There is a separate param string.length that calls the primitive.

I was trying to avoid string.length in case its meaning changes to count codepoints, which would break this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could call the primitive directly. It's likely that we'll need different primitives for the codepoint vs bytes cases, once we get the compiler working with those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed. Thanks. I will retest today and merge this weekend.

# --fast sets --no-checks and this test is explicitly checking that one
# of our runtime checks is working correctly. There's been debate as to
# whether tests like this should be suppressed or skipped. For now we
# suppress with the argument that we're testing that --fast actually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this argument actually explains why you should suppressif it. I think you should skipif it. I think we should use suppressif for cases that we expect to be resolved on their own at some point (e.g. backend C compiler issues) and .futures for things that will only resolve with some effort on our part. But this case seems to be neither of those - I don't have any expectation that --fast will ever stop throwing --no-checks.

All that said it doesn't bother me if you want to keep this a .suppressif for some reason. I just don't understand the reason to do so from what is written here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't care which way it goes myself. I copied this suppressif file byte-for-byte from one of Brad's, which is how I decided to suppress instead of skip. I think I'll leave it as a suppression only because it came from Brad. I don't mind if the whole issue gets decided differently later, in which case several suppressif files in various parts of the tests will need to be changed to skipifs.

@dmk42 dmk42 mentioned this pull request Jul 26, 2019
@dmk42 dmk42 merged commit c5aeaa5 into chapel-lang:master Jul 28, 2019
@dmk42 dmk42 deleted the tobyte branch August 6, 2019 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should there be an additional way to access the first (or only) byte or codepoint of a string?
3 participants