Added support for UnicodeCombiningMark, fixes #3639. #3645

ctjlewis · 2020-07-19T23:17:23Z

Some notes:

Annotating functions with spec grammar is likely wise (and will help with issues like this in the future), so I did not remove that JSDOC block.
A utils/ directory was added, which contains a script for building the source (without GWT) and a script for calling that built version with the given args. This directory will be a good place to put bash scripts in the future, including runtime_tests.sh to execute the jsComp runtime tests (a separate issue I'm working on).
There is more grammar to implement if we want to be consistent with the spec (isConnectorPunctuation, isZeroWidthJoiner, etc.), which will prevent issues like this in the future. I intend to do this over the coming days, I just wanted to get this PR in first.

This PR fixes #3639, and you can see that the source:

var bar = {
  İ: "foo"
};

now compiles as expected to:

var bar={"I\u0307":"foo"};

ctjlewis · 2020-07-19T23:43:13Z

Build tests are failing because:

/src/com/google/javascript/jscomp/parsing/parser/Scanner.java:909: error: cannot find symbol
    return Character.getType(ch) == Character.NON_SPACING_MARK;
                    ^
  symbol:   method getType(char)
  location: class Character
/src/com/google/javascript/jscomp/parsing/parser/Scanner.java:909: error: cannot find symbol
    return Character.getType(ch) == Character.NON_SPACING_MARK;

Anybody have recommendations? Not sure why Character.getType would not be defined, I even added a manual import for java.lang.Character, and getType method has been defined since JDK 1.1.

I noticed the rest of the character checks did not use builtins like Character.NO_SPACING_MARK, but this syntax is much more legible on top of being future-proofed and likely just as performant.

Edit: Is this breaking because it's being fed into J2CL, which can't compile Character.getType and therefore doesn't import it?

ctjlewis · 2020-07-20T02:07:48Z

Opened an issue at google/j2cl#103 regarding future-proofed Unicode category checks, just need J2CL to support Character.getType() and the category constants.

ctjlewis · 2020-07-21T07:59:52Z

.gitignore

+/bazel-bin
+/bazel-closure-compiler
+/bazel-out
+/bazel-testlogs


Running ./build_test.sh created bazel output directories that were added to the git tree by default. These lines prevent that.

ctjlewis · 2020-07-21T08:03:53Z

src/com/google/javascript/jscomp/parsing/parser/Scanner.java

@@ -901,8 +901,65 @@ private static boolean isIdentifierStart(char ch) {
        | (ch >= 0x03B1 & ch <= 0x03C9); // Greek lowercase letters
  }

+  // Check if char is Unicode Category "Combining spacing mark (Mc)"


Not compliant (see #3647), per spec UnicodeCombiningMark also includes "Non-spacing mark (Mn)" category, but this addresses #3639 and a decent scope of similar possibilities with as minimal change as possible.

ctjlewis · 2020-07-21T08:04:48Z

utils/run.sh

+#!/bin/bash
+# Run the local build of CC in target/. Make sure to run build.sh before running
+# this script.
+java -jar target/closure-compiler-1.0-SNAPSHOT.jar $@


Handy for debugging local build.

utils/build.sh

googlebot added the cla: yes label Jul 19, 2020

ctjlewis mentioned this pull request Jul 20, 2020

J2CL transpiled compiler Scanner.java doesn't understand non-ascii characters. #2383

Open

ctjlewis commented Jul 21, 2020

View reviewed changes

utils/build.sh Outdated Show resolved Hide resolved

rishipal self-assigned this Jul 22, 2020

ctjlewis mentioned this pull request Aug 9, 2020

Improve unicode escape in regex #3656

Merged

HenryRLee approved these changes Aug 11, 2020

View reviewed changes

ctjlewis closed this Aug 12, 2020

ctjlewis force-pushed the master branch from e64655e to 8e70fe7 Compare August 12, 2020 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for UnicodeCombiningMark, fixes #3639. #3645

Added support for UnicodeCombiningMark, fixes #3639. #3645

ctjlewis commented Jul 19, 2020

ctjlewis commented Jul 19, 2020 •

edited

Loading

ctjlewis commented Jul 20, 2020

ctjlewis Jul 21, 2020

ctjlewis Jul 21, 2020

ctjlewis Jul 21, 2020

Added support for UnicodeCombiningMark, fixes #3639. #3645

Added support for UnicodeCombiningMark, fixes #3639. #3645

Conversation

ctjlewis commented Jul 19, 2020

ctjlewis commented Jul 19, 2020 • edited Loading

ctjlewis commented Jul 20, 2020

ctjlewis Jul 21, 2020

Choose a reason for hiding this comment

ctjlewis Jul 21, 2020

Choose a reason for hiding this comment

ctjlewis Jul 21, 2020

Choose a reason for hiding this comment

ctjlewis commented Jul 19, 2020 •

edited

Loading