Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piping Closure compiler stderr output to Python with Unicode characters on Windows problem #4159

Open
juj opened this issue Mar 6, 2024 · 6 comments

Comments

@juj
Copy link

juj commented Mar 6, 2024

STR:

a.py

import subprocess
subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)

a.js

if (4 == NaN) console.log('á');

generates an error

C:\emsdk\emscripten\main>python a.py
Traceback (most recent call last):
  File "C:\emsdk\emscripten\main\a.py", line 2, in <module>
    subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)
  File "C:\Python311\Lib\subprocess.py", line 550, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\subprocess.py", line 1197, in communicate
    stderr = self.stderr.read()
             ^^^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 135: invalid continuation byte

My impression here is that Closure has emitted the ISO-8859-1 encoding value of á to stderr, which has the hex value of 0xe1. However, the encoding='utf-8' argument in Python expects the stderr to be printed out as UTF-8.

I could not find a command line directive in https://github.com/google/closure-compiler/wiki/Flags-and-Options to help control Closure stdout/stderr output encoding.

Which encoding does Closure use for stdout/stderr printing? Is it ISO-8859-1 by intent? Or should it have been UTF-8 and Closure accidentally printed out ISO-8859-1?

@brad4d
Copy link
Contributor

brad4d commented Mar 7, 2024

I cannot tell from the example a.js file in the description whether the á character is correctly encoded as UTF-8 in the file you're actually using when you see this error.

Can you confirm that the input file, a.js is actually correct utf-8?

@brad4d
Copy link
Contributor

brad4d commented Mar 7, 2024

Actually, could you just attach 2 files to this issue?

  1. The actual a.js file.
  2. The exact output from closure-compiler itself. (i.e. the input that python is seeing)

@juj
Copy link
Author

juj commented Mar 7, 2024

Here are the input files: a.zip

image

C3 A1 is 11000011 10100001, which is of form 110xxxxx 10yyyyyy, i.e. a leading code point and a continution code point. See e.g. Wikipedia on UTF-8 Encoding. The Unicode code point in this case will be xxxxxyyyyyy = 00011 100001 = 0xE1 = https://www.compart.com/en/unicode/U+00E1.

The exact output from closure-compiler itself. (i.e. the input that python is seeing)

The test case does not produce any JavaScript output from closure-compiler. Python attempts to capture the stderr error message from Closure process, but Python croaks internally since it cannot decode the stderr bytes that Closure is outputting, and so does not produce any output to the calling a.py file.

Executing the following python file instead

import subprocess
ret = subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='iso-8859-1', stderr=subprocess.PIPE, shell=True)
print(ret.stderr)

does not throw an exception, and instead causes Python to print the stderr as expected:

a.js:1:4: WARNING - [JSC_SUSPICIOUS_NAN] Comparison against NaN is always false. Did you mean isNaN()?
  1| if (4 == NaN) console.log('á');
         ^^^^^^^^

@brad4d
Copy link
Contributor

brad4d commented Mar 11, 2024

What I want to know is this:

Is closure-compiler actually generating an invalid character sequence to stderr, or is something else going on?

One thing that could be happening is that the stderr output from closure-compiler could be getting mixed with output from either its own stdout or output from some other process that happens to share the same output stream. Due to buffering, the 2-character sequence for 'á' closure-compiler sends to stderr could be interrupted by output from somewhere else..

Thanks for providing the a.js file and your command line. We can use that to find out what the actual stderr output from the latest closure-compiler build is for this case.

If this problem is in some way actually tied to Windows, we're unlikely to fix it ourselves as none of the core team uses Windows when working on closure-compiler.

@brad4d
Copy link
Contributor

brad4d commented Mar 12, 2024

Thank you for supplying the a.js file.

  1. I downloaded it
  2. I checked out and built the latest version of closure-compiler as a Java jar file.
  3. I stored the path to that jar file in $ccjar
  4. I ran the following commands to check the behavior.

First confirm that my terminal / OS is using UTF-8

$ echo $LANG
en_US.UTF-8
$ echo á |xxd
00000000: c3a1 0a  

Yep. c3a1 is the correct byte pair for this UTF-8 character as stated in a previous comment.

Now confirm that the character is correct in a.js

$ xxd a.js
00000000: 6966 2028 3420 3d3d 204e 614e 2920 636f  if (4 == NaN) co
00000010: 6e73 6f6c 652e 6c6f 6728 27c3 a127 293b  nsole.log('..');
00000020: 0d0a                                     ..

Yep.

Now run the compiler with the options as described in earlier comments and save its stderr output into err.out and use xxd to check the contents of that file.

$ java -jar $ccjar --charset=UTF8 --js a.js --js_output_file  o.js 2> err.out
$ xxd err.out
00000000: 612e 6a73 3a31 3a34 3a20 5741 524e 494e  a.js:1:4: WARNIN
00000010: 4720 2d20 5b4a 5343 5f53 5553 5049 4349  G - [JSC_SUSPICI
00000020: 4f55 535f 4e41 4e5d 2043 6f6d 7061 7269  OUS_NAN] Compari
00000030: 736f 6e20 6167 6169 6e73 7420 4e61 4e20  son against NaN 
00000040: 6973 2061 6c77 6179 7320 6661 6c73 652e  is always false.
00000050: 2044 6964 2079 6f75 206d 6561 6e20 6973   Did you mean is
00000060: 4e61 4e28 293f 0a20 2031 7c20 6966 2028  NaN()?.  1| if (
00000070: 3420 3d3d 204e 614e 2920 636f 6e73 6f6c  4 == NaN) consol
00000080: 652e 6c6f 6728 27c3 a127 293b 0d0a 2020  e.log('..');..  
00000090: 2020 2020 2020 205e 5e5e 5e5e 5e5e 5e0a         ^^^^^^^^.
000000a0: 0a30 2065 7272 6f72 2873 292c 2031 2077  .0 error(s), 1 w
000000b0: 6172 6e69 6e67 2873 290a                 arning(s).

Yep. We again see "c3" and "a1" used as the 2-byte encoding in bytes at positions 0x87 and 0x88.

The Java jar executing in Linux is definitely generating stderr using UTF-8 encoding.

Probably the closure-compiler you're running has been converted from a jar file to a native Windows binary using Graal, because I think that's what the google/closure-compiler-npm code that generates the NPM release tries to make the default.

I'm not sure if the different behavior you see is the result of Windows behavior or in the behavior of Java on Windows (as emulated by Graal), or something else.

@juj
Copy link
Author

juj commented Mar 26, 2024

One simplification/note to the bug test case is that the original a.py was

import subprocess
subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)

although this bug does not relate to --charset=UTF8 parameter, and the bug occurs also with shorter line

import subprocess
subprocess.run(['npx', 'google-closure-compiler','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)

It is expected that the issue does not occur on Linux or macOS, since those OSes default to UTF-8 widely.

In my Windows shell I have changed my active codepage to UTF-8, i.e.

C:\emsdk\emscripten\main>chcp
Active code page: 65001

See chcp 65001.

Although this change does not affect the bug, so this is not a Windows terminal/console issue, but something somewhere in the libraries in question either in Closure or somewhere else like observed.

We successfully worked around this in Emscripten code by specifying a directive encoding='iso-8859-1' if WINDOWS else 'utf-8' when invoking Closure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants