Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer #1718

jrfnl · 2017-10-16T03:28:57Z

A JS file with the following code - note the missing closing ' !

alert('hi);

is being tokenized as follows:

0 :: L001 :: C1 :: T_OPEN_TAG :: (0) ::
1 :: L001 :: C1 :: T_WHITESPACE :: (0) ::

2 :: L002 :: C1 :: T_STRING :: (5) :: alert
3 :: L002 :: C6 :: T_OPEN_PARENTHESIS :: (1) :: (
4 :: L002 :: C7 :: T_WHITESPACE :: (5) :: 'hi);
5 :: L002 :: C12 :: T_CLOSE_TAG :: (0) ::

I believe that the fourth token 'hi); is incorrectly tagged as T_WHITESPACE. While this is caused by a parse error in the original JS code, this is clearly not whitespace.

Maybe the catch-all T_STRING or even a T_UNKNOWN would be more appropriate ?

Full tokenizer processing log:

        *** START JS TOKENIZING ***
        Process char 0 => \n (buffer: )
        Process char 1 => a (buffer: \n)
        => Added token T_WHITESPACE (\n)
        Process char 2 => l (buffer: a)
        Process char 3 => e (buffer: al)
        Process char 4 => r (buffer: ale)
        Process char 5 => t (buffer: aler)
        Process char 6 => ( (buffer: alert)
        => Added token T_STRING (alert)
                * char is token, looking ahead 8 chars *
                => Looking ahead 1 chars => ('
                => Looking ahead 2 chars => ('h
                => Looking ahead 3 chars => ('hi
                => Looking ahead 4 chars => ('hi)
                => Looking ahead 5 chars => ('hi);
                * look ahead found nothing *
        => Added token T_OPEN_PARENTHESIS (()
        Process char 7 => ' (buffer: )
                * looking for string closer *
                Process char 8 => h (buffer: ')
                Process char 9 => i (buffer: 'h)
                Process char 10 => ) (buffer: 'hi)
                Process char 11 => ; (buffer: 'hi))
        => Added token T_WHITESPACE ('hi);)
        *** END TOKENIZING ***
        *** START TOKEN MAP ***
        *** END TOKEN MAP ***
        *** START SCOPE MAP ***
        *** END SCOPE MAP ***
        *** START LEVEL MAP ***
        Process token 0 on line 1 [col:1;len:0;lvl:0;]: T_OPEN_TAG =>
        Process token 1 on line 1 [col:1;len:0;lvl:0;]: T_WHITESPACE => \n
        Process token 2 on line 2 [col:1;len:5;lvl:0;]: T_STRING => alert
        Process token 3 on line 2 [col:6;len:1;lvl:0;]: T_OPEN_PARENTHESIS => (
        Process token 4 on line 2 [col:7;len:5;lvl:0;]: T_WHITESPACE => 'hi);
        Process token 5 on line 2 [col:12;len:0;lvl:0;]: T_CLOSE_TAG =>
        *** END LEVEL MAP ***
        *** START ADDITIONAL JS PROCESSING ***
        Process token 0: T_OPEN_TAG =>
        Process token 1: T_WHITESPACE => \n
        Process token 2: T_STRING => alert
        Process token 3: T_OPEN_PARENTHESIS => (
        Process token 4: T_WHITESPACE => 'hi);
        Process token 5: T_CLOSE_TAG =>
        *** END ADDITIONAL JS PROCESSING ***

The text was updated successfully, but these errors were encountered:

jrfnl · 2017-10-16T03:41:18Z

Also see #1719

avdg · 2017-10-17T15:20:56Z

I suspect many js parser bugs and I know some repo's with data for brute forcing, I wish I had dedicated time to fix all important issues though...

It might be a hard to set up the tools correctly, as the js community didn't want to maintain some tools (and stopped maintaining while pushing through non-compatible changes) as others started to create their own incomplete tools.

(also, the tools have to be adopted for our test cases as well, but since these repo's have batches of test data, it might be worth writing the tool long term)

The js grammar (little outdated compared to live spec) can be found here although it might be a little bit overwhelming (click on the "lexical grammar" link to only see the lexical part):

https://avdg.github.io/ecmascript.html

gsherwood · 2017-11-08T02:34:11Z

Looks like this only happens if there is no newline char at the end of the file. Otherwise, the tokeniser sees the newline char and bails out.

…ITESPACE by the JS tokenizer

gsherwood · 2017-11-08T02:45:33Z

Yep, looked like an exception needed to be made for this specific case. Thanks for reporting it.

jrfnl mentioned this issue Oct 16, 2017

Tests: fix unintentional parse error in js test file #1719

Merged

gsherwood changed the title ~~JS Tokenizer bug ?~~ Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer Nov 8, 2017

gsherwood added the Type: bug label Nov 8, 2017

gsherwood added a commit that referenced this issue Nov 8, 2017

Fixed bug #1718 : Unclosed strings at EOF sometimes tokenized as T_WH…

5ad186a

…ITESPACE by the JS tokenizer

gsherwood closed this as completed Nov 8, 2017

gsherwood added this to the 3.2.0 milestone Nov 8, 2017

dependencies bot mentioned this issue Aug 7, 2018

Update squizlabs/php_codesniffer in composer.json from ^2.5.1 to 3.3.1 CNHCircleK/cnhcirclek.org#54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer #1718

Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer #1718

jrfnl commented Oct 16, 2017

jrfnl commented Oct 16, 2017

avdg commented Oct 17, 2017 •

edited

Loading

gsherwood commented Nov 8, 2017

gsherwood commented Nov 8, 2017

Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer #1718

Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer #1718

Comments

jrfnl commented Oct 16, 2017

jrfnl commented Oct 16, 2017

avdg commented Oct 17, 2017 • edited Loading

gsherwood commented Nov 8, 2017

gsherwood commented Nov 8, 2017

avdg commented Oct 17, 2017 •

edited

Loading