Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer #1718

Closed
jrfnl opened this issue Oct 16, 2017 · 4 comments
Closed
Milestone

Comments

@jrfnl
Copy link
Contributor

jrfnl commented Oct 16, 2017

A JS file with the following code - note the missing closing ' !

alert('hi);

is being tokenized as follows:

0 :: L001 :: C1 :: T_OPEN_TAG :: (0) ::
1 :: L001 :: C1 :: T_WHITESPACE :: (0) ::

2 :: L002 :: C1 :: T_STRING :: (5) :: alert
3 :: L002 :: C6 :: T_OPEN_PARENTHESIS :: (1) :: (
4 :: L002 :: C7 :: T_WHITESPACE :: (5) :: 'hi);
5 :: L002 :: C12 :: T_CLOSE_TAG :: (0) ::

I believe that the fourth token 'hi); is incorrectly tagged as T_WHITESPACE. While this is caused by a parse error in the original JS code, this is clearly not whitespace.

Maybe the catch-all T_STRING or even a T_UNKNOWN would be more appropriate ?

Full tokenizer processing log:

        *** START JS TOKENIZING ***
        Process char 0 => \n (buffer: )
        Process char 1 => a (buffer: \n)
        => Added token T_WHITESPACE (\n)
        Process char 2 => l (buffer: a)
        Process char 3 => e (buffer: al)
        Process char 4 => r (buffer: ale)
        Process char 5 => t (buffer: aler)
        Process char 6 => ( (buffer: alert)
        => Added token T_STRING (alert)
                * char is token, looking ahead 8 chars *
                => Looking ahead 1 chars => ('
                => Looking ahead 2 chars => ('h
                => Looking ahead 3 chars => ('hi
                => Looking ahead 4 chars => ('hi)
                => Looking ahead 5 chars => ('hi);
                * look ahead found nothing *
        => Added token T_OPEN_PARENTHESIS (()
        Process char 7 => ' (buffer: )
                * looking for string closer *
                Process char 8 => h (buffer: ')
                Process char 9 => i (buffer: 'h)
                Process char 10 => ) (buffer: 'hi)
                Process char 11 => ; (buffer: 'hi))
        => Added token T_WHITESPACE ('hi);)
        *** END TOKENIZING ***
        *** START TOKEN MAP ***
        *** END TOKEN MAP ***
        *** START SCOPE MAP ***
        *** END SCOPE MAP ***
        *** START LEVEL MAP ***
        Process token 0 on line 1 [col:1;len:0;lvl:0;]: T_OPEN_TAG =>
        Process token 1 on line 1 [col:1;len:0;lvl:0;]: T_WHITESPACE => \n
        Process token 2 on line 2 [col:1;len:5;lvl:0;]: T_STRING => alert
        Process token 3 on line 2 [col:6;len:1;lvl:0;]: T_OPEN_PARENTHESIS => (
        Process token 4 on line 2 [col:7;len:5;lvl:0;]: T_WHITESPACE => 'hi);
        Process token 5 on line 2 [col:12;len:0;lvl:0;]: T_CLOSE_TAG =>
        *** END LEVEL MAP ***
        *** START ADDITIONAL JS PROCESSING ***
        Process token 0: T_OPEN_TAG =>
        Process token 1: T_WHITESPACE => \n
        Process token 2: T_STRING => alert
        Process token 3: T_OPEN_PARENTHESIS => (
        Process token 4: T_WHITESPACE => 'hi);
        Process token 5: T_CLOSE_TAG =>
        *** END ADDITIONAL JS PROCESSING ***
@jrfnl
Copy link
Contributor Author

jrfnl commented Oct 16, 2017

Also see #1719

@avdg
Copy link

avdg commented Oct 17, 2017

I suspect many js parser bugs and I know some repo's with data for brute forcing, I wish I had dedicated time to fix all important issues though...

It might be a hard to set up the tools correctly, as the js community didn't want to maintain some tools (and stopped maintaining while pushing through non-compatible changes) as others started to create their own incomplete tools.

(also, the tools have to be adopted for our test cases as well, but since these repo's have batches of test data, it might be worth writing the tool long term)

The js grammar (little outdated compared to live spec) can be found here although it might be a little bit overwhelming (click on the "lexical grammar" link to only see the lexical part):

https://avdg.github.io/ecmascript.html

@gsherwood
Copy link
Member

Looks like this only happens if there is no newline char at the end of the file. Otherwise, the tokeniser sees the newline char and bails out.

@gsherwood gsherwood changed the title JS Tokenizer bug ? Unclosed strings at EOF sometimes tokenized as T_WHITESPACE by the JS tokenizer Nov 8, 2017
gsherwood added a commit that referenced this issue Nov 8, 2017
@gsherwood
Copy link
Member

Yep, looked like an exception needed to be made for this specific case. Thanks for reporting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants