Literal carriage returns break parser under Python 3.x #153

ssokolow · 2019-11-04T07:36:35Z

This chunk from my .vimrc causes python3 py/vimlparser.py ~/.vimrc to error out on my system.

Since this didn't show up in my initial search for VimL parsers for Python, I started work on a VimL parser of my own and I know why it happens.

The literal <CR>s in this piece of my .vimrc (the ^Ms)...

...get converted to \n by the universal newline support in Python 3.x's text mode. (I assume because a bare \r is the old Classic MacOS line terminator and Python's universal newlines mode is converting all potential newlines rather than limiting itself to whichever kind of newline it encounters first like Vim itself does.)

The quick hack I came up with to make the parser robust was to open the file in binary mode and then manually decode and split lines:

with open(path, 'rb') as fobj:
    process_group(fobj.read().decode('utf8').split('\n'))

The text was updated successfully, but these errors were encountered:

blueyed · 2019-11-06T18:29:36Z

Yeah, I think we should not use universal newlines.
Can you tell where in vimlparser this happens?

ssokolow · 2019-11-08T15:39:52Z

Sorry for the delay. The last couple of days were distracting.

I think I know where the problem is, but I'll need to find a moment to test my hypothesis before I can give you a definitive answer.

Also, I can't say for certain without reading the Vim source code to try to match whatever it does, but I'm guessing that the proper solution for detecting newlines would be to check for \n, \r\n, and \r and assume whichever occurs first is what the file should be split on.

(I'll have to try generating a test .vimrc to see if Vim's parser breaks if I include a \r on the very first line of an otherwise \n file with Vim configured to support classic MacOS-style newlines.)

Failing that, maybe splitting on \n unconditionally and then trimming a maximum of one trailing \r from each line.

ssokolow · 2019-11-09T18:19:12Z

I didn't have time to check vimlparser yet, but I did find a moment to look up Vim's documentation on the line-ending detection algorithm that we probably want to replicate:

From :h 'fileformats':

	- When more than one name is present, separated by commas, automatic
	  <EOL> detection will be done when reading a file.  When starting to
	  edit a file, a check is done for the <EOL>:
	  1. If all lines end in <CR><NL>, and 'fileformats' includes "dos",
	     'fileformat' is set to "dos".
	  2. If a <NL> is found and 'fileformats' includes "unix", 'fileformat'
	     is set to "unix".  Note that when a <NL> is found without a
	     preceding <CR>, "unix" is preferred over "dos".
	  3. If 'fileformat' has not yet been set, and if a <CR> is found, and
	     if 'fileformats' includes "mac", 'fileformat' is set to "mac".
	     This means that "mac" is only chosen when:
	      "unix" is not present or no <NL> is found in the file, and
	      "dos" is not present or no <CR><NL> is found in the file.
	     Except: if "unix" was chosen, but there is a <CR> before
	     the first <NL>, and there appear to be more <CR>s than <NL>s in
	     the first few lines, "mac" is used.
	  4. If 'fileformat' is still not set, the first name from
	     'fileformats' is used.

In other words, assuming that unix, dos, and mac are all listed for detection, it behaves like this:

If a \n is found that isn't preceded by \r, assume UNIX line endings.
Else, if no lone \n was found, but at least one \r\n was found, assume DOS line endings.
Else, if neither \n nor \r\n were found, but \r was found, assume Classic MacOS line endings.
Else, use whatever type of line-endings was listed first in fileencodings. (Defaults to dos,unix for native Windows builds, unix,dos for everything else including Cygwin.)

For step 4, the best way to eliminate surprises would probably be to use os.name == 'nt' to decide whether to split on \r\n or \n. That'll perfectly match Vim's behaviour in every situation except "single-line VimL file being loaded by a Vim with a non-default first entry in fileformats."

ssokolow · 2019-11-09T18:24:08Z

That said, step 4 may not even be necessary because, a little further down, it looks like it's saying that it uses a simplified form of that algorithm for VimL that's sourced or in vimrc:

	For systems with a Dos-like <EOL> (<CR><NL>), when reading files that
	are ":source"ed and for vimrc files, automatic <EOL> detection may be
	done:
	- When 'fileformats' is empty, there is no automatic detection.  Dos
	  format will be used.
	- When 'fileformats' is set to one or more names, automatic detection
	  is done.  This is based on the first <NL> in the file: If there is a
	  <CR> in front of it, Dos format is used, otherwise Unix format is
	  used.

(Which, now that I think about it, would make sense. I'm on Linux and had to manually change the line endings from DOS to Unix on bwHomeEndAdv to get it to load properly.)

In that case, I suppose the best solution would be to just follow that "systems with a Dos-like <EOL>" routine unconditionally. If there's a <CR> in front of the first <NL>, split on \r\n. Otherwise, split on \n.

Sorry about not catching that before I made the previous post. I didn't sleep well and I'm just about to go back to bed.

ssokolow · 2019-11-16T04:41:13Z

Sorry again for the delay.

The problem is this function:

def viml_readfile(path):
    lines = []
    f = open(path)
    for line in f.readlines():
        lines.append(line.rstrip("\r\n"))
    f.close()
    return lines

...and it works if experimentally changed to this:

def viml_readfile(path):
    lines = []

    # Replicate Vim's algorithm for identifying DOS vs. UNIX line endings 
    # in sourced scripts and .vimrc files
    with open(path, 'rb') as f:
        content = f.read().decode('utf-8')
        first_n = content.index('\n')
        if first_n > 0 and content[first_n - 1] == '\r':
            raw_lines = content.split('\r\n')
        else:
            raw_lines = content.split('\n')

    for line in raw_lines:
        lines.append(line.rstrip("\r\n"))
    return lines

...though, to be honest, that's still broken and was always broken, because there's another bug in the Python version of the parser that I noticed while writing that. My unconditional .decode('utf-8') replicates Python 3's behaviour in text mode and, if you try to open a non-UTF8 file in Python's text mode, you'll get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte

The proper solution to match Vim is tricky because, again, it depends on stuff you can reconfigure external to the script but it boils down to "Parse once using the system's default encoding as specified by $LANG, falling back to latin1, and, if a scriptencoding statement is found, then re-parse as whatever encoding it declares. (eg. scriptencoding utf-8 in my case)

My best guess at an algorithm that doesn't require a big table of regional encoding would be something like this:

If os.environ.get('LANG', '').endswith('.utf8'), then attempt to parse as UTF-8.
If that fails, parse as latin1.
If the parse turned up a scriptencoding statement, re-parse as the encoding it declares.

That works for two reasons:

There are no invalid bit patterns under latin1, so it can't return an Exception... just gibberish interpretations of stuff outside 7-bit ASCII.
Nothing needed to recognize and interpret scriptencoding <name> will involve characters outside 7-bit ASCII.

I'd have done that too, but the whole "search through the AST for scriptencoding ... and then redo the parse is far less trivial.

ssokolow · 2019-11-16T04:51:35Z

The closest trivial hack I can think of would be to replace content = f.read().decode('utf-8') in my changed version with:

    content = f.read()
    try:
        content = content.decode('utf-8')
    except UnicodeDecodeError:
        content = content.decode('latin1')

Since all structural elements of VimL are 7-bit ASCII, that'll get you something that parses UTF-8 correctly and, if that fails, you instead get a correct AST but any tokens containing non-ASCII characters will be assumed to be latin1, even if that makes them gibberish.

(Of course, not irretrievable gibberish. Someone could always do gibberish.encode('latin1').decode('correct_encoding') at a later stage.)

whonore · 2020-01-31T16:54:23Z

Does the newline argument of open solve the problem?

with open('nl', 'bw') as f:
    f.write(b'newline\ncarriage\rnewline carriage\r\n')

with open('nl', 'r') as f:
    print(f.readlines())
# -> ['newline\n', 'carriage\n', 'newline carriage\n']

with open('nl', 'r', newline='') as f:
    print(f.readlines())
# -> ['newline\n', 'carriage\r', 'newline carriage\r\n']

ssokolow · 2020-02-06T04:10:51Z

Sorry for the delayed response. I'll try to get you an answer soon.

ssokolow · 2020-02-08T05:41:59Z

No. Substituting open(path) for open(path, 'r', newline='') has no apparent effect.

To avoid this back-and-forthing, here's a test file, containing the relevant two lines bit of my vimrc, plus a few extra lines of context to help make sure things are working properly:

vimrc.txt

Vim parses it properly, as does vimlparser when run with Python 2.x, or when my proposed "Replicate Vim's algorithm for identifying DOS vs. UNIX line endings in sourced scripts and .vimrc files" modifications are run with Python 3.x.

Otherwise, vimlparser with Python 3.x will produce the somewhat amusingly confusing failure mode of appearing to output nothing at all because the error message doesn't use repr() when showing the problem line and the portion after the \r contains a literal ^[ (i.e. ESC) which the terminal dutifully interprets.

If you use less to force the non-printables to not be interpreted by the terminal, you see this:

vimlparser: E492: Not an editor command: JA TODO:: line 4 col 1

For that reason, I'd also suggest changing the %s in this line to a %r:

raise VimLParserException(Err(viml_printf("E492: Not an editor command: %s", self.reader.peekline()), self.ea.cmdpos))

...that removes the need for less to see the message and makes it look like this:

vimlparser: E492: Not an editor command: '\x1bkJA TODO:': line 4 col 1

However, the successful output still looks truncated unless you run it through something like less because it doesn't use repr() on the ^[ either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Literal carriage returns break parser under Python 3.x #153

Literal carriage returns break parser under Python 3.x #153

ssokolow commented Nov 4, 2019

blueyed commented Nov 6, 2019

ssokolow commented Nov 8, 2019 •

edited

Loading

ssokolow commented Nov 9, 2019

ssokolow commented Nov 9, 2019 •

edited

Loading

ssokolow commented Nov 16, 2019 •

edited

Loading

ssokolow commented Nov 16, 2019 •

edited

Loading

whonore commented Jan 31, 2020

ssokolow commented Feb 6, 2020

ssokolow commented Feb 8, 2020 •

edited

Loading

Literal carriage returns break parser under Python 3.x #153

Literal carriage returns break parser under Python 3.x #153

Comments

ssokolow commented Nov 4, 2019

blueyed commented Nov 6, 2019

ssokolow commented Nov 8, 2019 • edited Loading

ssokolow commented Nov 9, 2019

ssokolow commented Nov 9, 2019 • edited Loading

ssokolow commented Nov 16, 2019 • edited Loading

ssokolow commented Nov 16, 2019 • edited Loading

whonore commented Jan 31, 2020

ssokolow commented Feb 6, 2020

ssokolow commented Feb 8, 2020 • edited Loading

ssokolow commented Nov 8, 2019 •

edited

Loading

ssokolow commented Nov 9, 2019 •

edited

Loading

ssokolow commented Nov 16, 2019 •

edited

Loading

ssokolow commented Nov 16, 2019 •

edited

Loading

ssokolow commented Feb 8, 2020 •

edited

Loading