Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Allow carriage return as valid newline sequence #837

Closed
hukkin opened this issue Jul 22, 2021 · 19 comments
Closed

Proposal: Allow carriage return as valid newline sequence #837

hukkin opened this issue Jul 22, 2021 · 19 comments

Comments

@hukkin
Copy link
Contributor

hukkin commented Jul 22, 2021

Related to some of the discussion in #835

I'd like to propose adding CR (carriage return) character that is not followed by an LF character to the list of allowed newline sequences. This is how e.g. CommonMark defines line ending.

The reasoning for the change is:

  1. Avoid roundtrip instability when an LF is prefixed by more than one CR characters. EDIT: Not an issue as per .abnf

  2. In Python (perhaps other languages?) the standard way of opening files supports something called Universal Newline Support, meaning that LF, CR, and CRLF are all normalised to LF. This normalisation should NOT be used with the current TOML spec because

    • CR characters in strings will be converted to LFs when not allowed to do so. EDIT: Not an issue as per .abnf
    • Files where CRs are used as newline sequence are parsed successfully even though they are invalid TOML (spec v1.0.0).

    As far as I know, all popular Python parsers and their documented API make the mistake of doing this incorrect newline normalisation.

Now for point 2, a valid counter-argument is to "just fix all Python parsers". 😄 However, perhaps other languages do similar normalisation by default or otherwise we agree that a spec change makes more sense here instead.

@hukkin
Copy link
Contributor Author

hukkin commented Jul 23, 2021

If I interpret the .abnf correctly, CRs not followed by an LF character are already prohibited in MLB strings, in which case point 1 in the original post is actually not an issue.

I think this should be clarified in the text spec though. Currently the text spec explicitly mentions that unescaped CRs can be used in MLBs (with no mention of a following LF being required).

@eksortso
Copy link
Contributor

eksortso commented Jul 23, 2021

You're right that clarification is worthwhile. But to be clear, there's no such thing as an "escaped" carriage return in TOML. There's a limited set of characters that can be escaped (with a backslash) in basic strings, and carriage returns aren't among them. (Even inside escaped end-of-line whitespace in MLBs, carriage returns are parts of newline sequences.) And other representations of carriage returns like \r and \u000d use the escape backslash but the sequences aren't considered "eacaped."

@hukkin
Copy link
Contributor Author

hukkin commented Jul 30, 2021

But to be clear, there's no such thing as an "escaped" carriage return

Thanks for the correction.

FWIW, I currently think it may be best to reject this proposal in favor of the clarifying PR #838

@emmatyping
Copy link

Hm shouldn't TOML use the the Unicode line breaking algorithm since documents are utf-8 encoded?
https://www.unicode.org/reports/tr14/tr14-32.html#Algorithm

@hukkin
Copy link
Contributor Author

hukkin commented Jul 30, 2021

@ethanhs Based on quick glance, it seems that the algorithm is an annex that is not part of the core spec. It also seems to do a slightly different thing than what we want, specifying mandatory and optional break opportunities for text display. E.g. how web browsers wrap text at space characters.

@pradyunsg
Copy link
Member

I'd like to propose adding CR (carriage return) character that is not followed by an LF character to the list of allowed newline sequences. This is how e.g. CommonMark defines line ending.

Happy to accept a PR adding this.

@LongTengDao
Copy link
Contributor

LongTengDao commented Aug 14, 2021

TOML is a format for general config.

In currently computer world, Windows use CRLF, Linux and Mac use LF; CR only existed in history (old iOS).

So I think keep EOL as CRLF/LF is good. If all newlines are considered, they will be too many:

  • for JS, EOL is LF CR CRLF LS(U+2028) PS(U+2029)
  • for CSS, EOL is LF FF(U+000C) CR CRLF
  • for Unicode, there are more newline chars... like U+0085 for XML

@ChristianSi
Copy link
Contributor

ChristianSi commented Aug 14, 2021

I have some sympathy for @LongTengDao's viewpoint, but at the other hand I think: If it works for CommonMark, it can work for us too!

The alternative would be fragmentation, since it doesn't sound wise to require all implementations to reject lonely CR's as errors. Hence, if we don't clarify this, some will continue to accept and others to reject them. Not an ideal state of affairs.

@hukkin
Copy link
Contributor Author

hukkin commented Aug 14, 2021

Note that I didn't request this because "I want all newlines" 😄, but mostly because the Markdown spec seems to need either this or the clarifying PR to fix the first struck-through issue item in the original post.

I'm also slightly in favor of the clarifying PR and leaving lone CR in the history books. The nice thing is that it is a clarification, not a spec change, so less disturbing for implementations. E.g. git (as a modern system) doesn't seem to consider lone CR a newline in its text manipulations and people are happy with that.

I don't feel strongly at all however, as long as one of the two proposed changes happen.

@ChristianSi
Copy link
Contributor

ChristianSi commented Aug 14, 2021

Also, from @LongTengDao's comment one can see that both JS and CSS accept lonely CR's as linebreaks, so it seems we're heading in the right direction. (I don't see a reason to allow additional linebreak characters, though. Let's keep it simple, like CommonMark does.).

@pradyunsg
Copy link
Member

pradyunsg commented Aug 14, 2021

As far as I can tell, lone CRs are accepted in most places (think: browsers, editors, language interpreters/compilers etc) as a valid newline character sequence. I don't think I'd be expanding the specification to allow other characters.

@arp242
Copy link
Contributor

arp242 commented Aug 16, 2021

CR was used in the old pre-OSX MacOS and some other long-obsolete systems. Systems and standards that go back to the 80s and 90s support it because of that, but there's really no practical reason to add it in 2021.

@hukkin
Copy link
Contributor Author

hukkin commented Aug 16, 2021

In addition to git's lack of support for lone CRs, also the Golang compiler seems to completely refuse to parse them. So even if we stick to just LF and CRLF, it seems we're not an outlier among other modern projects.

@Lemmingh
Copy link

Lemmingh commented Oct 6, 2021

I don't think lone CR should be accepted as EOL.

LF and CR LF are enough.

Modern things should not repeat the mistakes of predecessors

Ideally, EOL should be a uniform thing in the electronic world.

The first people who made computers created so many kinds of EOL, perhaps for laziness, perhaps for business competition. Anyway, the past decades have been a big lesson.

Nowadays, only LF and CR LF have enough weight. Other options will be eradicated in practice.

Adding dying things to new specs is not wise.

"Lone CR as line ending" will have poor support in future

The world is moving away from chaos slowly.

New editors, such as Atom and VS Code, only accept LF and CR LF:

Compiler designers should manage to get original input

The Python example (Universal Newline Support) is not convincing.

Compilers and similar tools are expected to interpret the original input. Working on tweaked data is a mistake. Those Python implementations should open files in binary mode and perform decoding on their own.

On the other hand, PEP 278 was published in 2002, so there was adequate reason to recognize lone CR for compatibility. The basis no longer exists nowadays.

CommonMark is not suitable as a reference

Also, CommonMark is designed to have great backward compatibility to cater to users from different flavors and backgrounds due to the wild history of Markdown.

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 6, 2021

@Lemmingh 's comment makes a lot of sense. While I don't have strong opinion on this whole issue, I agree that the rationale for the suggested change seems pretty week indeed.

Which sane (or even insane) editor people might use these days to edit TOML files would use CR to write newlines? Probably none. So, it seems there is no problem and hence no need for a solution/change.

@onerandomusername
Copy link

Given the last few comments were over 2 months ago, is there an official verdict on this yet?

@ChristianSi
Copy link
Contributor

Not yet, evidently. But considering the last two comments and the number of 👍 on them, I guess the proposal will be rejected.

@septatrix
Copy link

I too see no need for this change.

The reasoning for the change is:

  1. [...]

  2. In Python (perhaps other languages?) the standard way of opening files supports something called Universal Newline Support, meaning that LF, CR, and CRLF are all normalised to LF. This normalisation should NOT be used with the current TOML spec because

    • CR characters in strings will be converted to LFs when not allowed to do so. EDIT: Not an issue as per .abnf
    • Files where CRs are used as newline sequence are parsed successfully even though they are invalid TOML (spec v1.0.0).

    As far as I know, all popular Python parsers and their documented API make the mistake of doing this incorrect newline normalisation.

Now for point 2, a valid counter-argument is to "just fix all Python parsers". smile However, perhaps other languages do similar normalisation by default or otherwise we agree that a spec change makes more sense here instead.

Just to invalidate the last of your reasons (after 1 and 2a were already retracted): The newline parsing is a feature of python and not an implementation error in the toml libraries. If you do not want that normalization you are free to open the file in byte-mode. The libraries which I have tested will correctly reject files containing a lone \r if they are opened in byte-mode as you would expect.

@pradyunsg
Copy link
Member

Well, looks like we have concensus within the broader group and... well, I don't care strongly enough to push against that.

Rejecting this proposal, although I do appreciate everyone who has pitched in on these discussions! ^>^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants