Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour when using multiple character newline #2216

Open
OMAlexB opened this issue Dec 21, 2023 · 2 comments
Open

Unexpected behaviour when using multiple character newline #2216

OMAlexB opened this issue Dec 21, 2023 · 2 comments
Labels

Comments

@OMAlexB
Copy link

OMAlexB commented Dec 21, 2023

We have some files that are using |*| as the delimiter and |##|\r\n as the newline, this is comes from a third party application and we unfortunately cannot change it. We have found three issues while trying to parse this file using CSVHelper.

  1. When the first character of a row is the same as the first character of the new line character it is treated as a blank line.
    Example with |##|\r\n as the new line character, |a,b,c|##|\r\n is treated as an empty row. I expect this should be treated as a full valid row |a,b,c|##|\r\n and not required to be wrapped in quotes because it does not contain the full newline. RFC 4180 obviously doesn't specify allowing custom newline characters, but if we are just replacing CRLF with |##|\r\n I wouldn't expect we should need to wrap this field in quotes because it does not contain the entire newline. In our particular case we don't have control of the file so we also can't wrap these fields in quotes regardless.

  2. Similarly to the first issue when the first character in the newline is present in a record the row is cut off at that point, despite not containing the entire newline. Eg a|,b,c|##|\r\n is treated as two separate rows with the raw record for row 1 being a| and the raw record for row 2 being ,b,c|##|r\n. Similar reasoning to above for why this should be working as normal without the need for quotes.

  3. When both the delimiter and new line character begin with the same character the parser believes every occurrence of a new line is a delimiter because it is only checking the first character (and checking delimiter first). Eg

a|*|b|*|c|##|\r\n
d|*|e|*|f|##|\r\n

is treated as a single row with 5 records in it. I expect this should be treated normally as 2 separate rows even though they are strange delimiter/new line characters.

We are looking at implementing fixes for these ourselves by just peeking at the next few characters to validate the entire delimiter or newline is present when under these circumstances and hoping to make a PR to merge back to here.

@OMAlexB OMAlexB added the bug label Dec 21, 2023
@JoshClose
Copy link
Owner

Peeking could be a problem due to the buffer. There is currently no peeking in the parser.

I think there is an issue when the first char of the delimiter and newline are the same. I also see an issue with blank lines and custom newlines.

For now, could you possibly run a replace on the file first, replacing |##|\r\n with \r\n? That's not ideal, but it may work for the time being.

@OMAlexB
Copy link
Author

OMAlexB commented Feb 19, 2024

Oops, I didn't see your comment. I have opened up a PR to add support for peeking with the buffer and these fixes. It wasn't really feasible to run the full replace as they had are pretty big and are provided to us by a third party.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants