Incorrect Unicode Handling causes XLSX files to be unreadable. #238

mschnee · 2018-09-28T21:40:31Z

Describe the bug
A couple of weeks ago I submitted a PR for #75, which updated excel4node to accept XML 1.1-valid characters as cell content. This worked, mostly, until a customer of mine attempted to download XLSX reports that had input data from certain multi-code-point emoji (the skin-tone emoji), which resulted in Microsoft Excel being unable to open the resulting file, with the message:

We found a problem with some content in 'filename.xlsx.  Do you want us to try to recover as much as we can?

Further investigation and reading into the ECMA-376 specification, and the Microsoft Excel extensions to the Office Open XML SpreadsheetML File Format, revealed a problem with my PR: XLSX files are strictly XML 1.0 compliant, not 1.1.

I have a PR to go along with this issue, based on the documentation, which addresses the following:

Greatly expanded the unit tests for cells containing unicode strings.
Added some additional tests to include XML control characters that should be escaped.
Changed Cell.stringSetter to evaluate all codepoints instead of using regular expressions on entire strings (faster and more reliable to maintain)

To Reproduce
Create a worksheet from a set of data that includes various multi-point unicode characters above U+FFFF, and other characters that are valid in XML 1.1 but not XML 1.0, and then attempt to open it in Excel / LibreOffice / etc.

Create a worksheet from a set of data that includes invalid control characterts such as \u000b (vertical tab).

Expected behavior
The document should be valid, and open, and the characters should be presented correctly by the system in their encoded form.

Environment (please complete the following information):

Mac OSX Mojave 10. 14
Node v8.11.2
excel4node Version 1.6.0

Additional context
Add any other context about the problem here. Log entries related to the issue are good things.

Update post-release for clarity

The list of XML valid characters is well known:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That regex would look like /[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u, but that final character range is currently an invalid regular expression in Node 8

> '☕️'.match(/\u2615/u);
[ '☕', index: 0, input: '☕️' ]
> '😂'.match(/\u1f602/u);
null

I have updated stringSetter to use a much simplified removeInvalidXml function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.

Under the hood, xmlbuilder handles almost all the encoding and escaping properly: several control charactes such as vertical tab are legal in XML but not in the OfficeML specification, which is why the codepoint evaluation was added.

The text was updated successfully, but these errors were encountered:

Fix for #238 - bug in new emoji support allowed some illegal characters.

natergj · 2018-10-07T18:45:45Z

PR 239 has been released as part of version 1.7.0. Thanks, again @mschnee

mschnee mentioned this issue Sep 28, 2018

Fix for #238 - bug in new emoji support allowed some illegal characters. #239

Merged

natergj added a commit that referenced this issue Oct 5, 2018

Merge pull request #239 from mschnee/fix-xml-1.0-unicode-escapes

1a54652

Fix for #238 - bug in new emoji support allowed some illegal characters.

natergj added this to the 1.7.0 milestone Oct 7, 2018

natergj closed this as completed Oct 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

mschnee commented Sep 28, 2018 •

edited

Loading

natergj commented Oct 7, 2018

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

Comments

mschnee commented Sep 28, 2018 • edited Loading

Update post-release for clarity

natergj commented Oct 7, 2018

mschnee commented Sep 28, 2018 •

edited

Loading