Skip to content
This repository has been archived by the owner on Jul 22, 2022. It is now read-only.

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

Closed
mschnee opened this issue Sep 28, 2018 · 1 comment
Closed

Incorrect Unicode Handling causes XLSX files to be unreadable. #238

mschnee opened this issue Sep 28, 2018 · 1 comment
Milestone

Comments

@mschnee
Copy link
Contributor

mschnee commented Sep 28, 2018

Describe the bug
A couple of weeks ago I submitted a PR for #75, which updated excel4node to accept XML 1.1-valid characters as cell content. This worked, mostly, until a customer of mine attempted to download XLSX reports that had input data from certain multi-code-point emoji (the skin-tone emoji), which resulted in Microsoft Excel being unable to open the resulting file, with the message:

We found a problem with some content in 'filename.xlsx.  Do you want us to try to recover as much as we can?

Further investigation and reading into the ECMA-376 specification, and the Microsoft Excel extensions to the Office Open XML SpreadsheetML File Format, revealed a problem with my PR: XLSX files are strictly XML 1.0 compliant, not 1.1.

I have a PR to go along with this issue, based on the documentation, which addresses the following:

  • Greatly expanded the unit tests for cells containing unicode strings.
  • Added some additional tests to include XML control characters that should be escaped.
  • Changed Cell.stringSetter to evaluate all codepoints instead of using regular expressions on entire strings (faster and more reliable to maintain)

To Reproduce
Create a worksheet from a set of data that includes various multi-point unicode characters above U+FFFF, and other characters that are valid in XML 1.1 but not XML 1.0, and then attempt to open it in Excel / LibreOffice / etc.

Create a worksheet from a set of data that includes invalid control characterts such as \u000b (vertical tab).

Expected behavior
The document should be valid, and open, and the characters should be presented correctly by the system in their encoded form.

Environment (please complete the following information):

  • Mac OSX Mojave 10. 14
  • Node v8.11.2
  • excel4node Version 1.6.0

Additional context
Add any other context about the problem here. Log entries related to the issue are good things.

Update post-release for clarity

The list of XML valid characters is well known:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That regex would look like /[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u, but that final character range is currently an invalid regular expression in Node 8

> '☕️'.match(/\u2615/u);
[ '☕', index: 0, input: '☕️' ]
> '😂'.match(/\u1f602/u);
null

I have updated stringSetter to use a much simplified removeInvalidXml function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.

Under the hood, xmlbuilder handles almost all the encoding and escaping properly: several control charactes such as vertical tab are legal in XML but not in the OfficeML specification, which is why the codepoint evaluation was added.

natergj added a commit that referenced this issue Oct 5, 2018
Fix for #238 - bug in new emoji support allowed some illegal characters.
@natergj natergj added this to the 1.7.0 milestone Oct 7, 2018
@natergj
Copy link
Owner

natergj commented Oct 7, 2018

PR 239 has been released as part of version 1.7.0. Thanks, again @mschnee

@natergj natergj closed this as completed Oct 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants