You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 22, 2022. It is now read-only.
Describe the bug
A couple of weeks ago I submitted a PR for #75, which updated excel4node to accept XML 1.1-valid characters as cell content. This worked, mostly, until a customer of mine attempted to download XLSX reports that had input data from certain multi-code-point emoji (the skin-tone emoji), which resulted in Microsoft Excel being unable to open the resulting file, with the message:
We found a problem with some content in 'filename.xlsx. Do you want us to try to recover as much as we can?
Further investigation and reading into the ECMA-376 specification, and the Microsoft Excel extensions to the Office Open XML SpreadsheetML File Format, revealed a problem with my PR: XLSX files are strictly XML 1.0 compliant, not 1.1.
I have a PR to go along with this issue, based on the documentation, which addresses the following:
Greatly expanded the unit tests for cells containing unicode strings.
Added some additional tests to include XML control characters that should be escaped.
Changed Cell.stringSetter to evaluate all codepoints instead of using regular expressions on entire strings (faster and more reliable to maintain)
To Reproduce
Create a worksheet from a set of data that includes various multi-point unicode characters above U+FFFF, and other characters that are valid in XML 1.1 but not XML 1.0, and then attempt to open it in Excel / LibreOffice / etc.
Create a worksheet from a set of data that includes invalid control characterts such as \u000b (vertical tab).
Expected behavior
The document should be valid, and open, and the characters should be presented correctly by the system in their encoded form.
Environment (please complete the following information):
Mac OSX Mojave 10. 14
Node v8.11.2
excel4node Version 1.6.0
Additional context
Add any other context about the problem here. Log entries related to the issue are good things.
That regex would look like /[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u, but that final character range is currently an invalid regular expression in Node 8
I have updated stringSetter to use a much simplified removeInvalidXml function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.
Under the hood, xmlbuilder handles almost all the encoding and escaping properly: several control charactes such as vertical tab are legal in XML but not in the OfficeML specification, which is why the codepoint evaluation was added.
The text was updated successfully, but these errors were encountered:
Describe the bug
A couple of weeks ago I submitted a PR for #75, which updated
excel4node
to accept XML 1.1-valid characters as cell content. This worked, mostly, until a customer of mine attempted to download XLSX reports that had input data from certain multi-code-point emoji (the skin-tone emoji), which resulted in Microsoft Excel being unable to open the resulting file, with the message:Further investigation and reading into the ECMA-376 specification, and the Microsoft Excel extensions to the Office Open XML SpreadsheetML File Format, revealed a problem with my PR: XLSX files are strictly XML 1.0 compliant, not 1.1.
I have a PR to go along with this issue, based on the documentation, which addresses the following:
Cell.stringSetter
to evaluate all codepoints instead of using regular expressions on entire strings (faster and more reliable to maintain)To Reproduce
Create a worksheet from a set of data that includes various multi-point unicode characters above
U+FFFF
, and other characters that are valid in XML 1.1 but not XML 1.0, and then attempt to open it in Excel / LibreOffice / etc.Create a worksheet from a set of data that includes invalid control characterts such as
\u000b
(vertical tab).Expected behavior
The document should be valid, and open, and the characters should be presented correctly by the system in their encoded form.
Environment (please complete the following information):
Additional context
Add any other context about the problem here. Log entries related to the issue are good things.
Update post-release for clarity
The list of XML valid characters is well known:
That regex would look like
/[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u
, but that final character range is currently an invalid regular expression in Node 8I have updated
stringSetter
to use a much simplifiedremoveInvalidXml
function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.Under the hood,
xmlbuilder
handles almost all the encoding and escaping properly: several control charactes such asvertical tab
are legal in XML but not in the OfficeML specification, which is why the codepoint evaluation was added.The text was updated successfully, but these errors were encountered: