Fix for #238 - bug in new emoji support allowed some illegal characters. #239

mschnee · 2018-09-28T21:42:58Z

This fix is for #238

A previous PR #75 updated excel4node to support XML 1.1-valid characters. Unfortunately, the XLSX / Office Open XML SpreadsheetML File Format is not a XML 1.1 format, only XML 1.0. This caused resultant files to be unreadable when certain categories of multipoint unicode characters (like skin-tone emoji) were rendered into it incorrectly.

Greatly expanded the unit tests for cells containing unicode strings.
Added some additional tests to include XML control characters that should be escaped.
Changed Cell.stringSetter to evaluate all codepoints instead of using regular expressions on entire strings (faster and more reliable to maintain), with the following logic:
- Escape control characters '<>"& with the respective &escaped; versions.
- Use the original XML 1.0 regex to validate a codepoint. If it does not match, escape it using the numeric form of the unicode character, ex 😂 for 😂, 🤞🏼 for 🤞🏼 and so on.
- Return any codepoints unmodified that match the valid set.

Please comment on other ways I may be able to increase coverage, quality, or test sampling :) .

(PS sorry for breaking XLSX files)

natergj · 2018-09-30T18:26:24Z

source/lib/cell/index.js

@@ -6,6 +6,48 @@ const Style = require('../style/style.js');
 const utils = require('../utils.js');
 const util = require('util');

+const ampEncoded = {
+    '"': '&quot;',


Why replace these characters with their HTML equivalents? When opening docs with Excel or LibreOffice, these are shown as their HTML literals and not their expected value.
e.g. I would expect that when I pass Cats & Dogs into the cell string function, "Cats & Dogs" is displayed when opening the document with Excel, however, "Cats & Dogs" is the actual result. I imagine Goole Sheets would display the correct value, but not Excel or Libre Office.

I was mostly following the ECMA-376, but it may be unnecessary with xmlbuilder. I will update the sample and do some actual comparison.

natergj · 2018-09-30T18:49:18Z

Thanks for looking into this @mschnee!
One thing that may be helpful for troubleshooting or quality is to add a sheet to the workbook generated by the sample.js file in the repo. You can run though a bunch of emojis and characters and then verify that when the document is opened there are no errors and everything is displayed as you would expect. When I was first starting with this project I used https://archive.codeplex.com/?p=ooxmlvalidator to validate the package. That project is no longer maintained and I do need to find a replacement.

natergj · 2018-09-30T19:54:30Z

When I was first starting with this project I used https://archive.codeplex.com/?p=ooxmlvalidator to validate the package. That project is no longer maintained and I do need to find a replacement.

I've added a validate.sh script that can be used to validate generated workbooks. It wouldn't have actually caught the unicode issue, though, but if you work on providing other fixes, it would help with testing.

mschnee · 2018-10-01T19:11:59Z

The list of XML valid characters is well known:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That regex would look like /[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u, but I don't think that final character range is currently a valid regular expression- at least node 8.11.3 didn't match emoji against it.

> '☕️'.match(/\u2615/u);
[ '☕', index: 0, input: '☕️' ]
> '😂'.match(/\u1f602/u);
null

@natergj

I have updated stringSetter to use a much simplified removeInvalidXml function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.

mschnee · 2018-10-04T23:25:53Z

I've renamed the PR. This wasn't about adding tests: it was about fixing a bug I introduced in a previous PR @natergj

natergj · 2018-10-05T22:42:00Z

This looks great! Thanks, @mschnee! Sorry for the delay in getting it merged. I'll have a new version published to npm by the end of the weekend and this ill be part of that.

Update tests

55ddcae

natergj reviewed Sep 30, 2018

View reviewed changes

mschnee added 2 commits October 1, 2018 12:02

Second attempt at escaping, also updates sample.js

7fe4690

Clean up spacing

05de693

Fix comparison for final codepoint

c4e0272

mschnee changed the title ~~Update tests~~ Fix for #238 - bug in new emoji support allowed some illegal characters. Oct 4, 2018

natergj merged commit 1a54652 into natergj:master Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for #238 - bug in new emoji support allowed some illegal characters. #239

Fix for #238 - bug in new emoji support allowed some illegal characters. #239

mschnee commented Sep 28, 2018 •

edited

Loading

natergj Sep 30, 2018

mschnee Oct 1, 2018

natergj commented Sep 30, 2018

natergj commented Sep 30, 2018

mschnee commented Oct 1, 2018 •

edited

Loading

mschnee commented Oct 4, 2018

natergj commented Oct 5, 2018

Fix for #238 - bug in new emoji support allowed some illegal characters. #239

Fix for #238 - bug in new emoji support allowed some illegal characters. #239

Conversation

mschnee commented Sep 28, 2018 • edited Loading

natergj Sep 30, 2018

Choose a reason for hiding this comment

mschnee Oct 1, 2018

Choose a reason for hiding this comment

natergj commented Sep 30, 2018

natergj commented Sep 30, 2018

mschnee commented Oct 1, 2018 • edited Loading

mschnee commented Oct 4, 2018

natergj commented Oct 5, 2018

mschnee commented Sep 28, 2018 •

edited

Loading

mschnee commented Oct 1, 2018 •

edited

Loading