Skip to content
This repository has been archived by the owner on Jul 22, 2022. It is now read-only.

Fix for #238 - bug in new emoji support allowed some illegal characters. #239

Merged
merged 4 commits into from
Oct 5, 2018

Conversation

mschnee
Copy link
Contributor

@mschnee mschnee commented Sep 28, 2018

This fix is for #238

A previous PR #75 updated excel4node to support XML 1.1-valid characters. Unfortunately, the XLSX / Office Open XML SpreadsheetML File Format is not a XML 1.1 format, only XML 1.0. This caused resultant files to be unreadable when certain categories of multipoint unicode characters (like skin-tone emoji) were rendered into it incorrectly.

  • Greatly expanded the unit tests for cells containing unicode strings.
  • Added some additional tests to include XML control characters that should be escaped.
  • Changed Cell.stringSetter to evaluate all codepoints instead of using regular expressions on entire strings (faster and more reliable to maintain), with the following logic:
    • Escape control characters '<>"& with the respective &escaped; versions.
    • Use the original XML 1.0 regex to validate a codepoint. If it does not match, escape it using the numeric form of the unicode character, ex &#x1f602; for 😂, &#x1f91e;&#x1f3fc; for 🤞🏼 and so on.
    • Return any codepoints unmodified that match the valid set.

Please comment on other ways I may be able to increase coverage, quality, or test sampling :) .

(PS sorry for breaking XLSX files)

@@ -6,6 +6,48 @@ const Style = require('../style/style.js');
const utils = require('../utils.js');
const util = require('util');

const ampEncoded = {
'"': '&quot;',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replace these characters with their HTML equivalents? When opening docs with Excel or LibreOffice, these are shown as their HTML literals and not their expected value.
e.g. I would expect that when I pass Cats & Dogs into the cell string function, "Cats & Dogs" is displayed when opening the document with Excel, however, "Cats & Dogs" is the actual result. I imagine Goole Sheets would display the correct value, but not Excel or Libre Office.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly following the ECMA-376, but it may be unnecessary with xmlbuilder. I will update the sample and do some actual comparison.

@natergj
Copy link
Owner

natergj commented Sep 30, 2018

Thanks for looking into this @mschnee!
One thing that may be helpful for troubleshooting or quality is to add a sheet to the workbook generated by the sample.js file in the repo. You can run though a bunch of emojis and characters and then verify that when the document is opened there are no errors and everything is displayed as you would expect. When I was first starting with this project I used https://archive.codeplex.com/?p=ooxmlvalidator to validate the package. That project is no longer maintained and I do need to find a replacement.

@natergj
Copy link
Owner

natergj commented Sep 30, 2018

When I was first starting with this project I used https://archive.codeplex.com/?p=ooxmlvalidator to validate the package. That project is no longer maintained and I do need to find a replacement.

I've added a validate.sh script that can be used to validate generated workbooks. It wouldn't have actually caught the unicode issue, though, but if you work on providing other fixes, it would help with testing.

@mschnee
Copy link
Contributor Author

mschnee commented Oct 1, 2018

The list of XML valid characters is well known:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

That regex would look like /[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD\u100000-\u10FFFF]/u, but I don't think that final character range is currently a valid regular expression- at least node 8.11.3 didn't match emoji against it.

> '☕️'.match(/\u2615/u);
[ '☕', index: 0, input: '☕️' ]
> '😂'.match(/\u1f602/u);
null

@natergj

I have updated stringSetter to use a much simplified removeInvalidXml function, which numerically tests for codepoints 65536-1114111, and then regex tests the rest of the valid range. I have also updated the tests, updated the sample, and ran it against my own customers' funky data.

@mschnee mschnee changed the title Update tests Fix for #238 - bug in new emoji support allowed some illegal characters. Oct 4, 2018
@mschnee
Copy link
Contributor Author

mschnee commented Oct 4, 2018

I've renamed the PR. This wasn't about adding tests: it was about fixing a bug I introduced in a previous PR @natergj

@natergj
Copy link
Owner

natergj commented Oct 5, 2018

This looks great! Thanks, @mschnee! Sorry for the delay in getting it merged. I'll have a new version published to npm by the end of the weekend and this ill be part of that.

@natergj natergj merged commit 1a54652 into natergj:master Oct 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants