Skip to content

[core-http] Update XML serialization to support Atom XML operations#4916

Merged
ramya0820 merged 83 commits intoAzure:masterfrom
ramya0820:sb-atom-p5
Oct 16, 2019
Merged

[core-http] Update XML serialization to support Atom XML operations#4916
ramya0820 merged 83 commits intoAzure:masterfrom
ramya0820:sb-atom-p5

Conversation

@ramya0820
Copy link
Member

@ramya0820 ramya0820 commented Aug 27, 2019

EDIT:

This PR is for changes in core-http required for supporting Atom XML serialization required of Service Bus (and other SDKs in future such as Event Hubs). Code was mostly moved over from older JS only azure-sb implementation and has been significantly updated to fit into TS best practices.
Most of the initial work has been moved to Service Bus #4922 as per latest discussion.

Also includes

  • refactoring of general XML utilities
  • bug fix for XML serialization in browser

@ramya0820 ramya0820 requested a review from daviwil as a code owner August 27, 2019 22:15
@ramya0820 ramya0820 self-assigned this Aug 27, 2019
@ramya-rao-a ramya-rao-a requested review from bterlson and chradek and removed request for KarishmaGhiya and daviwil August 28, 2019 00:02
eslint-plugin-promise: 4.2.1
events: 3.0.0
express: 4.17.1
fast-xml-parser: 3.12.20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a canonical implementation of an XML parser? Is this the default used in the community?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependencies to review would be in package.json file.
This is a temporary artifact that gets generated on local and will be rebased on master. (After which this line would go away since it's not being used in implementation)

About the dependency in itself, I tried using it but it doesn't allow for XML manipulation in memory. So, not sure if it's a good choice for an alternative just yet.

Copy link
Member

@bterlson bterlson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level:

  • Remove each util
  • Remove unused utils
  • Remove any types
  • Remove dependency on cryptojs in the browser (consult with Johnathan and Daniel on what they did with keyvault)
  • Remove isUndefined helper (x === undefined is better). In general the helpers seem less useful with modern JS so take a stab at removing as many as makes sense :)
  • Some specific comments were left, but I did not point out all the places where the feedback applies. Take these comments as general feedback.
  • Replace vars with let/const.

@ramya0820 ramya0820 requested a review from sadasant August 29, 2019 22:25
@ramya0820
Copy link
Member Author

Thanks @sadasant for helping with the WebCrypto stuff! Let me know if there's anything else you suggest around it. The changes we looked into are present in crypto.browser.ts file

@ramya0820 ramya0820 changed the title [core-http] Implement Atom XML support in core-http [core-http] Update XML serialization to support Atom XML operations Oct 15, 2019
@ramya0820
Copy link
Member Author

ramya0820 commented Oct 15, 2019

PR updated to not house the atomSerializationPolicy or any Atom XML specific things per discussions yesterday.
In addition to refactoring, changes so far include few more bug fixes (with XML serialization in browser) and unit tests (#5548)
@ramya-rao-a @bterlson @chradek @daviwil

Copy link
Contributor

@AlexGhiondea AlexGhiondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes overall look good, but I have some concerns around throwing new exceptions when we didn't throw them before.

I left a couple of comments, please take a look.


/**
* @ignore
* Helper utility to clean up unintended characters that get appended by OS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the OS won't automatically append the BOM to a file as it doesn't know the content of the file. It is up to the application to add those bytes or not.

In our scenario, who is going to put the BOM into the stream that contains the XML doc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comment that was added by original authors in azure-sb

Based on https://www.w3.org/International/questions/qa-byte-order-mark.en, looks like this gets added commonly by editors and any app that's used to generate the text content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the scenario where applications put this header. What I am trying to understand is how is the service bus SDK receiving XML objects that contain this BOM.

I don't mind adding this in if there is a scenario in which we know the BOM is being returned by the service -- what I am asking is for is that scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is no test scenario that was written earlier that surfaced this as being sent by the service.
As discussed earlier, our understanding is that the OS on which the code generating files (if any) would end up appending this.

So no, I haven't encountered a concrete scenario supporting this other than the fact that this check was added with a comment describing reason by authors in azure-sb

Notes from offline discussion yesterday afternoon for ref -
• Handling of unwanted BOM characters - This was being taken care of in azure-sb implementation and seems less understood by us overall.
https://en.wikipedia.org/wiki/Byte_order_mark
https://www.screenaware.com/en/blog/xml2js-sax-js-non-whitespace-before-first-tag
Based on research and testing things, looks like it's a general problem that users face and must have been added for a reason in azure-sb. The documentation and intent is pretty clear too.
In Node, while xml2js handles this internally for XML parsing, it doesn't do so for XML building.
In browser, both building and parsing require BOM handling.

*/
export function removeBOM(str: any) {
if (typeof str == "string") {
if (str.charCodeAt(0) === 0xfeff || str.charCodeAt(0) === 0xffef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this checks for the BOM header when the encoding is expected to be UTF16.
Would the correct check be for the following sequence of bytes EF BB BF ?

https://www.w3.org/International/questions/qa-byte-order-mark.en

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is for utf-8, utf-16 seems to have some other considerations added in a different section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another resource -- there seems to be a difference between UTF8 and UTF16.

https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF

https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

Can you share more information about the section you mention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the earlier link at https://www.w3.org/International/questions/qa-byte-order-mark.en#bytheway
It says

If you use a UTF-16 encoding for your page (and we strongly recommend that you don't), there are some additional considerations.

So the directions mentioned in entirety of article seemed to refer to UTF-8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same document, they have this:

With a binary editor capable of displaying the hexadecimal byte values in the file, the UTF-8 signature displays as EF BB BF.

At this point, this is irrelevant as we don't have the BOM code anymore, but this is something to keep in mind for the SB changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created 3 files using notepad and saved them with different encodings (UTF8, UTF16 big endian, UTF16 little endian).

Here is how they look like in a binary editor
UTF8
image
UTF16 Little Endian
image
UTF16 Big Endian
image

Copy link
Member Author

@ramya0820 ramya0820 Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since SB doesn't directly feed any files to input, it's not going to be necessary.
But when it comes to dealing with output sent by service, such as for deserializationPolicy and maybe others where the response content from service is being parsed directly, would be good to have these in core-http I think.
Especially cause xml2js handles it in Node, but for in browser we are manually parsing and this isn't being handled implicitly.
So it seems to do with general text stream handling things, and, wonder if Storage handles the BOM characters explicitly in the SDK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the actual characters to use, above is interesting - unclear why the utf-16 code was being added in azure-sb - it might be possible that this specific character was noted in output. (Or maybe it was before xml2js (or the service itself) was updated to implicitly handle the BOM characters)

export function parseXML(str: string, opts?: { includeRoot?: boolean }): Promise<any> {
try {
if (str == undefined || str == "") {
throw new Error("Document is empty");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new error case -- what used to happen when you passed and empty XML string before?

I would prefer that we match the existing behavior so that we don't break existing scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the same error that gets thrown by parser in Node - same for all errors in .browser.ts file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is not what I was wondering. If another SDK was using this browser code before and they passed either undefined or and empty document, what would they get back?

I am asking because changing that behavior (and throwing in that case) could be a breaking change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on usage in Storage SDK, this scenario doesn't seem to exist
It seems a very less likely possibility since the SDKs are all targeting the same Azure services. And for an agnostic thing such as empty/undefined document being disallowed in one SDK, it seems like the behavior should be same for others as well.

For instance, if parsing of undefined/empty documents is allowed in browser, it would be a bug if it doesn't work from the Node SDK?
Similarly, if it is not allowed in browser, then the tests and checks would have helped catch that type of bug early on.

But I think more investigation is needed on this, since maybe we should allow handling of empty/undefined documents if that is a valid scenario? Ideally, this discrepancy wouldn't be there if we are using the same parser for Node and browser, but for most part the behaviors would be expected to be same unless very explicitly required as otherwise I think..
cc @bterlson @daviwil

Copy link
Contributor

@ramya-rao-a ramya-rao-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 main concerns left

  • I would have the consumer of the library worry about bom and not have the bom handling here in this library
  • I would remove the new validation errors that are being thrown when in browser mode. These are made to match the errors coming from xml2js library when run in node which can change over time and we have no control over.

@ramya0820
Copy link
Member Author

@ramya-rao-a @AlexGhiondea
About the concern that browser tests would fail if xml2js changes their error messages, Node tests would fail as well since they are the same simple strings.

However, reduced the test coverage and removed BOM handling to unblock PR. We can continue discussing/learning more later..

});
/**
* Converts given XML string into JSON
* @param str String containg the XML content to be parsed into JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param str String containg the XML content to be parsed into JSON
* @param str String containing the XML content to be parsed into JSON

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

);
});

it.only("with element with child undefined element", async function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it.only("with element with child undefined element", async function() {
it("with element with child undefined element", async function() {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@ramya-rao-a ramya-rao-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 more things

  • Remove the .only on one of the test cases
  • A spelling mistake in one of the jsdocs.

I have added suggested edits for both that you can use

@ramya-rao-a
Copy link
Contributor

/azp run js - client - ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants