Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML parser #3412

Closed
Closed

Conversation

bbert
Copy link
Contributor

@bbert bbert commented Sep 23, 2020

The goal of this PR is to provide a new XML parser in order to speed up parsing time.
Parsing time may be criticial for long contents with SegmentTimelines.

What has been achieved in this PR:

  • integration of tXml parser: https://github.com/TobiasNickel/tXml
  • modification of the tXml parser in order to apply simplification process (https://github.com/TobiasNickel/tXml#xmlsimplify-txml_dom_object) on-the-fly during parsing, while also applying matchers on the attributes values and managing children that shall be stored as arrays
  • remove the StringMatcher since it is not useful anymore as every attribute is parsed by default as a string (BTW simply removing this matcher could reduce up to 30-40% the original parsing processing time)
  • update of the NumericMatcher in order to avoid convert into numbers attributes that shall be in string format but that can be filled as a numeric value ("id" attribute for example)
  • remove all references to '_asArray' notation (no more duplication of child nodes as a single object and as an array of object)

Here are some numbers for the XML parsing processing time (in ms) for a 240mn content having one AdaptationSet with a SegmentTimeline without repeat pattern:

  Chrome desktop  (core i7 2x2.7GHz 16Go) Chromecast 3rd Gen Chromecast Ultra
dash.js 110 2 500 1 800
dash.js + tXml 30 500 350

@bbert
Copy link
Contributor Author

bbert commented Feb 10, 2021

@technogeek00 can you please check this PR on your side regarding all the changes from your PR #3451?
Can you check if does not break anything? Also can you check the unit tests since some of them do not success anymore? But I don't know how to fix them.

@technogeek00
Copy link
Contributor

@bbert I can take a look, will take me a few days to get to it.

@bbert
Copy link
Contributor Author

bbert commented Mar 26, 2021

@technogeek00 Did you have time to check this PR and unit tests?

@dsilhavy dsilhavy modified the milestones: 3.2.2, 4.0.0 Mar 29, 2021
@technogeek00
Copy link
Contributor

Reviewing tonight

Copy link
Contributor

@technogeek00 technogeek00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I apologize for the delay in my review, but hopefully this is insightful feedback.

I pulled this branch and attempted to run the player and the tests locally, both appeared broken as indicated, but upon further investigation there seems to be deeper problems with the library proposed. The points I would highlight for discussion:

  1. The ordering of sibling elements of different tag names is not persisted, this is critical to patching operations as the add/remove/replace operations operate on the consecutively modified document
  2. There appears to be little to no proper support for xml namespacing, the prefixes are directly preserved, but these are arbitrary in a document and need to be properly handled

As a note on the first point, this is technically only critical for the patch document, the non-modified format of the tXml library does appear to support parsing and simplification independently. If that version was taken directly the DashParser.js file could apply simplification to the MPD element to keep footprints small while the Patch could be left untouched to persist the ordering information for usage in application.

While I agree the parsing and internal structuring of the current library should be replaced with a more performant option, this particular library seems very purpose built for a specific use case and may not be fully conformant. If this library was used, I'd recommend reverting back to the original version, re-porting the attribute matcher framework, and then utilize the simplified version for mpds and lossless simplified version for patches instead of manually modifying the original parse process.

Outside the library we may need to take a close look at document usage cases to ensure single vs array element usage is done consistently and accurately with the updated parsed structure. I highlighted a couple bugs in this review, but I'd worry about areas that did not reference the _asArray and still always expected it or others that always assumed standalone elements.

function tXml(S, options, attrMatchers, arrayChildNames) {
"use strict";
options = options || {};
attrMatchers = attrMatchers || {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type mismatch, attrMatchers used as array but defaulted to object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


* @return {tNode[]}
*/
function tXml(S, options, attrMatchers, arrayChildNames) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the differences between this file and the main tXML attrMatches and arrayChildNames appear to be extensions, it may be better to integrate them through the libraries existing conventions of the options object.

The attribute matchers are a nice addition, but generally why introduce a sometimes array type child function like arrayChildNames instead of using the full tXML file and passing options.simplify: true to have the all children are arrays simplified form without file modification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attrMatches and attrMatches can indeed be passed in the options object.
So do you suggest to have systematically all children being stored as arrays? This has the disadvantage to access some child nodes as an array while DASH spec indicates for some nodes you can have only (for example SegmentBase)

DashConstants.SUPPLEMENTAL_PROPERTY,
DashConstants.METRICS,
DashConstants.REPORTING
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing elements preventing patching from working:

  • PatchLocation
  • replace
  • add
  • remove

Example patch documents can be found in the DASH Schema repo:

Two other random notes discovered in my personal testing:

  1. The parsing library does not handle namespaced elements properly which means future inclusion will need to build this out in the library. I highlight this because the Annex I query parameter work may soon be included from the DASH-IF Ad Insertion work which requires namespacing. Note that with XML namespaces the namespace identifier in documents are arbitrary and part of the node definition. Example namespaced node also in the DASH Schema repo example_I1.mpd

  2. The parsing library does not correctly unescape XML characters, i.e. & => & does not occur rendering URLs invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed these ones. I did fix

@@ -1043,7 +1043,7 @@ function DashManifestModel() {
if (manifest && manifest.hasOwnProperty(Constants.LOCATION)) {
// for now, do not support multiple Locations -
// just set Location to the first Location.
manifest.Location = manifest.Location_asArray[0];
manifest.Location = manifest.Location[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unsafe re-assignment, it will cause the Location to be truncated on subsequent calls. With Location always an array this line should be omitted and the return should be shifted to return manifest.Location[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, did replace asArray without paying attention to this.
I fixed it

@@ -1055,7 +1055,7 @@ function DashManifestModel() {
function getPatchLocation(manifest) {
if (manifest && manifest.hasOwnProperty(DashConstants.PATCH_LOCATION)) {
// only include support for single patch location currently
manifest.PatchLocation = manifest.PatchLocation_asArray[0];
manifest.PatchLocation = manifest.PatchLocation[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}

mimeTypeRegEx = (type !== Constants.TEXT) ? new RegExp(type) : new RegExp('(vtt|ttml)');

if (adaptation.Representation_asArray && adaptation.Representation_asArray.length && adaptation.Representation_asArray.length > 0) {
let essentialProperties = getEssentialPropertiesForRepresentation(adaptation.Representation_asArray[0]);
if (adaptation.Representation && adaptation.Representation.length && adaptation.Representation.length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A theme I've observed across the review, there are a number of places that the dash.js code was guarding against the possibility of the element not being an array. With the forced array representation only an existence check is needed:

if(adaptation.Representation && adaptation.Representation.length > 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I did update these checks

@@ -106,7 +106,7 @@ class SimpleXPath {

// stop one early if this is the last element and an attribute
if (level !== this.path.length - 1 || !name.startsWith('@')) {
let children = parent[name + '_asArray'] || [];
let children = parent[name] || [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change assumes that all children are now arrays, but only a limited set are based on the parsing logic. While some SimpleXPath tests are failing because the test expects need to be updated to use array notation: Period.AdaptationSet => Period[0].AdaptationSet[0].

The remainder still fail because of nodes that are not arrays, such as SegmentTemplate and SegmentTimeline. For those objects, this line will take the object direction and it will have fail the length check as it is a strict equality to zero causing the node search to fail.

Copy link
Contributor Author

@bbert bbert Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can fix it, for example by considering the list of node names that should be as array, as done in the parser.
But I would need some test streams to check. Do you have some?

@@ -26,7 +25,7 @@ function staticSegmentTimeline() {
function staticSegmentTemplate() {
let timeline = staticSegmentTimeline();
return {
SegmentTimeline: timeline, // purposely omit the _asArray to ensure single node case captured
SegmentTimeline: timeline,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem on this exact line, but need to highlight in this file

Note that the PatchHelper is explicitly generating the patches with well formed and ordered __children arrays because ordering of elements is significant for patch operations, they must be supplied sequentially not based on node type. Actual code reference: https://github.com/Dash-Industry-Forum/dash.js/blob/development/test/unit/helpers/PatchHelper.js#L87

The simplified output of the updated parsing library does not appear capable of representing this information.

@dsilhavy dsilhavy added this to the 4.1.0 milestone Jun 27, 2021
@dsilhavy dsilhavy removed this from the 4.1.0 milestone Sep 15, 2021
@dsilhavy dsilhavy added this to the 4.6.0 milestone Oct 10, 2022
@dsilhavy dsilhavy modified the milestones: 4.5.2, 4.6.0 Dec 8, 2022
@dsilhavy dsilhavy modified the milestones: 4.6.0, 5.0.0 Jan 26, 2023
// if we did have a positional reference we need to purge from array set and restore X2JS proper semantics
if (relativePosition != -1) {
let targetArray = target[name + '_asArray'];
if (typeof target[name] === Object) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (typeof target[name] === Object) {
if (typeof target[name] === 'object') {

typeof returns a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in PR #4180

@bbert bbert mentioned this pull request May 5, 2023
@bbert
Copy link
Contributor Author

bbert commented May 5, 2023

Closed, see #4180

@bbert bbert closed this May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants