forked from ashi009/node-fast-html-parser
-
Notifications
You must be signed in to change notification settings - Fork 116
Closed
Labels
Description
node-html-parser
currently uses the following regex pattern to parse tag name:
This is incorrect, since tag name can not only be for a custom element, but for any element. The correct part of the spec for parsing tag name is here: https://html.spec.whatwg.org/multipage/parsing.html#tag-name-state
Test case:
const parse = require('parse5').parse
const Parser = require('htmlparser2').Parser
const { parse: parseNhp } = require('node-html-parser')
const root2 = parse('<h@1>')
console.log('parse5:', root2.childNodes[0].childNodes[1].childNodes[0].nodeName)
const parser = new Parser({
onopentag(name) {
console.log('htmlparser2:', name)
}
})
parser.write('<h@1>')
parser.end()
const root = parseNhp('<h@1>')
console.log('node-html-parser:', root.childNodes[0].rawTagName)
Output:
parse5: h@1
htmlparser2: h@1
node-html-parser:
HTML:
<h@1>
As you see above, h@1
tag name is correctly parsed by parse5
, htmlparser2
, Chrome and Firefox, but isn't parsed by node-html-parser
.
In terms of the question of whether code containing h@1
is 'broken' or 'malformatted' HTML - it's not. Although h@1
is not permitted by any content models, it is permitted inside elements with 'nothing' content model.
The following code:
<!DOCTYPE html>
<html lang="en">
<head>
<title>test</title>
</head>
<body>
<template>
<h@1>Smile!</h@1>
</template>
</body>
</html>
passes HTML5 validator: