Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML detector function #75

Closed
notriddle opened this issue Oct 24, 2017 · 10 comments
Closed

HTML detector function #75

notriddle opened this issue Oct 24, 2017 · 10 comments

Comments

@notriddle
Copy link
Member

As nice as it is to be able to remove the blatantly bad stuff, sometimes you don't want the user to be able to enter any HTML at all. You could do this by escaping the markup, but if having a database with < in it doesn't appeal to you, or you're worried about double-escaping or similarly nasty accidents, you could use a function that just tells you if a string has any HTML tags in it. And in case you wonder why anybody would want a library for that, it's not actually that easy to detect HTML without any false positives or false negatives.

We should be able to do this pretty easily:

  • Parse the string as an HTML snippet
  • Return true if the resulting HTML fragment contains exactly one node, and that node is of type TEXT
@notriddle notriddle added this to the 1.2 milestone Oct 24, 2017
@lnicola
Copy link
Member

lnicola commented Oct 24, 2017

Another similar use case might be "strip HTML and keep only the text", e.g. go from This is <span>some</span> text to This is some text, though I'm not sure how that should work for all elements.

@notriddle
Copy link
Member Author

notriddle commented Oct 24, 2017

Just use the regular mode and set the allowed_elements to the empty set.

Maybe we should add a Builder::empty() constructor?

@lnicola
Copy link
Member

lnicola commented Oct 24, 2017

Right, that works. I thought that would remove the text in those element, but it looks like it's kept.

@stanciuadrian
Copy link
Contributor

It seems that html5ever gives the same tree structure for s and <html>s</html>:

 NodeData::Document
 +-+ NodeData::Element with name:html
   +-- NodeData::Text with contents:s

@notriddle
Copy link
Member Author

You need to parse it as a document fragment instead of as the whole document. That'll prevent html5ever from inferring the document structure.

@lnicola
Copy link
Member

lnicola commented Oct 29, 2017

@notriddle That happens even with parse_fragment as used in make_parser.

@notriddle
Copy link
Member Author

That's annoying. It's supposed to be parsing the fragment as if it was inside a <div> tag.

@lnicola
Copy link
Member

lnicola commented Oct 30, 2017

I opened servo/html5ever#323 about this.

@notriddle
Copy link
Member Author

notriddle commented Oct 30, 2017

We should probably use https://play.rust-lang.org/?gist=dd917d5f859ac115971d8355e5ab3dd2&version=stable instead, then. I was wrong about fragment parsers. Sorry.

@Ygg01
Copy link

Ygg01 commented Oct 31, 2017

Just a small warning. The code I whipped up is pretty basic, feel free to change it around. E.g. remove println, maybe use StrTendril instead of ByteTendril, etc.

@notriddle notriddle removed this from the 1.2 milestone Jul 18, 2018
bors bot added a commit that referenced this issue Jun 3, 2021
139: Detect html (see #75) r=notriddle a=mozfreddyb

Having looked at the TokenSink example in #75, this seemed mostly straightfoward.
I found a couple of other nits along my way that I figured I might just pick up and do within this pull request, but I'm happy to drop them or turn them into individual PRs, if I must.



Co-authored-by: Frederik Braun <[email protected]>
Co-authored-by: Michael Howell <[email protected]>
@bors bors bot closed this as completed in ddb5159 Jun 3, 2021
bors bot added a commit that referenced this issue Apr 4, 2022
157: Add empty() constructor r=notriddle a=nrempel

Hey there,

This pull request adds a new `empty` constructor. This was mentioned in [#75](#75 (comment)) but never implemented.

I need this so that I can obtain `&Builder` and not `&mut Builder` which is returned when modifying the tags with `tags()`.

I'm happy to add an additional test as well if you like but the doctest seemed sufficient to me.

What do you think about this change? Thanks.

Co-authored-by: Nicholas Rempel <[email protected]>
bors bot added a commit that referenced this issue Apr 4, 2022
157: Add empty() constructor r=notriddle a=nrempel

Hey there,

This pull request adds a new `empty` constructor. This was mentioned in [#75](#75 (comment)) but never implemented.

I need this so that I can obtain `&Builder` and not `&mut Builder` which is returned when modifying the tags with `tags()`.

I'm happy to add an additional test as well if you like but the doctest seemed sufficient to me.

What do you think about this change? Thanks.

Co-authored-by: Nicholas Rempel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants