Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<noscript> causes selector to fail #139

Open
nathan-osman opened this issue Dec 5, 2016 · 9 comments
Open

<noscript> causes selector to fail #139

nathan-osman opened this issue Dec 5, 2016 · 9 comments

Comments

@nathan-osman
Copy link

Consider the following program:

package main

import (
	"fmt"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

const data = `<noscript><a href="http://example.org">click this link</a></noscript>`

func main() {
	d, err := goquery.NewDocumentFromReader(strings.NewReader(data))
	if err != nil {
		fmt.Println(err)
		return
	}
	a, ok := d.Find("noscript a").Attr("href")
	fmt.Printf("URL: '%s', %t\n", a, ok)
}

The expected output is:

URL: 'http://example.org', true

But instead the output is:

URL: '', false

Changing noscript to div in both the document and selector causes the expected output, so the problem seems to affect only <noscript> elements.

@nathan-osman
Copy link
Author

After further investigation, the problem appeared to originate in Cascadia:

package main

import (
	"fmt"
	"strings"

	"github.com/andybalholm/cascadia"
	"golang.org/x/net/html"
)

const data = `<noscript><a href="http://example.org">click</a></noscript>`

func main() {
	n, err := html.Parse(strings.NewReader(data))
	if err != nil {
		fmt.Println(err)
		return
	}
	s, err := cascadia.Compile("noscript a")
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(len(s.MatchAll(n)))
}

Before I could file a bug there, however, I came across this: andybalholm/cascadia#14

"The net/html parser parses the document as if javascript were enabled. Because of that, the contents of noscript elements are just a single text node, not parsed HTML elements."

Now it looks like the bug exists in the golang.org/x/net/html package. Indeed, there is an open bug there for this very problem: golang/go#16318

Sadly, it hasn't been fixed yet. 😢

@mna
Copy link
Member

mna commented Dec 5, 2016

Hello Nathan,

Thanks for looking into this. Makes sense that this is at the html parser level, would be nice if it provided the option to set javascript on or off for parsing. I'll keep the issue open until some decision is made in the parser.

Martin

@Arnold1
Copy link

Arnold1 commented Jan 8, 2017

just noticed the same issue :)

@machinae
Copy link

For those looking for a workaround, re-parsing the content of the noscript tag seems to do the trick.

s.Find("noscript").SetHtml(s.Find("noscript").Text())

@Arnold1
Copy link

Arnold1 commented Oct 23, 2018

@machinae cool thanks i will try it :) do you have an example which i can run?

@machinae
Copy link

machinae commented Oct 24, 2018

s in my example is any *goquery.Selection. Just add that line after loading the document

package main

import (
	"fmt"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

const data = `<noscript><a href="http://example.org">click this link</a></noscript>`
func main() {
	d, err := goquery.NewDocumentFromReader(strings.NewReader(data))
	if err != nil {
		fmt.Println(err)
		return
	}
        d.Find("noscript").SetHtml(d.Find("noscript").Text())
	a, ok := d.Find("noscript a").Attr("href")
	fmt.Printf("URL: '%s', %t\n", a, ok)
}

@xeoncross
Copy link

@machinae wouldn't this set the contents of the first noscript as the text of all noscript tags combined? I would think getting the instance of the tag would be safer? (not tested)

d.Find("noscript").Each(func(i int, s *goquery.Selection) { 
    s.ReplaceWithHtml(s.Text())
})

(I don't use goquery so the above is just a guess)

@orestonce
Copy link

I resolved it with code below:

root := doc.Selection
root.Find(`noscript`).Each(func(i int, selection *goquery.Selection) {
	selection.SetHtml(selection.Text())
})

@mikestead
Copy link

Looks like there was a partial fix in the referenced issue, i.e. ParseOptionEnableScripting(bool) which would support disabling script emulation mode. From the last issue comment it only work when noscript is inside the head though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants