Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getElementsByTagName doesn't work on some sites #1035

Open
sgehrman opened this issue Aug 25, 2020 · 2 comments
Open

getElementsByTagName doesn't work on some sites #1035

sgehrman opened this issue Aug 25, 2020 · 2 comments

Comments

@sgehrman
Copy link

I'm scraping a website to get the title and description and other meta data, but it's not working on all sites.

for example:
https://www.youtube.com/watch?v=3AIZAGwMRg8

final List elements = document.head.getElementsByTagName('title');

elements returns []

But other sites work just fine, like https://apple.com

I'm also using:

  final List<Element> metas = document.head.getElementsByTagName('meta');

And on that site, I'm not seeing all the meta tags

@TheYuriG
Copy link

TheYuriG commented Jan 15, 2021

It won't work because all of that is rendered through javascript, which this library does not run.

Disable javascript before loading a page and then you can see what can be scraped and what cannot.

I installed a chrome extension to do this (https://chrome.google.com/webstore/detail/toggle-javascript/cidlcjdalomndpeagkjpnefhljffbnlo) but you can also do it by pressing F12 to open the console and then pressing Cntr + Shift + P to open the command line, then just type javascript and the option is going to show up for you.

If you NEED javascript, i recommend running a library like puppeteer first and then parsing that post-rendered HTML.

Youtube also has an API you can tap into, instead of scraping their site. See if that can fit your need somehow.

@shawnlauzon
Copy link

If you set the User-Agent to a bot when retrieving the document, then it will return all of the tags.

@mosuem mosuem transferred this issue from dart-archive/html Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants