Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading from Areena's Audio pages returns Document empty #261

Closed
idiootti opened this issue Jan 26, 2021 · 9 comments · Fixed by #300
Closed

Downloading from Areena's Audio pages returns Document empty #261

idiootti opened this issue Jan 26, 2021 · 9 comments · Fixed by #300

Comments

@idiootti
Copy link

idiootti commented Jan 26, 2021

I was trying to download https://areena.yle.fi/audio/1-50674174.

Using OS X this returned

Traceback (most recent call last):
  File "/usr/local/Cellar/yle-dl/20201022/libexec/bin/yle-dl", line 11, in <module>
    load_entry_point('yle-dl==20201022', 'console_scripts', 'yle-dl')()
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/yledl.py", line 461, in main
    res = execute_action(url, action, io, httpclient, title_formatter,
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/yledl.py", line 300, in execute_action
    return download_clips(clips(), dl, io, title_formatter, stream_filters)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/yledl.py", line 282, in clips
    return extractor.extract(url, stream_filters.latest_only,
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/extractors.py", line 244, in extract
    playlist = self.get_playlist(url)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/extractors.py", line 264, in get_playlist
    playlist = self.get_playlist_old_style_url(
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/extractors.py", line 292, in get_playlist_old_style_url
    html = self.httpclient.download_html_tree(url)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/http.py", line 57, in download_html_tree
    return lxml.html.fromstring(page)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/vendor/lib/python3.9/site-packages/lxml/html/__init__.py", line 875, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/vendor/lib/python3.9/site-packages/lxml/html/__init__.py", line 763, in document_fromstring
    raise etree.ParserError(
lxml.etree.ParserError: Document is empty

I've updated everything, checked that everything's in place, and tested with both Homebrew and pip.

Downloading video works fine so I would assume this is an issue with lxml handling Areena's new audio pages.
Finding single audio episode addresses didn't work either, it returned this same.

@aajanki
Copy link
Owner

aajanki commented Jan 27, 2021

Thanks for the report and the backtrace! Based on the error message, it looks like yle-dl got an empty HTTP response from the Areena server instead of the web page content it expected.

Can you try downloading again just to make sure it wasn't some kind of temporary problem at Areena? I'm asking this because downloading your example program works for me on Linux.

Are you able to download TV episodes or do you get a similar error also on videos?

@idiootti
Copy link
Author

I tried over a period of days and before and after re-installation, same response every time.

Video downloads, both individual and series work just fine.

wget produces a complete page on download, only lxml looks like it's playing up.

Info that may or may not be helpful is that I'm running yle-dl on Mac OS X 10.13.

@aajanki
Copy link
Owner

aajanki commented Jan 29, 2021

Another strange thing is that the backtrace you posted shows a call to self.get_playlist_old_style_url(. It shouldn't reach that branch at all on audio URLs. So yle-dl has somehow misrecognized the URL.

I'm really struggling to come up with an explanation why this would happen. The only reason I can think of is that you have an invisible control character somewhere in the URL. For example: https://areena.yle.fi/ audio/1-50674174 but with an invisble control character instead of the whitespace before "audio".

@idiootti
Copy link
Author

idiootti commented Feb 1, 2021

No extra characters are present there. I've tried both copy-pasting the url and also writing it by hand since I expected there would be something there. Nothing has worked so far.

This is actually odd, now that I look different requests, the traceback is missing the self.get_playlist_old_style_url(

and has this:

Traceback (most recent call last):
  File "/usr/local/Cellar/yle-dl/20201022/libexec/bin/yle-dl", line 11, in <module>
    load_entry_point('yle-dl==20201022', 'console_scripts', 'yle-dl')()
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/yledl.py", line 461, in main
    res = execute_action(url, action, io, httpclient, title_formatter,
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/yledl.py", line 300, in execute_action
    return download_clips(clips(), dl, io, title_formatter, stream_filters)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/yledl.py", line 282, in clips
    return extractor.extract(url, stream_filters.latest_only,
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/extractors.py", line 244, in extract
    playlist = self.get_playlist(url)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/extractors.py", line 846, in get_playlist
    if self.is_playlist(url):
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/extractors.py", line 853, in is_playlist
    html_tree = self.httpclient.download_html_tree(url)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/lib/python3.9/site-packages/yledl/http.py", line 57, in download_html_tree
    return lxml.html.fromstring(page)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/vendor/lib/python3.9/site-packages/lxml/html/__init__.py", line 875, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/usr/local/Cellar/yle-dl/20201022/libexec/vendor/lib/python3.9/site-packages/lxml/html/__init__.py", line 763, in document_fromstring
    raise etree.ParserError(
lxml.etree.ParserError: Document is empty

This happens with new style url https://areena.yle.fi/audio/1-50674174.
The older style seems to return answer with the function referring to older urls.

@aajanki
Copy link
Owner

aajanki commented Feb 7, 2021

Can you try with the latest version from the Github master branch? I fixed one issue that could potentially cause this problem.

@idiootti
Copy link
Author

idiootti commented Feb 8, 2021

Thanks, that solves the issue. It still gives me WARNING: HTML parsing error: Document is empty but it downloads anyway.

One small thing here still stands: opening the playlist page https://areena.yle.fi/audio/1-50674174 only downloads the first episode for me, it doesn't continue after that. Individual episodes can be downloaded though.

@aajanki
Copy link
Owner

aajanki commented Feb 8, 2021

Thanks for testing. It seems that I still didn't manage to fix the error properly since it's still showing the warning and not downloading the full playlist.

I'll try to figure out a more correct fix but it's challending because I can test it myself. If any Mac user with Python debuging skills wants to dive into this, help would be appreciated. :)

@akx
Copy link
Contributor

akx commented Mar 3, 2022

Just bumped into this too (with https://areena.yle.fi/1-61070264).

yle-dl 20211213 says "WARNING: HTML parsing error: Document is empty" but downloads one episode anyway.
yle-dl 20220213 doesn't yield the same warning, but still only downloads only one episode.

I can try to take a look :)

EDIT: curiously, visiting https://areena.yle.fi/1-61070264 in a browser redirects you to https://areena.yle.fi/audio/1-61070264; yle-dl parses that URL fine and downloads all episodes. I assume this has to do with the extractor not realizing it has been redirected.

akx added a commit to akx/yle-dl that referenced this issue Mar 3, 2022
akx added a commit to akx/yle-dl that referenced this issue Mar 3, 2022
akx added a commit to akx/yle-dl that referenced this issue Mar 3, 2022
akx added a commit to akx/yle-dl that referenced this issue Mar 3, 2022
akx added a commit to akx/yle-dl that referenced this issue Mar 3, 2022
akx added a commit to akx/yle-dl that referenced this issue Mar 3, 2022
@aajanki
Copy link
Owner

aajanki commented Mar 3, 2022

This is now fixed thanks to @akx !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants