Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

So long lxml, and thanks for all the fish #329

Merged
merged 4 commits into from
May 22, 2012

Conversation

FiloSottile
Copy link
Collaborator

There was a lot of confusion over HTML and XML parsing:

  • _unescapeHTML did the same thing as htmlentity_transform but using the undocumented HTMLParser.unescape
  • sometimes _unescapeHTML was used, sometimes the complex to read syntax of htmlentity_transform was used instead
  • we depended on lxml (non public domain, MIT licensed) only to do what in Javascript is getElementById
  • there wasn't a consistent way to handle HTML to make it human-readable

So here are my changes:

  • now we have get_element_by_id that uses a custom HTMLParser to extract the HTML tag with a specific id, so lxml is gone
  • _unescapeHTML now uses htmlentity_transform and I applied (and tested) it across the code
  • clean_html does the tag-stripping and newline-handling needed to get a pretty text out of a bunch of HTML

Have a look at ./youtube-dl --get-description http://www.youtube.com/watch?v=BQ80NtYmktQ

Before (without lxml):

Il nuovo pezzo di Mattia Sterpi Deejay Seguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/...

Before (with lxml):

Il nuovo pezzo di Mattia Sterpi DeejaySeguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/DjSterpiCompra il singolo su iTunes:http://itunes.apple.com/it/album/heart-of-mine-single/id494530985?ls=1Pubblicato: 21/01/2012℗ 2012 Il Chiosco

Now:

Il nuovo pezzo di Mattia Sterpi Deejay

Seguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/DjSterpi

Compra il singolo su iTunes:
http://itunes.apple.com/it/album/heart-of-mine-single/id494530985?ls=1

Pubblicato: 21/01/2012
℗ 2012 Il Chiosco

@rmunn
Copy link

rmunn commented Apr 14, 2012

This code will fail on descriptions that contain non-ASCII characters, such as this video whose description is mostly in Japanese:

rmunn@localhost:~/test$ ./youtube-dl -l -c --skip-download --write-info-json http://www.youtube.com/watch?v=9wMwplzBK-Y
[youtube] Setting language
[youtube] 9wMwplzBK-Y: Downloading video webpage
[youtube] 9wMwplzBK-Y: Downloading video info webpage
[youtube] 9wMwplzBK-Y: Extracting video information
Traceback (most recent call last):
  File "./youtube-dl", line 4760, in <module>
    main()
  File "./youtube-dl", line 4751, in main
    _real_main()
  File "./youtube-dl", line 4735, in _real_main
    retcode = fd.download(all_urls)
  File "./youtube-dl", line 964, in download
    ie.extract(url)
  File "./youtube-dl", line 1230, in extract
    return self._real_extract(url)
  File "./youtube-dl", line 1496, in _real_extract
    video_description = get_element_by_id("eow-description", video_webpage)
  File "./youtube-dl", line 256, in get_element_by_id
    parser.loads(html)
  File "./youtube-dl", line 210, in loads
    self.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 393, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

@FiloSottile
Copy link
Collaborator Author

Thanks. Already got it and fixed in my master branch. Backporting soon.
But I think we need a standardized way to decode downloaded pages
automatically.

@phihag phihag merged commit 7a8501e into ytdl-org:master May 22, 2012
joedborg referenced this pull request in joedborg/youtube-dl Nov 17, 2020
[pull] master from ytdl-org:master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants