So long lxml, and thanks for all the fish #329

FiloSottile · 2012-04-10T21:37:47Z

There was a lot of confusion over HTML and XML parsing:

_unescapeHTML did the same thing as htmlentity_transform but using the undocumented HTMLParser.unescape
sometimes _unescapeHTML was used, sometimes the complex to read syntax of htmlentity_transform was used instead
we depended on lxml (non public domain, MIT licensed) only to do what in Javascript is getElementById
there wasn't a consistent way to handle HTML to make it human-readable

So here are my changes:

now we have get_element_by_id that uses a custom HTMLParser to extract the HTML tag with a specific id, so lxml is gone
_unescapeHTML now uses htmlentity_transform and I applied (and tested) it across the code
clean_html does the tag-stripping and newline-handling needed to get a pretty text out of a bunch of HTML

Have a look at ./youtube-dl --get-description http://www.youtube.com/watch?v=BQ80NtYmktQ

Before (without lxml):

Il nuovo pezzo di Mattia Sterpi Deejay Seguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/...

Before (with lxml):

Il nuovo pezzo di Mattia Sterpi DeejaySeguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/DjSterpiCompra il singolo su iTunes:http://itunes.apple.com/it/album/heart-of-mine-single/id494530985?ls=1Pubblicato: 21/01/2012℗ 2012 Il Chiosco

Now:

Il nuovo pezzo di Mattia Sterpi Deejay

Seguici su Facebook https://www.facebook.com/pages/Mattia-Sterpi-Deejay/121854024543063 e Twitter https://twitter.com/DjSterpi

Compra il singolo su iTunes:
http://itunes.apple.com/it/album/heart-of-mine-single/id494530985?ls=1

Pubblicato: 21/01/2012
℗ 2012 Il Chiosco

…HTML; fixed a bug in the use of _unescapeHTML (missing _, from d6a9615)

rmunn · 2012-04-14T08:46:38Z

This code will fail on descriptions that contain non-ASCII characters, such as this video whose description is mostly in Japanese:

rmunn@localhost:~/test$ ./youtube-dl -l -c --skip-download --write-info-json http://www.youtube.com/watch?v=9wMwplzBK-Y
[youtube] Setting language
[youtube] 9wMwplzBK-Y: Downloading video webpage
[youtube] 9wMwplzBK-Y: Downloading video info webpage
[youtube] 9wMwplzBK-Y: Extracting video information
Traceback (most recent call last):
  File "./youtube-dl", line 4760, in <module>
    main()
  File "./youtube-dl", line 4751, in main
    _real_main()
  File "./youtube-dl", line 4735, in _real_main
    retcode = fd.download(all_urls)
  File "./youtube-dl", line 964, in download
    ie.extract(url)
  File "./youtube-dl", line 1230, in extract
    return self._real_extract(url)
  File "./youtube-dl", line 1496, in _real_extract
    video_description = get_element_by_id("eow-description", video_webpage)
  File "./youtube-dl", line 256, in get_element_by_id
    parser.loads(html)
  File "./youtube-dl", line 210, in loads
    self.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 393, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

FiloSottile · 2012-04-14T09:43:08Z

Thanks. Already got it and fixed in my master branch. Backporting soon.
But I think we need a standardized way to decode downloaded pages
automatically.

[pull] master from ytdl-org:master

FiloSottile and others added 4 commits April 10, 2012 16:31

standardized the use of unescapeHTML; added clean_html()

d6a9615

removed dependency from lxml: added IDParser

c6f45d4

removed the undocumented HTMLParser.unescape, replaced with _unescape…

781cc52

…HTML; fixed a bug in the use of _unescapeHTML (missing _, from d6a9615)

ignore parsing errors in get_element_by_id()

7a8501e

FiloSottile mentioned this pull request May 1, 2012

Re-organized code and a lot of other stuff. #342

Merged

phihag merged commit 7a8501e into ytdl-org:master May 22, 2012

yan12125 mentioned this pull request Apr 21, 2016

why not using python builtin html unescape #9270

Closed

8 tasks

joedborg referenced this pull request in joedborg/youtube-dl Nov 17, 2020

Merge pull request #329 from ytdl-org/master

dfe23bf

[pull] master from ytdl-org:master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

So long lxml, and thanks for all the fish #329

So long lxml, and thanks for all the fish #329

FiloSottile commented Apr 10, 2012

rmunn commented Apr 14, 2012

FiloSottile commented Apr 14, 2012

So long lxml, and thanks for all the fish #329

So long lxml, and thanks for all the fish #329

Conversation

FiloSottile commented Apr 10, 2012

rmunn commented Apr 14, 2012

FiloSottile commented Apr 14, 2012