Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kankids extractor #32825

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions youtube_dl/extractor/extractors.py
Original file line number Diff line number Diff line change
Expand Up @@ -560,6 +560,7 @@
from .kakao import KakaoIE
from .kaltura import KalturaIE
from .kankan import KankanIE
from .kankids import KanKidsIE
from .karaoketv import KaraoketvIE
from .karrierevideos import KarriereVideosIE
from .keezmovies import KeezMoviesIE
Expand Down
76 changes: 76 additions & 0 deletions youtube_dl/extractor/kankids.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# coding: utf-8
from __future__ import unicode_literals

from .common import InfoExtractor
import re

CONTENT_DIR = r'/content/kids/'
DOMAIN = r'kankids.org.il'
Comment on lines +7 to +8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You don't need raw strings unless there's a \ in the string.
  2. Possibly CONTENT_DIR is only ever used when preceded by DOMAIN and these could be one constant?
  3. Possibly the constant(s) could be class variables rather than module variables?
  4. But maybe this will be irrelevant.



class KanKidsIE(InfoExtractor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've written the KanKidsPlaylistIE extractor!

There should also be a KanKidsIE that extracts single video pages. At the moment I think this is relying on the ld+json extraction in GenericIE, but that's a bit clumsy.

If there were a support request issue I expect (my inference as a non-Hebrew speaker) that the requester would be asking to support your test playlist pages

And presumably also

And also, from those playlist pages, video pages like

I'll discuss the two cases in the next two posts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the playlist pages, it seems that the video links are in anchor tags like this:

<a href="/content/kids/ktantanim-main/p-11732/s1/100026/" class="card-link" title="בית ספר לקוסמים | קורעים עיתונים | פרק 7">

The tactic for this sort of link is:

  1. find each tag
  2. pass the tag to extract_attributes() (so the order of the attributes isn't significant)
  3. urljoin() the page URL with the value of the href attribute
  4. skip the item if that URL is None
  5. extract any title
  6. pass the URL and any title to url_result()

But as the links belong to the site we can do better like this
6a. check the URL with KanKidsIE._match_valid_url()
6b. skip the item if no match (or just do 6. as before)
6c. pass the URL with the ID from the match and KanKidsIE.ie_key() and any title to url_result()

At 1. we should have a utility function equivalent to JS document.querySelector(".card-link") but instead have to roll a pattern like

# match group 1
r'(<a\s[^>]*\bclass\s*=\s*"card-link"[^>]*>)'
# or as the `class` attribute value can be a space-separated list
r'(<a\s[^>]*\bclass\s*=\s*("|\')\s*(?:\S+\s+)*?card-link(?:\s+\S+)*\s*\2[^>]*>)'

At 5. maybe factor out a pattern or routine for removing unwanted suffixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the video pages, there is a <script> tag like this:

<script type="application/ld+json">
    {
        "@context": "https://schema.org",
        "@type": "VideoObject",

                "name": "בית ספר לקוסמים | הקלף החתוך | פרק 36",
                "description": "דיקו הקוסם ויואב השוליה מלמדים איך מנבאים בדיוק היכן יבחר המתנדב לחתוך את הקלפים",
                "thumbnailUrl": "https://kan-media.kan.org.il/media/omhdwt20/2019-2-25-imgid-5344_b.jpeg",
                "uploadDate": "2023-04-22T22:05:32+03:00",
                "contentUrl": "https://cdnapisec.kaltura.com/p/2717431/sp/271743100/playManifest/entryId/1_en2re1iu/format/applehttp/protocol/https/desiredFileName.m3u8",
                "embedUrl": "https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/s1/100574/"
                }
</script>

The tactic for this is just to call self._search_json_ld(webpage, video_id, expected_type='VideoObject'). From the above JSON this will return a dict with url, timestamp, title, description, thumbnail set. To make a good return value just update with{'id': video_id}.

Optionally

  1. apply the common suffix removal process to the title
  2. pass the url to determine_ext() and if it's 'm3u8' (as above) pass the popped url to _extract_m3u8_formats() and its resulting formats to _sort_formats(), and update the info-dict with {'formats': formats}
  3. update the info-dict with any other known attributes available in the page, maybe from this JS fragment:
        dataLayer.push({
            ...
            season: '1',
            episode_number: '36',
            episode_title: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
            genre_tags: '',
            item_duration: '0',
            program_type: 'טלויזיה',
            program_genre: 'קטנטנים',
            program_format: 'סדרה',
            article_type: '',
            page_name: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
            program_name: 'בית ספר לקוסמים',
            channel_name: 'חינוכית - קטנטנים'
            ...

_VALID_URL = r'https?://(?:www\.)?' +\
DOMAIN.replace('.', '\\.') + CONTENT_DIR +\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a more general solution to this:

Suggested change
DOMAIN.replace('.', '\\.') + CONTENT_DIR +\
re.escape(DOMAIN + CONTENT_DIR) +\

r'(?P<category>[a-z]+)-main/(?P<id>[\w\-0-9]+)/(?P<season>\w+)?/?$'
_TESTS = [
{
'url': 'https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/',
'info_dict': {
'_type': 'playlist',
'id': 'p-11732',
'title': 'בית ספר לקוסמים',
},
'playlist_count': 60,
},
{
'url': 'https://www.kankids.org.il/content/kids/hinuchit-main/cramel_main/s1/',
'info_dict': {
'_type': 'playlist',
'id': 'cramel_main',
'title': 'כראמל - עונה 1',
},
'playlist_count': 21,
},
]

def _real_extract(self, url):
m = super()._match_valid_url(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As we don't hate Python2, super() has to mention the class and instance: super(KanKidsIE, self), but...
  2. Just say this, since Python knows how to find the method to run from the class and its base classes:
Suggested change
m = super()._match_valid_url(url)
m = self._match_valid_url(url)

series_id = m.group('id')
category = m.group('category')
playlist_season = m.group('season')
Comment on lines +38 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this in one call:

Suggested change
series_id = m.group('id')
category = m.group('category')
playlist_season = m.group('season')
series_id, category, playlist_season = m.group('id', 'category', 'season')


webpage = self._download_webpage(url, series_id)

title_pattern = r'<title>(?P<title>.+) \|'
series_title = re.search(title_pattern, webpage)
if not series_title:
series_title = re.search(title_pattern[:-1] + r'-', webpage)
if series_title:
series_title = series_title.group('title')
Comment on lines +44 to +49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Use _search_regex(), or when matching displayed content rather than inside HTML tags, _html_search_regex().
  2. Then alternative patterns can be in a tuple or list, if, unlike this case, the alternates can't be expressed easily in the pattern.

Like so:

Suggested change
title_pattern = r'<title>(?P<title>.+) \|'
series_title = re.search(title_pattern, webpage)
if not series_title:
series_title = re.search(title_pattern[:-1] + r'-', webpage)
if series_title:
series_title = series_title.group('title')
series_title = self._html_search_regex(
r'<title\b[^>]*>([^>]+)(?:\s+[|-][^>]*)?</title', webpage, 'title')

I'll break that regex down using the x flag that allows comments and swallows insignificant white-space, in a multi-line string with '''s:

r'''(?x)
    <title\b[^>]*>   # allow the site to send attributes in the `title` tag
                     # `\b` matches only if the next character doesn't match `[a-zA-Z0-9_]`
                     # strictly we should exclude `-` too: `(?!-)\b`
    ([^>]+)          # `_html_search_regex()` finds group 1 by default
                     # match characters that are not (`^`) `<`: the tag content should end at `</title>`
                     # to skip everything after the first separator rather than the last, use `([^>]+?)`
                     # could allow for other embedded tags by using `(?:(?!</title)[\s\S])+` but:
                     # 1. the content of a `<title>` is just text
                     # 2. actually we (probably) never do that anyway
    (?:              # an unnumbered ("non-captured") group  
       \s+           # allow the site to send one or more white-space before any separator 
       [|-]          # separator is either `|` or `-`: inside `[...]`:
                     # 1. no need to escape special regex characters `|` (also `? . * + { } ( )`)
                     # 2. `-` must be first or last
       [^>]*         # again match not `>`
    )?               # optionally match group: maybe this title has no, or an unexpected, separator
    </title          # end of content
'''

But if/as the suffix removal should be a common pattern or routine, it would be better to extract the entire <title> content and then remove the suffix, perhaps with re.sub().


season = playlist_season if playlist_season else r'(?P<season>\w+)'
content_dir = CONTENT_DIR + category + r'-main/'
playlist = set(re.findall(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set() is a good solution if you're concerned about duplicate URLs but won't help if there are different URLs for the same video (say, with different query parameters). In that case you have to extract a video ID for each URL and process the ID with set().

r'href="' + content_dir # Content dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above for a preferable tactic here. Maybe a string format would have been better than concatenation, though.

+ series_id + r'/' # Series
+ season + r'/' # Season
+ r'(?P<id>[0-9]+)/"' # Episode
+ r'.+title="(?P<title>.+)"', # Title
webpage))

entries = []
content_dir = r'https://www.' + DOMAIN + content_dir
for season, video_id, title in playlist if not playlist_season else map(lambda episode: (playlist_season,) + episode, playlist):
entries.append(self.url_result(
content_dir + season + r'/' + video_id + r'/',
ie='Generic',
video_id=video_id,
video_title=title,
))

return {
'_type': 'playlist',
'id': series_id,
'title': series_title,
'entries': entries,
}
Loading