-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kankids extractor #32825
base: master
Are you sure you want to change the base?
Kankids extractor #32825
Changes from all commits
833fe8c
d335e0b
3fb423c
c9265f6
162eb56
d3e980e
a4737bb
5b088cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,76 @@ | ||||||||||||||||||
# coding: utf-8 | ||||||||||||||||||
from __future__ import unicode_literals | ||||||||||||||||||
|
||||||||||||||||||
from .common import InfoExtractor | ||||||||||||||||||
import re | ||||||||||||||||||
|
||||||||||||||||||
CONTENT_DIR = r'/content/kids/' | ||||||||||||||||||
DOMAIN = r'kankids.org.il' | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
class KanKidsIE(InfoExtractor): | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You've written the There should also be a If there were a support request issue I expect (my inference as a non-Hebrew speaker) that the requester would be asking to support your test playlist pages
And presumably also
And also, from those playlist pages, video pages like
I'll discuss the two cases in the next two posts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at the playlist pages, it seems that the video links are in anchor tags like this: <a href="/content/kids/ktantanim-main/p-11732/s1/100026/" class="card-link" title="בית ספר לקוסמים | קורעים עיתונים | פרק 7"> The tactic for this sort of link is:
But as the links belong to the site we can do better like this At 1. we should have a utility function equivalent to JS # match group 1
r'(<a\s[^>]*\bclass\s*=\s*"card-link"[^>]*>)'
# or as the `class` attribute value can be a space-separated list
r'(<a\s[^>]*\bclass\s*=\s*("|\')\s*(?:\S+\s+)*?card-link(?:\s+\S+)*\s*\2[^>]*>)' At 5. maybe factor out a pattern or routine for removing unwanted suffixes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at the video pages, there is a <script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "VideoObject",
"name": "בית ספר לקוסמים | הקלף החתוך | פרק 36",
"description": "דיקו הקוסם ויואב השוליה מלמדים איך מנבאים בדיוק היכן יבחר המתנדב לחתוך את הקלפים",
"thumbnailUrl": "https://kan-media.kan.org.il/media/omhdwt20/2019-2-25-imgid-5344_b.jpeg",
"uploadDate": "2023-04-22T22:05:32+03:00",
"contentUrl": "https://cdnapisec.kaltura.com/p/2717431/sp/271743100/playManifest/entryId/1_en2re1iu/format/applehttp/protocol/https/desiredFileName.m3u8",
"embedUrl": "https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/s1/100574/"
}
</script> The tactic for this is just to call Optionally
dataLayer.push({
...
season: '1',
episode_number: '36',
episode_title: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
genre_tags: '',
item_duration: '0',
program_type: 'טלויזיה',
program_genre: 'קטנטנים',
program_format: 'סדרה',
article_type: '',
page_name: 'בית ספר לקוסמים | הקלף החתוך | פרק 36',
program_name: 'בית ספר לקוסמים',
channel_name: 'חינוכית - קטנטנים'
... |
||||||||||||||||||
_VALID_URL = r'https?://(?:www\.)?' +\ | ||||||||||||||||||
DOMAIN.replace('.', '\\.') + CONTENT_DIR +\ | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a more general solution to this:
Suggested change
|
||||||||||||||||||
r'(?P<category>[a-z]+)-main/(?P<id>[\w\-0-9]+)/(?P<season>\w+)?/?$' | ||||||||||||||||||
_TESTS = [ | ||||||||||||||||||
{ | ||||||||||||||||||
'url': 'https://www.kankids.org.il/content/kids/ktantanim-main/p-11732/', | ||||||||||||||||||
'info_dict': { | ||||||||||||||||||
'_type': 'playlist', | ||||||||||||||||||
'id': 'p-11732', | ||||||||||||||||||
'title': 'בית ספר לקוסמים', | ||||||||||||||||||
}, | ||||||||||||||||||
'playlist_count': 60, | ||||||||||||||||||
}, | ||||||||||||||||||
{ | ||||||||||||||||||
'url': 'https://www.kankids.org.il/content/kids/hinuchit-main/cramel_main/s1/', | ||||||||||||||||||
'info_dict': { | ||||||||||||||||||
'_type': 'playlist', | ||||||||||||||||||
'id': 'cramel_main', | ||||||||||||||||||
'title': 'כראמל - עונה 1', | ||||||||||||||||||
}, | ||||||||||||||||||
'playlist_count': 21, | ||||||||||||||||||
}, | ||||||||||||||||||
] | ||||||||||||||||||
|
||||||||||||||||||
def _real_extract(self, url): | ||||||||||||||||||
m = super()._match_valid_url(url) | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
series_id = m.group('id') | ||||||||||||||||||
category = m.group('category') | ||||||||||||||||||
playlist_season = m.group('season') | ||||||||||||||||||
Comment on lines
+38
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can do this in one call:
Suggested change
|
||||||||||||||||||
|
||||||||||||||||||
webpage = self._download_webpage(url, series_id) | ||||||||||||||||||
|
||||||||||||||||||
title_pattern = r'<title>(?P<title>.+) \|' | ||||||||||||||||||
series_title = re.search(title_pattern, webpage) | ||||||||||||||||||
if not series_title: | ||||||||||||||||||
series_title = re.search(title_pattern[:-1] + r'-', webpage) | ||||||||||||||||||
if series_title: | ||||||||||||||||||
series_title = series_title.group('title') | ||||||||||||||||||
Comment on lines
+44
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Like so:
Suggested change
I'll break that regex down using the r'''(?x)
<title\b[^>]*> # allow the site to send attributes in the `title` tag
# `\b` matches only if the next character doesn't match `[a-zA-Z0-9_]`
# strictly we should exclude `-` too: `(?!-)\b`
([^>]+) # `_html_search_regex()` finds group 1 by default
# match characters that are not (`^`) `<`: the tag content should end at `</title>`
# to skip everything after the first separator rather than the last, use `([^>]+?)`
# could allow for other embedded tags by using `(?:(?!</title)[\s\S])+` but:
# 1. the content of a `<title>` is just text
# 2. actually we (probably) never do that anyway
(?: # an unnumbered ("non-captured") group
\s+ # allow the site to send one or more white-space before any separator
[|-] # separator is either `|` or `-`: inside `[...]`:
# 1. no need to escape special regex characters `|` (also `? . * + { } ( )`)
# 2. `-` must be first or last
[^>]* # again match not `>`
)? # optionally match group: maybe this title has no, or an unexpected, separator
</title # end of content
''' But if/as the suffix removal should be a common pattern or routine, it would be better to extract the entire |
||||||||||||||||||
|
||||||||||||||||||
season = playlist_season if playlist_season else r'(?P<season>\w+)' | ||||||||||||||||||
content_dir = CONTENT_DIR + category + r'-main/' | ||||||||||||||||||
playlist = set(re.findall( | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||||||||||||||||
r'href="' + content_dir # Content dir | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above for a preferable tactic here. Maybe a string format would have been better than concatenation, though. |
||||||||||||||||||
+ series_id + r'/' # Series | ||||||||||||||||||
+ season + r'/' # Season | ||||||||||||||||||
+ r'(?P<id>[0-9]+)/"' # Episode | ||||||||||||||||||
+ r'.+title="(?P<title>.+)"', # Title | ||||||||||||||||||
webpage)) | ||||||||||||||||||
|
||||||||||||||||||
entries = [] | ||||||||||||||||||
content_dir = r'https://www.' + DOMAIN + content_dir | ||||||||||||||||||
for season, video_id, title in playlist if not playlist_season else map(lambda episode: (playlist_season,) + episode, playlist): | ||||||||||||||||||
entries.append(self.url_result( | ||||||||||||||||||
content_dir + season + r'/' + video_id + r'/', | ||||||||||||||||||
ie='Generic', | ||||||||||||||||||
video_id=video_id, | ||||||||||||||||||
video_title=title, | ||||||||||||||||||
)) | ||||||||||||||||||
|
||||||||||||||||||
return { | ||||||||||||||||||
'_type': 'playlist', | ||||||||||||||||||
'id': series_id, | ||||||||||||||||||
'title': series_title, | ||||||||||||||||||
'entries': entries, | ||||||||||||||||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\
in the string.CONTENT_DIR
is only ever used when preceded byDOMAIN
and these could be one constant?