add extractors for fantia and fanbox #1459

thatfuckingbird · 2021-04-10T19:54:25Z

Implemented post and user extractors for Fantia and Fanbox. Both use cookies for auth.
Doing it in 1 PR because these are pretty similar.

I added tests for both, however I don't know how to pass cookies to test_results.py so I didn't manage to run the Fantia tests.
The tests use freely available content, but registration (and for Fantia, subscription to the free plan) is likely needed.

Note that the test urls contain NSFW content (it's hard to find freely available + sfw on these sites).

Closes #849.
Closes #1260.
Closes #739.

gallery_dl/extractor/fanbox.py

mikf

A huge Thank You for adding extractors for both sites to you!
I hope I'm not too nitpicky about some of this ...

gallery_dl/extractor/fanbox.py

gallery_dl/extractor/fantia.py

thatfuckingbird · 2021-04-13T12:48:38Z

I hope I'm not too nitpicky about some this ...

Don't worry about it, it's helpful.

thatfuckingbird · 2021-04-13T18:03:14Z

ok, I think everything above is addressed

abslamp · 2021-04-15T15:04:01Z

Do you plan to support saving video links (maybe as metadata) in Fanbox? I think for some posts the type in the json is "video", and it contains a "video" key in body's body, with content like: "video":{"serviceProvider":"youtube","videoId":"youtubeId"}. Maybe useful to invoke a post action like youtube-dl.

If you are interested I can share a dump with paid contents stripped, but so far I found no public sources.

thatfuckingbird · 2021-04-15T15:35:55Z

@abslamp Sure, I didn't know fanbox had this feature. If you can share the whole post response (with the paid urls removed) then I can add it.

abslamp · 2021-04-15T15:48:26Z

@abslamp Sure, I didn't know fanbox had this feature. If you can share the whole post response (with the paid urls removed) then I can add it.

@thatfuckingbird Thanks for considering this!

URL: https://www.fanbox.cc/@ayumasayu/posts/1737774

Formatted JSON with paid contents removed:

{
   "body":{
      "id":"1737774",
      "title":"\u7483\u5948\u3061\u3083\u3093\u30bf\u30a4\u30e0\u30e9\u30d7\u30b9",
      "coverImageUrl":null,
      "feeRequired":864,
      "publishedDatetime":"2020-12-28T17:51:48+09:00",
      "updatedDatetime":"2021-01-02T09:31:12+09:00",
      "type":"video",
      "body":{
         "text":[REDUCTED],
         "video":{
            "serviceProvider":"youtube",
            "videoId":[REDUCTED, the last part of youtube's video url]
         }
      },
      "tags":[
         
      ],
      "excerpt":[REDUCTED],
      "isLiked":false,
      "likeCount":18,
      "commentCount":1,
      "restrictedFor":null,
      "user":{
         "userId":"2473967",
         "name":"\u3042\u3086\u307e\u7d17\u7531",
         "iconUrl":"https:\/\/pixiv.pximg.net\/c\/160x160_90_a2_g5\/fanbox\/public\/images\/user\/2473967\/icon\/fA0s7GFQgdcE21pwISHHmzvg.jpeg"
      },
      "creatorId":"ayumasayu",
      "hasAdultContent":true,
      "commentList": [REDUCTED]
      "nextPost":{
         "id":"1745343",
         "title":"\u306e\u3063\u304b\u308a\u3042\u3059\u3061\u3083\u3093",
         "publishedDatetime":"2020-12-30 19:24:10"
      },
      "prevPost":{
         "id":"1734302",
         "title":"\u8868\u7d19\u69cb\u56f3\u6848\u30e9\u30d5",
         "publishedDatetime":"2020-12-27 17:53:37"
      },
      "imageForShare":"https:\/\/pixiv.pximg.net\/c\/1200x630_90_a2_g5\/fanbox\/public\/images\/creator\/2473967\/cover\/02RjvKWibZct6SWzEqUJ5Sms.jpeg"
   }
}

Please note it does not provide the whole URL, but breaking it into 2 parts. For example, for https://www.youtube.com/watch?v=VjHMmF7HsoU (a random public video appeared in my youtube's suggestions), it would be like:

"video":{
      "serviceProvider":"youtube",
      "videoId":"VjHMmF7HsoU"
}

In the acutal post, the video is shown as an embedded youtube video.
Maybe just adding the 2 fields to the metadata, and leave the rest part to the user to decide?

thatfuckingbird · 2021-04-15T18:40:29Z

@abslamp Hmm yea I would just add the "video" object to the metadata, if present. Problem is, in this example gallery-dl would produce no files for this post, so there would be no metadata written at all in the end...

Maybe I could yield the youtube URL and worst case it could be captured with the 'write unsupported URLs to file' option, but that won't work for other serviceProviders....

abslamp · 2021-04-15T18:56:27Z

@abslamp Hmm yea I would just add the "video" object to the metadata, if present. Problem is, in this example gallery-dl would produce no files for this post, so there would be no metadata written at all in the end...

Maybe I could yield the youtube URL and worst case it could be captured with the 'write unsupported URLs to file' option, but that won't work for other serviceProviders....

@thatfuckingbird You can use this page for testing. But I don't suggest you adding this to test because I will remove this page later. https://mkx-bot.fanbox.cc/posts/2130801

From the creator panel I found that currently fanbox supports YouTube, Vimeo, and SoundCloud only.

thatfuckingbird · 2021-04-15T19:45:36Z

@abslamp Thanks, I will look at it in the coming days.
Can you put up a vimeo and soundcloud link too? Or actually just checking what the videoId for those looks like would be enough, so I can generate a URL from it

mikf

I've been playing around with this a bit more and found a few things in regards to URL patterns and fetching results. In summary:

it should be possible to safe a lot of requests to https://api.fanbox.cc/post.info
use self.request(url, params=params) instead of building URLs yourself

gallery_dl/extractor/fanbox.py

mikf · 2021-04-15T19:04:50Z

gallery_dl/extractor/fanbox.py

+    def _pagination(self, url):
+        headers = {"Origin": self.root}
+
+        while url:
+            url = text.ensure_http_scheme(url)
+            body = self.request(url, headers=headers).json()["body"]
+            for item in body["items"]:
+                yield self._get_post_data(item["id"])
+
+            url = body["nextUrl"]


The items returned from https://api.fanbox.cc/post.listCreator?creatorId=USER&limit=10 appear to be, at least for xub.fanbox.cc, more or less the same as the single-item results from https://api.fanbox.cc/post.info?postId=ID (*) It's only missing comments, imageForShare, and the entries about next/previous posts.

If that's true in general (posts with videos, embeds, etc), we don't need to fetch data from /post.info for every post and can use the items returned from /post.listCreator directly.

(*) Diff "/post.listCreator" - "/post.info" for post 2059366

"creatorId": "xub", - "hasAdultContent": true, - "commentList": { - "items": [], - "nextUrl": null - }, - "nextPost": { - "id": "2085876", - "title": "Skeb Commission", - "publishedDatetime": "2021-04-03 04:56:12" - }, - "prevPost": { - "id": "2009099", - "title": "ﾒｽｶﾞｷ〇〇★〇〇〇〇〇をわからせる絵", - "publishedDatetime": "2021-03-12 19:01:08" - }, - "imageForShare": "https://pixiv.pximg.net/c/1200x630_90_a2_g5/fanbox/public/images/post/2059366/cover/JuTXNtvo1BRN93cLW371vVd6.jpeg" + "hasAdultContent": true }

Seems like you are right, at least I couldn't find any posts where listCreator didn't have all the content. Updated the code to use that directly, it's easy enough to change it back if it turns out it's needed.

gallery_dl/extractor/fanbox.py

gallery_dl/extractor/fantia.py

gallery_dl/extractor/fanbox.py

mikf · 2021-04-15T19:53:11Z

@thatfuckingbird @abslamp There should also be something useful in the test data from PixivUtil2: https://github.com/Nandaka/PixivUtil2/tree/master/test

abslamp · 2021-04-18T13:42:21Z

@thatfuckingbird SoundCloud and Vimeo links added. Seems that SoundCloud is not working correctly.

Co-authored-by: Mike Fährmann <[email protected]>

…ing a creator page

thatfuckingbird · 2021-04-21T21:01:04Z

Added support for embedded videos and addressed the comments above.

Looking at the pixivutil test json files, I realized that article type posts and the imageMap/fileMap things are not handled correctly in the fanbox downloader. I also want to do another round of testing with the new changes. Going to do these soon.

mikf · 2021-04-22T20:54:36Z

docs/configuration.rst

+extractor.fanbox.videos
+-----------------------
+Type
+    ``bool`` or ``string``
+Default
+    ``true``
+Description
+    Control behavior on videos embedded from external sites.
+    Recognizes embeds from YouTube, Vimeo and SoundCloud.
+
+    * ``true``: Extract video URLs (videos are not downloaded, as
+      galley-dl does not support these sites natively)
+    * ``"ytdl"``: Download videos and let `youtube-dl`_ handle all of
+      video extraction and download
+    * ``false``: Ignore videos


I think this option should be simplified to just

true: Download embedded media with youtube-dl (what is currently happening for "ytdl")

false: Ignore videos

with just "Download embedded media from YouTube, Vimeo, and SoundCloud with youtube-dl" as description.
The current true would download the HTML content of those external sites and I don't think anyone would want that.

Hmm, true shouldn't download html, instead the links show up in the unsupported URLs (the --write-unsupported flag),
though I can remove it, I don't particularly mind either way.

You are right, they would land in the "unsupported" file. I didn't realize it's usingMessage.Queue.
But that wouldn't use youtube-dl to download them, either. It's Message.Url to download and Message.Queue to potentially spawn a new extractor.

I would still only have true or false as options (because its easier).
It'd need to use Message.Url for ytdl:… URIs if you want to keep all three.

thatfuckingbird · 2021-04-23T19:01:30Z

OK, almost done. Looked at pixivutil and based on the examples found there I added support for the remaining post types:

-"entry" type posts just contain a "html" entry with raw html in it, have to extract images from that (I followed what pixivutil does here for exactly what type of image URLs are downloaded)

-"article" type posts seem to be similar in purpose but a newer format, it has paragraphs (+styling info, etc.) as json objects. This is where fileMap/imageMap is used. Pixivutil actually parses the paragraphs from the json and generates a HTML with the written post content, but I think it is better if gallery-dl just saves the whole article json into the metadata, then it can be postprocessed later by the user. Fortunately files/images can simply be downloaded based on the contents of imageMap/fileMap.

Handling embeds was also extended, turns out there are other types of embeds, some which gallery-dl can handle itself (e.g. twitter). I renamed the "videos" option to "embeds" but the idea is mostly the same. @abslamp You can delete the test posts now, thanks for the help.

Fortunately some of the posts that pixivutil tests use are public so I added some tests for the above.

Only one problem remains, that is the handling of Fanbox embeds (a fanbox post embedded in another). For some reason, if I yield a fanbox URL while parsing a fanbox post, it gets ignored (written into the unsupported urls file) despite the same code working with yielding twitter URLs (and if I manually run gallery-dl with the yielded Fanbox url then it is recognized correctly).
@mikf Can you take a look at this? The last test URL has a fanbox embed.

mikf · 2021-04-23T20:54:27Z

if I yield a fanbox URL while parsing a fanbox post, it gets ignored

You need to specify the expected Extractor class as _extractor in the metadata dict:

             final_post["_extractor"] = FanboxCreatorExtractor

When there's no _extractor field or it is set to None, blacklist/whitelist apply, which ignores URLs for the same category by default.

thatfuckingbird · 2021-04-23T21:46:50Z

Thanks, that did it. I think everything's done now.

mo-han · 2021-05-25T03:07:35Z

@thatfuckingbird
Could you please add parsing and downloading of those "profile" and "plans" images?
Currently am saving those pictures by drag & drop, since they are just few, but it would be much more convenient if it's automated.
Thanks.

Posts from 'https://api.fanbox.cc/post.listCreator' do not contain a 'body' with all images anymore. #1459 (comment)

thatfuckingbird added 3 commits April 10, 2021 21:41

add extractors for fantia and fanbox

1413e10

appease linter

3502643

make docstrings unique

9fac87a

thatfuckingbird commented Apr 10, 2021

View reviewed changes

gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved

mikf reviewed Apr 12, 2021

View reviewed changes

thatfuckingbird added 7 commits April 13, 2021 18:44

[fantia] refactor post extraction

7750441

[fantia] capitalize

bf6d6d9

[fantia] improve regex pattern

e0fa068

code style

2694c1e

capitalize

916a7f8

[fanbox] use BASE_PATTERN for url regexes

a4c8cc8

[fanbox] refactor metadata and post extraction

0f56a5e

mikf reviewed Apr 15, 2021

View reviewed changes

thatfuckingbird and others added 7 commits April 21, 2021 19:14

[fanbox] improve url base pattern

abdcea7

Co-authored-by: Mike Fährmann <[email protected]>

[fanbox] accept creator page links ending with /posts

aa4bdb8

Co-authored-by: Mike Fährmann <[email protected]>

[fanbox] more tests

4d6b94a

Co-authored-by: Mike Fährmann <[email protected]>

[fantia] improved pagination #1

f6006f0

Co-authored-by: Mike Fährmann <[email protected]>

[fantia] improved pagination #2

d23835c

Co-authored-by: Mike Fährmann <[email protected]>

[fantia] improved pagination #3

7d45676

Co-authored-by: Mike Fährmann <[email protected]>

[fanbox] misc. code logic improvements

afd0ee4

Co-authored-by: Mike Fährmann <[email protected]>

thatfuckingbird added 4 commits April 21, 2021 20:30

[fantia] finish restructuring pagination code

b5c81ed

[fanbox] avoid making a request for each individual post when process…

e21c3a4

…ing a creator page

[fanbox] support embedded videos

de0fd08

[fanbox] fix errors

f149242

[fanbox] document extractor.fanbox.videos

cac1316

mikf reviewed Apr 22, 2021

View reviewed changes

[fanbox] handle "article" and "entry" post types, all embeds

599a790

[fanbox] fix downloading of embedded fanbox posts

7c9e000

mikf merged commit e47952a into mikf:master Apr 25, 2021

mikf added a commit that referenced this pull request Mar 11, 2022

[fanbox] fetch data for each individual post (fixes #2388)

f31ab0d

Posts from 'https://api.fanbox.cc/post.listCreator' do not contain a 'body' with all images anymore. #1459 (comment)

mikf mentioned this pull request Jul 8, 2022

Fanbox gallery images downloading out of order (alphabetical hash) #2718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add extractors for fantia and fanbox #1459

add extractors for fantia and fanbox #1459

thatfuckingbird commented Apr 10, 2021 •

edited

Loading

mikf left a comment •

edited

Loading

thatfuckingbird commented Apr 13, 2021

thatfuckingbird commented Apr 13, 2021

abslamp commented Apr 15, 2021 •

edited

Loading

thatfuckingbird commented Apr 15, 2021

abslamp commented Apr 15, 2021 •

edited

Loading

thatfuckingbird commented Apr 15, 2021 •

edited

Loading

abslamp commented Apr 15, 2021

thatfuckingbird commented Apr 15, 2021

mikf left a comment •

edited

Loading

mikf Apr 15, 2021

thatfuckingbird Apr 21, 2021

mikf commented Apr 15, 2021

abslamp commented Apr 18, 2021

thatfuckingbird commented Apr 21, 2021

mikf Apr 22, 2021

thatfuckingbird Apr 22, 2021

mikf Apr 22, 2021

thatfuckingbird commented Apr 23, 2021 •

edited

Loading

mikf commented Apr 23, 2021

thatfuckingbird commented Apr 23, 2021

mo-han commented May 25, 2021

add extractors for fantia and fanbox #1459

add extractors for fantia and fanbox #1459

Conversation

thatfuckingbird commented Apr 10, 2021 • edited Loading

mikf left a comment • edited Loading

Choose a reason for hiding this comment

thatfuckingbird commented Apr 13, 2021

thatfuckingbird commented Apr 13, 2021

abslamp commented Apr 15, 2021 • edited Loading

thatfuckingbird commented Apr 15, 2021

abslamp commented Apr 15, 2021 • edited Loading

thatfuckingbird commented Apr 15, 2021 • edited Loading

abslamp commented Apr 15, 2021

thatfuckingbird commented Apr 15, 2021

mikf left a comment • edited Loading

Choose a reason for hiding this comment

mikf Apr 15, 2021

Choose a reason for hiding this comment

thatfuckingbird Apr 21, 2021

Choose a reason for hiding this comment

mikf commented Apr 15, 2021

abslamp commented Apr 18, 2021

thatfuckingbird commented Apr 21, 2021

mikf Apr 22, 2021

Choose a reason for hiding this comment

thatfuckingbird Apr 22, 2021

Choose a reason for hiding this comment

mikf Apr 22, 2021

Choose a reason for hiding this comment

thatfuckingbird commented Apr 23, 2021 • edited Loading

mikf commented Apr 23, 2021

thatfuckingbird commented Apr 23, 2021

mo-han commented May 25, 2021

thatfuckingbird commented Apr 10, 2021 •

edited

Loading

mikf left a comment •

edited

Loading

abslamp commented Apr 15, 2021 •

edited

Loading

abslamp commented Apr 15, 2021 •

edited

Loading

thatfuckingbird commented Apr 15, 2021 •

edited

Loading

mikf left a comment •

edited

Loading

thatfuckingbird commented Apr 23, 2021 •

edited

Loading