Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add extractors for fantia and fanbox #1459

Merged
merged 24 commits into from
Apr 25, 2021
Merged

add extractors for fantia and fanbox #1459

merged 24 commits into from
Apr 25, 2021

Conversation

thatfuckingbird
Copy link
Contributor

@thatfuckingbird thatfuckingbird commented Apr 10, 2021

Implemented post and user extractors for Fantia and Fanbox. Both use cookies for auth.
Doing it in 1 PR because these are pretty similar.

I added tests for both, however I don't know how to pass cookies to test_results.py so I didn't manage to run the Fantia tests.
The tests use freely available content, but registration (and for Fantia, subscription to the free plan) is likely needed.

Note that the test urls contain NSFW content (it's hard to find freely available + sfw on these sites).

Closes #849.
Closes #1260.
Closes #739.

Copy link
Owner

@mikf mikf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A huge Thank You for adding extractors for both sites to you!
I hope I'm not too nitpicky about some of this ...

gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fantia.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fantia.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fantia.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fantia.py Outdated Show resolved Hide resolved
@thatfuckingbird
Copy link
Contributor Author

I hope I'm not too nitpicky about some this ...

Don't worry about it, it's helpful.

@thatfuckingbird
Copy link
Contributor Author

ok, I think everything above is addressed

@abslamp
Copy link

abslamp commented Apr 15, 2021

Do you plan to support saving video links (maybe as metadata) in Fanbox? I think for some posts the type in the json is "video", and it contains a "video" key in body's body, with content like: "video":{"serviceProvider":"youtube","videoId":"youtubeId"}. Maybe useful to invoke a post action like youtube-dl.

If you are interested I can share a dump with paid contents stripped, but so far I found no public sources.

@thatfuckingbird
Copy link
Contributor Author

@abslamp Sure, I didn't know fanbox had this feature. If you can share the whole post response (with the paid urls removed) then I can add it.

@abslamp
Copy link

abslamp commented Apr 15, 2021

@abslamp Sure, I didn't know fanbox had this feature. If you can share the whole post response (with the paid urls removed) then I can add it.

@thatfuckingbird Thanks for considering this!

URL: https://www.fanbox.cc/@ayumasayu/posts/1737774

Formatted JSON with paid contents removed:

{
   "body":{
      "id":"1737774",
      "title":"\u7483\u5948\u3061\u3083\u3093\u30bf\u30a4\u30e0\u30e9\u30d7\u30b9",
      "coverImageUrl":null,
      "feeRequired":864,
      "publishedDatetime":"2020-12-28T17:51:48+09:00",
      "updatedDatetime":"2021-01-02T09:31:12+09:00",
      "type":"video",
      "body":{
         "text":[REDUCTED],
         "video":{
            "serviceProvider":"youtube",
            "videoId":[REDUCTED, the last part of youtube's video url]
         }
      },
      "tags":[
         
      ],
      "excerpt":[REDUCTED],
      "isLiked":false,
      "likeCount":18,
      "commentCount":1,
      "restrictedFor":null,
      "user":{
         "userId":"2473967",
         "name":"\u3042\u3086\u307e\u7d17\u7531",
         "iconUrl":"https:\/\/pixiv.pximg.net\/c\/160x160_90_a2_g5\/fanbox\/public\/images\/user\/2473967\/icon\/fA0s7GFQgdcE21pwISHHmzvg.jpeg"
      },
      "creatorId":"ayumasayu",
      "hasAdultContent":true,
      "commentList": [REDUCTED]
      "nextPost":{
         "id":"1745343",
         "title":"\u306e\u3063\u304b\u308a\u3042\u3059\u3061\u3083\u3093",
         "publishedDatetime":"2020-12-30 19:24:10"
      },
      "prevPost":{
         "id":"1734302",
         "title":"\u8868\u7d19\u69cb\u56f3\u6848\u30e9\u30d5",
         "publishedDatetime":"2020-12-27 17:53:37"
      },
      "imageForShare":"https:\/\/pixiv.pximg.net\/c\/1200x630_90_a2_g5\/fanbox\/public\/images\/creator\/2473967\/cover\/02RjvKWibZct6SWzEqUJ5Sms.jpeg"
   }
}

Please note it does not provide the whole URL, but breaking it into 2 parts. For example, for https://www.youtube.com/watch?v=VjHMmF7HsoU (a random public video appeared in my youtube's suggestions), it would be like:

"video":{
      "serviceProvider":"youtube",
      "videoId":"VjHMmF7HsoU"
}

In the acutal post, the video is shown as an embedded youtube video.
Maybe just adding the 2 fields to the metadata, and leave the rest part to the user to decide?

@thatfuckingbird
Copy link
Contributor Author

thatfuckingbird commented Apr 15, 2021

@abslamp Hmm yea I would just add the "video" object to the metadata, if present. Problem is, in this example gallery-dl would produce no files for this post, so there would be no metadata written at all in the end...

Maybe I could yield the youtube URL and worst case it could be captured with the 'write unsupported URLs to file' option, but that won't work for other serviceProviders....

@abslamp
Copy link

abslamp commented Apr 15, 2021

@abslamp Hmm yea I would just add the "video" object to the metadata, if present. Problem is, in this example gallery-dl would produce no files for this post, so there would be no metadata written at all in the end...

Maybe I could yield the youtube URL and worst case it could be captured with the 'write unsupported URLs to file' option, but that won't work for other serviceProviders....

@thatfuckingbird You can use this page for testing. But I don't suggest you adding this to test because I will remove this page later. https://mkx-bot.fanbox.cc/posts/2130801

From the creator panel I found that currently fanbox supports YouTube, Vimeo, and SoundCloud only.

@thatfuckingbird
Copy link
Contributor Author

@abslamp Thanks, I will look at it in the coming days.
Can you put up a vimeo and soundcloud link too? Or actually just checking what the videoId for those looks like would be enough, so I can generate a URL from it

Copy link
Owner

@mikf mikf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been playing around with this a bit more and found a few things in regards to URL patterns and fetching results. In summary:

  • it should be possible to safe a lot of requests to https://api.fanbox.cc/post.info
  • use self.request(url, params=params) instead of building URLs yourself

gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
Comment on lines 43 to 52
def _pagination(self, url):
headers = {"Origin": self.root}

while url:
url = text.ensure_http_scheme(url)
body = self.request(url, headers=headers).json()["body"]
for item in body["items"]:
yield self._get_post_data(item["id"])

url = body["nextUrl"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The items returned from https://api.fanbox.cc/post.listCreator?creatorId=USER&limit=10 appear to be, at least for xub.fanbox.cc, more or less the same as the single-item results from https://api.fanbox.cc/post.info?postId=ID (*) It's only missing comments, imageForShare, and the entries about next/previous posts.

If that's true in general (posts with videos, embeds, etc), we don't need to fetch data from /post.info for every post and can use the items returned from /post.listCreator directly.

(*) Diff "/post.listCreator" - "/post.info" for post 2059366
   "creatorId": "xub",
-  "hasAdultContent": true,
-  "commentList": {
-    "items": [],
-    "nextUrl": null
-  },
-  "nextPost": {
-    "id": "2085876",
-    "title": "Skeb Commission",
-    "publishedDatetime": "2021-04-03 04:56:12"
-  },
-  "prevPost": {
-    "id": "2009099",
-    "title": "メスガキ〇〇★〇〇〇〇〇をわからせる絵",
-    "publishedDatetime": "2021-03-12 19:01:08"
-  },
-  "imageForShare": "https://pixiv.pximg.net/c/1200x630_90_a2_g5/fanbox/public/images/post/2059366/cover/JuTXNtvo1BRN93cLW371vVd6.jpeg"
+  "hasAdultContent": true
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you are right, at least I couldn't find any posts where listCreator didn't have all the content. Updated the code to use that directly, it's easy enough to change it back if it turns out it's needed.

gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fantia.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fantia.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fantia.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Outdated Show resolved Hide resolved
gallery_dl/extractor/fanbox.py Show resolved Hide resolved
@mikf
Copy link
Owner

mikf commented Apr 15, 2021

@thatfuckingbird @abslamp There should also be something useful in the test data from PixivUtil2: https://github.com/Nandaka/PixivUtil2/tree/master/test

@abslamp
Copy link

abslamp commented Apr 18, 2021

@thatfuckingbird SoundCloud and Vimeo links added. Seems that SoundCloud is not working correctly.

@thatfuckingbird
Copy link
Contributor Author

Added support for embedded videos and addressed the comments above.

Looking at the pixivutil test json files, I realized that article type posts and the imageMap/fileMap things are not handled correctly in the fanbox downloader. I also want to do another round of testing with the new changes. Going to do these soon.

Comment on lines 950 to 964
extractor.fanbox.videos
-----------------------
Type
``bool`` or ``string``
Default
``true``
Description
Control behavior on videos embedded from external sites.
Recognizes embeds from YouTube, Vimeo and SoundCloud.

* ``true``: Extract video URLs (videos are not downloaded, as
galley-dl does not support these sites natively)
* ``"ytdl"``: Download videos and let `youtube-dl`_ handle all of
video extraction and download
* ``false``: Ignore videos
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this option should be simplified to just

  • true: Download embedded media with youtube-dl (what is currently happening for "ytdl")
  • false: Ignore videos

with just "Download embedded media from YouTube, Vimeo, and SoundCloud with youtube-dl" as description.
The current true would download the HTML content of those external sites and I don't think anyone would want that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, true shouldn't download html, instead the links show up in the unsupported URLs (the --write-unsupported flag),
though I can remove it, I don't particularly mind either way.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, they would land in the "unsupported" file. I didn't realize it's usingMessage.Queue.
But that wouldn't use youtube-dl to download them, either. It's Message.Url to download and Message.Queue to potentially spawn a new extractor.

I would still only have true or false as options (because its easier).
It'd need to use Message.Url for ytdl:… URIs if you want to keep all three.

@thatfuckingbird
Copy link
Contributor Author

thatfuckingbird commented Apr 23, 2021

OK, almost done. Looked at pixivutil and based on the examples found there I added support for the remaining post types:

-"entry" type posts just contain a "html" entry with raw html in it, have to extract images from that (I followed what pixivutil does here for exactly what type of image URLs are downloaded)

-"article" type posts seem to be similar in purpose but a newer format, it has paragraphs (+styling info, etc.) as json objects. This is where fileMap/imageMap is used. Pixivutil actually parses the paragraphs from the json and generates a HTML with the written post content, but I think it is better if gallery-dl just saves the whole article json into the metadata, then it can be postprocessed later by the user. Fortunately files/images can simply be downloaded based on the contents of imageMap/fileMap.

Handling embeds was also extended, turns out there are other types of embeds, some which gallery-dl can handle itself (e.g. twitter). I renamed the "videos" option to "embeds" but the idea is mostly the same. @abslamp You can delete the test posts now, thanks for the help.

Fortunately some of the posts that pixivutil tests use are public so I added some tests for the above.

Only one problem remains, that is the handling of Fanbox embeds (a fanbox post embedded in another). For some reason, if I yield a fanbox URL while parsing a fanbox post, it gets ignored (written into the unsupported urls file) despite the same code working with yielding twitter URLs (and if I manually run gallery-dl with the yielded Fanbox url then it is recognized correctly).
@mikf Can you take a look at this? The last test URL has a fanbox embed.

@mikf
Copy link
Owner

mikf commented Apr 23, 2021

if I yield a fanbox URL while parsing a fanbox post, it gets ignored

You need to specify the expected Extractor class as _extractor in the metadata dict:

             final_post["_extractor"] = FanboxCreatorExtractor

When there's no _extractor field or it is set to None, blacklist/whitelist apply, which ignores URLs for the same category by default.

@thatfuckingbird
Copy link
Contributor Author

Thanks, that did it. I think everything's done now.

@mikf mikf merged commit e47952a into mikf:master Apr 25, 2021
@mo-han
Copy link
Contributor

mo-han commented May 25, 2021

@thatfuckingbird
Could you please add parsing and downloading of those "profile" and "plans" images?
Currently am saving those pictures by drag & drop, since they are just few, but it would be much more convenient if it's automated.
Thanks.

mikf added a commit that referenced this pull request Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants