Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for instagram.com user profiles and pages #134

Merged
merged 4 commits into from
Dec 9, 2018
Merged

Add support for instagram.com user profiles and pages #134

merged 4 commits into from
Dec 9, 2018

Conversation

iamleot
Copy link
Contributor

@iamleot iamleot commented Dec 8, 2018

The extractor scrapes instagram.com/<user>' timelines and instagram.com/p/' by mimicking the behaviour of a web
browser and extracting the sharedData JSON of the single pages.

Please note that this mean that for user timelines we also do an
extra request to the `instagram.com/p/' page but this
permit to have consistent (and all) information about the media
fetched.

The MD5 logic used for X-Instagram-GIS was documented in

https://stackoverflow.com/questions/49786980/

The current known (and documented via TODO comment) limitation of
the extractor is that it treats GraphSidecar' media (that contains multiple images) like GraphImage'. So, for `GraphSidecar' media - at the
moment -only an image is retrieved, not all of them.

(I would prefer to add support for it once this patch is reviewed
and imported!)

Thank you very much!

@iamleot
Copy link
Contributor Author

iamleot commented Dec 8, 2018

Whoops, sorry for that! (I've run the single tests for instagram but not test_extractor.py (`p/' is actually reserved but there are single letter users so we need to accept them).

I will fix that ASAP!

The extractor scrapes `instagram.com/<user>' timelines and
`instagram.com/p/<shortcode>' by mimicking the behaviour of a web
browser and extracting the sharedData JSON of the single pages.

Please note that this mean that for user timelines we also do an
extra request to the `instagram.com/p/<shortcode>' page but this
permit to have consistent (and all) information about the media
fetched.

The MD5 logic used for X-Instagram-GIS was documented in

 <https://stackoverflow.com/questions/49786980/>
@iamleot
Copy link
Contributor Author

iamleot commented Dec 8, 2018

I have fixed the pattern in InstagramProfilepageExtarctor and text_extractor.py tests now passes!

Unfortunately it seems that the first two tests of InstagramPostpageExtractor fails.
I guess that the URLs are not stable. Any suggestion on how to address that?
(I think that we can check their ID (media_id) and shortcodes (shortcode) instead).

URLs returned by instagram seems not stable so avoid testing for
them and instead test for keyword returned.
Also check the count of media returned.
@iamleot
Copy link
Contributor Author

iamleot commented Dec 8, 2018

I have adjusted the tests in order to avoid checking the URL for GraphImage and GraphSidecar. While here I have also added checks for keywords in GraphVideo too (in that case the URL is stable because it is a ytdl: one with just the shortcode).

I have added a second commit to also check the count of InstagramProfilepageExtractor.

Copy link
Owner

@mikf mikf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the instagram support! It mostly it looks really solid, but there some small things that should be changed.

For example extractor- and subcategory names: the current ones (postpage, profilepage) are a bit too verbose, I think. Just post and profile are better, but I would even consider calling them image and user to be in line with other extractors. Your choice.

I guess that the URLs are not stable. Any suggestion on how to address that?

Use pattern instead of url and specify a regex that should match all returned URLs. (example)

gallery_dl/extractor/instagram.py Outdated Show resolved Hide resolved
gallery_dl/extractor/instagram.py Outdated Show resolved Hide resolved
gallery_dl/extractor/instagram.py Outdated Show resolved Hide resolved
gallery_dl/extractor/instagram.py Outdated Show resolved Hide resolved
gallery_dl/extractor/instagram.py Outdated Show resolved Hide resolved
gallery_dl/extractor/instagram.py Outdated Show resolved Hide resolved
- Change description, subcategories to generate a better description in
  docs/supportedsite.rst
- Remove not needed InstagramExtractor.__init__()
- Use text.parse_int() instead of directly using int() (the former is more
  robust)
- Use self.request().json() instead of using json.loads() the
  self.request().text()
- Add `pattern:' to check the URLs where we do not have a stable URLs.
  It seems that only the subdomain is not stable.

Thanks to @mikf!
@iamleot
Copy link
Contributor Author

iamleot commented Dec 9, 2018 via email

@mikf mikf merged commit 2655a2e into mikf:master Dec 9, 2018
@Hrxn
Copy link
Contributor

Hrxn commented Dec 9, 2018

Very nice, thank you very much, both of you!

What you call GraphSidecar media here (that contains multiple images), doesn't this also apply to videos as well? I don't really use Instagram, but as far as I know, these kinds of post (I think Instagram originally called them carousel posts) allow up to 10 pages, i.e. 10 different items per single post page (and URL) which can either be pictures or videos, I think you can completely mix them at will.

Additional bonus comment:
I actually think that support for Insta in gallery-dl is a good idea, because if mikf would decide to run a poll on here including sites not yet implemented, I think this would come out on top 😅 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants