[reddit] Support of standalone submissions on personal pages of users #2301

Ailothaen · 2022-02-13T15:35:15Z

While trying to download a submission on a user's personal page (these are submissions that are not bound to a subreddit but to an user, such as this one: https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/ ), I noticed that there was no sub-extractor of Reddit for this purpose (there is one for a whole user's page section, but not for single submissions)

This commit adds a class RedditUserSubmissionExtractor which takes care of this, providing the pattern for this situation.

Examples before then after the changes:

root@hestia:~ # python3 -m gallery_dl "https://www.reddit.com/user/TheSpiritTree" --verbose
[gallery-dl][debug] Version 1.20.4
[gallery-dl][debug] Python 3.7.3 - Linux-5.4.51-v7+-armv7l-with-debian-10.6
[gallery-dl][debug] requests 2.27.1 - urllib3 1.26.8
[gallery-dl][debug] Starting DownloadJob for 'https://www.reddit.com/user/TheSpiritTree'
[reddit][debug] Using RedditUserExtractor for 'https://www.reddit.com/user/TheSpiritTree'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): oauth.reddit.com:443
[urllib3.connectionpool][debug] https://oauth.reddit.com:443 "GET /user/TheSpiritTree/.json?limit=100&raw_json=1 HTTP/1.1" 200 5734
./gallery-dl/reddit/u_TheSpiritTree/srilyf This is a random image posted for automation testing purposes..jpg

root@hestia:~ # python3 -m gallery_dl "https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/" --verbose
[gallery-dl][debug] Version 1.20.4
[gallery-dl][debug] Python 3.7.3 - Linux-5.4.51-v7+-armv7l-with-debian-10.6
[gallery-dl][debug] requests 2.27.1 - urllib3 1.26.8
[gallery-dl][debug] Starting DownloadJob for 'https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/'
[reddit][debug] Using RedditUserExtractor for 'https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): oauth.reddit.com:443
[urllib3.connectionpool][debug] https://oauth.reddit.com:443 "GET /user/TheSpiritTree/comments/.json?limit=100&raw_json=1 HTTP/1.1" 200 4488
[reddit][info] No results for https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/

root@hestia:~ # python3 -m gallery_dl "https://www.reddit.com/user/TheSpiritTree" --verbose
[gallery-dl][debug] Version 1.20.4
[gallery-dl][debug] Python 3.7.3 - Linux-5.4.51-v7+-armv7l-with-debian-10.6
[gallery-dl][debug] requests 2.27.1 - urllib3 1.26.8
[gallery-dl][debug] Starting DownloadJob for 'https://www.reddit.com/user/TheSpiritTree'
[reddit][debug] Using RedditUserExtractor for 'https://www.reddit.com/user/TheSpiritTree'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): oauth.reddit.com:443
[urllib3.connectionpool][debug] https://oauth.reddit.com:443 "GET /user/TheSpiritTree/.json?limit=100&raw_json=1 HTTP/1.1" 200 5734
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): i.redd.it:443
[urllib3.connectionpool][debug] https://i.redd.it:443 "GET /8fpgv17yqlh81.jpg HTTP/1.1" 200 975588
./gallery-dl/reddit/u_TheSpiritTree/srilyf This is a random image posted for automation testing purposes..jpg

root@hestia:~ # python3 -m gallery_dl "https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/" --verbose
[gallery-dl][debug] Version 1.20.4
[gallery-dl][debug] Python 3.7.3 - Linux-5.4.51-v7+-armv7l-with-debian-10.6
[gallery-dl][debug] requests 2.27.1 - urllib3 1.26.8
[gallery-dl][debug] Starting DownloadJob for 'https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/'
[reddit][debug] Using RedditUserSubmissionExtractor for 'https://www.reddit.com/user/TheSpiritTree/comments/srilyf/this_is_a_random_image_posted_for_automation/'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): oauth.reddit.com:443
[urllib3.connectionpool][debug] https://oauth.reddit.com:443 "GET /comments/srilyf/.json?limit=0&raw_json=1 HTTP/1.1" 200 1676
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): i.redd.it:443
[urllib3.connectionpool][debug] https://i.redd.it:443 "GET /8fpgv17yqlh81.jpg HTTP/1.1" 200 975588

I had to add a $ at the end of the regex for RedditUserExtractor since its regex was catching the URL before the new class otherwise.

Please note that I filled the test variable for this class, but:

I am unsure about how to launch the tests and how to fill them as well, so I could not test it, this surely has to be changed
The URL I implemented is bound to the account "TheSpiritTree" (which I own). I am not going to remove this submission if it is going to be used in the code, but still feel free to change the URL to something more "resilient".

mikf · 2022-02-13T17:16:29Z

You can run tests with make test in the root directory, or by directly running each test_*.py file in the test directory. Running extractor result tests only works like that (running all takes more than 30 minutes), but you can select specific extractor categories with the second argument:

$ python test/test_results.py reddit
$ python test/test_results.py reddit:usersubmission

The one test you added runs successfully when testing it on my machine.

Also, maybe you could just extend the regex pattern of the existing submission extractor instead of creating a new one using basically the same code.

I think that would just be changing

gallery-dl/gallery_dl/extractor/reddit.py

Lines 199 to 201 in b17e2dc

    
           pattern = (r"(?:https?://)?(?:" 
        
                      r"(?:\w+\.)?reddit\.com/(?:r/[^/?#]+/comments|gallery)" 
        
                      r"|redd\.it)/([a-z0-9]+)")

to

    pattern = (r"(?:https?://)?(?:"
               r"(?:\w+\.)?reddit\.com/(?:(?:r|u|user)/[^/?#]+"
               "/comments|gallery)|redd\.it)/([a-z0-9]+)")

(r to (?:r|u|user))

The tests on GitHub failed because the comment in https://github.com/mikf/gallery-dl/pull/2301/files#diff-638fcf254368bdeef37f0ad5e3150e8575e2d5619f4815134f6ea2d6e880a923R249 is too long. flake8 complains about all lines longer than 79 characters.

Ailothaen · 2022-02-13T18:23:06Z

Thank you for the suggestions. I definitely agree that modifying the pattern of RedditSubmissionExtractor is a better design, since in the end, a user submission still remains a submission. (The only point of concern would be that one would not currently be able to distinguish between subreddits and usernames, so it may cause confusion if someone downloads from both usernames and subreddits with the same names)

I am going to push a commit that goes on this design (and corrects the linter issue)

Thank you as well for the guidance on how to run tests. I feel like this should be useful to put on a wiki page/documentation related to development (such as the answer of rachmadaniHaryono related to my question with the Extractor class back then)

[reddit] Support of submissions on personal pages of users

c4d3afc

[reddit] Design improvement for user submissions

e077344

[reddit] Removed functions declared twice

b9f5273

mikf merged commit 203a04a into mikf:master Feb 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reddit] Support of standalone submissions on personal pages of users #2301

[reddit] Support of standalone submissions on personal pages of users #2301

Ailothaen commented Feb 13, 2022 •

edited

Loading

mikf commented Feb 13, 2022

Ailothaen commented Feb 13, 2022

[reddit] Support of standalone submissions on personal pages of users #2301

[reddit] Support of standalone submissions on personal pages of users #2301

Conversation

Ailothaen commented Feb 13, 2022 • edited Loading

mikf commented Feb 13, 2022

Ailothaen commented Feb 13, 2022

Ailothaen commented Feb 13, 2022 •

edited

Loading