Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug : files on chan archives with four letter extension saved as three letter extensions 404 out #5116

Closed
stubkan opened this issue Jan 26, 2024 · 4 comments

Comments

@stubkan
Copy link

stubkan commented Jan 26, 2024

I noticed scraping some 4chan archives that host webm files cause the downloader to 404 out on those, it occurs on thebarchive.com

However, those files do exist on the archives, only they are for some reason saved as .web not .webm

https://thebarchive.com/b/full_image/9999999999999.webm

will cause downloader to 404 error, but the file exists as

https://thebarchive.com/b/full_image/9999999999999.web

which can be downloaded (and be renamed manually by me afterwards)

@mikf
Copy link
Owner

mikf commented Jan 26, 2024

I downloaded ~50 .web WEBMs from /b/ and none of them caused a 404.
Could a post an URL where this error happens?

Regarding renaming those files, you could use the extension-map option for that.

@stubkan
Copy link
Author

stubkan commented Jan 26, 2024

Lets use this thread as an example; https://archived.moe/b/thread/912594917/

There are two webms in it - when you attempt to open the links

image

They dont exist, I see now its a failure on the part of the archive sites to have the correct url address - since you can reproduce that on the web without gallery-dl. They are actually accessible if the ending is changed to .web and will load then.

archived.moe is the only /b/ archive I know of that allows you to search /b/ - the others dont, so one is kind of forced to go through it to wade through the content.

Were you getting the webms from the 4chan site, and not an archive site?

@mikf
Copy link
Owner

mikf commented Jan 26, 2024

Oh, you are using archived.moe as input URL. I'd been downloading from thebarchive.com directly and that works just fine, even your example thread:

$ gallery-dl https://thebarchive.com/b/thread/912594917/
thebarchive/b/912594917 WEBMs?/1705625299234839 pepe-agony.gif
thebarchive/b/912594917 WEBMs?/1705625431133806 754247383421.webm
thebarchive/b/912594917 WEBMs?/1705626190307840 find me when you wake up.webm

Well, at least I've got an reproducible error now. I'll look into it.

@stubkan
Copy link
Author

stubkan commented Jan 26, 2024

Yes, all the content on archived.moe is hosted on thebarchive - archived.moe only saves the html and thumbnails

But people are forced to use archived.moe to find content anyway, since it is the only /b/ archive that has indexed searching

@mikf mikf closed this as completed Jan 27, 2024
bradenhilton pushed a commit to bradenhilton/gallery-dl that referenced this issue Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants