A Go CLI application to scrap various websites in order to get high-quality movie snapshots. See the list of Supported Websites.
There are various ways to install or use the application:
Download the latest binary from the releases page for your OS. Then you can simply execute the binary like this:
# list all the implemented scrapers
./moviestills --list
# You can also use environment variables
# instead of CLI arguments.
# scrap the blubeaver website with default settings
WEBSITE=blubeaver ./moviestills
See Usage to check what settings you can pass through CLI arguments or environment variables.
Docker comes to the rescue, providing an easy way how to run moviestills
on most platforms.
The "latest" image is built from the master branch on every push. You can see all the other tags (releases) available here.
docker run \
--name moviestills \
--pull=always \
--volume "${PWD}/cache:/app/cache" \
--volume "${PWD}/data:/app/data" \
--rm ghcr.io/kinoute/moviestills:latest \
--website movie-screencaps \
--async
You can can see all the image tags available on the Docker Hub here.
docker run \
--name moviestills \
--pull=always \
--volume "${PWD}/cache:/app/cache" \
--volume "${PWD}/data:/app/data" \
--env WEBSITE=blubeaver \
--env ASYNC=true \
--rm hivacruz/moviestills:latest
As you can see, you can also use environment variables instead of CLI arguments.
By default, the docker run
command above will always pull
before running to get the latest image changes for the specified tag. If you don't like this behavior, remove the --pull=always
flag from the command.
Output of ./moviestills --help
:
Usage: moviestills [--website WEBSITE] [--list] [--parallel PARALLEL] [--delay DELAY] [--async] [--timeout TIMEOUT] [--proxy PROXY] [--cache-dir CACHE-DIR] [--data-dir DATA-DIR] [--hash] [--debug] [--no-colors] [--no-style]
Options:
--website WEBSITE, -w WEBSITE
Website to scrap movie stills on [env: WEBSITE]
--list, -l List all available scrapers implemented [default: false, env: LIST]
--parallel PARALLEL, -p PARALLEL
Limit the maximum parallelism [default: 2, env: PARALLEL]
--delay DELAY, -r DELAY
Add some random delay between requests [default: 1s, env: RANDOM_DELAY]
--async, -a Enable asynchronus running jobs [default: false, env: ASYNC]
--timeout TIMEOUT, -t TIMEOUT
Set the default request timeout for the scraper [default: 15s, env: TIMEOUT]
--proxy PROXY, -x PROXY
The proxy URL to use for scraping [env: PROXY]
--cache-dir CACHE-DIR, -c CACHE-DIR
Where to cache scraped websites pages [default: cache, env: CACHE_DIR]
--data-dir DATA-DIR, -f DATA-DIR
Where to store movie snapshots [default: data, env: DATA_DIR]
--hash Hash image filenames with md5 [default: false, env: HASH]
--debug, -d Set Log Level to Debug to see everything [default: false, env: DEBUG]
--no-colors Disable colors from output [default: false, env: NO_COLORS]
--no-style Disable styling and colors entirely from output [default: false, env: NO_STYLE]
--help, -h display this help and exit
--version display version and exit
Note: CLI arguments will always override environment variables. Therefore, if you set WEBSITE
as an environment variable and also use —website
as a CLI argument, only the latter will be passed to the app.
For boolean arguments such as --async
or --debug
, their equivalent as environment variables is, for example, ASYNC=false
or ASYNC=true
.
The simple goal of this CLI app is to scrap movie snapshots from a few supported websites. You can use either the binary or the Docker image to run moviestills
.
To see the list of supported websites on which we can download movie snapshots, you can do:
# see all supported websites
./moviestills --list
# or
LIST=true ./moviestills
The list of supported websites with way more details can also be seen below.
Asynchronous jobs make the scraping faster. You can also increase the maximum number of simultaneous requests by changing the --parallel
flag or the PARALLEL
environment variable (set to 2
by default).
# with CLI arguments
./moviestills --website film-grab --async
# with environment variables
WEBSITE=film-grab ASYNC=true ./moviestills
# increase to 10 the maximum number of simultaneous requests
./moviestills --website dvdbeaver --async --parallel 10
A complete example that scraps a website with:
- Docker ;
- Asynchronous jobs ;
- Random delays between requests ;
- Hashed filenames ;
- No colors on the terminal output.
# docker with CLI arguments
docker run \
--name moviestills \
--pull=always \
--volume "${PWD}/cache:/app/cache" \
--volume "${PWD}/data:/app/data" \
--rm ghcr.io/kinoute/moviestills:latest \
--website movie-screencaps \
--async \
--delay 5s \
--hash \
--no-colors
With Docker, settings can also be set with environment variables instead of CLI arguments with the —env
or -e
flag.
You can set up a proxy URL to use for scraping using the --proxy
CLI agument or the PROXY
environment variable. At the moment, you can set only one proxy but the app might support multiple proxies in a round robin fashion later.
By default, every scraped page will be cached in the cache
folder. You can change the name or path to the folder through the options with —cache-dir
or the CACHE_DIR
environment variable. This is an important folder as it stores everything that was scraped.
It avoids requesting again some websites pages when there is no need to. It is a nice thing as we don't want to flood these websites with thousands of useless requests. It is also handy to continue an early-stopped scraping job.
In case you are using our Docker image to run moviestills
, don't forget to change the volume path to the new internal cache
folder, if you set up a custom internal cache
folder. But you should not bother editing this internal cache
folder anyway, since you have volumes and can set the desired path on your host machine for the cache folder.
By default, each scraped website will have its own subfolder in the data
folder. Inside, every movie will have its own folder with the scraped movie snapshots found on the website.
Example:
data # where to store movie snapshots
├── blubeaver # website's name
│ ├── 12\ Angry\ Men # movie's title
│ │ ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_1.jpg
│ │ ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_2.jpg
│ │ ├── film3_blu_ray_reviews55_12_angry_men_blu_ray_large_large_12_angry_men_blu_ray_3.jpg
You can change the default data
folder with the —data-dir
CLI argument or the DATA_DIR
environment variable.
If you use our Docker image to run moviestills
, don't forget to change the volume path in case you edited the internal data
folder. Again, you should not even bother editing the internal data
folder's path or name anyway as you have volumes to store and get access to these files on the host machine.
To get some consistency, you can use the MD5 hash function to normalize image filenames. All images will then use 32 hexadecimal digits as filenames. To enable the hashing, use the —hash
CLI argument or the HASH=true
environment variable.
As today, scrapers were implemented for the following websites in moviestills
:
Website | Simplified Name 1 | Description | Movies 3 |
---|---|---|---|
BluBeaver | blubeaver | Extension of DVDBeaver, this time dedicated to Blu-Ray reviews only. Reviews are great to check the quality of BD releases with lot of technical details. Only snapshots on "free" access are scraped. | ~5069 |
BlusScreens | bluscreens | Website with high resolution screen captures taken directly from different Blu-ray releases by Blusscreens. | ~452 |
DVDBeaver | dvdbeaver | Not recommended. 2 A massive list of DVD reviews with a lot of movie snapshots. This task only includes the DVD reviews. BD snapshots are available in BluBeaver. It is advised to use the latter instead. | ~4098 |
EvanERichards | evanerichards | A short but interesting list of movies with a lot of snapshots for each. Also includes some TV Series but they are ignored by the scraper. | ~245 |
Film-Grab | film-grab | A great list of movies with a few snapshots for each. Snapshots were cherry-picked and show nice cinematography. | ~2829 |
HighDefDiscNews | highdefdiscnews | A few hundreds movies featured with high-quality snapshots (png, lossless) in native resolution. | ~209 |
Movie-Screencaps | movie-screencaps | Website with DVD, BD, and 4K BD movie snapshots. Since thousands of snapshots are available for each movie (one per second or so), we only take some of them per paginated page. | ~715 |
ScreenMusings | screenmusings | A small list of movies but with nice cherry-picked snapshots. | ~260 |
StillsFrmFilms | stillsfrmfilms | A very small list of movies but, again, the snapshots were nicely chosen and depict perfectly the atmosphere of each movie. | ~63 |
1 : The simplified name column displays the website's name to use with moviestills
to start the scraping job. Eg —website blubeaver
for the BluBeaver.ca website.
2 : While DVDBeaver provides a lot of movie snapshots from great DVD reviews, it is harder to filter correctly the images on the reviews pages. Expect a lot of false positives (DVD covers, banners etc) and average quality overall.
3 : Approximate number of movies calculated on October 5th, 2021.
Contribute: If you want to fix or add a new website to the scraper, please read how to set up a development workflow and how to contribute.
If you already have Go installed – 1.22+ supported –, you can simply clone the repo and run the application from the folder:
# clone the repo
git clone https://github.com/kinoute/moviestills
# go inside the project
cd moviestills
Then:
# download the dependencies
go mod download
go mod tidy
# you can run the app without compiling
go run . --help
# optional: compile the binary
go build -v
# use the compiled app (example on Linux)
./moviestills --help
If you don't have Go installed, you can build a Docker Image to start developing in a Go environment. To do that, clone the repository and inside the folder:
# build development image
docker build --tag moviestills-dev . \
--target base
Then start and go inside the container:
# start the development container
# and go inside it
docker run \
--name moviestills-dev \
--volume "${PWD}:/app" \
--interactive \
--tty \
--rm moviestills-dev
A volume is created to report any change you make to the code inside the container.
To run your code, you can use go run .
inside the container, test it, build it etc. Do note that these commands might be slow at first.
You can also build locally the final image of the app to test it. To do that, run in the repository:
# build final image with binary
docker build -t moviestills .
# test it
docker run \
--volume "${PWD}/cache:/app/cache" \
--volume "${PWD}/data:/app/data" \
--rm moviestills:latest \
--website film-grab \
--debug
This is my first public project in Golang. Therefore, pull requests, suggestions or bug reports are appreciated. A major refactoring is not excluded since I'm still learning the language.
A lot of things included here might look overkill for such a small app (linting, packages etc) but I thought setting a complete workflow would be interesting. Don't hesitate to make it better!
You can contribute to this scraper by adding a new website which provides high-quality movie snapshots. To do that, there are four steps:
- Create a new file in the
websites
folder with the simplified name of the website (eg.yahoo.go
). Check how other websites were implemented and scraped with the Colly library. You need to define a main URL as a constant (the first URL that will get visited) and a main function such asyahooScraper()
where your Colly logic will do the work. - Once you created a scraper for a website, you need to add this new scraper in the available options of the app in
main.go
. Edit thesites
map by adding the simplified name of the website as a key and the scraper's function as a value (eg.websites.yahooScraper
). - Create a unit test for the website, eg
yahoo_test.go
. For that test, we are not going to test with Colly but only with GoQuery, a library that makes HTML/CSS parsing easy, on which Colly is based. We just want to make sure the CSS selectors we use in our scraper are still up-to-date and are still filtering correctly the data we are looking for. - Edit the Supported Websites table in the
README
file and write detailed informations about the website you added – please make sure websites are sorted alphabetically in the table.
Most of the websites we are scraping are owned by individuals who just want to share nice movie snapshots. Please don't abuse these websites and limit your scraping activity to not arm them in any way.
You can support some of the webmasters behind these nice galleries:
- The owner of DVDBeaver/BluBeaver has a Patreon page where you can support him and also get access to a lot more movie snapshots in great quality, which we don't scrap. Here is the link: https://www.patreon.com/dvdbeaver ;
- The man behind Film-Grab also has a Patreon page where you can send him some love: https://www.patreon.com/filmgrab.
I couldn't find any support page for the other websites but you can support some of them by using their affiliated links for example.
Just be gentle while scraping: some are hosting the images on their own servers!
- Created by Yann Defretin.