This project crawls IMDb top 250 movies information by using Scrapy which is a "powerful web crawlling framework" with python.
spiders/IMDb_spider.py # parse crawled website into scrapy items
movie_item.py
- MovieItem # Movie Item class
- MovieReview # Review Item for each Movie
- MovieStar # Movie star Item
pipelines.py
- handle the two items for saving to separated files
- Transfer item object to python dictionary(essentially they are same.)
Outputs:(I didn't upload it to Git.)
- MovieItem.csv # save Movie Items
- MovieReview.csv # save Movie Reviews
- MovieStar.csv # save Movie star info
spiders/settings.py
# set for pipelines
ITEM_PIPELINES = {
'IMDB_Spider.pipelines.Pipeline': 300,
}
Movie_Analysis.ipynb jupyter analysis report
Use console to the project folder, then run "scrapy crawl imdbspider", where imdbspider
is project name which you can find in IMDb_spider.py
.
name = 'imdbspider'
allowed_domains = ['imdb.com']
start_urls = ['http://www.imdb.com/chart/top',]
- Engine gets request object from spiders(IMDb_spider.py)
- Engine handles request object to scheduler
- Engine gets next request from scheduler
- Engine sends the request to downloader through middleware
- Downloader sends response back to engine through middleware
- Engine transfers response to spider for parsing
- Spider creates scrapy Items and sends new request to Engine
- Engine sends Items to pipelines
In this process, Engine will receive request from scheduler until it is emtpy. Framewrok starts with
start_url
, end withpipelines
. Enigne, Downloader and Scheduler are already completed by framework. we need to code spiders and pipelines, also do some configure stuffs.
startproject: create a new project
genspider: create a spider
setting: get spider config info
crawl: start to run crawling
list: show all project names
shell: start URL parse
more information and code see here
- star anlaysis
Accordig this breif table blew, we can find
Robert De Niro
took the most movies in top 250 list. Followed byHarrison
,Tom
andLeonardo
.
165 movies in top 250 movies are performed by the 100 best stars who is defined that took more than one movie in the list. We picked up these 100 movie stars for future star research. 83% movie star only took one movie in the list.
I picked up a few stars who took more than 2 movies in the top 250 list, and create a relationship netwrok for them.We can find the major 5 blocks, if we loose the filter, maybe we can find more.
From picked 100 movie stars, most of them are born between 1930s to 1970s. California, Illinois, New Jersey are the states with most movie stars. Even so, none of state or regions is predominant.
- Movie Review Anlaysis I use NLTK to spem the words and only picked adj and noun for word cloud. See which words are frequcely refereced in the best movies.
I didn't do word sentiment anlaysis in this project, but you can find in my other project- here.
- add movie type and release year for each movie when crawling and corresponsding analysis block.
- find the movie stars' "well done" areas over movie type(category)