GitHub

IMDB_Spider_By_Scrapy

This project crawls IMDb top 250 movies information by using Scrapy which is a "powerful web crawlling framework" with python.

Structure

spiders/IMDb_spider.py # parse crawled website into scrapy items

movie_item.py

MovieItem # Movie Item class

MovieReview # Review Item for each Movie

MovieStar # Movie star Item

pipelines.py

handle the two items for saving to separated files

Transfer item object to python dictionary(essentially they are same.)

Outputs:(I didn't upload it to Git.)

MovieItem.csv # save Movie Items

MovieReview.csv # save Movie Reviews

MovieStar.csv # save Movie star info

spiders/settings.py

# set for pipelines
ITEM_PIPELINES = {
   'IMDB_Spider.pipelines.Pipeline': 300,
}

Movie_Analysis.ipynb jupyter analysis report

How to use

Use console to the project folder, then run "scrapy crawl imdbspider", where imdbspider is project name which you can find in IMDb_spider.py.

name = 'imdbspider'
allowed_domains = ['imdb.com']
start_urls = ['http://www.imdb.com/chart/top',]

How scrapy works

Engine gets request object from spiders(IMDb_spider.py)
Engine handles request object to scheduler
Engine gets next request from scheduler
Engine sends the request to downloader through middleware
Downloader sends response back to engine through middleware
Engine transfers response to spider for parsing
Spider creates scrapy Items and sends new request to Engine
Engine sends Items to pipelines In this process, Engine will receive request from scheduler until it is emtpy. Framewrok starts with start_url, end with pipelines. Enigne, Downloader and Scheduler are already completed by framework. we need to code spiders and pipelines, also do some configure stuffs.

Some commands in scrapy

startproject: create a new project

genspider: create a spider

setting: get spider config info

crawl: start to run crawling

list: show all project names

shell: start URL parse

Report

more information and code see here

star anlaysis Accordig this breif table blew, we can find Robert De Niro took the most movies in top 250 list. Followed by Harrison,Tom and Leonardo.

165 movies in top 250 movies are performed by the 100 best stars who is defined that took more than one movie in the list. We picked up these 100 movie stars for future star research. 83% movie star only took one movie in the list.

I picked up a few stars who took more than 2 movies in the top 250 list, and create a relationship netwrok for them.We can find the major 5 blocks, if we loose the filter, maybe we can find more.

From picked 100 movie stars, most of them are born between 1930s to 1970s. California, Illinois, New Jersey are the states with most movie stars. Even so, none of state or regions is predominant.

Movie Review Anlaysis I use NLTK to spem the words and only picked adj and noun for word cloud. See which words are frequcely refereced in the best movies.

I didn't do word sentiment anlaysis in this project, but you can find in my other project- here.

Future Imporvement

add movie type and release year for each movie when crawling and corresponsding analysis block.
find the movie stars' "well done" areas over movie type(category)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
IMDB_Spider		IMDB_Spider
img		img
.fuse_hidden000000b500000002		.fuse_hidden000000b500000002
LICENSE.md		LICENSE.md
Movie_Analysis.ipynb		Movie_Analysis.ipynb
README.md		README.md
__init__.py		__init__.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDB_Spider_By_Scrapy

Structure

How to use

How scrapy works

Some commands in scrapy

Report

Future Imporvement

About

Releases

Packages

Languages

License

neoaksa/IMDB_Spider

Folders and files

Latest commit

History

Repository files navigation

IMDB_Spider_By_Scrapy

Structure

How to use

How scrapy works

Some commands in scrapy

Report

Future Imporvement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages