open source, restful, distributed crawler engine
- Persistence
- Dynamic Master
I wrote a crawler engine named ants in python base on scrapy. But sometimes, dynamic language is chaos. So I start to write it in a compile language.
I design the crawler framework by imitating scrapy. such as downloader,scraper,and the way user write customize spider, but in a compile way
I design my distributed architecture by imitating elasticsearch. it spire me to do a engine for distributed crawler
go get
go get
go get
go install
cd bin
curl 'http://localhost:8200/cluster'
curl 'http://localhost:8200/spiders'
curl 'http://localhost:8200/crawl?spider=spiderName'
to test cluster in one computer,you can run it from different port in different terminal
one node,use the default port tcp 8300 http 8200
cd bin
the other node set tcp port and http port
cd bin
./ants-go -tcp 9300 -http 9200
there are some flags you can set,check out the help message
./ants-go -h
./ants-go -help
- go to spiders
- write your spiders follow the example deap_loop_spider.go or go to the spider page
- add you spider to spiderMap,follow the example in LoadAllSpiders in load_all_spider.go
- install again