ptt-web-crawler (PTT 網路版爬蟲)

Live demo

特色

支援單篇及多篇文章抓取
過濾資料內空白、空行及特殊字元
JSON 格式輸出
支援 Python 2.7 - 3.4

輸出 JSON 格式

{
    "article_id": 文章 ID,
    "article_title": 文章標題 ,
    "author": 作者,
    "board": 板名,
    "content": 文章內容,
    "date": 發文時間,
    "ip": 發文位址,
    "message_conut": { # 推文
        "all": 總數,
        "boo": 噓文數,
        "count": 推文數-噓文數,
        "neutral": → 數,
        "push": 推文數
    },
    "messages": [ # 推文內容
        {
            "push_content": 推文內容,
            "push_ipdatetime": 推文時間及位址,
            "push_tag": 推/噓/→ ,
            "push_userid": 推文者 ID
        },
        ...
    ]
}

執行方式

python crawler.py -b 看板名稱 -i 起始索引 結束索引 (設為 -1 則自動計算最後一頁)
python crawler.py -b 看板名稱 -a 文章ID

範例

* 直接執行腳本:`python crawler.py -b PublicServan -i 100 200`
* 呼叫package:
    * install:`python setup.py install`
    * 直接:`python -m PttWebCrawler -b PublicServan -i 100 200`

會爬取 PublicServan 板第 100 頁 (https://www.ptt.cc/bbs/PublicServan/index100.html) 到第 200 頁 (https://www.ptt.cc/bbs/PublicServan/index200.html) 的內容，輸出至 PublicServan-100-200.json

python crawler.py -b PublicServan -a M.1413618360.A.4F0

會爬取 PublicServan 板文章 ID 為 M.1413618360.A.4F0 (https://www.ptt.cc/bbs/PublicServan/M.1413618360.A.4F0.html) 的內容，輸出至 PublicServan-M.1413618360.A.4F0.json

scrapy 範例

cd ptt_web_crawler
scrapy crawl ptt-web -a board=Gossiping -a article_id=M.1413618360.A.4F0
scrapy crawl ptt-web -a board=Gossiping -a pages=100,200
scrapy crawl ptt-web -a board=Gossiping -a pages=100,-1

測試

python test.py

ptt-web-crawler is a crawler for the web version of PTT, the largest online community in Taiwan.

usage: python crawler.py [-h] -b BOARD_NAME (-i START_INDEX END_INDEX | -a ARTICLE_ID) [-v]
optional arguments:
  -h, --help                  show this help message and exit
  -b BOARD_NAME               Board name
  -i START_INDEX END_INDEX    Start and end index
  -a ARTICLE_ID               Article ID
  -v, --version               show program's version number and exit

Output would be BOARD_NAME-START_INDEX-END_INDEX.json (or BOARD_NAME-ID.json)

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
PttWebCrawler		PttWebCrawler
ptt_web_crawler		ptt_web_crawler
web		web
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ptt-web-crawler (PTT 網路版爬蟲)

Live demo

執行方式

範例

scrapy 範例

測試

About

Releases

Packages

Languages

License

afunTW/ptt-web-crawler

Folders and files

Latest commit

History

Repository files navigation

ptt-web-crawler (PTT 網路版爬蟲)

Live demo

執行方式

範例

scrapy 範例

測試

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages