News Crawler

A repo for news crawling. Then combine similar news. Temporarily used python and Scrapy framework.
Used Jieba and Scrapy

Version 1.0 is the current directory
Version 2.0 is in the reconstruction directory

1.0版本的代码参见当前目录
2.0版本的代码参见目录reconstruction

TODO

~~Find a proper Chinese segmentation tool~~
~~Split a JSON file into small files, every file contains only one piece of news~~
~~Read some articles about SVM~~ (Did not use SVM, but tfidf and cosin similarity)
~~Try to categorize different news~~
Build another IDF dictionary from web news
~~Categorize those similar passages in sina and netease but not in tencent~~
~~Make a website display the results and show the comments~~
~~Improve categorization performance~~
Make the website more beautiful!
~~Reconstruct this project with php~~

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
chnsegmt		chnsegmt
crawler		crawler
reconstruction		reconstruction
result		result
website		website
.gitignore		.gitignore
README.md		README.md
fetchcontents.py		fetchcontents.py
toolstobetried		toolstobetried