A repo for news crawling. Then combine similar news.
Temporarily used python and Scrapy framework.
Used Jieba and Scrapy
Version 1.0 is the current directory
Version 2.0 is in the reconstruction
directory
1.0版本的代码参见当前目录
2.0版本的代码参见目录reconstruction
Find a proper Chinese segmentation toolSplit a JSON file into small files, every file contains only one piece of newsRead some articles about SVM(Did not use SVM, but tfidf and cosin similarity)Try to categorize different news- Build another IDF dictionary from web news
Categorize those similar passages in sina and netease but not in tencentMake a website display the results and show the commentsImprove categorization performance- Make the website more beautiful!
Reconstruct this project with php