EntitySearch

Description

A tool to download and parse web page from search engines like Google. Developed and tested on Windows/Python 3.5.
Derived from Eric's gopage package.

Requirements

Network connected to a VPN outside mainland China.
Python 3

textblob package. To install:

  $ pip install -U textblob
  $ python -m textblob.download_corpora

BeautifulSoup
```
  $ pip install beautifulsoup4
```

Usage

Edit a keywords list and store it into a '.txt' file.
Use PageSearcher to get corresponding html pages for given queries.
Use EntitySearcher to get text and extract relevent sentences from html files generated from Step 2.

Major Classes

EntitySearcher

Description
- Given a list of query keywords, generate sentences that contain given keyword pair.

Example:

  from textblob import TextBlob
  from web_helper import WebHelper
  from google_item_parser import GoogleItemParser
  from bs4 import BeautifulSoup
  from entity_sentence_search import EntitySearcher
  
  entitySearcher = EntitySearcher('search_result', 'sentence_result.txt', 'keywords.txt')

  text = 'I like to eat apples. Me too. Let\'s go buy some apples.'
  results = entitySearcher.search_sentences(['buy', 'some apples'], text)

  entitySearcher.search_entity()

PageSearcher

Description
- Given a list of query keywords, download corresponding web pages from search engines.
PageSearcher(output_dir, keyword_list, search_helper)
- output_dir: where your downloaded pages will be stored. PageSearcher will create the folder if needed.
- keyword_list: a list of keywords, one for each query.
- search_helper: depends on which search engine you wanna use(GoogleHelper, BaiduHelper, SogouHelper, etc).

Example:

  from page_searcher.page_searcher import PageSearcher
  from search_helper.search_helper import GoogleHelper
  keyword_list = ['Tsinghua', 'PKU', 'hello world']
  searcher = PageSearcher('output_dir', keyword_list, GoogleHelper())
  searcher.get_page()

GoogleItemParser

Description
- Given the content of a web page from Google, parse the page to the form of a list of items(snippets).
- Each item is a dict with 'title', 'content', 'cite_url' etc.

Example

  with open('test.html') as f:
  	content = f.read()
  	parser = GoogleItemParser()
  	parser.feed(content)
  	item_list = parser.get_items()

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
entity_sentence_search.py		entity_sentence_search.py
google_item_parser.py		google_item_parser.py
keywords.txt		keywords.txt
page_searcher.py		page_searcher.py
proxy_helper.py		proxy_helper.py
search_helper.py		search_helper.py
web_helper.py		web_helper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EntitySearch

Description

Requirements

Usage

Major Classes

EntitySearcher

PageSearcher

GoogleItemParser

About

Releases

Packages

Languages

zihao-fan/EntitySearch

Folders and files

Latest commit

History

Repository files navigation

EntitySearch

Description

Requirements

Usage

Major Classes

EntitySearcher

PageSearcher

GoogleItemParser

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages