easIE

easy Information Extraction: is an easy-to-use information extraction framework that extracts data about companies from heterogeneous Web sources in a semi-automatic manner. It allows admin users to extract data about companies from heterogeneous Web sources in a semi-automatic manner by only defining a configuration file. The framework is quickly and simply generating Web Information Extractors and Wrappers. easIE offers a set of wrappers for obtaining content from Static and Dynamic HTML pages by pointing to the html elements using css Selectors.

Note: Here you can find the web support of easIE

Getting started

Each extractor extends AbstractHTMLExtractor and implements the extractFields(List<ScrapableField> fields) and extractTable(String table_selector, List<ScrapableField> fields) methods. There are four objects that extend AbstractHTMLExtractor:

StaticHTMLExtractor is responsible for extracting content from static HTML pages:

   StaticHTMLExtractor extractor = new StaticHTMLExtractor(base_url, relative_url);
   extractor.extractFields(fields);

DynamicHTMLExtractor is responsible for executing a number of events to a dynamic HTML page and extracting the defined contents:

   DynamicHTMLExtractor extractor = new DynamicHTMLWrapper(base_url, relative_url, chrome_driver_path);
   extractor.browser_emulator.clickEvent(css_selector);
   extractor.extractFields(fields);

GroupHTMLExtractor is responsible for extracting content from a group of static HTML pages with similar structure:

   GroupHTMLExtractor extractor = new GroupHTMLExtractor(group_of_pages);
   extractor.extractFields(fields);

PaginationIterator is responsible for extracting data that are distributed in different pages:

   PaginationIterator extractor = new PaginationIterator(base_url, relative_url, next_page_selector);
   extractor.extractFields(fields);

Requirements:

Java17-jdk or Java1.8-jdk
chromedriver and Chrome Web broswer according to your system [For Dynamic Extraction].
ConfigurationSchema.json
A website to crawl from, json schema check example using css selectors.

How To Use:

1 arg execution:

    >$ java -jar easIE.jar website2crawl.json
    
    *easIE.jar, ConfigurationSchema.json and chromedriver MUST be in the same folder.*

2 args execution:

    >$ java -jar easIE.jar website2crawl.json path2chromedriver
    
    *easIE.jar, ConfigurationSchema.json MUST be in the same folder.

Example:

1.| Download or Get this example json religiousgreece_example_group_url2.json

2.| Edit line 71, changing to your path, where to store the results.

3.| Get the easIE.jar, ConfigSchema.json, and the appropriate chromedriver (according to your Chrome Browser version).

4.| Execution:

      $> java -jar easIE.jar religiousgreece_example_group_url2.json

NOTICE:

easIE.jar, ConfigSchema.json, religiousgreece_example_group_url2.json and chromedriver are on the same directory.

More Info:

Project developed using Intellij IDE and Maven project manager.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
configuration_files		configuration_files
src/main		src/main
.gitattributes		.gitattributes
.gitignore		.gitignore
ConfigurationSchema.json		ConfigurationSchema.json
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

easIE

Getting started

Requirements:

How To Use:

Example:

More Info:

About

Releases

Packages

Languages

License

estathop/easIE

Folders and files

Latest commit

History

Repository files navigation

easIE

Getting started

Requirements:

How To Use:

Example:

More Info:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages