-
Notifications
You must be signed in to change notification settings - Fork 0
webcorpus pipeline
License
LGPL-3.0, GPL-3.0 licenses found
Licenses found
LGPL-3.0
COPYING.LESSER
GPL-3.0
COPYING
zseder/webcorpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project is a collection of scripts and programs for creating a webcorpus from crawled data. The input data is extracted by the wire crawler (http://www.cwr.cl/projects/WIRE/) and the output is a text file with document separators and raw text A sample output data and the published article can be found at the homepage of our research group at http://hlt.sztaki.hu/resources/webcorpora.html Dependencies: - gcc - flex - libtextcat - http://software.wise-guys.nl/libtextcat/ - WARNING: libtextcat unfortunately uses a predefined confidence parameter, which cannot be changed at runtime. we created a little patch fixing this, so before compiling and installing libtextcat, please apply our patch at src/libtextcat-2.2.patch with - patch -p1 < /path/to/libtextcat-2.2.patch # while being in a directory that contains original libtextcat-2.2; or - patch -p2 < /path/to/libtextcat-2.2.patch # while being in the actual libtextcat-2.2 directory - libhunspell - we only used libhunspell-1.3 for testing Usage: - src/Makefile is for building everything and move binaries to bin/ folder make for building; make clean for cleaning - dat/Makefile is for processing data, see Makefile for details example: make finnish.raw; make finnish.parsed; make finnish.senfiltered; etc. - some scenarios: - crawl ended without a problem and data is extracted with wire-info-extract - result is in mylang.wire - run these commands in dat/: - make mylang.raw - make mylang.parsed - make mylang.senfiltered - make mylang.langfiltered - make mylang.dedup - make mylang.neardedup - make mylang.tok - make mylang.freq - make mylang.stemdict - OR simply (because of make discovers dependencies): - make mylang.tokenized - there were some troubles in crawling - rename Data/text/storage.raw in crawling directory to dat/mylang.not_extractable_wire and run these commands: - make mylang.raw - make mylang.parsed - make mylang.senfiltered - make mylang.langfiltered - make mylang.cleaned # see WIRE troubles later in this file - make mylang.cleaned_langfiltered # see WIRE troubles later in this file - make mylang.dedup - make mylang.neardedup - make mylang.tok - make mylang.freq - make mylang.stemdict - this cannot be shortened with "make mylang.tokenized" because then optional cleaning won't run; this depends on user needs and data - if processing a language, one has to edit dat/Makefile and add hunspell dictionaries in the same format in which there are already some, or if there is no hunspell dictionary for a given language (like finnish), then use libtextcat (dat/Makefile is prepared to do that automatically, when there is no dictionary given) and edit dat/Makefile in TEXTCAT_CONFIG an TEXTCAT_CONF_LIMIT variables. WIRE troubles - when setting maxdoc in xml config too high, crawling really slows down after a few rounds. See crawling tutorial on wiki page at github for details - There are some encoding issues in wire (especially when the run didn't end correctly) such as: - html is not in utf-8 even though it is supposed to be (set in wire.conf) - some htmls lie about their encoding, so some characters will get messed up, and it cannot be fixed with a simple iconv anymore The main problem is when a multibyte utf-8 character is handled as more characters in a 1-byte encoding, so that ugly things happen - when WIRE crashes while running gatherer or seeder, crawling cannot be continued and data can only be extracted in a hard way. wire-data/text/storage.raw has to be renamed to mylang.not_extractable_wire and process data afterwards (see one scenario above) - because of these problems, there is a "clean encoding" phase in dat/Makefile after language filtering, and the reason of its location is that our cleaning procedure uses clean data for character statistics, and it is extracted from data that is already in a specific language judged by hunspell or textcat. After cleaning, a second language filtering runs because it now will give you more good data in the given language.
About
webcorpus pipeline
Resources
License
LGPL-3.0, GPL-3.0 licenses found
Licenses found
LGPL-3.0
COPYING.LESSER
GPL-3.0
COPYING
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published