Skip to content
This repository has been archived by the owner on Nov 18, 2021. It is now read-only.
/ IG-Text Public archive

Repository for the Interest Group on Text

Notifications You must be signed in to change notification settings

CLARIAH/IG-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

Note: This repo is deprecated and moved to the central CLARIAH PLUS repository!

CLARIAH Interest Group on Text

This repository is intended to organize the work, output and documentation of the CLARIAH Interest Group (IG) on Text Processing.

(note: in the current stage, all of this should be interpreted as a proposal and open for discussion)

Introduction

There is a CLARIAH-wide need for robust text processing technologies that can handle historical as well as contemporary Dutch texts. Partners like VU, INT and RU have contributed different components in WP3 and WP6.

Aims of the Interest Group

The aims of the IG on Text are:

  • foster discussion and knowledge sharing regarding automatic text processing
  • enhance interoperability between various text processing solutions
  • develop and share best practices
  • inform development of CLARIAH text processing tools and services

Scope of the Interest Group

Our scope is automatic text processing, and roughly encompasses the following fields:

  • Natural Language Processing
    • automatic linguistic enrichment for multiple languages and multiple time periods
      • named entity extraction & linking
      • dependency parsing, syntactic parsing, morphological analysis
      • part-of-speech tagging
      • lemmatisation
      • sentiment analysis
      • tokenisation and sentence segmentation
    • text normalisation (including post-OCR/HTR correction)
    • optical character recognition & handwriting recognition
    • machine translation
    • language modelling
  • Text Mining
  • Text Search & Retrieval (raw text, querying of annotations is covered by the annotation group)

Though our scope is not limited to Dutch, it is probably fair to say that Dutch, Flemish and Frisian, merit most attention, as we are a project in the Netherlands.

Aspects that are outside the scope of this Interest Group (because they are covered by other IGs):

  • manual text annotation (covered by the annotation group)
  • annotation models and formats (covered by the annotation group)
  • speech recognition (covered by the AV group)

Communication

We use the following communication channel:

  • slack (if you don't have access yet, please contact one of the coordinators)

Tasks

  1. Provide an inventory of current text processing tools, services and models in CLARIAH, either developed in CLARIAH (WP3 or WP6), or third party projects that are adopted as solutions.
  2. Identify connections that can be made between various tools (specific workflows/pipelines) to certain specific ends desired by the research community.
  3. Specify what requirements we want text processing solutions to adhere to for CLARIAH, to facilitate interoperability between tools/services. Indicate to what extent the existing solutions adhere to these requirements.

Group Members

  • Maarten van Gompel (KNAW Humanities Cluster) (coordinator)
  • Jesse de Does (INT)
  • Hennie Brugman (KNAW Humanities Cluster)
  • Roeland Ordelman (Netherlands Institute for Sound and Vision)
  • Martin Reynaert (DCA - Tilburg University / ILLC - Universiteit van Amsterdam)
  • Dirk Roorda (DANS)
  • Eduard Drenth (Fryske Akademy)
  • Piek Vossen (CLTL, Vrije Universiteit Amsterdam)
  • Sophie Arnoult (CLTL, Vrije Universiteit Amsterdam)
  • Jan Wijffels (Vrije Universiteit Brussel)
  • Enno Meijers (Koninklijke Bibliotheek)
  • Rana Klein (Netherlands Institute for Sound and Vision)
  • Jan Niestadt (INT)

The group is open to new members.

About

Repository for the Interest Group on Text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published