Skip to content

INDElab/conversationkg

Repository files navigation

conversationkg

Building knowledge graphs from dialogue and analyzing them.

Goals and Functions

  • parsing email conversations in (standard) JSON format into hierarchies of Python objects in order to make tasks such as inspection, iteration and extraction convenient

  • exposing interfaces for common information extraction tasks, such as NER, topic modelling and keyword extraction on email data

  • implementing basic versions of two types of knowledge graphs of email conversations:

    • EmailKG, a ground-truth graph based on the emails' meta-data
    • TextKG, a graph obtained purely by information extraction on emails' textual bodies
  • facilitating and testing the feasibility of machine-learning experiments based on conversational data

Contents of the Repository

conversationkg is both a package and a repository, the below list is a guide to the directories in this repository (see the READMEs in each for more information):

Working with the conversationkg package

Installation

Requires Python>=3.6 and Python pip; no virtual environment needed (and not tested), this creates a local site-packages installation by default. For Python-internal dependencies, see requirements.txt. Installation steps:

  1. clone this repository (e.g. by running git clone https://github.com/pgroth/conversationkg.git in your command-line interface)
  2. navigate to the cloned repository and run python -m pip install .

Note: The installation copies and extracts the contents of email_data_compressed into the package, which will occopy up to a GB of memory. The mailinglist data in email_data_compressed can subsequently be loaded as part of the package and should make development with this data easier.

Contribute

Once you have cloned a local copy and made changes you wish to upload to the main branch of this repository, follow these (standard) steps:

  1. use git pull to update your local copy with remote changes (git will alert and abort if local changes would be overwritten)
  2. to add all changes at once, run git add . in the root directory (git add [some folder] only adds the changes made in [some folder])
  3. commit the added changes with git commit -m "[your message]" where [your message] is hopefully a meaningful description of the changes
  4. finally, upload the changes to the repository by running git push

You can check what your changes are, if any, by running git status. You can, of course, also make a new branch that will co-exist with the main branch and upload your changes to that. Or make a pull request if you want your changes approved by the repository's owners before they are integrated. Please refer to GitHub's documentation.

The package's documentation (located in the docs folder) is autogenerated by pdoc3. If you have made changes to the package (and re-installed it), please re-generate the documentation by running pdoc --html conversationkg and copy the contents of the produced folder html into docs (and upload the updated documentation as described above).

Usage

For the full API documentation, go to https://indelab.github.io/conversationkg.

Basics

Once installed, the package can be imported by:

import conversationkg

There are two subpackages (for the two subtasks, corpus parsing and knowledge graph extraction):

import conversationkg.corpus 
import conversationkg.kgs

Load the example mailinglist ietf-http-wg included in the package installation:

from conversationkg import load_example_data_as_raw_JSON

json_data = load_example_data_as_raw_JSON("ietf-http-wg")

The names of all included mailing lists can be found in conversationkg.example_mailinglists.

Import and instantiate a corpus object:

from conversationkg.conversations import EmailCorpus

corpus = EmailCorpus.from_email_dicts(json_data)

The corpus object can alternatively be instantiated via a list of conversation objects:

from conversationkg.conversations import EmailCorpus, Conversation

conversations = [Conversation.from_email_dicts(subject, email_dicts) for subject, email_dicts in json_data

corpus = EmailCorpus(conversations)

Applying Factories

The most common and basic information extraction tasks are implemented in the conversationkg package as so-called factories (named so because they produce objects, mainly entities, given emails or conversations). conversationkg implements factories for text vectorisation, topic modelling, NER and keyword extraction and has base classes for these so that new factories can easily be added. Factories are applied to a corpus like so:

from conversationkg.conversations import EmailCorpus
from conversationkg.conversations.factories import SKLearnLDA, SpaCyNER, StanzaNER, RakeKeyWordExtraction

corpus = EmailCorpus.from_email_dicts(json_data)

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus.vectorise(TfidfVectorizer)


factories = [SKLearnLDA(corpus, 13, max_iter=10), 
             SpaCyNER(),  # alternatively StanzaNER
             RakeKeyWordExtraction()]

for factory in factories: factory(corpus)

Beware that some of the downstream functionality, most importantly the parts of KG extraction, require or rely on factories having already been run on the corpus and may produce useless results otherwise.

Instantiating KGs

Instantiating either a TextKG or EmailKG object from a corpus object is as simple as:

from conversationkg.kgs import EmailKG, TextKG

emailkg = EmailKG(corpus)

textkg = TextKG(corpus)

Supplying your own Data: Required Format

The scraped W3C mailing lists are stored as JSON dict objects and the current implementation of conversation_building.declarations.corpus.EmailCorpus expects such a format. Below is an example of the public-credentials mailing list, namely the first email of the first conversation (subject Use-Case: Deaths) of the first period (2015Aug, i.e. August 2015):

public_credentials = """
{'2015Aug': 
     {'Use-Case: Deaths': 
         [  
             {'body': '\nAppearing at the infamous annual Def Con <https://www.defcon.org/> IT\nsecurity conference in Las Vegas this week, Mr Rock demonstrated gaping\nflaws that have surfaced in the rush to go digital\n<https://4a5b508b5f92124e39ff-ccd8d0b92a93a9c1ab1bc91ad6c9bfdb.ssl.cf4.rackcdn.com/2015/01/150122-Births-Deaths-Marriages-Records-To-Go-Online.pdf>\nwith\nthe process of registering births and deaths in Australia.\n\nRead more:\nhttp://www.theage.com.au/digital-life/consumer-security/meet-chris-rock-the-man-with-the-power-to-kill-off-any-australian-20150809-giuuxd.html\n<http://www.theage.com.au/digital-life/consumer-security/meet-chris-rock-the-man-with-the-power-to-kill-off-any-australian-20150809-giuuxd.html?utm_campaign=echobox&utm_medium=Social&utm_source=Facebook#ixzz3iIqYrCHc>\n',
              'author': 'Timothy Holborn ([email protected])',
              'subject_from_meta': 'Use-Case: Deaths',
              'date': '2015-08-09',
              'isoreceived': '20150809080538',
              'isosent': '20150809080430',
              'sent': 'Sun, 9 Aug 2015 18:04:30 +1000',
              'name': 'Timothy Holborn',
              'email': '[email protected]',
              'subject': 'Use-Case: Deaths',
              'id': 'CAM1Sok0TpowLbun83N+_QRyd14amti3ME0uPTRtodtnNGx87-Q@mail.gmail.com',
              'charset': 'UTF-8',
              'inreplyto': None,
              'from': 'Timothy Holborn <[email protected]>',
              'date_from_body': 'Sun, 9 Aug 2015 18:04:30 +1000',
              'to': 'W3C Credentials Community Group <[email protected]>',
              'id_from_body': None,
              'original_path': ['public-credentials', '2015Aug', 'Use-Case: Deaths', '0021.html']
            }
        ]
    }
}
"""            

Notice that some of the meta-data entries duplicate information, such as author and from; conversation_building.declarations.emails defines how such duplicated information is resolved. All keys in the meta-data need to be present for parsing but may be empty ("") values. However, some meta-data information is necessary for certain functionality when extracting the KG; for instance, the "sent" meta-data entry is used to obtain temporal ordering on both conversations and emails inside those in the email corpus.

Citation

You can cite us at:

Valentin Vogelmann, Paul Groth, & Brian Grier. (2022). conversationkg. Zenodo. https://doi.org/10.5281/zenodo.5883258 DOI

Acknowledgements

This material is based upon work supported by the Air Force Office of Scientific Research under award number FA8655-20-1-7005. Any opinions, finding, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force.

About

building knowledge graphs and analysis from dialogue

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages