Core Data Modules

A library of reusable AVF functionality and data structures.

It includes the primary data structure in which all runs are stored, helper functions for IO/hashing/time/logging, a cleaning library, and standardised analysis functions.

Installation

CoreDataModules is not available on PyPI, so must instead be installed from this GitHub repository. Do so with the following command, setting <version> to the tag to install e.g. v0.0.1:

$ pipenv install -e git+https://www.github.com/AfricasVoices/CoreDataModules@<version>#egg=CoreDataModules

To use the core_data_modules.analysis.mapping module, include the extra dependencies for mapping:

$ pipenv install -e git+https://www.github.com/AfricasVoices/CoreDataModules@<version>#egg=CoreDataModules[mapping]

API Overview

TracedData

The data format for a batch of runs in the data pipeline is always an iterable of TracedData.

A TracedData object can be thought of as a "dictionary with data provenance". It provides the same read interface as a Python dict, but writes must be performed using the append_data method, which requires some additional metadata about each update when it is called. TracedData objects maintain a complete record of all previous updates.

To construct a TracedData object, read a value, and update value(s):

import time
from core_data_modules.traced_data import TracedData, Metadata

USER = "test_user"

td = TracedData({"age": "20", "language": "english"}, Metadata(USER, Metadata.get_call_location(), time.time()))
print(td["age"])  # 20

td.append_data({"age": "21"}, Metadata(USER, Metadata.get_call_location(), time.time()))
print(td["age"])  # 21

CoreDataModules also provides utilites for converting between in-memory iterables of TracedData and JSON, CSV, Coda, Coding CSV, and The Interface files. For example, to save and load a list of TracedData objects by serializing to and from JSON:

from core_data_modules.traced_data.io import TracedDataJsonIO

data =  # TODO: assign iterable of TracedData objects

with open("output.json", "w") as f:
    TracedDataJsonIO.export_traced_data_iterable_to_json(data, f)
    
with open("input.json", "r") as f:
    data = TracedDataJsonIO.import_json_to_traced_data_iterable(f)

Cleaning

All reusable cleaning functions may be used by importing functionality from core_data_modules.cleaners. For example:

from core_data_modules.cleaners.english.demographic_cleaner import DemographicCleaner

DemographicCleaner.clean_gender("woman")  # 'female'
DemographicCleaner.clean_gender("aoeu")  # Codes.NOT_CODED

Note that cleaners which fail to assign a cleaned value return Codes.NOT_CODED.

Development

Testing

All tests are to be written using unittest, and to be placed in the tests/ directory. The structure of the tests/ directory should match that of core_data_modules, with all test scripts prefixed with test_. For example, if there is a file at core_data_modules/package/file.py, there should be an associated test file at tests/package/test_file.py.

To run all tests: $ pipenv run python setup.py test --addopts doctest.

To run a specific test or test suite: $ pipenv run python setup.py test --addopts path/to/test_file.py::TestClass::test_method e.g. $ pipenv run python setup.py test --addopts tests/cleaners/english/test_demographic_cleaner.py.

Circle CI automatically runs all tests in Python 3.6 whenever a commit is pushed to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 1,154 Commits
.circleci		.circleci
core_data_modules		core_data_modules
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Core Data Modules

Installation

API Overview

TracedData

Cleaning

Development

Testing

About

Releases 52

Packages

Contributors 5

Languages

License

AfricasVoices/CoreDataModules

Folders and files

Latest commit

History

Repository files navigation

Core Data Modules

Installation

API Overview

TracedData

Cleaning

Development

Testing

About

Resources

License

Stars

Watchers

Forks

Releases 52

Packages 0

Contributors 5

Languages

Packages