Skip to content

Dataset: Format and Usage

Jieyu Zhang edited this page Oct 22, 2021 · 2 revisions

Classification

We currently have TextDataset for text classification, RelationDataset for text relation classification and NumericDataset for tabular data classification. Note that any new type of dataset can be converted into NumericDataset with numeric features/pre-trained embeddings.

General

Format

In general, data are stored in json files. We highly recommend that train/valid/test data are stored in three different json files. Each json file contains a dictionary that looks like

{
    "0":{ # data id
        "label": 0, # (int) ground truth label, start from 0
        "weak_labels": [0, -1, 0, 1], # (List(int)) weak supervision labels, start from -1. -1 is ABSTAIN, 0...k are labels
        "data": { # dataset specific raw data dictionary
            ...
        },
    },
    ...
}

In addition, a label.json file is required which basically contains a dictionary mapping label id (start from 0) to label surface name.

The file structure should be like

datasets 
     -- data name (for example, yelp)
         |-- label.json
         |-- train.json
         |-- valid.json
         |-- test.json

Usage

Than the dataset can be loaded by

dataset_path = 'datasets/yelp'
train_data = TextDataset(path=dataset_path, split='train')
valid_data = TextDataset(path=dataset_path, split='valid')
test_data = TextDataset(path=dataset_path, split='test')
)

For classification, we also support multiple feature extractors. We take BERT embedding extractor as an example

extractor_fn = train_data.extract_feature(extract_fn='bert', model_name='bert-base-uncased', return_extractor=True)
valid_data.extract_feature(extract_fn=extractor_fn, return_extractor=False)
test_data.extract_feature(extract_fn=extractor_fn, return_extractor=False)
)

After calling extract_feature, the extracted features are stored in, for example, train_data.features.

NumericDataset

Each train/valid/test json file contains a dictionary:

{
    "0":{
        "label": 0,
        "weak_labels": [0, -1, 0, 1],
        "data": {
            "feature": [0, 1, 0.1]
        },
    },
    "1":{
        "label": 1,
        "weak_labels": [-1, 1, -1, 1],
        "data": {
            "feature": [1, 0, 0.2]
        },
    }
    ...
}

The NumericDataset only has a default feature extractor, which directly copy the feature stored in json to dataset's features attribute.

data.extract_feature(extract_fn=None, return_extractor=False)

TextDataset

Each train/valid/test json file contains a dictionary:

{
    "0":{
        "label": 0,
        "weak_labels": [0, -1, 0, 1],
        "data": {
            "text": "this is an example"
        },
    },
    "1":{
        "label": 1,
        "weak_labels": [-1, 1, -1, 1],
        "data": {
            "text": "this is another example"
        },
    }
    ...
}

The TextDataset has following feature extractors (extract_fn argument):

'bow': bag of words feature extractor, based on sklearn.feature_extraction.text.CountVectorizer.

'tfidf': TF-IDF feature extractor, based on sklearn.feature_extraction.text.TfidfVectorizer.

'sentence_transformer': sentence transformer feature extractor, based on SentenceTransformers.

'bert': BERT-based feature extractor, based on HuggingFace.

RelationDataset

Each train/valid/test json file contains a dictionary:

{
    "0":{
        "label": 1,
        "weak_labels": [1, -1, 1, 1],
        "data": {
            "text": "AA is a BB.",
            "entity1": "AA",
            "entity2": "BB",
            "span1": [0, 2],  # character-level span
            "span2": [8, 10], # character-level span
        },
    },
    ...
}

The RelationDataset has following feature extractors (extract_fn argument):

'bert': BERT-based feature extractor, based on HuggingFace and R-BERT.

Sequence Tagging

SeqDataset

Each train/valid/test json file contains a dictionary:

{
    "0":{
        "label": ["B-PER", "O", "O", "O", "O", "O", "O", "O", "B-LOC", "I-LOC", "I-LOC", "O", "B-ORG", "I-ORG", "O"],
        "weak_labels": "weak_labels": [["B-PER", "B-PER", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["B-LOC", "B-LOC", "O", "O", "O", "B-LOC", "B-LOC", "B-MISC", "O", "O", "B-LOC", "B-LOC", "B-LOC", "B-LOC", "B-LOC", "B-LOC"], ["I-LOC", "I-LOC", "O", "O", "O", "I-LOC", "I-LOC", "I-MISC", "O", "O", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC"], ["O", "I-LOC", "O", "O", "O", "I-LOC", "I-LOC", "I-MISC", "O", "O", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC"], ["O", "I-LOC", "O", "O", "O", "O", "O", "I-MISC", "O", "O", "O", "O", "O", "O", "O", "O"], ["B-ORG", "B-ORG", "B-ORG", "B-ORG", "O", "O", "O", "I-MISC", "B-ORG", "B-ORG", "O", "O", "B-LOC", "B-LOC", "B-LOC", "B-LOC"], ["I-ORG", "I-ORG", "I-ORG", "I-ORG", "O", "O", "O", "I-MISC", "I-ORG", "I-ORG", "O", "O", "I-LOC", "I-LOC", "I-LOC", "I-LOC"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]],
        "data": {
            "text": {"text": ["Gradsky", "has", "also", "performed", "as", "a", "tenor", "at", "New", "York", "City", "'s", "Carnegie", "Hall", "."]},
        "len": 15
        },
    },
    ...
}

Note that different from the Classification dataset, each sentence in SeqDataset consists of a series of tokens. Hence, the label here is a list with the length equal to the sequence length. weak_labels is a list of weak labels where the i-th list stands for the weak labels for the i-th token. len stands for the length of the sequence.

In addition, a meta.json file is required which basically contains metadata information (e.g. weak label sources, entity type, maximum number of tokens). An example of meta.json is on belows (for CoNLL03 dataset).

{
    "train_size": 14041,
    "valid_size": 3250,
    "test_size": 3453,
    "num_labels": 9,
    "max_length": 124,
    "lf": [
        "BTC",
        "core_web_md",
        "crunchbase_cased",
        "crunchbase_uncased",
        "full_name_detector",
        "geo_cased",
        "geo_uncased",
        "misc_detector",
        "multitoken_crunchbase_cased",
        "multitoken_crunchbase_uncased",
        "multitoken_geo_cased",
        "multitoken_geo_uncased",
        "multitoken_wiki_cased",
        "multitoken_wiki_uncased",
        "wiki_cased",
        "wiki_uncased"
    ],
    "num_lf": 16,
    "entity_types": [
        "PER",
        "LOC",
        "ORG",
        "MISC"
    ],
    "lf_rec": [
        "BTC",
        "core_web_md",
        "crunchbase_cased",
        "crunchbase_uncased",
        "full_name_detector",
        "geo_cased",
        "geo_uncased",
        "misc_detector",
        "multitoken_crunchbase_cased",
        "multitoken_crunchbase_uncased",
        "multitoken_geo_cased",
        "multitoken_geo_uncased",
        "multitoken_wiki_cased",
        "multitoken_wiki_uncased",
        "wiki_cased",
        "wiki_uncased"
    ],
    "num_lf_rec": 16
}