Skip to content

hazylavender/fl_llm_benchmark_dataset

Repository files navigation

fl_llm_benchmark_dataset

Congressional Dataset

This code collects congressional/parliamentary dataset across US, UK and Canada. The dataset is hosted in the hugging face repo https://huggingface.co/datasets/hazylavender/CongressionalDataset

Biorxiv

This code collects biorxiv abstracts under liscence 'cc_by_nc_nd', 'cc_by_nd', 'cc_by_nc', 'cc_by', 'cc0'. The dataset is hosted in the hugging face repo https://huggingface.co/datasets/hazylavender/biorxiv-abstract

This dataset contains speeches extracted from government institutions of the US, UK and Canada. Each speech is a single partition and contains metadata such as speaker name, date, country, and the source URL.

The raw dataset is a list of dictionaries object like below

{ "url": "https://www.ourcommons.ca/Content/House/441/Debates/335/HAN335-E.XML", "date_str": "2024-06-19", "title": "Prime Minister's Award for Excellence in Early Childhood Education", "speaker": "Mr. Mike Kelloway", "data": "it is a privilege to rise in the House today to recognize a constituent of mine who has received the Prime Minister's Award for Teaching Excellence….", "chamber": "HOUSE OF COMMONS", "country": "CA" },

Using the dataset

To use this dataset, use any libraries of your liking, for example

from mlcroissant import Dataset

ds = Dataset(jsonld="https://huggingface.co/api/datasets/hazylavender/CongressionalDataset/croissant")
records = ds.records("default")

To filter on dates (or any other fields), you can do

import itertools
import pandas as pd

df = (
    pd.DataFrame(list(itertools.filterfalse(lambda x: x['default/date_str'].decode() < '2024-01-01', records)))
)

Data sources and license information

US

Canada

UK

Biorxiv

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published