#

corpus-data

Here are 164 public repositories matching this topic...

esbatmop / MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

nlp chinese chinese-nlp corpus-data chinese-simplified nlp-machine-learning chinese-language

Updated Jan 13, 2025

PlexPt / chatgpt-corpus

ChatGPT 中文语料库对话语料小说语料客服语料用于训练大模型

awesome corpus question-answering corpus-data

Updated May 15, 2024

shijiebei2009 / CEC-Corpus

📚中文突发事件语料库（Chinese Emergency Corpus）-上海大学-语义智能实验室

corpus-data

Updated Sep 26, 2019

sheepzh / poetry

地球上最全的华语现代诗歌语料库，3k+诗人，80K+诗歌，15M+字

nlp poetry literature corpus-data chinese-corpus

Updated Jan 3, 2025
Python

gkiril / oie-resources

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Updated Oct 25, 2022

guhhhhaa / 4675-scifi

chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

nlp corpus science-fiction scifi chinese-nlp corpus-data datasets nlp-resources nlp-machine-learning nlp-datasets

Updated Oct 22, 2022

grammarly / ua-gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

natural-language-processing corpus dataset corpus-data corpus-tools gec nlp-datasets grammatical-error-correction ukrainian-language

Updated Feb 11, 2024
Macaulay2

guhhhhaa / wula-scifi

chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

nlp corpus science-fiction scifi chinese-nlp corpus-data datasets nlp-resources nlp-machine-learning nlp-datasets

Updated Oct 22, 2022

NathanDuran / Switchboard-Corpus

Utilities for Processing the Switchboard Dialogue Act Corpus

dialogue corpus corpus-data corpus-tools switchboard dialogues corpus-processing dialogue-data switchboard-corpus dialogue-act

Updated Jan 24, 2021
Python

aplmikex / deduplication_mnbvc

文本去重

nlp chinese chinese-nlp corpus-data chinese-simplified nlp-machine-learning chinese-language

Updated May 23, 2024
Python

dataset-vn / DANeS

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

open-source machine-learning natural-language-processing corpus artificial-intelligence dataset newspaper corpus-data text-sentiment danes datasetvn aivgroup

Updated May 11, 2022
Python

zonghui0228 / BioMedical-NLP-corpus

Biomedical NLP Corpus or Datasets.

nlp natural-language-processing text-mining bioinformatics dataset named-entity-recognition corpus-data medical-informatics

Updated Apr 26, 2022

LemonAttn / bilibili_comment_crawl

爬取bilibili视频下的评论，最新出品！！！⚠本代码只适用于学习，做其他事情概不负责！！！

python crawler spider requests bilibili corpus-data

Updated Jan 11, 2025
Python

johentsch / ms3

A parser for annotated MuseScore 3 files.

Updated Sep 26, 2024
Python

shijiebei2009 / CEEC-Corpus

📚中文环境突发事件语料库（Chinese Environment Emergency Corpus）-上海大学-语义智能实验室

corpus-data

Updated Nov 3, 2015

hailiang-wang / egret-wenda-corpus

A Public Corpus for Machine Learning

qa corpus corpus-data

Updated Jul 3, 2018
JavaScript

KehaoWu / Jinyong-Corpus

金庸15部小说字典

nlp corpus-data

Updated Nov 17, 2018

jaaack-wang / ccnc

CCNC: A Comprehensive Chinese Name Corpus (3.65M name samples). 大型中文姓名语料库 (内含365万姓名语例)。

names chinese corpus-data webscraping

Updated Jun 28, 2021
Jupyter Notebook

uma-pi1 / OPIEC

Reading the data from OPIEC - an Open Information Extraction corpus

nlp natural-language-processing wiki wikipedia corpus information-extraction dataset corpora corpus-data nlp-resources wikipedia-dump corpus-tools natural-language-understanding open-information-extraction dataset-interface wikipedia-corpus corpus-processing nlp-datasets

Updated Jun 12, 2019
Java

CanCLID / canto-filter

粵文語料篩選器 Cantonese text filter

nlp data corpus cantonese corpus-data cantonese-language

Updated Dec 17, 2024
Python

Improve this page

Add a description, image, and links to the corpus-data topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus-data topic, visit your repo's landing page and select "manage topics."