Here’s a download link for all of bookcorpus as of Sept 2020 #27

shawwn · 2020-09-05T03:35:56Z

You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21

it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)

The new script:

Correctly preserves structure, matching the table of contents very closely;
Correctly renders tables of data (by default html2txt produces mostly garbage-looking results for tables),
Correctly preserves code structure, so that source code and similar things are visually coherent,
Converts numbered lists from “1\.” to “1.”
Runs the full text through ftfy.fix_text() (which is what OpenAI does for GPT), replacing Unicode apostrophes with ascii apostrophes;
Expands Unicode ellipses to “...” (three separate ascii characters).

The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.

soskek · 2020-09-05T07:32:37Z

@shawwn Excellent work! It seems great, and I added the reference to it in the README in this repo!

ZonglinY · 2020-09-24T01:40:57Z

@shawwn Thanks for your efforts! However, I run into 'network error' when using the link. Anyone succeed in using the link?

richarddwang · 2020-09-30T11:54:16Z

@shawwn This is exciting !
But I also encountered failed download.

shawwn · 2020-10-02T04:23:17Z

@ZonglinY @richarddwang

Sorry for the download problems. It should be fixed now. My server was running out of space due to 128GB of google cloud logs.

Ideally the zip file could be mirrored elsewhere. I'd set up a torrent, but I've never done that before. If someone has a good walkthrough, feel free to link it, else I'll research it someday.

SeanVody · 2020-10-19T21:11:22Z

@shawwn This seems excellent and I can't wait to snag a copy of the files!

Unfortunately I'm running into failed downloads now as well (likely due to log proliferation again I'd presume -- incidentally, while I know nothing about setting up torrents, I'd be happy to help out with a stop-gap scripted daemon that cleans logs to keep them in check if that appeals).

shawwn · 2020-10-25T07:08:14Z

@SeanVody and everyone else:

I am delighted to announce that, in cooperation with the-eye.eu, bookcorpus now has a reliable, stable download link that I expect will work for years to come:

https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz

(It's bit-for-bit identical to the file in my original tweet.)

However, anyone who is looking for bookcorpus will undoubtedly be interested in everything else. I urge you to take a peek: https://the-eye.eu/public/AI/pile_preliminary_components

In addition to bookcorpus (books1.tar.gz), it also has:

books3.tar.gz (37GB), aka "all of bibliotik in plain .txt form", aka 197,000 books processed in exactly the same way as I did for bookcorpus here. So basically 11x bigger.
github.tar (100GB), a huge amount of code for training purposes
Many other delightful datasets, all of which are extremely high quality:

This is possible thanks to two organizations. First and foremost, thank you to the-eye.eu. They have a wonderful community (see discord), and they are extremely interested in archiving data for the benefit of humanity.

Secondly, thank you to "The Pile", which is the project that has been meticulously gathering and preparing this training data. Join their discord if you're interested in ML: https://www.eleuther.ai/get-involved

You now have OpenAI-grade training data at your fingertips; do with it as you please.

books3.tar.gz seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture. Nonetheless, books3 is "all of bibliotik", which is possibly useful to anyone doing NLP work.

I have tried to carefully and rigorously prepare the data in books3; e.g. all of the files are already preprocessed with ftfy.fix_text(), as OpenAI does.

If you have high quality datasets that you wish to make available to ML researchers, please DM me (@theshawwn) or reach out to The Pile.

jorditg · 2020-10-26T07:09:39Z

Great!

Do we have any information about the language percentages of the database or should be considered a "main English" database?

shawwn · 2020-10-26T09:22:54Z

@jorditg It's mostly English, but if anyone discovers a trove of foreign .epub files, please DM me. I am quite interested in doing various foreign language versions.

By the way, you can use the epub to txt converter on your own .epub files. I would be curious if it works well enough on foreign epubs, since sadly I speak only southern Texas, ya'll.

see also https://nanowrimo.org/ soskek/bookcorpus#27 (comment)

turnkit · 2020-11-07T14:08:06Z

+1 torrent.

shawwn · 2020-11-17T16:47:38Z

Happy to announce that bookcorpus was just merged into huggingface's Datasets library as bookcorpusopen, thanks to @vblagoje: huggingface/datasets#856

vblagoje · 2020-11-21T16:09:22Z

Small correction @shawwn - it is bookcorpusopen. Whoever wants to use Shawn's bookcorpus in HuggingFace Datasets simply has to:

from datasets import load_dataset
d = load_dataset('bookcorpusopen', split="train")

And then continue to use dataset d as any other HF dataset. See the manual for more details or the dataset card for this version of bookcorpus.

ilyalasy · 2021-06-09T12:49:01Z

@SeanVody and everyone else:

I am delighted to announce that, in cooperation with the-eye.eu, bookcorpus now has a reliable, stable download link that I expect will work for years to come:

https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz

(It's bit-for-bit identical to the file in my original tweet.)

However, anyone who is looking for bookcorpus will undoubtedly be interested in everything else. I urge you to take a peek: https://the-eye.eu/public/AI/pile_preliminary_components

In addition to bookcorpus (books1.tar.gz), it also has:

books3.tar.gz (37GB), aka "all of bibliotik in plain .txt form", aka 197,000 books processed in exactly the same way as I did for bookcorpus here. So basically 11x bigger.

github.tar (100GB), a huge amount of code for training purposes

Many other delightful datasets, all of which are extremely high quality:

This is possible thanks to two organizations. First and foremost, thank you to the-eye.eu. They have a wonderful community (see discord), and they are extremely interested in archiving data for the benefit of humanity.

Secondly, thank you to "The Pile", which is the project that has been meticulously gathering and preparing this training data. Join their discord if you're interested in ML: https://www.eleuther.ai/get-involved

You now have OpenAI-grade training data at your fingertips; do with it as you please.

books3.tar.gz seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture. Nonetheless, books3 is "all of bibliotik", which is possibly useful to anyone doing NLP work.

I have tried to carefully and rigorously prepare the data in books3; e.g. all of the files are already preprocessed with ftfy.fix_text(), as OpenAI does.

If you have high quality datasets that you wish to make available to ML researchers, please DM me (@theshawwn) or reach out to The Pile.

Hey, I'm experiencing failed download with link mentioned here, am I the only one?

lucaguarro · 2021-06-11T23:36:15Z

Is there a way to get the authors and titles for the books in any of those download links (in a machine readable format)?

shawwn · 2023-02-28T13:43:12Z

You can now download the original epub files for bookcorpus:

https://battle.shawwn.com/bookcorpus-epub.tar

It's 14.2GB with 17,876 epub files.

The tarball also contains bookcorpus/2020-08-27-epub_urls.txt which is a file containing the original URLs I scraped the epubs from back in 2020. Many of the urls are dead as of 2023, but it might still be useful for gathering metadata.

Is there a way to get the authors and titles for the books in any of those download links (in a machine readable format)?

@lucaguarro if you're still interested in this, you can extract that info from the epub files via the above download link.

ofou · 2023-03-28T13:19:30Z

Does it include books3.tar.gz the LibGen db (in *.txt)?

shawwn · 2023-03-28T16:24:22Z

Does it include books3.tar.gz the LibGen db (in *.txt)?

@ofou No, but I do have a copy of those epub files. Someday I'll get around to packaging them up.

ofou · 2023-03-28T20:09:19Z

Thanks for answering @shawwn. The cool thing about the libgen is that they manage a database dump with all the metadata, a torrent health tracker and around 4 million non-fiction books plus 4.2 million for fiction. Around 248.39 TB of content if you consider Scientific papers and other stuff. I wonder what will be the size of all libgen as .txt

And it seems to keep growing!

Topic	Total files	Total filesize	Middle filesize	Number of files added in the last day	The total size of files added over the last day	Number of files added in the last month	The total size of files added over the last month
libgen	4052645	58.96 TB	15.51 MB	16	252.195 MB	34428	624.3 GB
fiction	4243562	5.62 TB	1.382 MB	1609	9.8 GB	11056	41.9 GB
fiction_rus	1437450	2.7 TB	1.794 MB	0	0 B	1	35.042 MB
scimag	86726617	80.62 TB	989.711 kB	0	0 B	19	61.958 MB
magazines	381044	7.66 TB	21.092 MB	0	0 B	0	0 B
comics	2372361	93.09 TB	41.144 MB	2	26.945 MB	8105	880 GB
standarts	228799	529.4 GB	2.369 MB	0	0 B	11	30.581 MB

ofou · 2023-05-05T06:07:54Z

Does it include books3.tar.gz the LibGen db (in *.txt)?

@ofou No, but I do have a copy of those epub files. Someday I'll get around to packaging them up.

Can't stop thinking about this. I think it'd be awesome to collect all this data in txt, per language, etc. Maybe I'll give it a shot!

iliemihai · 2023-06-28T19:19:28Z

I am also interested in this corpus

ofou · 2023-06-29T10:24:06Z

I am also interested in this corpus

let's join forces to download it

LazurasLong · 2023-07-14T05:24:09Z

Anyone else notice that the links to the Books3 are dead ?

alerque · 2023-07-16T19:18:50Z

I'm sure some threads linking to it going viral on social media factored into that ;-)

There are a number of Torrents out there including The Pile v1 that have books3 in it. The whole thing is 773 GB total so you probably want to use a Torrent downloader that lets you download only some of the embeded files / directories.

Edit: Evidently magnet links don't work from GH Issues, here is the whole thing:

magnet:?xt=urn:btih:0d366035664fdf51cfbe9f733953ba325776e667&dn=EleutherAI_ThePile_v1&tr=https%3A%2F%2Facademictorrents.com%2Fannounce.php&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

u7390 · 2023-07-23T21:26:38Z

How to use the epub to txt converter at https://github.com/shawwn/scrap/blob/master/epub2txt-all on my own .epub files?

LazurasLong · 2023-07-23T23:21:04Z

Just use calibre

…

On Sun, Jul 23, 2023 at 4:26 PM u7390 ***@***.***> wrote: How to use the epub to txt converter at https://github.com/shawwn/scrap/blob/master/epub2txt-all on my own .epub files? — Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI2QMNTYR2LXJI3HXTFSLD3XRWJJVANCNFSM4Q2GFE2A> . You are receiving this because you commented.Message ID: ***@***.***>

1os3 · 2023-08-24T14:04:05Z

链接貌似失效了

the-superpirate · 2023-09-08T19:59:45Z

We recently began extracting the text layers of scholarly publications and books to include in our database. This encompasses sources such as scimag, libgen, and the latest zlib leaks. Our project, named the Standard Template Construct, also features a distributed search engine and incorporates various AI routines to handle the text corpus.

Today we have releases our first dataset, STC230908. This dataset contains approximately 75,000 book texts, 1.3 million scholarly paper texts, and 24 million abstracts, including the years from 2021 to 2023.

We're currently in the process of preparing the next version of the dataset, which will include an additional 300,000 books.

How to Access

Short Instructions:

Install IPFS and launch it.
pip3 install stc-geck && geck - documents

More details: the dataset is released in IPFS and replicated to multiple nodes. It is in format of database for the search engine that we use in STC. GECK is the library that embeds this search engine and allows to stream all contained data in easy way.

Even more detailed Instructions: STC GitHub Repository

cocopete · 2024-05-08T02:31:03Z

so weird... downloads deleted

richarddwang mentioned this issue Sep 30, 2020

Bookcorpus data contains pretokenized text huggingface/datasets#486

Closed

szha mentioned this issue Oct 26, 2020

[nlp_data] Add BookCorpus dmlc/gluon-nlp#1406

Open

Aurametrix added a commit to Aurametrix/Alg that referenced this issue Oct 28, 2020

books

45e70f9

see also https://nanowrimo.org/ soskek/bookcorpus#27 (comment)

vblagoje mentioned this issue Nov 16, 2020

Add open book corpus huggingface/datasets#856

Merged

bearlike mentioned this issue Dec 7, 2022

Possible to share the books corpus hhexiy/pungen#1

Open

kibitzing mentioned this issue Jun 27, 2024

GPT Pre-training Data kibitzing/awesome-llm-data#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Here’s a download link for all of bookcorpus as of Sept 2020 #27

Here’s a download link for all of bookcorpus as of Sept 2020 #27

shawwn commented Sep 5, 2020 •

edited

Loading

soskek commented Sep 5, 2020

ZonglinY commented Sep 24, 2020

richarddwang commented Sep 30, 2020 •

edited

Loading

shawwn commented Oct 2, 2020

SeanVody commented Oct 19, 2020

shawwn commented Oct 25, 2020 •

edited

Loading

jorditg commented Oct 26, 2020 •

edited

Loading

shawwn commented Oct 26, 2020

turnkit commented Nov 7, 2020

shawwn commented Nov 17, 2020 •

edited

Loading

vblagoje commented Nov 21, 2020 •

edited

Loading

ilyalasy commented Jun 9, 2021

lucaguarro commented Jun 11, 2021 •

edited

Loading

shawwn commented Feb 28, 2023

ofou commented Mar 28, 2023

shawwn commented Mar 28, 2023

ofou commented Mar 28, 2023 •

edited

Loading

ofou commented May 5, 2023 •

edited

Loading

iliemihai commented Jun 28, 2023

ofou commented Jun 29, 2023

LazurasLong commented Jul 14, 2023

alerque commented Jul 16, 2023 •

edited

Loading

u7390 commented Jul 23, 2023

LazurasLong commented Jul 23, 2023 via email

1os3 commented Aug 24, 2023

the-superpirate commented Sep 8, 2023 •

edited

Loading

cocopete commented May 8, 2024

Here’s a download link for all of bookcorpus as of Sept 2020 #27

Here’s a download link for all of bookcorpus as of Sept 2020 #27

Comments

shawwn commented Sep 5, 2020 • edited Loading

soskek commented Sep 5, 2020

ZonglinY commented Sep 24, 2020

richarddwang commented Sep 30, 2020 • edited Loading

shawwn commented Oct 2, 2020

SeanVody commented Oct 19, 2020

shawwn commented Oct 25, 2020 • edited Loading

jorditg commented Oct 26, 2020 • edited Loading

shawwn commented Oct 26, 2020

turnkit commented Nov 7, 2020

shawwn commented Nov 17, 2020 • edited Loading

vblagoje commented Nov 21, 2020 • edited Loading

ilyalasy commented Jun 9, 2021

lucaguarro commented Jun 11, 2021 • edited Loading

shawwn commented Feb 28, 2023

ofou commented Mar 28, 2023

shawwn commented Mar 28, 2023

ofou commented Mar 28, 2023 • edited Loading

ofou commented May 5, 2023 • edited Loading

iliemihai commented Jun 28, 2023

ofou commented Jun 29, 2023

LazurasLong commented Jul 14, 2023

alerque commented Jul 16, 2023 • edited Loading

u7390 commented Jul 23, 2023

LazurasLong commented Jul 23, 2023 via email

1os3 commented Aug 24, 2023

the-superpirate commented Sep 8, 2023 • edited Loading

cocopete commented May 8, 2024

shawwn commented Sep 5, 2020 •

edited

Loading

richarddwang commented Sep 30, 2020 •

edited

Loading

shawwn commented Oct 25, 2020 •

edited

Loading

jorditg commented Oct 26, 2020 •

edited

Loading

shawwn commented Nov 17, 2020 •

edited

Loading

vblagoje commented Nov 21, 2020 •

edited

Loading

lucaguarro commented Jun 11, 2021 •

edited

Loading

ofou commented Mar 28, 2023 •

edited

Loading

ofou commented May 5, 2023 •

edited

Loading

alerque commented Jul 16, 2023 •

edited

Loading

the-superpirate commented Sep 8, 2023 •

edited

Loading