Skip to content

daniel-z-kaplan/ProteinData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

First run download-data
Then run read_fasta (may want to change file names or not). Read fasta parses fasta file & does the tokenization stuff.
Cd to tools and run:
python3 preprocess_data.py --input ../fasta_parsed.txt --output-prefix my-gpt --vocab ../model-vocab.json --dataset-impl mmap --tokenizer-type GPT2BPETokenizer --merge-file ../model-merges.txt --workers 6 --chunk-size 100 --append-eod --json-key "seq"

About

Data for RWKV

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages