ChemNLP project 🧪🚀

The ChemNLP project aims to

create an extensive chemistry dataset and
use it to train large language models (LLMs) that can leverage the data for a wide range of chemistry applications.

For more details see our information material section below.

Information material

Introduction presentation
Project proposal
Task board
awesome-chemistry-datasets repository to collect interesting chemistry datasets
Weekly meetings are set up soon! Please join our Discord community for more information.

Community

Feel free to join our #chemnlp channel on our OpenBioML discord server to start the discussion in more detail.

Contributing

ChemNLP is an open-source project - your involvement is warmly welcome! If you're excited to join us, we recommend the following steps:

Join our Discord server.
Have a look at our contributing guide.
Looking for ideas? See our task board to see what we may need help with.
Have an idea? Create an issue!

Note on the "ChemNLP" name

Our OpenBioML ChemNLP project is not afiliated to the ChemNLP library from NIST and we use "ChemNLP" as a general term to highlight our project focus. The datasets and models we create through our project will have a unique and recognizable name when we release them.

About OpenBioML.org

See https://openbioml.org, especially our approach and partners.

Installation and set-up

Create a new conda environment with Python 3.8:

conda create -n chemnlp python=3.8
conda activate chemnlp

To install the chemnlp package (and required dependencies):

pip install chemnlp

If working on developing the python package:

pip install -e "chemnlp[dev]"  # to install development dependencies

If extra dependencies are required (e.g. for dataset creation) but are not needed for the main package please add to the pyproject.toml in the dataset_creation variable and ensure this is reflected in the conda.yml file.

Note

If working on model training, request access to the wandb project chemnlp and log-in to wandb with your API key per here.

Adding a new dataset (to the model training pipline)

We specify datasets by creating a new function here which is named per the dataset on Hugging Face. At present the function must accept a tokenizer and return back the tokenized train and validation datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
configs		configs
data		data
gpt-neox @ a5c2229		gpt-neox @ a5c2229
notebooks		notebooks
scripts		scripts
src/chemnlp		src/chemnlp
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
code_of_conduct.md		code_of_conduct.md
conda.yaml		conda.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChemNLP project 🧪🚀

Information material

Community

Contributing

Note on the "ChemNLP" name

About OpenBioML.org

Installation and set-up

Adding a new dataset (to the model training pipline)

About

Releases

Packages

Languages

License

csjackson0/chemnlp

Folders and files

Latest commit

History

Repository files navigation

ChemNLP project 🧪🚀

Information material

Community

Contributing

Note on the "ChemNLP" name

About OpenBioML.org

Installation and set-up

Adding a new dataset (to the model training pipline)

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages