The ChemNLP project aims to
- create an extensive chemistry dataset and
- use it to train large language models (LLMs) that can leverage the data for a wide range of chemistry applications.
For more details see our information material section below.
- Introduction presentation
- Project proposal
- Task board
- awesome-chemistry-datasets repository to collect interesting chemistry datasets
- Weekly meetings are set up soon! Please join our Discord community for more information.
Feel free to join our #chemnlp
channel on our OpenBioML discord server to start the discussion in more detail.
ChemNLP is an open-source project - your involvement is warmly welcome! If you're excited to join us, we recommend the following steps:
- Join our Discord server.
- Have a look at our contributing guide.
- Looking for ideas? See our task board to see what we may need help with.
- Have an idea? Create an issue!
Our OpenBioML ChemNLP project is not afiliated to the ChemNLP library from NIST and we use "ChemNLP" as a general term to highlight our project focus. The datasets and models we create through our project will have a unique and recognizable name when we release them.
See https://openbioml.org, especially our approach and partners.
Create a new conda environment with Python 3.8:
conda create -n chemnlp python=3.8
conda activate chemnlp
To install the chemnlp
package (and required dependencies):
pip install chemnlp
If working on developing the python package:
pip install -e "chemnlp[dev]" # to install development dependencies
If extra dependencies are required (e.g. for dataset creation) but are not needed for the main package please add to the pyproject.toml
in the dataset_creation
variable and ensure this is reflected in the conda.yml
file.
Note
If working on model training, request access to the wandb
project chemnlp
and log-in to wandb
with your API key per here.
We specify datasets by creating a new function here which is named per the dataset on Hugging Face. At present the function must accept a tokenizer and return back the tokenized train and validation datasets.