Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Papyrus 3 Million data point pchembl for 7k protein #340

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

phalem
Copy link
Contributor

@phalem phalem commented Jun 29, 2023

Add all papyrus dataset that have pchembl only from: https://doi.org/10.4121/16896406.v3 To understand columns means in details look at README inside reference link. Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link: https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz size : 105 MB
Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible and revise the columns as well.
Example include:
What is the this mention at?
what is the of the or on ?
what <activity_type> of the reported on ? Ka for example.

Please, if possible it need some enhancement ,
@MicPie Can you help me in this ?
Data was large. Hugging face raise a problem when loading using load_dataset.

For 60 Million datapoint We will need to check each compound either active or not as I found compound that doesn't have pchembl is inactive. However I didn't search on all data and other data. I will see away to do that.

Thank you.

Add all papyrus dataset that have pchembl  only from: https://doi.org/10.4121/16896406.v3
To understand columns means in details look at README inside reference link.
Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link:
https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz
size : 105 MB
Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible.
@phalem phalem mentioned this pull request Jun 30, 2023
@MicPie MicPie self-requested a review July 26, 2023 13:13
@MicPie
Copy link
Contributor

MicPie commented Jul 26, 2023

Hi @phalem thank you for looking into the Papyrus data, this looks very interesting!

For this dataset you used the data from https://data.4tu.nl/file/ca10bf7d-f508-4d54-9c9a-5a9e9c1adef9/36feebfc-4703-4290-90f2-f3e41261f0c4 right?
If, we don't have to go over the HF Hub route at all, or maybe I'm missing something?

PS: I just merged with the latest main and applied the pre-commit hooks.

@MicPie
Copy link
Contributor

MicPie commented Jul 26, 2023

Ok, I'm currently trying to get the data from the direct source but the data is very big and the transform.py script needs a lot of RAM. Let's see how this works out. Depending on that we can discuss how we best approach that.
But this seems to be a great and big dataset! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants