Add Papyrus 3 Million data point pchembl for 7k protein #340

phalem · 2023-06-29T13:26:10Z

Add all papyrus dataset that have pchembl only from: https://doi.org/10.4121/16896406.v3 To understand columns means in details look at README inside reference link. Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link: https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz size : 105 MB
Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible and revise the columns as well.
Example include:
What is the this mention at?
what is the of the or on ?
what <activity_type> of the reported on ? Ka for example.

Please, if possible it need some enhancement ,
@MicPie Can you help me in this ?
Data was large. Hugging face raise a problem when loading using load_dataset.

For 60 Million datapoint We will need to check each compound either active or not as I found compound that doesn't have pchembl is inactive. However I didn't search on all data and other data. I will see away to do that.

Thank you.

Add all papyrus dataset that have pchembl only from: https://doi.org/10.4121/16896406.v3 To understand columns means in details look at README inside reference link. Data was cleaned and upload into Hugging face as original data are difficult to upload for a normal computer link: https://huggingface.co/datasets/phalem/awesome_chem_clean_data/resolve/main/pchembl_papyrus.csv.gz size : 105 MB Please note: I fill na of the field rather than pchembl with unknown. Please look other field if possible.

MicPie · 2023-07-26T13:19:18Z

Hi @phalem thank you for looking into the Papyrus data, this looks very interesting!

For this dataset you used the data from https://data.4tu.nl/file/ca10bf7d-f508-4d54-9c9a-5a9e9c1adef9/36feebfc-4703-4290-90f2-f3e41261f0c4 right?
If, we don't have to go over the HF Hub route at all, or maybe I'm missing something?

PS: I just merged with the latest main and applied the pre-commit hooks.

MicPie · 2023-07-26T15:35:47Z

Ok, I'm currently trying to get the data from the direct source but the data is very big and the transform.py script needs a lot of RAM. Let's see how this works out. Depending on that we can discuss how we best approach that.
But this seems to be a great and big dataset! :-)

phalem mentioned this pull request Jun 30, 2023

Dataset TODO list #75

Open

Merge branch 'main' into add_papyrus_pchembl

7f9f0de

MicPie self-requested a review July 26, 2023 13:13

MicPie assigned phalem Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Papyrus 3 Million data point pchembl for 7k protein #340

Add Papyrus 3 Million data point pchembl for 7k protein #340

phalem commented Jun 29, 2023 •

edited

Loading

MicPie commented Jul 26, 2023 •

edited

Loading

MicPie commented Jul 26, 2023

Add Papyrus 3 Million data point pchembl for 7k protein #340

Are you sure you want to change the base?

Add Papyrus 3 Million data point pchembl for 7k protein #340

Conversation

phalem commented Jun 29, 2023 • edited Loading

MicPie commented Jul 26, 2023 • edited Loading

MicPie commented Jul 26, 2023

phalem commented Jun 29, 2023 •

edited

Loading

MicPie commented Jul 26, 2023 •

edited

Loading