-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add antibody developability from TDC #99
base: main
Are you sure you want to change the base?
Changes from 2 commits
817805b
66b78cd
bd7fedd
ad2dd8f
94c8df1
0b955ef
a1c0163
6eb651c
caf6bab
309bfe6
22bff01
3b7c7f8
6aa3608
a52db11
87dc335
40ad396
d626e3f
5067c90
21e3301
c6a05c2
94e42ed
39744aa
937ffb9
d5d5187
c583f3b
339ff9f
d7a6192
864d880
09f5899
f141055
9ba3e1c
2df8eb6
aebcfd7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
name: SAbDab_Chen | ||
description: "Antibody data from Chen et al, where they process from the SAbDab. \n From an initial dataset of 3816 antibodies, they retained 2426\ | ||
\ antibodies\n that satisfy the following criteria: 1. \n have both sequence (FASTA) and Protein Data Bank (PDB) structure files,\n \ | ||
\ 2. contain both a heavy chain and a light chain, and 3. \n have crystal structures with resolution < 3 Å. \n The DI label is derived\ | ||
\ from BIOVIA's pipelines." | ||
targets: | ||
- id: developability | ||
description: functional antibody candidate to be developed into a manufacturable(1), or not(0) | ||
units: '' | ||
type: categorical | ||
names: | ||
- antibody developability | ||
- monoclonal anitbody | ||
- functional antibody candidate | ||
- manufacturable, stable, safe, and effective antibody drug | ||
uris: | ||
- https://rb.gy/idkdqp | ||
- https://rb.gy/b8cx8i | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With URIs we mean links to ontologies, such as the ones you can find here https://bioportal.bioontology.org/ontologies/BAO?p=classes&conceptid=http://purl.obolibrary.org/obo/NCIT_C20604 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed those as they were not fitting our setup. |
||
identifiers: | ||
- id: antibody_pdb_ID | ||
type: Other | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are those IDs chemically meaningful or just some identifier number? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, the pdb id There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So should we keep them or remove them? |
||
description: anitbody pdb id | ||
- id: heavy_chain | ||
type: Other | ||
description: anitbody heavy chain amino acid sequence in FASTA | ||
- id: light_chain | ||
type: Other | ||
description: anitbody light chain amino acid sequence in FASTA | ||
license: CC BY 4.0 | ||
links: | ||
- url: https://doi.org/10.1101/2020.06.18.159798 | ||
description: corresponding publication | ||
- url: https://doi.org/10.1093/nar/gkt1043 | ||
description: corresponding publication | ||
- url: https://www.3ds.com/products-services/biovia/products/data-science/pipeline-pilot/ | ||
description: corresponding tools used | ||
- url: https://tdcommons.ai/single_pred_tasks/develop/#sabdab-chen-et-al | ||
description: data source | ||
num_points: 2409 | ||
bibtex: | ||
- "@article{Chen2020,\n doi = {10.1101/2020.06.18.159798},\n url = {https://doi.org/10.1101/2020.06.18.159798},\n year =\ | ||
\ {2020},\n month = jun,\n publisher = {Cold Spring Harbor Laboratory},\n author = {Xingyao Chen and Thomas Dougherty and\ | ||
\ \n Chan Hong and Rachel Schibler and Yi Cong Zhao and \n Reza Sadeghi and Naim Matasci and Yi-Chieh Wu and Ian Kerman},\n \ | ||
\ title = {Predicting Antibody Developability from Sequence \n using Machine Learning}}" | ||
- "@article{Dunbar2013,\n doi = {10.1093/nar/gkt1043},\n url = {https://doi.org/10.1093/nar/gkt1043},\n year = {2013},\n\ | ||
\ month = nov,\n publisher = {Oxford University Press ({OUP})},\n volume = {42},\n number = {D1},\n pages\ | ||
\ = {D1140--D1146},\n author = {James Dunbar and Konrad Krawczyk and Jinwoo Leem \n and Terry Baker and Angelika Fuchs and Guy Georges\ | ||
\ and Jiye Shi and\n Charlotte M. Deane},\n title = {{SAbDab}: the structural antibody database},\n journal = {Nucleic\ | ||
\ Acids Research}}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also surprised by the linebreaks here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this also due to the Å? Anyway, fixed! |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
import pandas as pd | ||
import yaml | ||
from tdc.single_pred import Develop | ||
|
||
|
||
def get_and_transform_data(): | ||
# get raw data | ||
target_folder = "SAbDab_Chen" | ||
target_subfolder = "SAbDab_Chen" | ||
data = Develop(name=target_subfolder) | ||
MicPie marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# proceed raw data | ||
df = data.get_data() | ||
fields_orig = df.columns.tolist() | ||
assert fields_orig == ["Antibody_ID", "Antibody", "Y"] | ||
|
||
fn_data_original = "data_original.csv" | ||
|
||
antibody_list = df.Antibody.tolist() | ||
s2l = lambda list_string: list( | ||
map(str.strip, list_string.strip("][").replace("'", "").split(",")) | ||
) | ||
df["heavy_chain"] = [s2l(x)[0] for x in antibody_list] | ||
df["light_chain"] = [s2l(x)[1] for x in antibody_list] | ||
df = df[["Antibody_ID", "heavy_chain", "light_chain", "Y"]] | ||
df.to_csv(fn_data_original, index=False) | ||
|
||
# load raw data and assert columns | ||
df = pd.read_csv(fn_data_original, sep=",") | ||
fields_orig = df.columns.tolist() | ||
assert fields_orig == ["Antibody_ID", "heavy_chain", "light_chain", "Y"] | ||
fields_clean = ["antibody_pdb_ID", "heavy_chain", "light_chain", "developability"] | ||
df.columns = fields_clean | ||
assert not df.duplicated().sum() | ||
|
||
# save to csv | ||
fn_data_csv = "data_clean.csv" | ||
df.to_csv(fn_data_csv, index=False) | ||
|
||
meta = { | ||
"name": f"{target_folder}", # unique identifier, we will also use this for directory names | ||
"description": """Antibody data from Chen et al, where they process from the SAbDab. | ||
From an initial dataset of 3816 antibodies, they retained 2426 antibodies | ||
that satisfy the following criteria: 1. | ||
have both sequence (FASTA) and Protein Data Bank (PDB) structure files, | ||
2. contain both a heavy chain and a light chain, and 3. | ||
have crystal structures with resolution < 3 Å. | ||
The DI label is derived from BIOVIA's pipelines.""", | ||
"targets": [ | ||
{ | ||
"id": "developability", # name of the column in a tabular dataset | ||
"description": "functional antibody candidate to be developed into a manufacturable(1), or not(0)", | ||
"units": "", # units of the values in this column (leave empty if unitless) | ||
"type": "categorical", # can be "categorical", "ordinal", "continuous" | ||
"names": [ # names for the property (to sample from for building the prompts) | ||
"antibody developability", | ||
"monoclonal anitbody", | ||
"functional antibody candidate", | ||
"manufacturable, stable, safe, and effective antibody drug", | ||
], | ||
"uris": [ | ||
"https://rb.gy/idkdqp", | ||
"https://rb.gy/b8cx8i", | ||
], | ||
}, | ||
], | ||
"identifiers": [ | ||
{ | ||
"id": "antibody_pdb_ID", # column name | ||
"type": "Other", # can be "SMILES", "SELFIES", "IUPAC", "Other" | ||
"description": "anitbody pdb id", # description (optional, except for "Other") | ||
}, | ||
{ | ||
"id": "heavy_chain", # column name | ||
"type": "Other", # can be "SMILES", "SELFIES", "IUPAC", "Other" | ||
"description": "anitbody heavy chain amino acid sequence in FASTA", # description (optional, except for "Other") | ||
}, | ||
{ | ||
"id": "light_chain", # column name | ||
"type": "Other", # can be "SMILES", "SELFIES", "IUPAC", "Other" | ||
"description": "anitbody light chain amino acid sequence in FASTA", # description (optional, except for "Other") | ||
}, | ||
], | ||
"license": "CC BY 4.0", # license under which the original dataset was published | ||
"links": [ # list of relevant links (original dataset, other uses, etc.) | ||
{ | ||
"url": "https://doi.org/10.1101/2020.06.18.159798", | ||
"description": "corresponding publication", | ||
}, | ||
{ | ||
"url": "https://doi.org/10.1093/nar/gkt1043", | ||
"description": "corresponding publication", | ||
}, | ||
{ | ||
"url": "https://www.3ds.com/products-services/biovia/products/data-science/pipeline-pilot/", | ||
"description": "corresponding tools used", | ||
}, | ||
{ | ||
"url": "https://tdcommons.ai/single_pred_tasks/develop/#sabdab-chen-et-al", | ||
"description": "data source", | ||
}, | ||
], | ||
"num_points": len(df), # number of datapoints in this dataset | ||
"bibtex": [ | ||
"""@article{Chen2020, | ||
doi = {10.1101/2020.06.18.159798}, | ||
url = {https://doi.org/10.1101/2020.06.18.159798}, | ||
year = {2020}, | ||
month = jun, | ||
publisher = {Cold Spring Harbor Laboratory}, | ||
author = {Xingyao Chen and Thomas Dougherty and | ||
Chan Hong and Rachel Schibler and Yi Cong Zhao and | ||
Reza Sadeghi and Naim Matasci and Yi-Chieh Wu and Ian Kerman}, | ||
title = {Predicting Antibody Developability from Sequence | ||
using Machine Learning}}""", | ||
"""@article{Dunbar2013, | ||
doi = {10.1093/nar/gkt1043}, | ||
url = {https://doi.org/10.1093/nar/gkt1043}, | ||
year = {2013}, | ||
month = nov, | ||
publisher = {Oxford University Press ({OUP})}, | ||
volume = {42}, | ||
number = {D1}, | ||
pages = {D1140--D1146}, | ||
author = {James Dunbar and Konrad Krawczyk and Jinwoo Leem | ||
and Terry Baker and Angelika Fuchs and Guy Georges and Jiye Shi and | ||
Charlotte M. Deane}, | ||
title = {{SAbDab}: the structural antibody database}, | ||
journal = {Nucleic Acids Research}}""", | ||
], | ||
} | ||
|
||
def str_presenter(dumper, data): | ||
"""configures yaml for dumping multiline strings | ||
Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data | ||
""" | ||
if data.count("\n") > 0: # check for multiline string | ||
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|") | ||
return dumper.represent_scalar("tag:yaml.org,2002:str", data) | ||
|
||
yaml.add_representer(str, str_presenter) | ||
yaml.representer.SafeRepresenter.add_representer( | ||
str, str_presenter | ||
) # to use with safe_dum | ||
fn_meta = "meta.yaml" | ||
with open(fn_meta, "w") as f: | ||
yaml.dump(meta, f, sort_keys=False) | ||
|
||
print(f"Finished processing {meta['name']} dataset!") | ||
|
||
|
||
if __name__ == "__main__": | ||
get_and_transform_data() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
--- | ||
name: TAP | ||
description: "Immunogenicity, instability, self-association, \n high viscosity, polyspecificity, or poor expression can all preclude\n an\ | ||
\ antibody from becoming a therapeutic. Early identification of these\n negative characteristics is essential. Akin to the Lipinski guidelines,\n\ | ||
\ which measure druglikeness in small molecules, \n Therapeutic Antibody Profiler (TAP) highlights antibodies \n that possess characteristics\ | ||
\ that are rare/unseen in \n clinical-stage mAb therapeutics." | ||
targets: | ||
- id: CDR_Length | ||
description: CDR Complementarity-determining regions length | ||
units: '' | ||
type: continuous | ||
names: | ||
- Antibody Complementarity-determining regions length | ||
- Therapeutic Antibody Profiler | ||
- antibody developability | ||
- monoclonal anitbody | ||
uris: | ||
- https://rb.gy/s9gv88 | ||
- https://rb.gy/km77hq | ||
- https://rb.gy/b8cx8i | ||
- id: PSH | ||
description: patches of surface hydrophobicity | ||
units: '' | ||
type: continuous | ||
names: | ||
- antibody patches of surface hydrophobicity | ||
- Therapeutic Antibody Profiler | ||
- antibody developability | ||
- monoclonal anitbody | ||
uris: | ||
- https://rb.gy/bchhaa | ||
- https://rb.gy/2irr4l | ||
- https://rb.gy/b8cx8i | ||
- id: PPC | ||
description: patches of positive charge | ||
units: '' | ||
type: continuous | ||
names: | ||
- patches of positive charge | ||
- Therapeutic Antibody Profiler | ||
- antibody developability | ||
- monoclonal anitbody | ||
uris: | ||
- https://rb.gy/b8cx8i | ||
- id: PNC | ||
description: patches of negative charge | ||
units: '' | ||
type: continuous | ||
names: | ||
- anitbody patches of negative charge | ||
- Therapeutic Antibody Profiler | ||
- antibody developability | ||
- monoclonal anitbody | ||
uris: | ||
- https://rb.gy/b8cx8i | ||
- id: SFvCSP | ||
description: structural Fv charge symmetry parameter | ||
units: '' | ||
type: continuous | ||
names: | ||
- antibody structural Fv charge symmetry parameter | ||
- Therapeutic Antibody Profiler | ||
- antibody developability | ||
- monoclonal anitbody | ||
uris: | ||
- https://rb.gy/uxyhc3 | ||
- https://rb.gy/b8cx8i | ||
identifiers: | ||
- id: antibody_name | ||
type: Other | ||
description: anitbody name | ||
- id: heavy_chain | ||
type: Other | ||
description: anitbody heavy chain amino acid sequence | ||
- id: light_chain | ||
type: Other | ||
description: anitbody light chain amino acid sequence | ||
license: CC BY 4.0 | ||
links: | ||
- url: https://doi.org/10.1073/pnas.1810576116 | ||
description: corresponding publication | ||
- url: https://tdcommons.ai/single_pred_tasks/develop/#tap | ||
description: data source | ||
num_points: 241 | ||
bibtex: | ||
- |- | ||
@article{Raybould2019, | ||
doi = {10.1073/pnas.1810576116}, | ||
url = {https://doi.org/10.1073/pnas.1810576116}, | ||
year = {2019}, | ||
month = feb, | ||
publisher = {Proceedings of the National Academy of Sciences}, | ||
volume = {116}, | ||
number = {10}, | ||
pages = {4025--4030}, | ||
author = {Matthew I. J. Raybould and Claire Marks and Konrad Krawczyk and Bruck Taddese and Jaroslaw Nowak and Alan P. Lewis and Alexander Bujotzek and Jiye Shi and Charlotte M. Deane}, | ||
title = {Five computational developability guidelines for therapeutic antibody profiling}, | ||
journal = {Proceedings of the National Academy of Sciences}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the linebreaks seems a bit awkward, do you have an idea where they come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that was the Ångström Å!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I converted to nm.