The design of AI systems for health is a grand achievement of science and technology of our times. Nevertheless, such systems learn to perform specific tasks by processing extensive amounts of data that is produced and stored in large biomedical repositories. The quality and content of this data have an immense impact on what and how AI learns. If the data contains biases, such as skewed representation of certain categories or missing information, the application of AI can lead to discriminatory outcomes and propagate them into society, as we recently pointed out (Cirillo et al. NPJ Digit Med. 2020 doi:10.1038/s41746-020-0288-5). The aim of our project is to determine the extent of biases in available demographic categories (sex, age, race) in ELIXIR biomedical data repositories, which are largely used in the community to train AI systems. We aim to quantify bias and provide recommendations on how to properly use the data to develop fair and trustworthy AI, including solutions and best practices. We have recently collected endorsement and support regarding this project from representatives of several ELIXIR platforms, communities and focus groups, namely Data platform, Human Data Communities, Diversity, Equity, & Inclusion group, Impact group, Industry group and Communication.
Cancer Data Platform Federated Human Data Human Copy Number Variation Machine learning Rare Disease
Project Number: 35
EasyChair Number: 61
Davide Cirillo [email protected] Nataly Buslón [email protected]
Task 1. Quantification of bias in selected resources Task 2. Evaluation of social and ethical impact
ELIXIR data resources representatives especially designers, developers and data miners Computer scientists with database skills including development and data management Researchers in computational biology with strong programming background Researchers in social sciences with interests in biomedicine and technology Data scientists with strong analytical and statistical knowledge Bioinformaticians with knowledge on biological data resources Biostatisticians with interests in bias and data mining Researchers and practitioners in academic or industrial fields devoted to social equity
- Nataly Buslón, subgroup spokesperson
- Gemma Holliday
- Atia Cortés
- FTP access to the dataset: http://ftp.ncbi.nlm.nih.gov/dbgap/studies
- Study Submission Guide: https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/ and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=HowToSubmit.pdf
-
Non-NIH funded "expectations": https://osp.od.nih.gov/wp-content/uploads/Expectations_for_Non-NIH_Funded_Submission_Requests.pdf
-
Basic Requirements: https://osp.od.nih.gov/wp-content/uploads/Non-NIH-Funded_Basic_Study_Information.pdf
-
Template files: https://ftp.ncbi.nlm.nih.gov/dbgap/dbGaP_Submission_Guide_Templates/Individual_Submission_Templates/
-
Data Access: https://www.ncbi.nlm.nih.gov/books/NBK5294/ and https://osp.od.nih.gov/wp-content/uploads/NIH_Best_Practices_for_Controlled-Access_Data_Subject_to_the_NIH_GDS_Policy.pdf
-
Quality Control Errors: https://www.ncbi.nlm.nih.gov/gap/public_utils/messages/ and for the QC process: https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/#aqcchecks
- Davide Cirillo, subgroup spokesperson
- María Morales
- Alejandro Muñoz
- Camila Pontes
- Olivier Philippe
- API Metadata documentation: https://ega-archive.org/metadata/how-to-use-the-api
- Policy documentation: https://ega-archive.org/submission/dac/documentation
- Submitter Portal: https://ega-archive.org/submission/tools/submitter-portal
- Quality Control Reports https://ega-archive.org/about/quality-control-reports
- Implementation of the EU General Data Protection Regulation (GDPR): https://ega-archive.org/privacy-notice
- Data Access: https://ega-archive.org/access/data-access
- Download Client V3: https://ega-archive.org/download/downloader-quickguide-APIv3
- Metadata Rest Endpoints: https://ega-archive.org/metadata/how-to-use-the-api
- Aina Jené, subgroup spokesperson
- Babita Singh
- Mauricio Moldes
- Victoria Ruiz
- Diego Saby