Skip to content
sayadennis edited this page Aug 24, 2023 · 4 revisions

Welcome to the BBCAR project wiki! Here, I document the most up-to-date workflow of the project.

Background

Breast cancer remains a formidable global health challenge affecting women. Timely identification and prevention are pivotal in reducing mortality rates associated with the disease. Benign breast disease (BBD) diagnoses are common among women, with around one-third of BBD cases eventually progressing to breast cancer. However, BBD alone is typically not a strong enough risk factor for patients to take up preventive therapy.

This study aims to refine breast cancer risk stratification in women diagnosed with BBD. Leveraging whole-exome sequencing of BBD biopsy tissues, complemented by a subset of germline sequencing data, predictive models were constructed. Employing machine learning techniques, these models were trained and evaluated for their capacity to predict breast cancer risk.

Project goals

  • Characterization of genomic aberrations in BBD that distinguishes individuals by risk for breast cancer.
  • Development of risk stratification ML model using genomic aberrations as input features.

Data description

Sequencing reads files

  • Originally taken from:
    • Tumor tissue: /projects/b1122/Zexian/Alignment/BBCAR/RAW_data/
    • Germline: /projects/b1122/Zexian/Alignment/Germline_37/RAW_data/
  • Copied to:
    • BBD tissue: /projects/b1122/saya/raw/bbb_tissue/
    • Germline: /projects/b1122/saya/raw/germline/
  • Currently using:
    • BBD tissue: /projects/b1131/saya/bbcar/data/00_raw/tissue/
    • Germline: /projects/b1131/saya/bbcar/data/00_raw/germline/

Clinical data files

  • Clinical data location: /projects/b1131/saya/bbcar/data/clinical/
  • Originally taken from:
    • Gannon shared local file with me: /Users/sayadennis/Projects/bbcar_project/GATK_Analysis_Sample_Status.xlsx
    • Files with names starting with BBCaRDatabaseNU09B2-* are downloaded from the BBCAR REDCap database. NOTE: the outcome labels of REDCap is apparently not always correct!!! Gannon and Natalie double-checked the outcomes for each patient and correctly labeled with bbcar_label_studyid_from_gatk_filenames.csv.

Sequencing metadata files

  • A subset of samples were sequenced at University of Chicago, and the rest were sequenced at Indiana.
    • Which samples were sequenced at Indiana?
      • Sample IDs can be found at /projects/b1131/saya/bbcar/data/sample_ids_uchicago.txt
    • What is the difference?
      • U Chicago samples: Uses Exome intervals /projects/b1122/gannon/bbcar/RAW_data/int_lst/SureSelect_v5/
      • Indiana samples: Uses Exome intervals /projects/b1122/gannon/bbcar/RAW_data/int_lst/SureSelect_v6/

Workflow

  1. Process data
  2. Create data summary
  3. Statistical characterization of features
  4. Predict breast cancer risk