Skip to content

Downloading and Creating labels for the TCGA lung Whole Slide Images

Notifications You must be signed in to change notification settings

GeorgeBatch/TCGA-lung-histology-download

Repository files navigation

TCGA Lung Dataset

This repository contains the instructions of how to download the diagnostic slides for the lung portion of the TCGA dataset. It will require ~800GB of space.

TCGA lung also has tissue slides which are were not diagnostic. Experimental strategy can be Tissue Slide (non-diagnostic) or/and Diagnostic Slide.

Important note

  • Patient ID is the first 12 characters of the slide name, e.g. TCGA-50-5066
  • Case ID is the first 15 characters of the slide name, e.g. TCGA-50-5066-01 or TCGA-50-5066-02
  • Slide name, e.g. TCGA-50-5066-01Z-00-DX1.e161df31-84a4-40a4-a6a2-748b60820f77 contains the slide name TCGA-50-5066-01Z-00-DX1 and some uid e161df31-84a4-40a4-a6a2-748b60820f77; all slide names in the downloaded dataset are unique and so are the uid's so that there is a one-to-one mapping between the slide names and the uid's.

Source: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/#creating-barcodes

This explains why the web page refers to 478 cases for LUAD and 478 cases for LUSC, while the manifest files contain 479 cases for LUAD and 478 cases for LUSC. The web page really refers to the patients, while the manifest files refer to the cases.

Patient TCGA-50-5066 has 2 cases for LUAD. The case IDs are:

  • TCGA-50-5066-01 with diagnostic slide: TCGA-50-5066-01Z-00-DX1
  • TCGA-50-5066-02 with diagnostic slide: TCGA-50-5066-02Z-00-DX1

Every other patient has only 1 case per patient.

Instructions for Downloading the Dataset

  1. Make sure you have enough disk space: ~800GB.
  2. Download the gdc-client Data Transfer Tool binaries from https://gdc.cancer.gov/access-data/gdc-data-transfer-tool and add it to your PATH or into /usr/local/bin if you have this directory (it's usually already added to the PATH). I did not have any problems with it on Linux, however on Apple devices you can run into "MacOS cannot verify app is free from malware" which can be solved as described here: https://gadgetstouse.com/blog/2021/04/08/fix-macos-cannot-verify-app-is-free-from-malware/
  3. Clone this repository and cd into it.

If you want to make sure that the manifests have not changed, download new ones from TCGA data portal and check them. Example of checking the versions from 2023-10-03 vs 2021-11-03. The manifests have not changed.

cd tcga-download

for file in gdc_manifest.2023-10-03-TCGA-LUSC.txt gdc_manifest.2023-10-03-TCGA-LUAD.txt gdc_manifest.2021-11-03-TCGA-LUSC.txt gdc_manifest.2021-11-03-TCGA-LUAD.txt; do
    sorted_file="${file%.txt}-sorted.txt"
    echo -e "id\tfilename\tmd5\tsize\tstate" > "$sorted_file"
    tail -n +2 "$file" | sort -k2,2 >> "$sorted_file"
done

diff gdc_manifest.2023-10-03-TCGA-LUAD-sorted.txt gdc_manifest.2021-11-03-TCGA-LUAD-sorted.txt
diff gdc_manifest.2023-10-03-TCGA-LUSC-sorted.txt gdc_manifest.2021-11-03-TCGA-LUSC-sorted.txt

cd ..
  1. Create ./WSI/LUSC/ and ./WSI/LUAD folders.
  2. If you choose to change the folder structure, make changes to
    1. ./tcga-download/config-LUSC.dtt
    2. ./tcga-download/config-LUAD.dtt
  3. Run:
bash ./0-download-LUSC-and-LUAD.sh

to download the files. It will take a while. Restarting the download is not advisable. I am not sure, but I think the manifest file will need to be modified: already downloaded files fill need to be excluded. See: https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/#resuming-a-failed-download

Tip: I used tmux for the process to continue on a remote surver after I closed the connection. See this how-to on StackExchange.

  1. Check that the downloaded slides were not currupted during the download:
md5sum ./WSI/*/*.svs > downloaded_md5sum_hashes.txt

The hashes should match the ones in ./tcga_download/ manifest files for LUAD and LUSC.

The code to parse the manifest files and downloaded_md5sum_hashes.txt and check the matches is in 2-check-names.ipynb.

If you already have a file with md5 checksums, you can use a trick shown here: https://askubuntu.com/questions/318530/generate-md5-checksum-for-all-files-in-a-directory

Contents

Corrupted Slides Excluded in DSMIL-WSI work

There seem to be some corrupted files that were excluded from the dataset in DSMIL-WSI work. see issue that gives a Google Drive Link to the TCGA-lung dataset. When using the code from the dsmil-wsi repo to download pre-trained features for TCGA-lung, the excluded set is different. The names of the folders within the google drive folder have changed, however, the slide names contain the patient ID (first 12 characters) and case ID (first 15 characters). See ./classes_extended_info.csv. Use ./2-check-names.ipynb code to investigate and choose which of the slides you want to exclude.

My investigation results:

  1. All of the slides have a significantly darker background around the tissue.

  2. In Google Drive version, 11 LUAD parients 1 case per patient and 1 slide per case were excluded

patient_id case_id slide_id_short
TCGA-05-4384 TCGA-05-4384-01 TCGA-05-4384-01Z-00-DX1
TCGA-05-4390 TCGA-05-4390-01 TCGA-05-4390-01Z-00-DX1
TCGA-05-4410 TCGA-05-4410-01 TCGA-05-4410-01Z-00-DX1
TCGA-05-4425 TCGA-05-4425-01 TCGA-05-4425-01Z-00-DX1
TCGA-05-5420 TCGA-05-5420-01 TCGA-05-5420-01Z-00-DX1
TCGA-05-5423 TCGA-05-5423-01 TCGA-05-5423-01Z-00-DX1
TCGA-05-5425 TCGA-05-5425-01 TCGA-05-5425-01Z-00-DX1
TCGA-05-5428 TCGA-05-5428-01 TCGA-05-5428-01Z-00-DX1
TCGA-05-5429 TCGA-05-5429-01 TCGA-05-5429-01Z-00-DX1
TCGA-05-5715 TCGA-05-5715-01 TCGA-05-5715-01Z-00-DX1
TCGA-44-7661 TCGA-44-7661-01 TCGA-44-7661-01Z-00-DX1
  1. In GitHub version, 7 out of the Google Drive's 11 patients with their 7 cases and 7 slides were exluded. The remaining 4 patients with 4 cases and 4 slides were not excluded.
patient_id case_id slide_id_short
TCGA-05-4384 TCGA-05-4384-01 TCGA-05-4384-01Z-00-DX1
TCGA-05-4410 TCGA-05-4410-01 TCGA-05-4410-01Z-00-DX1
TCGA-05-4425 TCGA-05-4425-01 TCGA-05-4425-01Z-00-DX1
TCGA-05-5420 TCGA-05-5420-01 TCGA-05-5420-01Z-00-DX1
TCGA-05-5423 TCGA-05-5423-01 TCGA-05-5423-01Z-00-DX1
TCGA-05-5425 TCGA-05-5425-01 TCGA-05-5425-01Z-00-DX1
TCGA-05-5715 TCGA-05-5715-01 TCGA-05-5715-01Z-00-DX1
  1. Test set form google drive has 1 slide that is also in the excluded set from google drive: TCGA-05-4390-01Z-00-DX1. However, all slides from the test set on google drive are included in the slides on GitHub.

Decision: I will use the GitHub version of the dataset (excludes the 7 patients with 7 cases and 7 slides). I will use the test set from google drive as the test set to be able to make a direct comparison to the DSMIL-WSI results since this test set is fully included in the slides on GitHub.

About

Downloading and Creating labels for the TCGA lung Whole Slide Images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published