Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCGA data download #16

Open
LITTLEKKKK opened this issue Sep 4, 2021 · 21 comments
Open

TCGA data download #16

LITTLEKKKK opened this issue Sep 4, 2021 · 21 comments

Comments

@LITTLEKKKK
Copy link

When I come to the website, it says: “All slide and diagnostic images from the TCGA program are currently unavailable for download”. Could you share the lung datasets by using a Google Cloud link? : )

@binli123
Copy link
Owner

binli123 commented Sep 4, 2021

I believe the Google Drive link is posted in the readme. I have emphasized the link and updated the readme file. Could you check if the link in the section Processing raw WSI data->Download WSIs->From Google Drive works for you?

@GeorgeBatch
Copy link
Contributor

Hi Bin,

Do you have any advice on how to download the Google Drive folder with the TCGA files from a terminal?
I tried using gdown, but it only allows to download folders with at most 50 files.

Best wishes,
George

@binli123
Copy link
Owner

Hi Bin,

Do you have any advice on how to download the Google Drive folder with the TCGA files from a terminal? I tried using gdown, but it only allows to download folders with at most 50 files.

Best wishes, George

I have never tried it with a terminal. But I think one of the appropriate ways to download a large number of files is to use the Google Drive desktop app and select the folder to sync to your local device.

@GeorgeBatch
Copy link
Contributor

Thanks! How large is the Google Drive folder that you provided?

@binli123
Copy link
Owner

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

@GeorgeBatch
Copy link
Contributor

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

@binli123
Copy link
Owner

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

@GeorgeBatch
Copy link
Contributor

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

Thank you!

@GeorgeBatch
Copy link
Contributor

Hi Bin,

I am trying to understand which of the files from the Google Drive folder I actually need.

In TCGA-lung-WSI folder, all the .svs files are enclosed in folders, e.g. ffa686dc-0f3c-4fb8-af3b-ee82a940752a folder for the ffa686dc-0f3c-4fb8-af3b-ee82a940752a.svs WSI. Each of them also seems to have a corresponding logs folder. Can you please explain what is there and why it is needed?

A similar thing is true about the TCGA-lung-WSI-corrupt folder, but here each of the WSI subfolders also has an annotations.txt file. Can you also please also explain why the corrupted WSIs have annotations, while all the other WSIs don't?

Many thanks,
George

@binli123
Copy link
Owner

Hi Bin,

I am trying to understand which of the files from the Google Drive folder I actually need.

In TCGA-lung-WSI folder, all the .svs files are enclosed in folders, e.g. ffa686dc-0f3c-4fb8-af3b-ee82a940752a folder for the ffa686dc-0f3c-4fb8-af3b-ee82a940752a.svs WSI. Each of them also seems to have a corresponding logs folder. Can you please explain what is there and why it is needed?

A similar thing is true about the TCGA-lung-WSI-corrupt folder, but here each of the WSI subfolders also has an annotations.txt file. Can you also please also explain why the corrupted WSIs have annotations, while all the other WSIs don't?

Many thanks, George

Those are just download logs that automatically generated when you download something from NCI data portal. A small portion of the WSI has coarse annotations that come with the slide and those low quality ones (also scanned with a lower mag) just happen to have it. I guess those are uploaded by a specific facility who also annotated the slides.

@GeorgeBatch
Copy link
Contributor

Makes sense, thank you!

@LITTLEKKKK
Copy link
Author

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

@binli123
Copy link
Owner

binli123 commented Nov 1, 2021

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

https://drive.google.com/file/d/17zCn-WRNzxxxh8kkdBTbDLDZy0XZ3RIu/view?usp=sharing

@LITTLEKKKK
Copy link
Author

Thanks a lot. The cropped patches zip file is often broken off and not stable. Did you upload unzip files of cropped patches before? : (

@GeorgeBatch
Copy link
Contributor

Also, it looks like the command should include the download specification.

  $ cd tcga-download
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

instead of

  $ cd tcga-download
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

@binli123
Copy link
Owner

binli123 commented Nov 3, 2021

Also, it looks like the command should include the download specification.

  $ cd tcga-download
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

instead of

  $ cd tcga-download
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

They also updated the download client, I might just remove this part from the readme

@binli123
Copy link
Owner

binli123 commented Nov 3, 2021

Thanks a lot. The cropped patches zip file is often broken off and not stable. Did you upload unzip files of cropped patches before? : (

Which operating system do you use?

@LITTLEKKKK
Copy link
Author

Win. I use IDM to download the file.

@GeorgeBatch
Copy link
Contributor

TCGA slides are back online. But I needed to generate the manifest files from scratch. I originally wanted to used yours, but some of the file names were not found, maybe they changed them.

Are these TCGA-LUAD (541 slides) and TCGA-LUSC (512 slides) the links you used to get the manifest files?

I ended up there by clicking on "diagnostic slides" from the main links:

@LITTLEKKKK
Copy link
Author

Is there a Google Drive link for Camelyon 16 cropped patches? Thanks.

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

https://drive.google.com/file/d/17zCn-WRNzxxxh8kkdBTbDLDZy0XZ3RIu/view?usp=sharing

@Raymvp
Copy link

Raymvp commented Feb 12, 2024

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

What is the magnification of these patches? 20 or 5?
The picture looks blurry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants