Skip to content

4. Creating input data folder

Akshay Paropkari edited this page Nov 13, 2021 · 8 revisions

The main script of RNAseq is the pipeline.sh file. This single bash script contains all the preprocessing steps - QC filtering with bbduk, QC check with FASTQC and, finally, alignment and gene counting with STAR . pipeline.sh takes in a single input which is a folder/directory with -

  1. all raw FASTQ sequence files AND
  2. the sample metadata Excel file.

The raw FASTQ sequence files may either be compressed (gzipped) or uncompressed. The file names must start with the sample ID, followed by the underscore and the rest of the file name. For example, projectname_date_L001.fastq.gz should be named sampleid_projectname_date_L001.fastq.gz. The first part of the file name before the first underscore is how the script knows which sample it is processing. The sample metadata file contains all metadata associated with input samples including sample ID, genotype, condition, treatment, time, etc. For this repository, the sample metadata file must contain at least two columns - SampleID and Condition. The table below is an example of a sample metadata file, where the first two columns SampleID and Condition are required, and the third column FASTQ_file and beyond is optional, but highly recommended. A comprehensive metadata file also enables convenient sample submission to SRA, once your manuscript is published.

SampleID Condition FASTQ_file Other_Sample_Info
Sample1A WT Sample1A_S8_L001_R1_001.fastq.gz ...
Sample1B Mutant Sample1B_S8_L001_R1_001.fastq.gz ...
Sample2A WT Sample2A_S8_L001_R1_001.fastq.gz ...
Sample2B Mutant Sample2B_S8_L001_R1_001.fastq.gz ...
Sample3A WT Sample3A_S8_L001_R1_001.fastq.gz ...
Sample3B Mutant Sample3B_S8_L001_R1_001.fastq.gz ...
... ... ... ...

NOTE: The input directory must contain raw FASTQ files and a sample metadata Excel file. On a broader note, users may implement a user-defined project structure to organize their RNA-seq data. Please go through Cookiecutter Data Science project and a published guide to get ideas on how to organize computational data.


Transferring data to/from MERCED to a local machine via command line

Below is common usage of secure copy scp function which one of the commands used for transferring files to/from MERCED. The other command is secure file transfer protocol sftp. Please refer to MERCED wiki for detailed instructions on sftp function.

scp syntax -

scp FROM TO

Users can copy an individual file to MERCED using the following command on their machine -

# copy a file named descriptive_file_name.txt from your local machine
scp /full/path/to/descriptive_file_name.txt <username>@merced.ucmerced.edu:/full/path/to/destination/

# enter your MERCED password when prompted

To copy a folder of files from your local machine to MERCED -

# Ordered steps to transfer a directory with multiple FASTQ files to MERCED

# ---------------------------------------
# on your local machine run steps 1 and 2
# ---------------------------------------

# 1. create a folder tarball-gzipped file
tar -cvzf descriptive_folder_name.tar.gz /full/path/to/FASTQ_file_1.fastq.gz /full/path/to/FASTQ_file_2.fastq.gz /full/path/to/FASTQ_file_3.fastq.gz ... /full/path/to/FASTQ_file_N.fastq.gz

# 2. run scp command on descriptive_folder_name.tar.gz
scp /full/path/to/descriptive_folder_name.tar.gz <username>@merced.ucmerced.edu:/full/path/to/destination/
# enter MERCED password when prompted

# ----------------------------
# on MERCED, run steps 3 and 4
# ----------------------------

# 3. login into MERCED and create a folder called descriptive_directory_name
ssh <username>@merced.ucmerced.edu
# enter MERCED password when prompted
mkdir -p /full/path/to/descriptive_directory_name/

# 4. extract tarball contents to descriptive_directory_name
tar -xzvf /full/path/to/descriptive_folder_name.tar.gz -C /full/path/to/descriptive_directory_name/

Third party GUI apps

Users can also use third party clients to transfer files to/from MERCED. FileZilla for Linux and Windows or Cyberduck for MacOS and Windows are alternative to using scp or sftp to transfer files with drag and drop.