Skip to content

A collection of random bits and pieces of Ruby scripts, written here and there to parse data files

Notifications You must be signed in to change notification settings

agbiotec/bioparsing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(1). GET THE DATA

- Download the latest release of UniRef100 and UniProt/Trembl:

wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz

gunzip *

- Large files not needed anymore:
rm uniprot_sprot.dat uniprot_trembl.dat 




(2). PRE-FORMAT THE DATA FOR UPLOAD TO HADOOP

- Join sequence and header lines to avoid split of a multi-line sequence across the Hadoop file chunks 
  (separation of sequence and header is take care by the Hadoop code).

  awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0"\t":$0 }' uniref100.fasta > uniref100.fasta.joined
  (6 min, 52 sec)
  rm uniref100.fasta


- Run Perl script that gets only the cross-references and lineage mappings from Uniprot that pertain to UniRef100:

subset_idmappings.pl uniref100.fasta.joined idmapping_selected.tab  uniref100_cross_refs uniref100_cluster_proteins_taxons 
(4 min, 36 sec)

Input files:
uniref100.fasta  - the Uniref100
idmapping_selected.tab - cross references of Uniprot/Trembl to other databases

Output files:
uniref100_cross_refs - cross references to other databases of proteins only from Uniref100 (*)
uniref100_cluster_proteins_taxons - will be used to find the members of each cluster from Uniref100 (**)

(*) Will be merged with the Uniref100 data, and used for the FINAL HADOOP JOB for the final Uniref100 to PANDA convertion
(**) Will be used as input to the SECOND HADOOP JOB, and the output of this job will be merged with the Uniref100 data




(3). FIRST HADOOP JOB, GETS KINGOMS FROM ALL PROTEINS IN UniProt/Trembl

- Copy data into Hadoop filesystem (former is local dir, latter is Hadoop dir):
/opt/hadoop/bin/hadoop fs -put uniprot_sprot_trembl.dat /data/kkrampis/uniref2panda/

- Run the job:
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /data/kkrampis/uniref2panda/uniprot_sprot_trembl.dat -output /data/kkrampis/uniref2panda/uniref100_lineages -file /home/kkrampis/devel/BioDoop/uniprot_taxon_lineage_map.rb -mapper /home/kkrampis/devel/BioDoop/uniprot_taxon_lineage_map.rb

- Get output from Hadoop back to local file system (former is Hadoop dir, latter is local dir)
/opt/hadoop/bin/hadoop fs -get /data/kkrampis/uniref2panda/uniref100_lineages uniref100_lineages



(4). SECOND HADOOP JOB,	FINDS ALL MEMBERS OF EACH CLUSTER FROM Uniref100

- Copy data into Hadoop filesystem (former is local dir, latter is Hadoop dir):
/opt/hadoop/bin/hadoop fs -put uniref100_cluster_proteins_taxons /data/kkrampis/uniref2panda/

- Run the job:
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /data/kkrampis/uniref2panda/uniref100_cluster_proteins_taxons -output /data/kkrampis/uniref2panda/uniref100_cluster_proteins_taxons_output  -file /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_map.rb  -mapper /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_map.rb -file /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_reduce.rb -reducer /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_reduce.rb -numReduceTasks 128
( 3 mins, 30 sec )

- Get output from Hadoop back to local file system (former is Hadoop dir, latter is local dir), and concatenate the parts :

/opt/hadoop/bin/hadoop fs -get /data/kkrampis/uniref2panda/uniref100_cluster_proteins_taxons_output  uniref100_cluster_proteins_taxons_output 

cat uniref100_cluster_proteins_taxons_output/part-* >> uniref100_clusters




(5). FINAL HADOOP JOB. INTEGRATES CLUSTER MEMBERS, CROSS REFERENCES TO THE Uniref100 HEADER. ALSO CONVERTS HEADER TO PANDA FORMAT.

- Merge all data files for cross references, cluster members and lineages of Uniref100
cat uniref100_cross_refs uniref100_clusters uniref100_lineages/part-00000 >> uniref100.fasta.joined

- Copy the data to the Hadoop filesystem
/opt/hadoop/bin/hadoop fs -put /uniref100.fasta.joined /data/kkrampis/uniprot2panda/

--Run the job
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /data/kkrampis/uniref2panda/uniref100.fasta.joined -output /data/kkrampis/uniref2panda/uniref100_panda -file /home/kkrampis/devel/BioDoop/uniref2panda_map_2.pl  -mapper /home/kkrampis/devel/BioDoop/uniref2panda_map_2.pl -file /home/kkrampis/devel/BioDoop/uniref2panda_reduce_2.pl  -reducer /home/kkrampis/devel/BioDoop/uniref2panda_reduce_2.pl -numReduceTasks 128
(min, sec)


-- Copy output to local filesystem and concatenate file parts
/opt/hadoop/bin/hadoop fs -get /data/kkrampis/uniref2panda/uniref100_panda uniref100_panda
cat uniref100_panda/part-* >> uniref100_PANDA_Hadoop.fasta
cp uniref100_PANDA_Hadoop.fasta /usr/local/projects/PANDA/PANDA_uniref/all/


-- Split by Kingdom
grep -A 1 "\-\-\-Archaea" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Archaea//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Archaea/uniref100_PANDA_Archaea.fasta  &
grep -A 1 "\-\-\-Bacteria" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Bacteria//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Bacteria/uniref100_PANDA_Bacteria.fasta  &
grep -A 1 "\-\-\-Eukaryota" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Eukaryota//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Eukaryota/uniref100_PANDA_Eukaryota.fasta  &
grep -A 1 "\-\-\-Viruses" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Viruses//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Viruses/uniref100_PANDA_Viruses.fasta  &
grep -A 1 "\-\-\-noKingdom" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@noKingdom//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/noKingdom/uniref100_PANDA_noKingdom.fasta  &
grep -A 1 "\-\-\-unclassifiedsequences" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@unclassifiedsequences//' >> /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta  &
grep -A 1 "\-\-\-unclassified" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@unclassified//' >> /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta  &
grep -A 1 "\-\-\-othersequences" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@othersequences//' >> /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta  &


-- Cleanup the Kingdom identifiers
sed -e 's/@@@[A-Za-z]*//' /usr/local/projects/PANDA/PANDA_uniref/all/uniref100_PANDA_Hadoop.fasta > /usr/local/projects/PANDA/PANDA_uniref/all/uniref100_PANDA.fasta

-- Format DB
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Archaea formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Archaea/uniref100_PANDA_Archaea.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Bacteria formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Bacteria/uniref100_PANDA_Bacteria.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Eukaryota formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Eukaryota/uniref100_PANDA_Eukaryota.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Viruses formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Viruses/uniref100_PANDA_Viruses.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/noKingdom formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/noKingdom/uniref100_PANDA_noKingdom.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta -o T

qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/all formatdb -i /usr/local/projects/PANDA/PANDA_uniref/all/uniref100_PANDA_Hadoop.fasta -o T

About

A collection of random bits and pieces of Ruby scripts, written here and there to parse data files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published