GitHub - agbiotec/bioparsing: A collection of random bits and pieces of Ruby scripts, written here and there to parse data files

agbiotec / bioparsing Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

A collection of random bits and pieces of Ruby scripts, written here and there to parse data files

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Assembly-Scripts		Assembly-Scripts
BioDoop		BioDoop
README		README
genetix.rb		genetix.rb
idmappings_flatten.pl		idmappings_flatten.pl
idmappings_missing.pl		idmappings_missing.pl
subset_idmappings.pl		subset_idmappings.pl
uniprot_addtl_cxrefs.rb		uniprot_addtl_cxrefs.rb
uniprot_grid_mapper_parser.rb		uniprot_grid_mapper_parser.rb
uniprot_taxon_clusters_map.rb		uniprot_taxon_clusters_map.rb
uniprot_taxon_clusters_reduce.rb		uniprot_taxon_clusters_reduce.rb
uniprot_taxon_lineage.rb		uniprot_taxon_lineage.rb
uniprot_taxon_lineage_map.rb		uniprot_taxon_lineage_map.rb
uniprot_taxon_lineage_reduce.rb		uniprot_taxon_lineage_reduce.rb
uniref2panda.pl		uniref2panda.pl
uniref2panda_map.pl		uniref2panda_map.pl
uniref2panda_map_2.pl		uniref2panda_map_2.pl
uniref2panda_reduce.pl		uniref2panda_reduce.pl
uniref2panda_reduce_2.pl		uniref2panda_reduce_2.pl

Repository files navigation

(1). GET THE DATA

- Download the latest release of UniRef100 and UniProt/Trembl:

wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz

gunzip *

- Large files not needed anymore:
rm uniprot_sprot.dat uniprot_trembl.dat 




(2). PRE-FORMAT THE DATA FOR UPLOAD TO HADOOP

- Join sequence and header lines to avoid split of a multi-line sequence across the Hadoop file chunks 
  (separation of sequence and header is take care by the Hadoop code).

  awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0"\t":$0 }' uniref100.fasta > uniref100.fasta.joined
  (6 min, 52 sec)
  rm uniref100.fasta


- Run Perl script that gets only the cross-references and lineage mappings from Uniprot that pertain to UniRef100:

subset_idmappings.pl uniref100.fasta.joined idmapping_selected.tab  uniref100_cross_refs uniref100_cluster_proteins_taxons 
(4 min, 36 sec)

Input files:
uniref100.fasta  - the Uniref100
idmapping_selected.tab - cross references of Uniprot/Trembl to other databases

Output files:
uniref100_cross_refs - cross references to other databases of proteins only from Uniref100 (*)
uniref100_cluster_proteins_taxons - will be used to find the members of each cluster from Uniref100 (**)

(*) Will be merged with the Uniref100 data, and used for the FINAL HADOOP JOB for the final Uniref100 to PANDA convertion
(**) Will be used as input to the SECOND HADOOP JOB, and the output of this job will be merged with the Uniref100 data




(3). FIRST HADOOP JOB, GETS KINGOMS FROM ALL PROTEINS IN UniProt/Trembl

- Copy data into Hadoop filesystem (former is local dir, latter is Hadoop dir):
/opt/hadoop/bin/hadoop fs -put uniprot_sprot_trembl.dat /data/kkrampis/uniref2panda/

- Run the job:
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /data/kkrampis/uniref2panda/uniprot_sprot_trembl.dat -output /data/kkrampis/uniref2panda/uniref100_lineages -file /home/kkrampis/devel/BioDoop/uniprot_taxon_lineage_map.rb -mapper /home/kkrampis/devel/BioDoop/uniprot_taxon_lineage_map.rb

- Get output from Hadoop back to local file system (former is Hadoop dir, latter is local dir)
/opt/hadoop/bin/hadoop fs -get /data/kkrampis/uniref2panda/uniref100_lineages uniref100_lineages



(4). SECOND HADOOP JOB,	FINDS ALL MEMBERS OF EACH CLUSTER FROM Uniref100

- Copy data into Hadoop filesystem (former is local dir, latter is Hadoop dir):
/opt/hadoop/bin/hadoop fs -put uniref100_cluster_proteins_taxons /data/kkrampis/uniref2panda/

- Run the job:
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /data/kkrampis/uniref2panda/uniref100_cluster_proteins_taxons -output /data/kkrampis/uniref2panda/uniref100_cluster_proteins_taxons_output  -file /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_map.rb  -mapper /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_map.rb -file /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_reduce.rb -reducer /home/kkrampis/devel/BioDoop/uniprot_taxon_clusters_reduce.rb -numReduceTasks 128
( 3 mins, 30 sec )

- Get output from Hadoop back to local file system (former is Hadoop dir, latter is local dir), and concatenate the parts :

/opt/hadoop/bin/hadoop fs -get /data/kkrampis/uniref2panda/uniref100_cluster_proteins_taxons_output  uniref100_cluster_proteins_taxons_output 

cat uniref100_cluster_proteins_taxons_output/part-* >> uniref100_clusters




(5). FINAL HADOOP JOB. INTEGRATES CLUSTER MEMBERS, CROSS REFERENCES TO THE Uniref100 HEADER. ALSO CONVERTS HEADER TO PANDA FORMAT.

- Merge all data files for cross references, cluster members and lineages of Uniref100
cat uniref100_cross_refs uniref100_clusters uniref100_lineages/part-00000 >> uniref100.fasta.joined

- Copy the data to the Hadoop filesystem
/opt/hadoop/bin/hadoop fs -put /uniref100.fasta.joined /data/kkrampis/uniprot2panda/

--Run the job
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input /data/kkrampis/uniref2panda/uniref100.fasta.joined -output /data/kkrampis/uniref2panda/uniref100_panda -file /home/kkrampis/devel/BioDoop/uniref2panda_map_2.pl  -mapper /home/kkrampis/devel/BioDoop/uniref2panda_map_2.pl -file /home/kkrampis/devel/BioDoop/uniref2panda_reduce_2.pl  -reducer /home/kkrampis/devel/BioDoop/uniref2panda_reduce_2.pl -numReduceTasks 128
(min, sec)


-- Copy output to local filesystem and concatenate file parts
/opt/hadoop/bin/hadoop fs -get /data/kkrampis/uniref2panda/uniref100_panda uniref100_panda
cat uniref100_panda/part-* >> uniref100_PANDA_Hadoop.fasta
cp uniref100_PANDA_Hadoop.fasta /usr/local/projects/PANDA/PANDA_uniref/all/


-- Split by Kingdom
grep -A 1 "\-\-\-Archaea" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Archaea//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Archaea/uniref100_PANDA_Archaea.fasta  &
grep -A 1 "\-\-\-Bacteria" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Bacteria//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Bacteria/uniref100_PANDA_Bacteria.fasta  &
grep -A 1 "\-\-\-Eukaryota" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Eukaryota//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Eukaryota/uniref100_PANDA_Eukaryota.fasta  &
grep -A 1 "\-\-\-Viruses" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@Viruses//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Viruses/uniref100_PANDA_Viruses.fasta  &
grep -A 1 "\-\-\-noKingdom" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@noKingdom//' > /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/noKingdom/uniref100_PANDA_noKingdom.fasta  &
grep -A 1 "\-\-\-unclassifiedsequences" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@unclassifiedsequences//' >> /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta  &
grep -A 1 "\-\-\-unclassified" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@unclassified//' >> /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta  &
grep -A 1 "\-\-\-othersequences" uniref100_PANDA_Hadoop.fasta | sed -e 's/@@@othersequences//' >> /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta  &


-- Cleanup the Kingdom identifiers
sed -e 's/@@@[A-Za-z]*//' /usr/local/projects/PANDA/PANDA_uniref/all/uniref100_PANDA_Hadoop.fasta > /usr/local/projects/PANDA/PANDA_uniref/all/uniref100_PANDA.fasta

-- Format DB
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Archaea formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Archaea/uniref100_PANDA_Archaea.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Bacteria formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Bacteria/uniref100_PANDA_Bacteria.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Eukaryota formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Eukaryota/uniref100_PANDA_Eukaryota.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Viruses formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/Viruses/uniref100_PANDA_Viruses.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/noKingdom formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/noKingdom/uniref100_PANDA_noKingdom.fasta -o T
qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified formatdb -i /usr/local/projects/PANDA/PANDA_uniref/split_by_kingdom/unclassified/uniref100_PANDA_unclassified.fasta -o T

qsub -P 08020 -wd /usr/local/projects/PANDA/PANDA_uniref/all formatdb -i /usr/local/projects/PANDA/PANDA_uniref/all/uniref100_PANDA_Hadoop.fasta -o T