Skip to content

Commit

Permalink
v.1.1 with one new feature
Browse files Browse the repository at this point in the history
  • Loading branch information
nylander committed Jan 10, 2018
1 parent b592ad9 commit f0207a2
Show file tree
Hide file tree
Showing 7 changed files with 117 additions and 100 deletions.
102 changes: 51 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
# translate\_fasta\_headers
# Translate fasta headers

Translate long fasta headers to short - and back!

Your alignment program X doesn't allow strings longer than n characters, but all your info is
in the fasta headers of your file. What to do? Use `translate_fasta_headers.pl` on your fasta file
to create short labels and a translation table. Run your program X, and then back translate your
fasta headers running `translate_fasta_headers.pl` again!
in the fasta headers of your file. What to do?

And if you created a tree with the short labels, try to back translate using `replace_taxon_labels_in_newick.pl`!
Use `translate_fasta_headers.pl` on your fasta file to create short labels and a translation
table. Run your program X, and then back translate your fasta headers running `translate_fasta_headers.pl` again!

And if you created a tree with the short labels, try to back-translate using `replace_taxon_labels_in_newick.pl`!


## DESCRIPTION
Expand All @@ -16,61 +17,63 @@ Replace fasta headers with headers taken from tab delimited file. If no tab file
the (potentially long) fasta headers are replaced by short labels "Seq\_1", "Seq\_2", etc, and
the short and original headers are printed to a translation file.

The script for translating labels in newick trees is somewhat limited in capacity due to the
restrictions of the newick tree format. Use with caution.
If you wish, you may choose your own prefix (instead of `Seq_`). This could be handy if, for
example, you wish to concatenate files.

The script for translating labels in Newick trees is somewhat limited in capacity due to the
restrictions of the Newick tree format. Use with caution.


## USAGE

./translate_fasta_headers.pl [--tabfile=<tabfile.tab>] [--in=<in.fas>] [--out=<out.fas>] <in.fas>
./translate_fasta_headers.pl [options] <file>

#### From long to short labels:

./translate_fasta_headers.pl --out=out.fas in.fas

#### An back, using a translation table:

./translate_fasta_headers.pl --tabfile=out.fas.translation.tab out.fas

#### Slightly shorter version (see note about the '--out' option below):

./translate_fasta_headers.pl in.fas > out.fas
./translate_fasta_headers.pl -t in.fas.translation.tab out.fas

#### Translate short seq labels in Newick tree to long
./translate_fasta_headers.pl --out=short.fas long.fas

./replace_taxon_labels_in_newick.pl -t out.fas.translation.tab out.fas.phy
#### And back, using a translation table:

./translate_fasta_headers.pl --tabfile=short.fas.translation.tab short.fas

## OPTIONS
#### Slightly shorter version (see note about the `--out` option below):

* `translate_fasta_headers.pl`
./translate_fasta_headers.pl long.fas > short.fas
./translate_fasta_headers.pl -t long.fas.translation.tab short.fas

* `-t, --tabfile=<filename>` -- Specify tab-separated translation file with unique "short" labels to the left,
and "long" names to the right. Translation will be from left to right.
#### Use your own prefix:

* `-o, --out=<filename>` -- Specify output file for the fasta sequences. Note: If `--out=<filename>` is
specified, the translation file will be named `<filename>.translation.tab`.
This simplifies back translation. If `--out` is not used, the translation
file will be named after the infile!
./translate_fasta_headers.pl --prefix='Own_' long.fas

* `-i, --in=<filename>` -- Specify name of fasta file. Can be skipped as script reads files from STDIN.
#### Translate short seq labels in Newick tree to long:

* `-n, --notab` -- Do not create a translation file.
./replace_taxon_labels_in_newick.pl -t long.fas.translation.tab short.fas.phy

* `-f, --forceorder` -- [NOT YET IMPLEMENTED] translate in order of appearance in the fasta file, and use
the same order as in the tabfile - without rigid checking of the names! This
allows non-unique labels in the left column.

* `-h, --help` -- Show this help text and quit.
## OPTIONS

* `replace_taxon_labels_in_newick.pl`
### `translate_fasta_headers.pl`

* `-t, --table=<translation.tab>` -- file with table describing what will be translated with what.
* `-t, --tabfile=<filename>` Specify tab-separated translation file with unique "short" labels to the left,
and "long" names to the right. Translation will be from left to right.
* `-o, --out=<filename>` Specify output file for the fasta sequences.
**Note**: If `--out=<filename>` is specified, the translation file will be named
`<filename>.translation.tab`. This simplifies back translation.
If, on the other hand, `--out` is not used, the translation file will be named after the infile!
* `-i, --in=<filename>` Specify name of fasta file. Can be skipped as script reads files from STDIN.
* `-n, --notab` Do not create a translation file.
* `-p, --prefix=<string>` User your own prefix (default is `Seq_`). A numerical will be added to the
labels (e.g. `Own_1`, `Own_2`, ...)
* `-f, --forceorder` [NOT YET IMPLEMENTED!] translate in order of appearance in the fasta file, and use
the same order as in the tabfile - without rigid checking of the names! This
allows non-unique labels in the left column.
* `-h, --help` Show this help text and quit.

* `-h, --help` -- Help text.
### `replace_taxon_labels_in_newick.pl`

* `-o, --out=<out.file>` -- Print to outfile `out.file`, else to STDOUT.
* `-t, --table=<translation.tab>` file with table describing what will be translated with what.
* `-h, --help` Help text.
* `-o, --out=<out.file>` Print to outfile `out.file`, else to STDOUT.


AUTHOR
Expand All @@ -82,23 +85,20 @@ Johan.Nylander\@nbis.se
FILES
-----

* translate\_fasta\_headers.pl -- Perl script

* replace\_taxon\_labels\_in\_newick.pl -- Perl script

* in.fas -- example file with long fasta headers

* out.fas.translation.tab -- example translation table

* out.fas -- example output with short fasta headers

* out.fas.phy -- example newick tree with short labels
* `translate_fasta_headers.pl` Perl script
* `replace_taxon_labels_in_newick.pl` Perl script
* `long.fas` Example file with long fasta headers
* `short.fas.translation.tab` Example translation table
* `short.fas` Example output with short fasta headers
* `short.fas.phy` Example Newick tree with short labels
* `README.md` Documentation, markdown format
* `README.pdf` Documentation, PDF format


LICENSE AND COPYRIGHT
---------------------

Copyright (c) 2013, 2014, 2015, 2016, 2017 Johan Nylander. All rights reserved.
Copyright (c) 2013, 2014, 2015, 2016, 2017, 2018 Johan Nylander. All rights reserved.

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
Expand Down
Binary file added README.pdf
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
115 changes: 66 additions & 49 deletions translate_fasta_headers.pl
Original file line number Diff line number Diff line change
@@ -1,71 +1,87 @@
#!/usr/bin/perl
#!/usr/bin/env perl
#===============================================================================
=pod
=head2
FILE: translate_fasta_headers.pl
FILE: translate_fasta_headers.pl
USAGE: ./translate_fasta_headers.pl [--tabfile=tabfile.tab] [--in=in.fas] [--out=out.fas] in.fas
USAGE: ./translate_fasta_headers.pl [options] <file>
# From long to short labels:
./translate_fasta_headers.pl --out=out.fas in.fas
# From long to short labels:
./translate_fasta_headers.pl --out=short.fas long.fas
# An back, using a translation table:
./translate_fasta_headers.pl --tabfile=out.fas.translation.tab out.fas
# And back, using a translation table:
./translate_fasta_headers.pl --tabfile=short.fas.translation.tab short.fas
# Slightly shorter/different version:
./translate_fasta_headers.pl in.fas > out.fas
./translate_fasta_headers.pl -t in.fas.translation.tab out.fas > back.fas
# Slightly shorter/different version:
./translate_fasta_headers.pl long.fas > short.fas
./translate_fasta_headers.pl -t long.fas.translation.tab short.fas > back.fas
# Use your own prefix:
./translate_fasta_headers.pl --prefix='Own_' long.fas > short.fas
DESCRIPTION: Replace fasta headers with headers taken from tab delimited file. If no tab file is given,
the (potentially long) fasta headers are replaced by short labels "Seq_1", "Seq_2", etc, and
the short and original headers are printed to a translation file.
OPTIONS: tabfile=<filename> -- Specify tab-separated translation file with unique "short" labels to the left,
and "long" names to the right. Translation will be from left to right.
DESCRIPTION: Replace fasta headers with headers taken from tab delimited file.
If no tab file is given, the (potentially long) fasta headers are replaced
by short labels "Seq_1", "Seq_2", etc, and the short and original headers
are printed to a translation file.
in=<filename> -- Specify name of fasta file. Can be skipped as script reads files from STDIN.
OPTIONS: -t, --tabfile=<filename> -- Specify tab-separated translation file with
unique "short" labels to the left, and "long"
names to the right. Translation will be from
left to right.
out=<filename> -- Specify output file for the fasta sequences. Note: If --out=<filename> is
specified, the translation file will be named <filename>.translation.tab.
This simplifies back translation. If '--out' is not used, the translation
file will be named after the infile!
-i, --in=<filename> -- Specify name of fasta file. Can be skipped as
script reads files from STDIN.
notab -- Do not create a translation file.
-o, --out=<filename> -- Specify output file for the fasta sequences.
Note: If --out=<filename> is specified, the
translation file will be named
<filename>.translation.tab. This simplifies
back translation. If '--out' is not used,
the translation file will be named after
the infile!
forceorder -- [NOT IMPLEMENTED] translate in order of appearance in the fasta file, and use
the same order as in the tabfile - without rigid checking of the names! This
allows non-unique labels in the left column.
-n, --notab -- Do not create a translation file.
help -- Show this help text and quit.
-p, --prefix=<string> -- Prefix for short label. Defaults to 'Seq_'.
-f, --forceorder -- [NOT IMPLEMENTED] translate in order of
appearance in the fasta file, and use
the same order as in the tabfile - without
rigid checking of the names! This allows
non-unique labels in the left column.
-h, --help -- Show this help text and quit.
REQUIREMENTS: ---
BUGS: ---
NOTES: ---
AUTHOR: Johan.Nylander\@bils.se
AUTHOR: Johan.Nylander\@nbis.se
COMPANY: BILS/NRM
COMPANY: NBIS/NRM
VERSION: 1.0
VERSION: 1.0.1
CREATED: 03/13/2013 01:52:28 PM
REVISION: 03/14/2013 11:42:59 PM
REVISION: 01/10/2018 12:59:31 PM
TODO: Handle non-unique values in the left tabfile column (can't use hash):
TODO: Handle non-unique values in the left tabfile column
(can't use hash):
Test if values in translation table are unique. If so,
use read_tabfile. If not, read into two arrays, check
for same lengths, and then use an iterator while reading
the infile. Warn if number of sequences doesn't match
number of entries in the tab file. Plus give a warning
that labels where not unique. Use the array approach when '--forceorder'.
that labels where not unique. Use the array approach
when '--forceorder'.
LICENSE AND COPYRIGHT: Copyright (c) 2013 Johan Nylander. All rights reserved.
LICENSE AND COPYRIGHT: Copyright (c) 2013-2018 Johan Nylander. All rights reserved.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
Expand All @@ -87,17 +103,18 @@
use Getopt::Long;

## Globals
my $prefix = 'Seq_'; # Prefix for short names
my $tabfile = q{};
my $in = q{};
my $out = q{};
my $notab = q{};
#my $forceorder = q{};
my $help = q{};
my %header_hash = (); # Key: short, value: long.
my @short_array = (); # Array with short headers.
my @long_array = (); # Array with long headers.
my $PRINT; # Print file handle. Using the typeglob notation below
my $IN; # in order to use STDOUT as a variable.
my %header_hash = (); # Key: short, value: long.
my @short_array = (); # Array with short headers.
my @long_array = (); # Array with long headers.
my $PRINT; # Print file handle. Using the typeglob notation below
my $IN; # in order to use STDOUT as a variable.
#my $forceorder = q{};

## If no arguments
exec("perldoc", $0) unless (@ARGV);
Expand All @@ -107,6 +124,7 @@
"out=s" => \$out,
"notab" => \$notab,
"in=s" => \$in,
"prefix=s" => \$prefix,
#"forceorder" => \$forceorder,
"help" => sub { exec("perldoc", $0); exit(0); },
);
Expand All @@ -126,11 +144,11 @@

## If infile
if ($in) {
read_infile($in);
read_infile($in, $prefix);
}
else {
while (my $infile = shift(@ARGV)) {
read_infile($infile);
read_infile($infile, $prefix);
}
}

Expand All @@ -139,19 +157,18 @@

#=== FUNCTION ================================================================
# NAME: read_infile
# VERSION: 03/14/2013 10:07:26 PM
# VERSION: 01/10/2018 01:03:55 PM
# DESCRIPTION: Reads a tab separated file and returns a hash. Expects all values
# in left column ("short") to be unique
# PARAMETERS: filename
# in left column ("short") to be unique!
# PARAMETERS: filename, prefix
# RETURNS: hash: key:short, value:long
# TODO:
# TODO:
#===============================================================================
sub read_infile {

my ($file) = @_;
my @in_headers_array = ();
my $counter = 1;
my $shortlabel = 'Seq_';
my ($file, $shortlabel) = (@_);
my @in_headers_array = ();
my $counter = 1;
my $OUTTAB;
my $outtabfile;

Expand Down Expand Up @@ -230,7 +247,7 @@ sub read_tabfile {
open my $TAB, "<", $file or die "Could not open $file for reading : $! \n";
while(<$TAB>) {
chomp;
next if (/^\s+$/);
next if (/^\s*$/);
my ($short, $long) = split /\t/, $_;
$short = trim_white_space($short);
$long = trim_white_space($long);
Expand Down

1 comment on commit f0207a2

@nylander
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should read v.1.0.1 (not 1.1)

Please sign in to comment.