Skip to content

alignproteome added #875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
1cb2074
alignproteome added
Gyuuul2 Aug 16, 2024
12c0c81
add workflow, revise based on feedback
Gyuuul2 Aug 19, 2024
df80f07
Change createdb.cpp so that it takes in ".txt" file containing paths …
elpis51613 Aug 25, 2024
1073ec9
Merge remote-tracking branch 'refs/remotes/origin/master'
elpis51613 Aug 25, 2024
3617ea7
Change filepath format from txt to tsv. Also add explanations of its …
elpis51613 Aug 26, 2024
3fc7fcd
createtsv - handle source identifier
Gyuuul2 Aug 27, 2024
0487a1f
proteomcluster - change writer / way to select repProteome
Gyuuul2 Aug 27, 2024
3f4fc91
easy proteomecluster workflow
Gyuuul2 Aug 27, 2024
eeabad3
indexreader - handle source identifier
Gyuuul2 Aug 27, 2024
9996821
update mmseqsbase and change module name into proteomecluster
Gyuuul2 Aug 27, 2024
27d9f68
set alignment mode default
Gyuuul2 Aug 28, 2024
e2ef229
latest
Gyuuul2 Aug 28, 2024
fd1f5d6
original index
Gyuuul2 Aug 28, 2024
0d7db0f
use sync_fetch_and_add when increment memcount
Gyuuul2 Aug 28, 2024
1fe489e
change submat
Gyuuul2 Aug 28, 2024
86f6b15
add timer
Gyuuul2 Aug 28, 2024
8344e5b
Merge branch 'pr-879'
Gyuuul2 Aug 28, 2024
be5a53d
Remove __packed__ to resolve sync_fetch_and_add error
Gyuuul2 Aug 28, 2024
8d3a38f
change chmod in createdb
Gyuuul2 Aug 29, 2024
bfa1911
update proteome-similarity threshold parameter
Gyuuul2 Aug 29, 2024
7e912d2
Sort redundant proteomes by similarity score in proteome_cluster.tsv
Gyuuul2 Sep 8, 2024
618e2bf
delete redundant code - checking proteome is singleton(no redudant me…
Gyuuul2 Sep 9, 2024
e990773
stop
Gyuuul2 Sep 9, 2024
e9ffe22
jaccardindexTrial
Gyuuul2 Sep 8, 2024
d14217f
add normalized scoring
Gyuuul2 Sep 9, 2024
5e6ee6d
add proteome relative(normalized) similarity threshold parameter
Gyuuul2 Sep 9, 2024
743370d
relative similarity threshold
Gyuuul2 Sep 9, 2024
44a6fca
latest version-bugfix,relativeProteomeSimilarity added, singleton pro…
Gyuuul2 Sep 9, 2024
910bb7f
Change Relative Score formula
Gyuuul2 Sep 10, 2024
e24b9e3
Remove Debug printout
Gyuuul2 Sep 10, 2024
ad1acdf
style change
Gyuuul2 Sep 10, 2024
e026260
Latest Version
Gyuuul2 Sep 10, 2024
d9213ef
Add cluster count report
Gyuuul2 Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions data/workflow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ set(GENERATED_WORKFLOWS
workflow/easycluster.sh
workflow/easytaxonomy.sh
workflow/easyrbh.sh
workflow/easyalignproteome.sh
workflow/blastp.sh
workflow/blastpgp.sh
workflow/map.sh
Expand Down
53 changes: 53 additions & 0 deletions data/workflow/easyalignproteome.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/sh -e
fail() {
echo "Error: $1"
exit 1
}

notExists() {
[ ! -f "$1" ]
}


if notExists "${TMP_PATH}/input.dbtype"; then
# shellcheck disable=SC2086
"$MMSEQS" createdb "$@" "${TMP_PATH}/input" ${CREATEDB_PAR} \
|| fail "query createdb died"
fi

if notExists "${TMP_PATH}/clu.dbtype"; then
# shellcheck disable=SC2086
"$MMSEQS" linclust "${TMP_PATH}/input" "${TMP_PATH}/clu" "${TMP_PATH}/clu_tmp" ${CLUSTER_PAR} \
|| fail "Search died"
fi

if notExists "${TMP_PATH}/aln.dbtype"; then
# shellcheck disable=SC2086
"$MMSEQS" alignproteome "${TMP_PATH}/input" "${TMP_PATH}/clu" "${TMP_PATH}/aln_protein" "${TMP_PATH}/aln_proteome" ${ALIGNPROTEOME_PAR} \
|| fail "Convert Alignments died"
fi

if notExists "${RESULTS}_protein_cluster.tsv"; then
# shellcheck disable=SC2086
"$MMSEQS" createtsv "${TMP_PATH}/input" "${TMP_PATH}/clu" "${RESULTS}_protein_cluster.tsv" ${THREADS_PAR} \
|| fail "createtsv protein died"
fi

if notExists "${RESULTS}_proteome_cluster.tsv"; then
# shellcheck disable=SC2086
"$MMSEQS" createtsv "${TMP_PATH}/input" "${TMP_PATH}/aln_proteome" "${RESULTS}_proteome_cluster.tsv" ${THREADS_PAR} \
|| fail "createtsv proteome died"
fi

if [ -n "${REMOVE_TMP}" ]; then
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/input" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/input_h" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/clu" ${VERBOSITY_PAR}
# shellcheck disable=SC2086
"$MMSEQS" rmdb "${TMP_PATH}/aln" ${VERBOSITY_PAR}
rm -rf "${TMP_PATH}/clu_tmp"
rm -f "${TMP_PATH}/easyproteomecluster.sh"
fi
1 change: 1 addition & 0 deletions src/CommandDeclarations.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ extern int diffseqdbs(int argc, const char **argv, const Command& command);
extern int easycluster(int argc, const char **argv, const Command& command);
extern int easyrbh(int argc, const char **argv, const Command& command);
extern int easylinclust(int argc, const char **argv, const Command& command);
extern int easyalignproteome(int argc, const char **argv, const Command& command);
extern int easysearch(int argc, const char **argv, const Command& command);
extern int easylinsearch(int argc, const char **argv, const Command& command);
extern int tsv2exprofiledb(int argc, const char **argv, const Command& command);
Expand Down
10 changes: 8 additions & 2 deletions src/MMseqsBase.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,12 @@ std::vector<Command> baseCommands = {
CITATION_MMSEQS2|CITATION_LINCLUST, {{"fastaFile[.gz|.bz2]", DbType::ACCESS_MODE_INPUT, DbType::NEED_DATA|DbType::VARIADIC, &DbValidator::flatfileAndStdin },
{"clusterPrefix", DbType::ACCESS_MODE_OUTPUT, DbType::NEED_DATA, &DbValidator::flatfile },
{"tmpDir", DbType::ACCESS_MODE_OUTPUT, DbType::NEED_DATA, &DbValidator::directory }}},
{"easy-alignproteome", easyalignproteome, &par.easyalignproteome, COMMAND_EASY,
"Proteome clustering and alignemnt",
"<i:fastaFile1[.gz|.bz2]> ... <i:fastaFileN[.gz|.bz2]>"
// CITATION_MMSEQS2,

},
{"easy-taxonomy", easytaxonomy, &par.easytaxonomy, COMMAND_EASY,
"Taxonomic classification",
"# Assign taxonomic labels to FASTA sequences\n"
Expand Down Expand Up @@ -633,9 +639,9 @@ std::vector<Command> baseCommands = {
{"resultDB", DbType::ACCESS_MODE_INPUT, DbType::NEED_DATA, &DbValidator::resultDb },
{"alignmentDB", DbType::ACCESS_MODE_OUTPUT, DbType::NEED_DATA, &DbValidator::alignmentDb }}},
{"alignproteome", alignproteome, &par.alignproteome, COMMAND_ALIGNMENT,
"Within-result all-vs-all gapped local alignment",
"Proteome clustering and alignment",
NULL,
"Martin Steinegger <[email protected]>",
"Martin Steinegger <[email protected]> & Gyuri Kim <[email protected]>",
"<i:sequenceDB> <i:resultDB> <o:alignmentDB>",
CITATION_MMSEQS2, {{"sequenceDB", DbType::ACCESS_MODE_INPUT, DbType::NEED_DATA, &DbValidator::sequenceDb },
{"resultDB", DbType::ACCESS_MODE_INPUT, DbType::NEED_DATA, &DbValidator::resultDb },
Expand Down
1 change: 0 additions & 1 deletion src/alignment/Matcher.h
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,6 @@ class Matcher{


static size_t resultToBuffer(char * buffer, const result_t &result, bool addBacktrace, bool compress = true, bool addOrfPosition = false);
static size_t resultToBuffer_str(char * buffer, const result_t &result, bool addBacktrace, bool compress, bool addOrfPosition=false);
static int computeAlnLength(int anEnd, int start, int dbEnd, int dbStart);

static void updateResultByRescoringBacktrace(const char *querySeq, const char *targetSeq, const char **subMat, EvalueComputation &evaluer,
Expand Down
2 changes: 0 additions & 2 deletions src/commons/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ set(commons_header_files
commons/Command.h
commons/CommandCaller.h
commons/Concat.h
# commons/ClusterSpecies.h
commons/DBConcat.h
commons/DBReader.h
commons/DBWriter.h
Expand Down Expand Up @@ -53,7 +52,6 @@ set(commons_source_files
commons/BaseMatrix.cpp
commons/Command.cpp
commons/CommandCaller.cpp
# commons/ClusterSpecies.cpp
commons/DBConcat.cpp
commons/DBReader.cpp
commons/DBWriter.cpp
Expand Down
32 changes: 0 additions & 32 deletions src/commons/ClusterSpecies.cpp
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks incomplete

This file was deleted.

42 changes: 0 additions & 42 deletions src/commons/ClusterSpecies.h
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

This file was deleted.

17 changes: 0 additions & 17 deletions src/commons/DBReader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -137,8 +137,6 @@ template <typename T> bool DBReader<T>::open(int accessType){
}
}
if (dataMode & USE_LOOKUP || dataMode & USE_LOOKUP_REV) {
Debug(Debug::INFO) << "ReadLookup file: " << dataFileName << "\n"; //gyuri

std::string lookupFilename = (std::string(dataFileName) + ".lookup");
MemoryMapped lookupData(lookupFilename, MemoryMapped::WholeFile, MemoryMapped::SequentialScan);
if (lookupData.isValid() == false) {
Expand All @@ -147,9 +145,7 @@ template <typename T> bool DBReader<T>::open(int accessType){
}
char* lookupDataChar = (char *) lookupData.getData();
size_t lookupDataSize = lookupData.size();
Debug(Debug::INFO) << "Lookup Data size is " << lookupDataSize << "\n"; //gyuri
lookupSize = Util::ompCountLines(lookupDataChar, lookupDataSize, threads);
Debug(Debug::INFO) << "Lookup size is " << lookupSize << "\n"; //gyuri
lookup = new(std::nothrow) LookupEntry[this->lookupSize];
incrementMemory(sizeof(LookupEntry) * this->lookupSize);
readLookup(lookupDataChar, lookupDataSize, lookup);
Expand Down Expand Up @@ -1025,19 +1021,6 @@ size_t DBReader<T>::getOffset(size_t id) {
return index[id].offset;
}

template<typename T>
size_t DBReader<T>::getIndexLen(size_t id) {
if (id >= size){
Debug(Debug::ERROR) << "Invalid database read for id=" << id << ", database index=" << indexFileName << "\n";
Debug(Debug::ERROR) << "getOffset: local id (" << id << ") >= db size (" << size << ")\n";
EXIT(EXIT_FAILURE);
}
if (local2id != NULL) {
id = local2id[id];
}
return index[id].length;
}

template<typename T>
size_t DBReader<T>::findNextOffsetid(size_t id) {
size_t idOffset = getOffset(id);
Expand Down
4 changes: 1 addition & 3 deletions src/commons/DBReader.h
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ class DBReader : public MemoryTracker {

size_t getSize() const;

unsigned int getProteomeTotalLen(size_t id); //gyuri
unsigned int getProteomeTotalLen(size_t id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this getSourceTotalLen or getSetTotalLen to be consistent with the .source naming


unsigned int getMaxSeqLen(){
return (Parameters::isEqualDbtype(dbtype, Parameters::DBTYPE_HMM_PROFILE ) ) ?
Expand Down Expand Up @@ -446,8 +446,6 @@ class DBReader : public MemoryTracker {

size_t findNextOffsetid(size_t id);

size_t getIndexLen(size_t id);

int isCompressed(){
return isCompressed(dbtype);
}
Expand Down
1 change: 0 additions & 1 deletion src/commons/DBWriter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,6 @@ size_t DBWriter::writeAdd(const char* data, size_t dataSize, unsigned int thrIdx
}
size_t totalWriten = 0;
if(isCompressedDB && (state[thrIdx] == INIT_STATE || state[thrIdx] == COMPRESSED) ) {
Debug(Debug::INFO) << "Compressing data in thread " << thrIdx << "\n"; //gyuri
state[thrIdx] = COMPRESSED;
// zstd seems to have a hard time with elements < 60
ZSTD_inBuffer input = { data, dataSize, 0 };
Expand Down
6 changes: 5 additions & 1 deletion src/commons/Parameters.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1349,6 +1349,10 @@ Parameters::Parameters():
// easylinclustworkflow
easylinclustworkflow = combineList(linclustworkflow, createdb);

// easyalignproteomeworkflow
easyalignproteome = combineList(easylinclustworkflow, alignproteome);


// clustering workflow
clusterworkflow = combineList(prefilter, align);
clusterworkflow = combineList(clusterworkflow, rescorediagonal);
Expand Down Expand Up @@ -2353,7 +2357,7 @@ void Parameters::setDefaults() {
maxRejected = INT_MAX;
maxAccept = INT_MAX;
seqIdThr = 0.0;
proteomeSimThr = 0.0;
proteomeSimThr = 0.9;
alnLenThr = 0;
altAlignment = 0;
gapOpen = MultiParam<NuclAA<int>>(NuclAA<int>(11, 5));
Expand Down
1 change: 1 addition & 0 deletions src/commons/Parameters.h
Original file line number Diff line number Diff line change
Expand Up @@ -1118,6 +1118,7 @@ class Parameters {
std::vector<MMseqsParameter*> kmersearch;
std::vector<MMseqsParameter*> countkmer;
std::vector<MMseqsParameter*> easylinclustworkflow;
std::vector<MMseqsParameter*> easyalignproteome;
std::vector<MMseqsParameter*> linclustworkflow;
std::vector<MMseqsParameter*> easysearchworkflow;
std::vector<MMseqsParameter*> searchworkflow;
Expand Down
2 changes: 0 additions & 2 deletions src/util/alignall.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ int alignall(int argc, const char **argv, const Command &command) {
}
unsigned int swMode = Alignment::initSWMode(par.alignmentMode, par.covThr, par.seqIdThr);

// DBReader<unsigned int> tdbr(par.db1.c_str(), par.db1Index.c_str(), par.threads, DBReader<unsigned int>::USE_DATA|DBReader<unsigned int>::USE_INDEX);
DBReader<unsigned int> tdbr(par.db1.c_str(), par.db1Index.c_str(), par.threads, DBReader<unsigned int>::USE_DATA|DBReader<unsigned int>::USE_INDEX|DBReader<unsigned int>::USE_LOOKUP);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is not needed anymore


tdbr.open(DBReader<unsigned int>::NOSORT);
Expand Down Expand Up @@ -58,7 +57,6 @@ int alignall(int argc, const char **argv, const Command &command) {
EvalueComputation evaluer(tdbr.getAminoAcidDBSize(), subMat, gapOpen, gapExtend);
const size_t flushSize = 100000000;
size_t iterations = static_cast<int>(ceil(static_cast<double>(dbr_res.getSize()) / static_cast<double>(flushSize)));
Debug(Debug::INFO) << "Number of iterations: " << iterations << " Size of linclust dbr_res : " << dbr_res.getSize() << '\n';

for (size_t i = 0; i < iterations; i++) {
size_t start = (i * flushSize);
Expand Down
10 changes: 5 additions & 5 deletions src/util/alignproteome.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ struct __attribute__((__packed__)) ProteomeEntry{
totalSeqId += seqId;
}

void addSeqLength(unsigned int seqLength) {
void addSeqLen(unsigned int seqLength) {
proteomeAALen += seqLength;
}

Expand All @@ -64,14 +64,14 @@ struct __attribute__((__packed__)) ProteomeEntry{
}
};

struct __attribute__((__packed__)) memberProteinEntry{
struct __attribute__((__packed__)) MemberProteinEntry{
unsigned int proteomeKey;
unsigned int proteinKey;
};

struct ClusterEntry {
bool isAvailable;
std::vector<memberProteinEntry> memberProteins;
std::vector<MemberProteinEntry> memberProteins;

ClusterEntry() : isAvailable(false) {}

Expand All @@ -89,7 +89,7 @@ void calculateProteomeLength(std::vector<ProteomeEntry>& ProteomeList, DBReader<
for (size_t i = 0; i < lookupSize; i++) {
const unsigned int ProteomeId = lookup[i].fileNumber;
const unsigned int ProteinId = lookup[i].id;
ProteomeList[ProteomeId].addSeqLength(tProteinDB.getIndexLen(ProteinId) - 2);
ProteomeList[ProteomeId].addSeqLen(tProteinDB.getSeqLen(ProteinId));
}
}

Expand All @@ -110,7 +110,7 @@ void initLocalClusterReps(size_t& id, std::vector<ClusterEntry>& localClusterRep
for (auto& eachMemberKey : memberKeys){
const unsigned int proteinId = tProteinDB.getId(eachMemberKey);
const unsigned int proteomeKey = tProteinDB.getLookupFileNumber(proteinId);
memberProteinEntry mem;
MemberProteinEntry mem;
mem.proteomeKey = proteomeKey;
mem.proteinKey = proteinId;
eachClusterRep.memberProteins.push_back(mem);
Expand Down
1 change: 1 addition & 0 deletions src/workflow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ set(workflow_source_files
workflow/Search.cpp
workflow/Taxonomy.cpp
workflow/EasyTaxonomy.cpp
workflow/EasyAlignproteome.cpp
workflow/CreateIndex.cpp
PARENT_SCOPE
)
41 changes: 41 additions & 0 deletions src/workflow/EasyAlignproteome.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#include <cassert>

#include "FileUtil.h"
#include "CommandCaller.h"
#include "Util.h"
#include "Debug.h"
#include "Parameters.h"

#include "easyalignproteome.sh.h"

void setEasyAlignproteomeDefaults(Parameters *p) {
p->proteomeSimThr = 0.9;
}

// void setEasyAlignproteomeMustPassAlong(Parameters *p){

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to implement this whenever you use the createParameterString(..., true) parameter. Or it won't get passed along.

// }

int easyalignproteome(int argc, const char **argv, const Command &command) {
Parameters &par = Parameters::getInstance();
setEasyAlignproteomeDefaults(&par);
par.parseParameters(argc, argv, command, true, Parameters::PARSE_VARIADIC, 0);
// setEasyAlignproteomeMustPassAlong(&par);
std::string tmpDir = par.filenames.back();
std::string hash = SSTR(par.hashParameter(command.databases, par.filenames, *command.params));
if (par.reuseLatest) {
hash = FileUtil::getHashFromSymLink(tmpDir + "/latest");
}
tmpDir = FileUtil::createTemporaryDirectory(tmpDir, hash);
par.filenames.pop_back();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to pop_back() the output path too and pass it as an environment variable. Or else it will also be passed to createdb.


CommandCaller cmd;
cmd.addVariable("ALIGNPROTEOME_PAR", par.createParameterString(par.alignproteome,true).c_str()); // what?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLUSTER_PAR, THREADS_PAR etc are missing

std::string program = tmpDir + "/easyalignproteome.sh";
FileUtil::writeFile(program, easyalignproteome_sh, easyalignproteome_sh_len);
cmd.execProgram(program.c_str(), par.filenames);

// Should never get here
assert(false);
return 0;
}
Loading