Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kaiju #690

Open
pmenzel opened this issue Mar 12, 2024 · 11 comments
Open

add kaiju #690

pmenzel opened this issue Mar 12, 2024 · 11 comments

Comments

@pmenzel
Copy link

pmenzel commented Mar 12, 2024

Hi!

I was wondering if it is possible to make kaiju available for public usage on usegalaxy.eu.
The current public web server for kaiju is always overloaded and will likely be out of service soon, so it would be very nice to make the program available at usegalaxy.eu. Similar tools like kraken2 are already there.

The main computational cost is the high memory requirement, depending on the used reference database, which might be prohibitive.
See the kaiju download page for the memory requirements, they range from 49GB to 204GB (as of June 2023) from the smallest to the most comprehensive reference database.

If it is possible to add kaiju to the service, @pvanheus suggested to add it to the list of tools at https://github.com/galaxyproject/tools-iuc, which I could try to do, probably requiring some help.

Thanks!

@pvanheus
Copy link
Contributor

@pmenzel as mentioned on Slack, first there needs to be a Galaxy wrapper for kaiju and also a data manager to manage its databases. Input from the usegalaxy.eu admins on those resource requirements would be useful, though, to get some idea of whether it could be accommodated if a wrapper was available.

@bgruening
Copy link
Member

@pmenzel we can help you getting this tools into Galaxy. Do you know if the tool and its database will be maintained when they shutdown their server? Maybe they are interested to help you as well?

Here are a few links to checkout for Galaxy tools dev:

A few links and useful literature for Galaxy tool and workflow development.

  1. please use Visual Studio Code with this extension: https://github.com/galaxyproject/galaxy-language-server
  2. this is the Galaxy SDK with a lot of documentation: https://planemo.readthedocs.io/
  3. there is also https://docs.galaxyproject.org/en/latest/dev/schema.html <-- which is already integrated into the VSC extension above, but it might be nice to read, browse, search
  4. there is also the Best practice guide: https://galaxy-iuc-standards.readthedocs.io <-- which is already integrated into the VSC extension above, but it might be nice to read, browse, search

A complete (4h) tutorial, with everything you need is avaiable in the planeom docs. It will save you a lot of time later, please do this tutorial: https://planemo.readthedocs.io/en/latest/writing.html

If you want to read a publication about planemo have a look here: https://genome.cshlp.org/content/33/2/261.long

Last but not least there is this short blog post from David, showing the 3 major steps that you need to do if you want to get your tool into the European Galaxy server: https://usegalaxy-eu.github.io/posts/2020/08/22/three-steps-to-galaxify-your-tool/

@pmenzel
Copy link
Author

pmenzel commented Mar 12, 2024

@pvanheus Yes, exactly. Given the high resource requirements, I would need to know if it is at all possible, before starting to make a wrapper.. @bgruening and other admins, what do you think?

@bgruening I am the only maintainer and I plan to keep updating the downloadable reference databases once per year as before.

@bgruening
Copy link
Member

Ah I see! :)
Yes, we can run this tool on our infrastructure. If you could help us predict the memory requirement of a job that would help as well.

Thanks, let us know if you need any help.

@pmenzel
Copy link
Author

pmenzel commented Mar 12, 2024

The memory requirement comes from loading the reference index and does not depend on the size of input fastq/fasta files, so it is easy to predict. :) It currently ranges from 49GB to 204GB depending on the reference database (numbers from June 2023).

However, as sequence databases grow, these numbers will continue to increase and might well be up to 230GB in this year (I will make the databases for 2024 in the summer again).

From the CPU perspective, 10 parallel threads are enough for the program to chuck along.

@bgruening
Copy link
Member

That all looks ok to me!

@pmenzel
Copy link
Author

pmenzel commented Mar 12, 2024

Great!

For many users, it's not possible to run kaiju on their own hardware due to the large RAM requirement, so it would be really nice to have it available as a web service on usegalaxy.eu!

@bgruening
Copy link
Member

@pmenzel do you need any help?

@pmenzel
Copy link
Author

pmenzel commented Jul 8, 2024

@bgruening I didn't get around to delve into this issue, unfortunately. If the open issue is bothering, we can also close it for now and I will comment again, once I started working on it.

@bgruening
Copy link
Member

its not bothering, just keep us updated :) Thanks a lot!

@mycojon
Copy link

mycojon commented Sep 26, 2024

I would be VERY interested to see the KAIJU program on the Galaxay EU site. I find that it can classify species that aren't identified by nucleotide mapping. So if I question results, I can run them on KAIJU and compare results. If I find a species on KAIJU, I can usually find it in my sequences.

I have had only two issues regarding the databases used by KAIJU, and it's the same issue on Kraken2, etc. I have three species in the samples I'm running that aren't represented.

Mycobacterium - (Mycobacterium 1100029.7) It's not recognized by Kraken2, as it's not in the database. KAIJU catches it as Tuberculosis as I believe one of it's genes is the same as Tuberculosis.

Plasmodium (Plasmodium Ovale Wallikeri and Plasmodium Ovale Curtisi) BOTH of these are human pathogens, yet almost none of the databases available have it in them. The NCBI has it, so it shows up in the KAIJU results, but it's not in the Kraken2 databases so it doesn't even recognize it as Plasmodium. However a new reference was recently published POW222 (Plasmodium Ovale Wallikeri), and POC221 (Plasmodium Ovale Curtisi).

I was using KAIJU through Kbase, but it's use on Galaxy would be much better.

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants