Standardize Data Manager filesystem layout #19013
Replies: 3 comments 10 replies
-
This seems all good to me - thanks for the thoughtful documentation and detailed write up. If we're introducing more structure I would really love to catch in a way that Pulsar can consume it and reason about it - (e.g. https://github.com/galaxyproject/pulsar/blob/master/docs/files/file_actions_sample_1.yaml#L18). We haven't had to use unstructured path actions because we happen to use CVMFS on all our clients - but it is a high bar and you can imagine much more restricted clients might not allow this. It would be lovely if only the job file server needed the mount and we provided a default list of actions that worked for all DM generated data. It is just globs and expected paths right - we could use the same list to provide like a linter to ensure that all the IDC tracked files are things we expect. We could get cleaner, more structured, more documented IDC contents while also making the data more accessible. Can I get you onboard? I would be happy to expand the tool action syntax to catch things more exactly, write the file linter, extend the syntax to include stuff we might want to provide in the web version, etc... but I think I need your buy in and commitment to the use cases. |
Beta Was this translation helpful? Give feedback.
-
Has our plan of using normal datasets been abandoned at this point ? Is that going to be parallel work ? Or is this really just about how you want to organize cvmfs ? |
Beta Was this translation helpful? Give feedback.
-
Related issue - some DMs symlink the reference genome and then leave the symlink in place upon completion when the tool does not actually need the reference genome to function. When running DMs in normal mode this is fine, but in bundle mode this duplicates the reference genome for every indexer DM. Additionally, when importing those bundles to another server, it makes a copy of the reference genome for every imported bundle (which can also be quite slow for large genomes). So I propose that we remove the symlink from bowtie 1/2 and bwa-mem 1/2 on tool completion, after I verify it is not needed for any of them. |
Beta Was this translation helpful? Give feedback.
-
I am in the process of fetching and indexing nearly a thousand genomes for a project and will be doing the same for additional projects, and would like to apply the lessons learned to finally get the IDC moving again. We're also discussing how a website would render the available data in a more user-friendly format and construct links to it. The biggest question at the moment is the data layout.
Current state of affairs
Where data are stored is entirely at the discretion of the DM and varies considerably:
{tool_data_path}/{dbkey}/{dm_name}_index/seq/{value}.fa
{tool_data_path}/{dbkey}/{dm_name}_index/len/{value}.len
{tool_data_path}/{dbkey}/{dm_name}_index/{value}/{value}.*
{tool_data_path}/{dbkey}/{dm_name}_index/{value}.*
{tool_data_path}/{star_version}/{dbkey}/{value}/{dataset_id}/*
Where
dbkey
is the build identifier (typically from UCSC) andvalue
is the potential variant build ID of a dbkey (e.g. the female build of hg38) but typicallyvalue
==dbkey
.As a tree this looks like:
Thus there is a lot of inconsistency in the layout. In addition, non-genomic DMs typically use a DM-named subdirectory at the root of
tool_data_path
the same way that rnastar does (e.g.{tool_data_path}/kraken2_databases
), but because the other genomic indexers do not, the root oftool_data_path
is littered with a mixture of genome directories and non-genomic DM directories.Proposal
IMO keeping genomic indexes together under the dbkey is a useful construct for browsing e.g. on datacache.galaxyproject.org, so I propose the following changes to DMs:
{tool_data_path}/genomes/{dbkey}/
. Inside this dir:seq/{value}.fa
and chrom lengths atlen/{value}.len
(as before){unversioned_dm_name}_index/v{version}/{dbkey}/{value}/
where for DMs that don't have an internal concept of versions,version
is1
for bowtie1,2
for bowtie2, etc.{tool_data_path}/{dm_name}/
v{version}/
if the DM is versioned, elsev1
?Here is what that layout looks like in practice:
Caveat
There is the question of what to do with old data and existing servers: Essentially, all data built using old DMs will remain at the old paths, only data installed by updated DMs will be placed under the new layout. Admins can either choose to leave everything as it is, or move it to the new structure (which I will invariably end up writing a script to do). However, if old and new DMs are mixed you could end up with some dirs with mixtures of old and new layouts (primarily under non-genomic DMs, since all genomic DMs are moved to
{tool_data_path}/genomes/
. Thus the recommendation to admins would be to changetool_data_path
to a clean dir before running the new DMs. Galaxy will continue to find data at the old paths thanks to the existing entries inshed_tool_data_table_conf.xml
.Beta Was this translation helpful? Give feedback.
All reactions