Standardize Data Manager filesystem layout #19013

natefoo · 2024-10-16T18:38:54Z

natefoo
Oct 16, 2024
Maintainer

I am in the process of fetching and indexing nearly a thousand genomes for a project and will be doing the same for additional projects, and would like to apply the lessons learned to finally get the IDC moving again. We're also discussing how a website would render the available data in a more user-friendly format and construct links to it. The biggest question at the moment is the data layout.

Current state of affairs

Where data are stored is entirely at the discretion of the DM and varies considerably:

DM Name	Data Path
fetch (fasta)	`{tool_data_path}/{dbkey}/{dm_name}_index/seq/{value}.fa`
fetch (dbkey)	`{tool_data_path}/{dbkey}/{dm_name}_index/len/{value}.len`
bowtie2, bwa_mem, bwa_mem2, hisat2	`{tool_data_path}/{dbkey}/{dm_name}_index/{value}/{value}.*`
bowtie1	`{tool_data_path}/{dbkey}/{dm_name}_index/{value}.*`
rnastar	`{tool_data_path}/{star_version}/{dbkey}/{value}/{dataset_id}/*`

Where dbkey is the build identifier (typically from UCSC) and value is the potential variant build ID of a dbkey (e.g. the female build of hg38) but typically value == dbkey.

As a tree this looks like:

/cvmfs/brc.galaxyproject.org
├── config
│   ├── all_fasta.loc
│   ├── bowtie2_indices.loc
│   ├── bowtie_indices.loc
│   ├── bwa_mem2_index.loc
│   ├── bwa_mem_index.loc
│   ├── dbkeys.loc
│   ├── hisat2_indexes.loc
│   ├── lastz_seqs.loc
│   ├── rnastar_index2x_versioned.loc
│   └── twobit.loc
└── data
    ├── GCA_013358835.2
    │   ├── bowtie2_index
    │   │   └── GCA_013358835.2
    │   │       ├── Bowtie2_index_data_manager_json.html
    │   │       ├── GCA_013358835.2.1.bt2
    │   │       ├── GCA_013358835.2.2.bt2
    │   │       ├── GCA_013358835.2.3.bt2
    │   │       ├── GCA_013358835.2.4.bt2
    │   │       ├── GCA_013358835.2.fa
    │   │       ├── GCA_013358835.2.rev.1.bt2
    │   │       ├── GCA_013358835.2.rev.2.bt2
    │   │       └── _gx_data_bundle_index.json
    │   ├── bowtie_index
    │   │   ├── Bowtie_index_data_manager_json.html
    │   │   ├── GCA_013358835.2.fa
    │   │   ├── GCA_013358835.2.fa.1.ebwt
    │   │   ├── GCA_013358835.2.fa.2.ebwt
    │   │   ├── GCA_013358835.2.fa.3.ebwt
    │   │   ├── GCA_013358835.2.fa.4.ebwt
    │   │   ├── GCA_013358835.2.fa.rev.1.ebwt
    │   │   ├── GCA_013358835.2.fa.rev.2.ebwt
    │   │   └── _gx_data_bundle_index.json
    │   ├── bwa_mem2_index
    │   │   └── GCA_013358835.2
    │   │       ├── Build_BWA-MEM2_indexes_data_manager_json.html
    │   │       ├── GCA_013358835.2.fa
    │   │       ├── GCA_013358835.2.fa.0123
    │   │       ├── GCA_013358835.2.fa.amb
    │   │       ├── GCA_013358835.2.fa.ann
    │   │       ├── GCA_013358835.2.fa.bwt.2bit.64
    │   │       ├── GCA_013358835.2.fa.pac
    │   │       └── _gx_data_bundle_index.json
    │   ├── bwa_mem_index
    │   │   └── GCA_013358835.2
    │   │       ├── BWA-MEM_index_data_manager_json.html
    │   │       ├── GCA_013358835.2.fa
    │   │       ├── GCA_013358835.2.fa.amb
    │   │       ├── GCA_013358835.2.fa.ann
    │   │       ├── GCA_013358835.2.fa.bwt
    │   │       ├── GCA_013358835.2.fa.pac
    │   │       ├── GCA_013358835.2.fa.sa
    │   │       └── _gx_data_bundle_index.json
    │   ├── hisat2_index
    │   │   └── GCA_013358835.2
    │   │       ├── GCA_013358835.2.1.ht2
    │   │       ├── GCA_013358835.2.2.ht2
    │   │       ├── GCA_013358835.2.3.ht2
    │   │       ├── GCA_013358835.2.4.ht2
    │   │       ├── GCA_013358835.2.5.ht2
    │   │       ├── GCA_013358835.2.6.ht2
    │   │       ├── GCA_013358835.2.7.ht2
    │   │       ├── GCA_013358835.2.8.ht2
    │   │       ├── GCA_013358835.2.fa
    │   │       ├── HISAT2_index_data_manager_json.html
    │   │       └── _gx_data_bundle_index.json
    │   ├── len
    │   │   └── GCA_013358835.2.len
    │   └── seq
    │       ├── GCA_013358835.2.2bit
    │       └── GCA_013358835.2.fa
    └── rnastar
        └── 2.7.4a
            └── GCA_013358835.2
                └── GCA_013358835.2
                    └── dataset_dd0a61dc-202c-4aaf-ab56-4af6fb8a9ea3_files
                        ├── Genome
                        ├── SA
                        ├── SAindex
                        ├── _gx_data_bundle_index.json
                        ├── chrLength.txt
                        ├── chrName.txt
                        ├── chrNameLength.txt
                        ├── chrStart.txt
                        ├── genomeParameters.txt
                        └── rnastar_index_versioned_data_manager_json.html

Thus there is a lot of inconsistency in the layout. In addition, non-genomic DMs typically use a DM-named subdirectory at the root of tool_data_path the same way that rnastar does (e.g. {tool_data_path}/kraken2_databases), but because the other genomic indexers do not, the root of tool_data_path is littered with a mixture of genome directories and non-genomic DM directories.

Proposal

IMO keeping genomic indexes together under the dbkey is a useful construct for browsing e.g. on datacache.galaxyproject.org, so I propose the following changes to DMs:

All genomic DMs (fetch, bowtie*, bwa*, hisat2, rnastar) should store data in {tool_data_path}/genomes/{dbkey}/. Inside this dir:
1. The fetch (sequence, dbkey) DM should store the sequence at seq/{value}.fa and chrom lengths at len/{value}.len (as before)
2. Indexer DMs (bowtie*, bwa*, hisat2, rnastar) should store data at {unversioned_dm_name}_index/v{version}/{dbkey}/{value}/ where for DMs that don't have an internal concept of versions, version is 1 for bowtie1, 2 for bowtie2, etc.
All non-genomic DMs should store data in {tool_data_path}/{dm_name}/
1. A version should be the next directory at v{version}/ if the DM is versioned, else v1?
2. Anything underneath this point is probably DM-specific and can't be dictated.

Here is what that layout looks like in practice:

{tool_data_path}/genomes/{dbkey}/bowtie_index/v1/{value}/
{tool_data_path}/genomes/{dbkey}/bowtie_index/v2/{value}/
{tool_data_path}/genomes/{dbkey}/bwa_mem_index/v1/{value}/
{tool_data_path}/genomes/{dbkey}/bwa_mem_index/v2/{value}/
{tool_data_path}/genomes/{dbkey}/hisat_index/v2/{value}/
{tool_data_path}/genomes/{dbkey}/rnastar_index/v2.7.4a/{value}/
{tool_data_path}/kraken_databases/v1/...
{tool_data_path}/kraken_databases/v2/...
{tool_data_path}/busco_databases/v5/...
{tool_data_path}/bakta_databases/v1/...

Caveat

There is the question of what to do with old data and existing servers: Essentially, all data built using old DMs will remain at the old paths, only data installed by updated DMs will be placed under the new layout. Admins can either choose to leave everything as it is, or move it to the new structure (which I will invariably end up writing a script to do). However, if old and new DMs are mixed you could end up with some dirs with mixtures of old and new layouts (primarily under non-genomic DMs, since all genomic DMs are moved to {tool_data_path}/genomes/. Thus the recommendation to admins would be to change tool_data_path to a clean dir before running the new DMs. Galaxy will continue to find data at the old paths thanks to the existing entries in shed_tool_data_table_conf.xml.

jmchilton · 2024-10-17T14:31:04Z

jmchilton
Oct 17, 2024
Maintainer

This seems all good to me - thanks for the thoughtful documentation and detailed write up. If we're introducing more structure I would really love to catch in a way that Pulsar can consume it and reason about it - (e.g. https://github.com/galaxyproject/pulsar/blob/master/docs/files/file_actions_sample_1.yaml#L18). We haven't had to use unstructured path actions because we happen to use CVMFS on all our clients - but it is a high bar and you can imagine much more restricted clients might not allow this. It would be lovely if only the job file server needed the mount and we provided a default list of actions that worked for all DM generated data. It is just globs and expected paths right - we could use the same list to provide like a linter to ensure that all the IDC tracked files are things we expect. We could get cleaner, more structured, more documented IDC contents while also making the data more accessible. Can I get you onboard? I would be happy to expand the tool action syntax to catch things more exactly, write the file linter, extend the syntax to include stuff we might want to provide in the web version, etc... but I think I need your buy in and commitment to the use cases.

1 reply

natefoo Oct 17, 2024
Maintainer Author

I'm on board... what do you need me to do? 😄

mvdbeek · 2024-10-17T14:43:54Z

mvdbeek
Oct 17, 2024
Maintainer

Has our plan of using normal datasets been abandoned at this point ? Is that going to be parallel work ? Or is this really just about how you want to organize cvmfs ?

9 replies

bgruening Oct 17, 2024
Maintainer

I like that as well, but imho this will not replace general DM - unfortunately. We have a lot of DM that download large chunks of stuff - and that is unfortunately not cacheable as far as I understand.

mvdbeek Oct 17, 2024
Maintainer

Can you be more specific ? Reference data is nothing but a cache

natefoo Oct 17, 2024
Maintainer Author

Regardless of that plan I don't see why that should prevent us from improving the layout used by DMs?

mvdbeek Oct 17, 2024
Maintainer

Given you'd have to change the data managers this seems like a lot of work if we want to start handling reference data differently anyway, but this is of course not a problem.

jmchilton Oct 18, 2024
Maintainer

Does this require changing the data managers? fwiw I am still committed to helping with the new path forward on cached data but realistically I think a lot of tools still use data tables and I don't think we can just abandon them - so I want to make sure whatever we're doing works with Pulsar.

natefoo · 2024-10-28T19:31:01Z

natefoo
Oct 28, 2024
Maintainer Author

Related issue - some DMs symlink the reference genome and then leave the symlink in place upon completion when the tool does not actually need the reference genome to function. When running DMs in normal mode this is fine, but in bundle mode this duplicates the reference genome for every indexer DM. Additionally, when importing those bundles to another server, it makes a copy of the reference genome for every imported bundle (which can also be quite slow for large genomes).

So I propose that we remove the symlink from bowtie 1/2 and bwa-mem 1/2 on tool completion, after I verify it is not needed for any of them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize Data Manager filesystem layout #19013

{{title}}

Replies: 3 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Standardize Data Manager filesystem layout #19013

natefoo Oct 16, 2024 Maintainer

Current state of affairs

Proposal

Caveat

Replies: 3 comments · 10 replies

jmchilton Oct 17, 2024 Maintainer

natefoo Oct 17, 2024 Maintainer Author

mvdbeek Oct 17, 2024 Maintainer

bgruening Oct 17, 2024 Maintainer

mvdbeek Oct 17, 2024 Maintainer

natefoo Oct 17, 2024 Maintainer Author

mvdbeek Oct 17, 2024 Maintainer

jmchilton Oct 18, 2024 Maintainer

natefoo Oct 28, 2024 Maintainer Author

natefoo
Oct 16, 2024
Maintainer

Replies: 3 comments 10 replies

jmchilton
Oct 17, 2024
Maintainer

natefoo Oct 17, 2024
Maintainer Author

mvdbeek
Oct 17, 2024
Maintainer

bgruening Oct 17, 2024
Maintainer

mvdbeek Oct 17, 2024
Maintainer

natefoo Oct 17, 2024
Maintainer Author

mvdbeek Oct 17, 2024
Maintainer

jmchilton Oct 18, 2024
Maintainer

natefoo
Oct 28, 2024
Maintainer Author