User-based ObjectStore #4840

VJalili · 2017-10-20T21:17:02Z

(1) Initially started at #4314; (2) all the commits of this PR were squashed into a single commit on Dec 11, 2019, the history of the changes are preserved via this branch.

Introduction

This PR extends Galaxy's ObjectStore to enable users to bring-their-own-resources: users can plug a media (e.g., Amazon S3 bucket) on which Galaxy will persist their datasets.

Motivations

unlimited storage: users on Galaxy instances with limited storage resources (e.g., storage quota) can potentially have an unlimited storage by plugging their own (cloud-based) storage to Galaxy;
data sharing: having datasets generated by Galaxy stored on user’s cloud-based storage makes it easier sharing analysis results with collaborators;
flexible persistence location: members of different labs using a common Galaxy instance hosted at their institute can have their data stored on their lab’s network attached storage (NAS).

Highlights

For users without a plugged storage media, Galaxy will continue to use an instance-wide configuration for their data storage needs;
A user's storage media (e.g., an S3 bucket) will be used for their data storage needs only, and will not be accessed for other user's storage needs;
A storage media can be a local path, or an Amazon S3 bucket;
Users can plug multiple media (e.g., two different local path, and three Amazon S3 buckets), assign an order and quota attribute to each, and Galaxy will use them based on the given order and will fall from one to another if their quota limit is reached;
Leveraging the order attribute of storage media, users can use both instance-wide storage and their own media. For instance, they can direct Galaxy to use the instance-wide storage until their quota limit is reached (e.g., 250GB on Galaxy Main), then use their own media for the rest of their data storage needs.
Storage media are defined leveraging Galaxy’s cloud authorization model, hence Galaxy does not ask for user’s credentials.
This PR implements all the necessary models, managers, functions, and APIs; and there will be a separate PR for UI;
The functionality is leveraging Hierarchical ObjectStore; hence, it is functional only if Hierarchical ObjectStore is configure. However, the hierarchy is applied instance-wide only, and does not affect user’s plugged media configuration;
Each storage media has its separate staging path (mainly used for S3 backend), independent from the instance-wide ObjectStore and other storage media; and admin can define a default staging path.

What's next?

We're aiming to keep this PR "minimally functional"; hence, features such as ability to mount a cloud-based storage and user interfaces will be implemented in subsequent PRs.

How to use

Configure objectstore to the hierarchical backend; e.g.,:

<?xml version="1.0"?>
<object_store type="hierarchical">
    <backends>
        <object_store type="distributed" id="primary" order="0">
            <backends>
                <backend id="files1" type="disk" weight="1">
                    <files_dir path="database/files1"/>
                    <extra_dir type="temp" path="database/tmp1"/>
                    <extra_dir type="job_work" path="database/job_working_directory1"/>
                </backend>
                <backend id="files2" type="disk" weight="1">
                    <files_dir path="database/files2"/>
                    <extra_dir type="temp" path="database/tmp2"/>
                    <extra_dir type="job_work" path="database/job_working_directory2"/>
                </backend>
            </backends>
        </object_store>
        <object_store type="disk" id="secondary" order="1">
            <files_dir path="database/files3"/>
            <extra_dir type="temp" path="database/tmp3"/>
            <extra_dir type="job_work" path="database/job_working_directory3"/>
        </object_store>
    </backends>
</object_store>

Login and get your API key;
POST a payload as the following to the /api/storage_media (you may use Postman to send API requests):

{
    "category": "local",
    "path": "A_PATH_ON_LOCAL_DISK",
    "order": "1",
    "quota": "1000.0",
    "usage": "0.0"
}

Then any dataset you create, will be stored in the A_PATH_ON_LOCAL_DISK; e.g.,:

.
└── d
    └── b
        └── 1
            └── dataset_db1b29ae-524a-46c1-af8d-e3e9e6861a4e.dat

jgoecks · 2017-10-24T05:16:22Z

I started a branch with fixes here: https://github.com/jgoecks/galaxy/tree/UserBasedObjectStore2

Specifically, there are fixes for anonymous access. I can't seem to find your fork to initiate a pull request however—perhaps because your repo is restricted somehow and/or is so far behind the main repo?

VJalili · 2017-10-24T05:48:46Z

Thanks for the updates @jgoecks . Please see if you can make a PR agains this branch; if not, I can update this branch. Besides, I guess we could avoid your last commit.

jgoecks · 2017-10-24T16:56:24Z

@VJalili I still cannot find your fork to make a PR against. I'll try to look into this more soon.

VJalili · 2017-10-24T17:12:05Z

@jgoecks I applied the changes you made on your branch on this branch.

qiagu · 2018-03-23T15:45:02Z

It will be nice to have user-based storage. @VJalili Wonder whether you use sql tables to manage user and corresponding storages. I haven't looked deep into this project yet, but my first feeling is to build a table on top of current storage management system.

dannon · 2019-03-05T13:28:01Z

@VJalili I opened a PR that I think will fix tests for this PR. It's actually, I think, an issue we have always had and it was just never surfaced until this PR.

VJalili#6

VJalili · 2019-03-05T18:36:52Z

@dannon Thank you! I guess that has fixed it as all tests passed locally.

VJalili · 2019-03-05T19:29:19Z

@dannon I think the patch works fine for integration tests, but it breaks CI unit tests.

dannon · 2019-03-05T20:47:37Z

@VJalili Ahh, sure enough. I was laser focused on that one issue, let me dig deeper since there's more to the picture here.

Yeah, the error here has popped up again:
Parent instance <HistoryDatasetAssociation at 0x7fe8e45b0250> is not bound to a Session; lazy load operation of attribute 'history' cannot proceed

I'll try to figure out how we're getting an hda handle that's no longer bound.

VJalili · 2019-03-05T22:11:58Z

The orphan HDA handle is the issue causing the integration test's failure; I guess that is happening when Galaxy is writing metadata to a file.

test/integration/objectstore/test_plug_media.py

jmchilton · 2019-07-29T15:10:12Z

Thanks for refactoring the concept of ownership out of the dataset instance level (HDA/LDDA) and for the integration tests. These are serious improvements I believe.

Can you add an integration test of copying data on storage media between users? I assume based on the reading if a user copies my data and then I delete the storage media - the data will disappear for the user but I want that verified and stated explicitly with a test case. Is that fair?

VJalili · 2019-07-29T17:06:26Z

@jmchilton as per the challenges using this feature for shared data may introduce (e.g., authorization issues), last we decided to postpone the ability of using this feature for shared data. Do you think we should add some warnings for users who attempt to use this feature for share data?

jmchilton · 2019-08-01T14:11:06Z

Do you think we should add some warnings for users who attempt to use this feature for share data?

Yes, ideally. I'm not sure yet if that should be required for this PR but that is a good idea in general if we're going to impose that restriction.

VJalili · 2019-08-05T22:06:35Z

@jmchilton I disabled sharing for user storage media (a history that contains a dataset stored on a user-owned storage, cannot be shared); please see a272454. Any other thoughts?

jmchilton · 2019-08-21T16:18:03Z

lib/galaxy/jobs/__init__.py

@@ -1402,6 +1429,7 @@ def _set_object_store_ids(self, job):
        # afterward. State below needs to happen the same way.
        for dataset_assoc in job.output_datasets + job.output_library_datasets:
            dataset = dataset_assoc.dataset
+            self.__assign_media(job, dataset.dataset)


I've spent months of my life trying to optimize this process of initializing the output datasets. Can we have some property on app that we can check to see if this method would ever doing anything - and skip it if there is no possibility of assigning media?

Sure! this config property is added that if disabled, this method will not do anything. Would that be addressing your concerns?

lib/galaxy/config/__init__.py

jmchilton · 2020-09-10T15:36:23Z

One sticking point at a time - I think User. _calculate_or_set_disk_usage is going to be a problem here right? Like as soon as the user's disk usage is recalculated - all the dataset usage for all the attached disk is going to be added to the user's quota even though you very carefully prevented it from being initially added.

I'm trying to work on this in the context of creating like scratch storage object stores - I think what we need is more abstractions around quota calculation that ties them closer to object stores and is extensible for applications like this. I'll see if I can come up with something.

Not used yet in Galaxy core yet, but useful for applications where you want object store selection to be based on user in some way. This code was taken from galaxyproject#4840. Part of this is trying to reduce the number of files that branch touches to make review easier - but I'm confident this extension point is good regardless. Also it makes it clear we need to keep the user object in the picture when assigning the object store ID in the future.

jmchilton · 2020-09-10T16:11:40Z

lib/galaxy/webapps/galaxy/controllers/history.py

@@ -1057,7 +1063,12 @@ def purge_deleted_datasets(self, trans):
                if not hda.deleted or hda.purged:
                    continue
                if trans.user:
-                    trans.user.adjust_total_disk_usage(-hda.quota_amount(trans.user))
+                    if not hda.dataset.has_active_storage_media():
+                        trans.user.adjust_total_disk_usage(-hda.quota_amount(trans.user))


I think we can get around duplicating this code in the controllers with #10208.

That other PR has been merged so I think this should be rebased now along with the fix for https://github.com/galaxyproject/galaxy/pull/4840/files#r486551671.

jmchilton · 2020-09-10T18:29:46Z

lib/galaxy/model/__init__.py

-        """Sets and gets the size of the data on disk"""
-        return self.dataset.set_size(**kwds)
+        """Sets the size of the data on disk"""
+        self.dataset.set_size(**kwds)


Is this a broken rebase or is it important to drop the return here?

jmchilton · 2020-09-14T20:24:34Z

lib/galaxy/tools/actions/__init__.py

@@ -359,7 +359,7 @@ def execute(self, tool, trans, incoming=None, return_job=False, set_output_hid=T
        # datasets first, then create the associations
        parent_to_child_pairs = []
        child_dataset_names = set()
-        object_store_populator = ObjectStorePopulator(app)
+        object_store_populator = ObjectStorePopulator(app, user=trans.user)


Can you open a new PR with these changes - https://github.com/galaxyproject/galaxy/compare/dev...jmchilton:user_objectstore_populator?expand=1. This continues a theme along with #10208 and #10212 of trying to establish Galaxy abstractions that restrict the code needed to implement this functionality just to object store, quota, and model code.

Sure: #10231

jmchilton · 2020-12-15T17:20:08Z

lib/galaxy/webapps/galaxy/controllers/history.py

@@ -969,7 +969,13 @@ def _populate_restricted(self, trans, user, histories, send_to_users, action, se
                else:
                    # Only deal with datasets that have not been purged
                    for hda in history.activatable_datasets:
-                        if trans.app.security_agent.can_access_dataset(send_to_user.all_roles(), hda.dataset):
+                        if len(hda.dataset.storage_media_associations) > 0:


This is not sufficient to prevent sharing at all I don't think. There are other paths to share datasets that don't hit this controller, library datasets should be prohibited from being such datasets, etc...

I think #10840 is what we want. It is much more general - it allows any objectstore to be marked as private - and it is much more comprehensive in how it prevents sharing. It has test cases, it prevents such datasets from even showing up where say importing history datasets into libraries, etc...

I think this portion of the PR should be dropped when that other PR is merged and instead just ensure that your user based objectstores are marked as private - I think better APIs and UIs will pretty cleanly fallout from that.

jmchilton · 2020-12-15T17:25:46Z

lib/galaxy/jobs/handler.py

+        # exception(s).
+        if state == JOB_READY and self.app.config.enable_quotas and \
+                (job.user is not None and
+                 (job.user.active_storage_media is None or not job.user.has_active_storage_media())):


I think this should be redone on top of #10221 - which I think abstracts out the quota checking into nice optimizable functions. I think rather than checking if has_active_storage_media we should build on the abstractions in that PR to just ask if the configured objectstore we're talking to has quota left and then we can disable quota on objectstores that use storage media. The ability to disable quota on an objectstore is included in that PR.

jmchilton · 2024-06-21T20:45:42Z

Went with alternate implementation #18127

galaxybot added triage status/WIP labels Oct 20, 2017

VJalili changed the title ~~[WIP] User-based object store~~ [WIP] User-based ObjectStore Nov 3, 2017

VJalili mentioned this pull request Dec 7, 2017

The Roadmap #1928

Closed

martenson added kind/feature area/objectstore and removed triage labels Jan 1, 2018

VJalili changed the title ~~[WIP] User-based ObjectStore~~ User-based ObjectStore Mar 12, 2019

VJalili mentioned this pull request Mar 14, 2019

Human-readable/user-defined filename (& path) for datasets #7525

Open

martenson reviewed Jul 24, 2019

View reviewed changes

test/integration/objectstore/test_plug_media.py Outdated Show resolved Hide resolved

VJalili added status/review and removed status/WIP labels Jul 29, 2019

galaxybot added this to the 19.09 milestone Jul 29, 2019

hexylena mentioned this pull request Aug 12, 2019

Per-user or per-group object storage #3561

Closed

jmchilton reviewed Aug 21, 2019

View reviewed changes

lib/galaxy/config/__init__.py Outdated Show resolved Hide resolved

mvdbeek removed this from the 20.09 milestone Sep 8, 2020

galaxybot added this to the 20.09 milestone Sep 8, 2020

jmchilton reviewed Sep 10, 2020

View reviewed changes

jmchilton mentioned this pull request Sep 11, 2020

Implement quota interface - with better logic isolation and unit tests. #10212

Merged

jmchilton reviewed Sep 14, 2020

View reviewed changes

This was referenced Sep 15, 2020

Pass user to ObjectStorePopulator. #10223

Closed

Pass user to the ObjectStorePopulator #10231

Merged

mvdbeek modified the milestones: 20.09, 21.01 Sep 16, 2020

jmchilton mentioned this pull request Sep 29, 2020

Implement quota tracking options per ObjectStore. #10221

Closed

jmchilton mentioned this pull request Dec 1, 2020

[WIP] Implement abstractions to annotate non-sharable datasets & objectstores. #10840

Closed

jmchilton reviewed Dec 15, 2020

View reviewed changes

jmchilton mentioned this pull request Dec 21, 2020

[WIP] Implement quota tracking options per ObjectStore. #10977

Closed

mvdbeek modified the milestones: 21.01, 21.05 Jan 6, 2021

jmchilton mentioned this pull request Apr 5, 2021

Directions to Improve Distributed Data Handling in Galaxy #11787

Open

mvdbeek modified the milestones: 21.05, 21.09 Apr 7, 2021

jmchilton mentioned this pull request Apr 12, 2021

Executive summary of 2021 Q2 Backend Goals #11824

Closed

12 tasks

mvdbeek removed this from the 21.09 milestone Sep 8, 2021

This was referenced Jun 9, 2022

[WIP] Implement abstractions to annotate non-sharable datasets & objectstores. #14044

Closed

Implement quota tracking options per ObjectStore. #14047

Closed

Empower Users to Select Storage Destination #14073

Merged

mvdbeek marked this pull request as draft October 17, 2022 08:59

jmchilton closed this Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-based ObjectStore #4840

User-based ObjectStore #4840

VJalili commented Oct 20, 2017 •

edited

Loading

jgoecks commented Oct 24, 2017

VJalili commented Oct 24, 2017 •

edited

Loading

jgoecks commented Oct 24, 2017

VJalili commented Oct 24, 2017

qiagu commented Mar 23, 2018

dannon commented Mar 5, 2019

VJalili commented Mar 5, 2019

VJalili commented Mar 5, 2019

dannon commented Mar 5, 2019 •

edited

Loading

VJalili commented Mar 5, 2019

jmchilton commented Jul 29, 2019

VJalili commented Jul 29, 2019 •

edited

Loading

jmchilton commented Aug 1, 2019

VJalili commented Aug 5, 2019 •

edited

Loading

jmchilton Aug 21, 2019

VJalili Mar 6, 2020

jmchilton commented Sep 10, 2020

jmchilton Sep 10, 2020

jmchilton Dec 15, 2020

jmchilton Sep 10, 2020

jmchilton Sep 14, 2020

VJalili Sep 15, 2020 •

edited

Loading

jmchilton Dec 15, 2020 •

edited

Loading

jmchilton Dec 15, 2020

jmchilton commented Jun 21, 2024

User-based ObjectStore #4840

User-based ObjectStore #4840

Conversation

VJalili commented Oct 20, 2017 • edited Loading

Introduction

Motivations

Highlights

What's next?

How to use

jgoecks commented Oct 24, 2017

VJalili commented Oct 24, 2017 • edited Loading

jgoecks commented Oct 24, 2017

VJalili commented Oct 24, 2017

qiagu commented Mar 23, 2018

dannon commented Mar 5, 2019

VJalili commented Mar 5, 2019

VJalili commented Mar 5, 2019

dannon commented Mar 5, 2019 • edited Loading

VJalili commented Mar 5, 2019

jmchilton commented Jul 29, 2019

VJalili commented Jul 29, 2019 • edited Loading

jmchilton commented Aug 1, 2019

VJalili commented Aug 5, 2019 • edited Loading

jmchilton Aug 21, 2019

Choose a reason for hiding this comment

VJalili Mar 6, 2020

Choose a reason for hiding this comment

jmchilton commented Sep 10, 2020

jmchilton Sep 10, 2020

Choose a reason for hiding this comment

jmchilton Dec 15, 2020

Choose a reason for hiding this comment

jmchilton Sep 10, 2020

Choose a reason for hiding this comment

jmchilton Sep 14, 2020

Choose a reason for hiding this comment

VJalili Sep 15, 2020 • edited Loading

Choose a reason for hiding this comment

jmchilton Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

jmchilton Dec 15, 2020

Choose a reason for hiding this comment

jmchilton commented Jun 21, 2024

VJalili commented Oct 20, 2017 •

edited

Loading

VJalili commented Oct 24, 2017 •

edited

Loading

dannon commented Mar 5, 2019 •

edited

Loading

VJalili commented Jul 29, 2019 •

edited

Loading

VJalili commented Aug 5, 2019 •

edited

Loading

VJalili Sep 15, 2020 •

edited

Loading

jmchilton Dec 15, 2020 •

edited

Loading