Skip to content

ohsu-comp-bio/dos_connect

Repository files navigation

DOS connect

concept

Exercise the GA4GH data-object-schemas

image

epics

  • As a researcher, in order to maximize the amount of data I can process from disparate repositories, I can use DOS to harmonize those repositories
  • As a researcher, in order to minimize cloud costs and processing time, I can use DOS' to harmonized data to make decisions about what platform/region I should use to download from or where my code should execute.
  • As a informatician, in order to injest from disparate repositories, I need to injest an existing repository into DOS
  • As a informatician, in order to keep DOS up to date, I need to observe changes to the repository and automatically update DOS
  • As a developer, in order to enable DOS, I need to integrate DOS into my backend stack

capabilities

This project provides two high level capabilities:

  • observation: long lived services to observe the object store and populate a webserver with data-object-schema records. These observations catch add, moves and deletes to the object store.
  • inventory: on demand commands to capture a snapshot of the object store using data-object-schema records.

image

customizations

The data-object-schema is 'unopinionated' in several areas:

  • authentication and authorization is unspecified.
  • no specific backend is specified.
  • 'system-of-record' for id, if unspecified, is driven by the client.

dos_connect addresses these on the server and client by utilizing plugin duck-typing

Server plugins:

  • BACKEND: for storage. Implementations: in-memory and elasticsearch. e.g. BACKEND=dos_connect.server.elastic_backend
  • AUTHORIZER: for AA. noop, keystone, and basic. e.g. AUTHORIZER=dos_connect.server.keystone_api_key_authorizer
  • REPLICATOR: for downstream consumers. noop, keystone e.g. REPLICATOR=dos_connect.server.kafka_replicator

Client plugins:

All observers and inventory tasks leverage a middleware plugin capability.

  • user_metadata(): customize the collection of metadata
  • before_store(): modify the data_object before persisting
  • md5sum(): calculate the md5 of the file
  • id(): customize id e.g. CUSTOMIZER=dos_connect.apps.aws_customizer

To specify your own customizer, set the CUSTOMIZER environmental variable.

For example: AWS S3 returns a special hash for multipart files. The aws_customizer uses a lambda to calculate the true md5 hash of multipart files. Other client customizers include noop, url_as_id, and smmart (obfuscates paths and associates user metadata)

setup

see here

server

Setup: .env file

# ******* webserver
# http port
DOS_CONNECT_WEBSERVER_PORT=<port-number>
# configure backend
BACKEND=dos_connect.server.elasticsearch_backend
ELASTIC_URL=<url>
# configure authorizer
AUTHORIZER=dos_connect.server.keystone_api_key_authorizer
# (/v3)
DOS_SERVER_OS_AUTH_URL=<url>
AUTHORIZER_PROJECTS=<project_name>
# replicator
REPLICATOR=dos_connect.server.kafka_replicator
KAFKA_BOOTSTRAP_SERVERS=<url>
KAFKA_DOS_TOPIC=<topic-name>

Server Startup:

$ alias web='docker-compose -f docker-compose-webserver.yml'
$ web build ; web up -d

Client Startup: note: execute source <openstack-openrc.sh> first

# webserver endpoint
export DOS_SERVER=<url>
# sleep in between inventory runs
export SLEEP=<seconds-to-sleep>
# bucket to monitor
export BUCKET_NAME=<existing-bucket-name>
$ alias client='docker-compose -f docker-compose-swift.yml'
$ client build; client up -d

ohsu implementation:

  • see swagger

  • note: you will need to belong to openstack and provide a token from openstack token issue

image

  • see kafak topic 'dos-events' for stream

  • the kafka queue is populated with

    {'method': method, 'doc': doc}
    

    where doc is a data_object and method is one of ['CREATE', 'UPDATE', 'DELETE']

next steps

  • testing
  • evangelization
  • swagger improvements (403, 401 status codes)

About

Data Object Registry Connect

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages