Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible future development: towards a centralized storage infrastructure ? #262

Open
lucasgautheron opened this issue Aug 4, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@lucasgautheron
Copy link
Collaborator

lucasgautheron commented Aug 4, 2021

(This is WIP)

Is your feature request related to a problem? Please describe.

The current design in the one described in Managing, storing, and sharing long-form recordings and their annotations.

It can be summed up this way:

  • ChildProject is a python+CLI interface for interacting with corpora of child-centered daylong recordings
  • DataLad is a python+CLI interface for managing scientific datasets (such as these corpora)

There are a few issues that remain unsolved by this design:

  1. Neither of these tools provide the infrastructure to store the data and to process them. Yet, this can be challenging for such data.
  2. Although it is great in many aspects (especially versioning and reproducibility), DataLad may be a bit technical to use for some of the users interested in managing such corpora. Although not required by ChildProject - which works so long as the files are structured properly -, it is the only solution proposed in our design for the retrieval and upload of data.
  3. Not all storage supports handle complex permissions really well - also DataLad is not great if you have too many groups either

Although there are advantages to decentralization, these limitations call for (at least one) centralized database of daylong recordings.

I'll discuss two alternatives:

  1. A web-based database
  2. A DataLad based approach
@lucasgautheron lucasgautheron added the enhancement New feature or request label Aug 4, 2021
@lucasgautheron
Copy link
Collaborator Author

lucasgautheron commented Aug 10, 2021

Towards a web-oriented, git-less database

Here is a description of the long term goals.

ChildProject should be able to interact with corpora using different storage supports:

  1. Locally using the current standards (metadata as CSV dataframes), which we'll call the CSV interacting mode
  2. Locally using a database (e.g. sqlite, PgSQL, etc.), i.e. the database interacting mode
  3. Remotely through an API, i.e. the API interacting mode

For instance, the third option would apply to users of a centralized database.
Most functionalities should work the same regardless of the storage support (i.e. one client API for all storages).

The centralized database (let's call it daylong-db) would come with include two packages:

  • daylong-db-server: an API server complying with the standards of ChildProject's API storage interface, e.g. to serve the requests made through the API interacting mode of ChildProject. Could be run on Flask. It could also use ChildProject through the database interacting mode for instance.
  • daylong-db-frontend: a web-based graphical interface for interacting with the corpora (as an alternative to ChildProject which is a python/CLI interface)

It should be possible to run processing pipelines remotely too. The API would return a handler which could be used to check for the status of the job at anytime and to retrieve the results. On the server side, the jobs could be run using slurm on a local infrastructure or on a cloud computing provider such as AWS.

Note that it should be possible to convert corpora from any storage format to any other (e.g. export/import from CSV to DB etc.)

Roadmap

  1. Develop ChildProject database interacting mode (one in which CSVs are replaced with tables)
  2. Design specifications for an API interacting mode
  3. Implement it into ChildProject, ignoring processing pipelines for the beginning
  4. Implement a server capable of serving these requests (i.e. daylong-db-server)
  5. Factorize pipelines code
  6. Implement processing pipelines into daylong-db (using slurm to begin with)
  7. Develop a web-based graphical interface for daylong-db

Implementation

Stores

A store is an object that can fetch/updates data from a given storage (local or remote, CSV vs SQL, etc.)

We'd have:

  • CSVStore
  • SQLStore
  • APIStore

Which would all inherit from a Store abstract class, e.g.:

class Store(ABC):

    def __init__(self):
        pass

    @abstractmethod
    def get_children(self):
        pass

    @abstractmethod
    def get_recordings(self):
        pass

    @abstractmethod
    def get_annotations(self, sets: Optional[List[str]] = None):
        pass

    @abstractmethod
    def add_child(self, child: dict):
        pass

    @abstractmethod
    def update_child(self, child: dict):
        pass

    @abstractmethod
    def delete_child(self, child: str):
        pass

    @abstractmethod
    def add_recording(self, recording: dict):
        pass

    @abstractmethod
    def update_recording(self, recording: dict):
        pass

    @abstractmethod
    def delete_recording(self, recording: str):
        pass

    @abstractmethod
    def add_annotations(self, annotations: pd.DataFrame):
        pass

    @abstractmethod
    def update_annotation(self):
        pass

    @abstractmethod
    def delete_annotations(self, annotations: pd.DataFrame):
        pass


class CSVStore(Store):
    def __init__(self, path):
        super().__init__()
        self.path = path
    
    def get_children(self):
        children = pd.read_csv(join(self.path, 'metadata/children.csv'))
        return children

    # etc.
      
class SQLStore(Store):
    def __init__(self, engine: Engine, corpus: str):
        super().__init__()
        self.engine = engine
        self.conn = engine.connect()
        self.corpus  = corpus
    
    def get_children(self):
        children = pd.read_sql(query, self.conn)
        return children
    # etc.
     

Project and AnnotationManager use their store instance to access or modify the data (always) such that their code should not depend on the choice of the store.

Pros

  • Low technicality for the user
  • Easy to combine data from different datasets
  • All corpora in the DB always have the same format (even if the structure of the DB changes)

Cons

  • No versioning unless we build one ad hoc
  • It is unclear what belongs to the client side and what does not.
  • Not sure how this would scale up performance-wise: what if someone retrieves millions of records in one query?
  • Backing up large SQL databases is more difficult than backing up files
  • We can't really do wide-tables unlike with dataframes. This means we need relational tables for custom fields, which means the performance will be even worse.
  • Data owners don't really own the data (at best they can export it from the database)
  • If the users retrieve some data from the DB to do some work, later on they have no easy way of knowing whether their data are still up to date

@lucasgautheron
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant