applications.tex

\chapter{Applications}
\label{chap:applications}

This chapter presents three applications built with Syndicate and Gaia.
In all cases, the ability to control end-to-end semantics within SDS
(instead of the application) enables
developers to tackle difficult data management
techniques, in ways that both preserve backwards-compatibility with existing
applications and preserve forward-compatibility with future storage features.
Applications do not need to be modified to leverage
new commodity services, and data flows and gateway placement let developers
consistently solve data management problems across multiple applications.

\section{Serverless Groupware}

% Groupware with Gaia
Groupware is a common category of Web application that allow users to
collaborate via data-sharing.  Groupware applications include shared to-do lists, calendars,
documents, contact lists, and so on.  Multiple users read and write to the same
storage medium in order to coordinate their activities.

The data storage story for groupware today requires each user to be able to see
a consistent view of her data, regardless of which of her devices read or write
it.  Since groupware is often used in sensitive
settings such as corporations, users have an expectation of privacy---by
default, their state is only visible to their devices.  Users must
\emph{explicitly} share data with other users (or the public), and if they do
so, their shared data is visible to all other users on all of their devices.

In conventional groupware software, this is achieved by running a shared server.
The users in the same user group have read and write access to the server's
state, and the server resolves conflicts between writes and enforces access
controls.  In addition, the server takes advantage of its global view of the
users' state to build up derived state like edit histories and
backups.  From a data policy perspective, all users trust one
organization composed of the server and all of the user groups'
devices.

In multi-organization settings, or in settings where users do not directly know
one another, implementing shared groupware is more challenging.  Each user (or
subgroups of users) have different policies regarding how their data is to be shared.
For example, a user's personal calendar should not be shared with work
colleagues.  What is needed is a groupware system where users can \emph{self-organize} into
user groups with which to share data, in a way where users can easily
authenticate one another and establish trust relationships with minimal
coordination.  This is achieved with a Gaia groupware library.

The groupware library differs from existing groupware software in two key ways.
First, it lets each user host their data on whichever cloud services (or
servers) they choose, while preserving end-to-end storage semantics for
groupware applications.  Second, it gives each user the ability to vet each
other user in the system by having users prove ownership of existing social
media accounts.  This latter feature allows users to self-organize into their
own per-application organizations with minimal coordination.  By posting
machine-checkable proofs-of-ownership on social media that are cryptographically
linked to accounts in Gaia's SSI system (henceforth referred to as ``social
proofs''), a user can easily vet other users when deciding to share groupware
data with them.  For example, users can leverage social proofs to prove that
they work in the same company, or go to the same school, or have the same shared
interests.

\subsection{Motivation}

Groupware software falls into two categories:  in-house groupware servers that
the users of an organization must maintain themselves, or outsources groupware
servers that run in third party servers.  There are undesirable 
trade-offs for both types of groupware.  In the first case, users incur an
ongoing operational cost for keeping the software up-to-date and keeping the
server running.  The advantage, however, is that they unilaterally control all
aspects of the server's data storage---including how often it gets backed up,
who can view the data, what kinds of derived state it makes, what version(s) of
the software it runs, and so on.

The second type of groupware is increasingly popular.  Companies like Microsoft
and Google each have suites of software-as-a-service offerings that take the
operational responsibilities out of the user's
hands~\cite{gapps}~\cite{microsoft-apps}.  The advantage is that the
SaaS offerings have potentially higher uptime and are managed by experts, and
are available at a predictable cost to users no matter how easy or hard it is to
maintain it.  The downside, however, is that the SaaS provider has global
visibility into the users' data, regardless of the users' desired privacy
settings.  If the SaaS provider is hacked, their groupware data can be exposed
to the public.  If the SaaS provider goes out of business, the groupware data
can be lost forever.  If the SaaS provider changes its API, then any custom
integrations with the platform break.

There does not exist a middle ground where users can share their data in a way
that is convenient for them (like what SaaS offers), but with the policy
controls they would get by running an in-house groupware server.  The serverless
groupware library for Gaia fulfills this need.

\subsection{Role of SDS}

Gaia enables the best of both worlds.  Users get all of the
operational convenience of SaaS with the privacy and data controls of having
their own servers.  Importantly, Gaia allows users to select whichever storage
providers they want without affecting the design of the groupware software. 
In addition, ancillary functionality like search indexing can be
implemented in Gaia gateways and reused in other applications by way of the
global relational database design pattern described in the previous chapter.

The users rely in Gaia's SSI system to bootstrap data confidentiality and
authenticity.  The gateways in Gaia ensure that all data is signed and encrypted
when it leaves the device, such that only the user's designated recipients (if
any) can view it.  In addition, the groupware software uses Gaia to ensure that
applications are isolated from one another at the volume level---an application
client can only access application-specific state.

A key operational concern of groupware systems is that they must only allow
users to view one another's data \emph{with the owner's permission}.  Gaia's
gateways enable this by allow users to implement data-specific checks when sharing
data.  This is achieved by giving users the ability to create and vet one
another's social proofs.  Importantly, the social proofs are verified
automatically by the software and presented to the user as part of the
permission-granting user experience.

\subsection{Design}

The groupware software is designed to run within the Web browser.  The
application logic runs as a Web page, and loads and stores the user's
credentials and data via a co-located Gaia node.  This allows decouples the
user experience and application functionality from the user's shared
storage concerns.  For example, one user can store their data on Dropbox and
another user can store theirs on Google Drive, but the application can access
each user's data regardless via the Gaia node.  A system overview is given
in Figure~\ref{fig:chap4-gaia-groupware}.

\begin{figure}[h]
   \centering
   \includegraphics[width=0.9\textwidth,page=23]{figures/dissertation-figures}
   \caption{Design of serverless groupware with the Gaia SDS system.  Alice
   lists signed certificate graphs in her SSI user account data, as well as the
   list of her personal devices' public keys and social proofs.  While Alice can
   write to her storage from her private Gaia nodes, she can make her data
   available via a public Gaia node as long as her SSI account contains enough
   social proofs that she is a valid application user.  Bob
   uses this public gateway to discover and read her shared data.}
   \label{fig:chap4-gaia-groupware}
\end{figure}

\subsubsection{Setup}

A user receives a volume for each groupware application she uses.  When she signs up for a specific
application, the groupware software inserts an application-specific set of keys
into the user's SSI account information, indexed under the application's name.  To provide
confidentiality, the user has the option of encrypting this routing information
such that only her trusted peers can discover that she uses it.  Her other
devices and other users' devices inspect her account in the SSI system to determine
which keys to use to authenticate the data she writes, as well as discover how
to access her storage (i.e. which Gaia nodes to contact, which storage providers
to contact, etc.).

\subsubsection{Sign-in}

The groupware software employs device-specific keypairs to allow the user to
sign in via multiple devices.  When the user signs in for the first time, her
device creates a volume for her and registers \emph{all} of her devices as
belonging to the same volume owner.  Then, when the user signs in from a different device, she can
still read and write data to her existing volume and administrate it.

The software ensures that her devices are aware of each other via a
``delegation record'' in her SSI account.  The delegation record lists all of
the user's device IDs and their public keys.  This way, when the user creates a
new volume, the software automatically grants all devices the volume owner
privileges.  To the user, it appears that they simply began using the app from a
separate device, just as they would have had it been a conventional Web
groupware application.

If the user wants to add or remove a device, she must re-generate her delegation
record with the current set of device public keys.
To do this securely, the software requires a quorum of signatures from a trusted
subset of her devices (configurable by the user).
A delegation record will only be considered valid if it
is accompanied by a sufficient number of signatures from this trusted device
set.  For example, a user might require a signature from two of three of her
devices in order to add a fourth device or remove the third, and in doing so
tolerate the loss of one of her three devices.  This way, the user can control
which devices are allowed to write to her data while tolerating the loss or
compromise of a pre-configured set of them.

Both the quorum threshold and the public keys of the
trusted devices are listed in the user's SSI zone file.  Since changing the zone
file requires a blockchain transaction in the SSI system, there will be a
widely-replicated auditable log of each user's device key rotations.  This makes it
easy for users (and their collaborators) to check key lifetimes, and makes it
risky for attackers to attempt to change keys (since they cannot do so
silently).

When the user signs in, the groupware library creates a gateway for the device
she is using if one does not exist already.  Her device will sign the
new certificate graph for the app's volume and make it available in her SSI account.  The
software authenticates data from the user by (1) looking up the user's ID in the
SSI system, (2) extracting the trusted device public keys and quorum threshold
from the zone file, (3) validating the delegation record, and (4) validating the
certificate graph against the delegation record.  The software caches
monotonically-increasing version numbers for the certificate graphs locally to
prevent stale certificate graphs from being reused.

\subsubsection{Reading and Writing Data}

Since a user gives each application its own volume, a groupware application like
a shared calendar spans the set of users' devices.  Gaia ensures that when the
application client is loaded, it only has visibility into the
application-specific volumes the users have created (i.e. so a malicious
or buggy application cannot read another application's state).

The groupware storage interface references data by its volume key and owner user.  For
example, to read Bob's file \texttt{today.cal}, Alice's application client would call
\texttt{get(``today.cal'', ``bob.id'')}, where \texttt{bob.id} is Bob's username
in the underlying SSI system.  All the while, Gaia
ensures that Alice's calendar application only discovers the routing information 
to Bob's calendar volume.

\hfill \break
\noindent{\textbf{Read Authorization}}
\hfill \break

When writing shared data, the user must ensure that it is readable by a given
set of other users.  How does the writer identify these other users,
and how can the software identify users as belonging to particular
organizations?  The groupware library addresses these problems by both allowing
the writer to specify other individual readers, and by allowing the writer
to specify which social proofs a reader must have (as well as a way to vet
them).  

The user is free to choose which proofs are required for their
application, depending on the application.  For example, a cryptocurrency
investment application could require a user to produce a signed KYC
(know-your-customer) attestations
from the government and the user's bank that prove that the user is an
accredited investor.  This proof would be signed and stored in a social media
platform that the groupware library can crawl (such as
AngelList~\cite{angellist}).

Once a Gaia gateway knows which social media proofs are required to read
a key value, it will only accept read requests from users who present
the requisite proofs.  To facilitate this check, users insert URLs to the
proofs within their SSI account linked to their names in the SSI system (which
the Gaia gateway looks up on-the-fly).

\hfill \break
\noindent{\textbf{Searching}}
\hfill \break

Public groupware data is readily indexed by anyone who wishes to stand up a
Gaia database instance to crawl the set of application-specific
volumes. In addition, private groupware data can still be indexed---either by a
trusted, private Gaia database, or by downstream user groups.

To implement private search in a user group,
the groupware software ensures that the local device's Push
stage indexes the contents of the file before encrypting and replicating it.
The Push stage encrypts the index data with the viewers' public keys, so
the viewers will be able to search for the file by keyword.

The index itself is application-specific, but can do things such as
associate search terms to file names and word counts.
The index data is structured as a per-user prefix tree, so
that a search query only needs to fetch a narrow subset of the index to find
files with the search term.

A global untrusted relational database can accelerate delivery of encrypted
index files to downstream readers.  Trusted readers asynchronously fetch,
decrypt, and incrementally reconstruct the writer's index locally to service
search queries.  Depending on the sizes of the index and the number of users,
the application may take different strategies for fetching the encrypted
index---for example, a large user group may employ a private trusted instance of
a Gaia relational database that can eagerly build up a search index, whereas a
small user group may simply fetch and reconstruct each other users' indexes as
needed.

\subsection{Implementation}

The groupware library implementation is the work of multiple contributors.
It is implemented in two parts: Javascript library that facilitates user
sign-ins and application-specific volume creation, discovery, reads, and writes,
and a UI that allows users to manage their social proofs.
It was developed in collaboration with Blockstack Public Benefit Corporation~\cite{blockstack-pbc}.

Several applications have been independently built by Blockstack community members
with the groupware library.  Examples include:

\begin{itemize}
   \item \textbf{Blockstack To-Dos}:  This is a private to-do list application
      that uses single-reader Gaia volumes to store private user to-do lists.
   \item \textbf{Graphite}:  This is a Google Docs work-alike~\cite{graphite-docs}.  Users store and share documents and
      spreadsheets via multi-reader Gaia volumes.  The data is encrypted by
      default, so that only the designated readers can access it.  It makes use
      of a Gaia database to facilitate secure document discovery---the database
      discovers encrypted pointers to the encrypted document, so that only the
      intended recipient can access the data.  It also offers end-to-end
      encrypted messaging, where messages are replicated to Gaia volumes for
      long-term storage.
  \item \textbf{Blockstagram}:  This is an Instagram work-alike that allows
     users to securely share photos via multi-reader Gaia
      volumes~\cite{blockstagram}.  Photos are
     encrypted with the recipients' public keys before being replicated, thereby
      providing end-to-end confidentiality.  It was developed by a team of eight
      Web application developers with no prior experience with Gaia (or
      Blockstack, Gaia's SSI system) in less than 36 hours at a hackathon in
      Berlin~\cite{patrick-tweet-blockstagram}.  % https://twitter.com/PatrickWStanley/status/970307376690626561
  \item \textbf{Stealthy.im}:  This is an end-to-end encrypted chat application,
     where users can securely send text and pictures
      real-time~\cite{stealthy.im}.  It uses
      multi-reader Gaia volumes to store chat data, and uses a Gaia database to
      discover and invite users to chat.  A similar Gaia-powered application is
      \textbf{Hermes}~\cite{hi-hermes}.
  \item \textbf{Coins}:  This is a private cryptocurrency portfolio application
     that uses single-reader Gaia volumes to securely and confidentially store
      the user's cryptocurrency holdings~\cite{coins}.  It allows the user to track the worth
      of their holdings without exposing them to anyone outside of the user's
      computer.
  \item \textbf{Publik}:  This is a microblogging application that uses
     multi-reader Gaia volumes to share blog posts~\cite{publik}.  A Gaia
      database for indexing hashtags and user posts is under development.
  \item \textbf{Bellweathr}:  This is a business analytics program that uses
     machine learning in the user's Web browser to help a business owner
      identify patterns in customer purchases~\cite{bellweathr}.  Business
      owners use Gaia to load and store encrypted copies of their customer data
      and trained models, thereby ensuring that it will remain private.
      Equivalent applications today require business owners to expose their customer
      data to third parties, which puts both they and their customers at risk 
      to hackers and security mishaps.
\end{itemize}

All of these applications use Gaia and its SSI system to load, store, and share
user data.  The SSI system implementation (the Blockstack Naming
Service~\cite{bns}) removes the need for per-app password databases and per-app
identity services, and Gaia removes the need for per-app data silos.  Users can
share data from one application to
another~\cite{blockstack-technical-faq-share-data} without the application's
permission or cooperation.

The applications Graphite, Blockstagram, Stealthy.im, and Hermes all rely on a
global database instance to discover other application users.  They are not
coupled to a specific instance; anyone can deploy a new global database if the
default instance misbehaves or is not trusted.

\subsection{Discussion}

The usefulness of SDS is apparent in its ability to implement its users'
data-hosting policies independently of the applications.  Each user can keep their groupware data on
the storage providers of their choice, and in doing so, control their
availability, durability, and access control independently of one another and independently of
the applications.  For example, a user's Gaia node can programmatically delete
old Stealthy.im messages without Stealthy.im's permission.  As another example,
a user's Gaia node can limit access to its owner's Graphite documents by denying
reads from hosts outside its local area network.

At the same time, application developers do not need to care
about hosting user data, and do not need to worry about coupling their data to
specific storage systems.  All of the third-party applications above do not rely
on application servers.

As an optimization, their respective developers deploy
Gaia global databases to help users discover one another.  For example,
Stealthy.im implements an invite mechanism using a Gaia global database, and
Graphite uses a Gaia global database to help users discover shared files.
However, the developer is not required to deploy and maintain a global database.
Gaia global databases only host soft-state in the application, and any user can
instantiate their own global database and derive the same 
database state.  This means that as long as at least one user is interested in
preserving Stealthy.im's invite system or Graphite's
document discovery system, they can do so without the developer's help.

The expressive power given to developers by the aggregation driver model is
apparent in the ability to control read and write access based on whether or not
the requesting user has made particular social proofs.  The social proof check
code only needed to be written once, and it now works across all groupware
applications and all cloud services.  The expressive power is also apparent in
the ability to automatically generate private search indexes in response to reads and
writes.

The main difficulty with giving users direct control over their groupware data
today is that it forced them to run a shared groupware server (or collectively
trust someone to do so on their behalf).  By instead
implementing what used to be server-side functionality as aggregation driver
stages, the library removed the need for a shared server while preserving each
user's control over their data.

\section{End-to-End Encrypted Email}

The ability for SDS systems to instantiate application-specific data flows gives
users the power to enforce data transmission and storage concerns in
\emph{existing} protocols as well.  This is demonstrated by using Syndicate to construct
end-to-end encrypted email that addresses long-standing
usability concerns that impede the widespread use of PGP~\cite{pgp}.

\subsection{Motivation}

Encrypted email is not a new concept.  However, it has proven notoriously difficult to
deploy~\cite{why-johnny-cant-encrypt}
~\cite{why-johnny-still-still-cant-encrypt} due to the need for users to manage
private keys.  In addition, deploying end-to-end encrypted email over legacy
SMTP servers and clients leaves users vulnerable to two security flaws:  users
can only achieve end-to-end encryption if they all share keys, and users can
accidentally leak other users' cleartext when including new users in an email
thread.

\subsubsection{Using Private Keys}

Even if users had a good understanding of public key cryptography, they must still contend
with key distribution and key revocation.  Key distribution is not addressed by
the encrypted email systems studied.  However, existing methods---key escrows,
certificate authorities (e.g. S/MIME~\cite{smime}, DANE~\cite{dane},
x.509~\cite{x509}), and webs-of-trust are difficult to use securely, and easy to
use incorrectly.

Key escrows and certificate authorities are ``centralized''
entities that often live outside of a users' organizations, which makes it
difficult for users to reason about their trustworthiness.  Only organizations
whose data policies admit a trusted third party can make use of these services.
Trusting a third party for such a task carries the risk of compromise: if a
widely-used certificate authority is compromised, it can lead to widespread
data exposure.  Users may not discover until after harm has been done to
them, such as identity theft.

Webs of trust do a better job than centralized key servers at preserving
organizational autonomy because they allow each organization to unilaterally
decide which other organizations to trust.  However, there is a high
coordination cost in maintaining them.  This
is because trust is \emph{not} transitive by nature---if Alice trusts Bob and Bob
trusts Charlie, it does not follow that Alice trusts Charlie.  Users in each
organization need to be wary of the degree to which to trust their peers, and wary of the trust
judgments their peers will make.  Moreover, they must curate their webs of
trust to account for changes in the organization.  For example, if Bob is fired
from his job, then all of Bob's coworkers must update their webs of trust to stop trusting his
email signing key.

Key revocation adds another layer of complexity.  Key revocation certificates
and signed key expiration dates do not go far enough in making encrypted email
usable.  If a user loses both their private key and their key revocation
certificate, then they have to get other users to re-establish trust in them
from scratch.  If the user's private key is compromised, then the attacker can
send arbitrary emails before the user can transmit their key revocation
certificate.  If the user loses their revocation certificate, or if the attacker
can stop the certificate from reaching the victims, then the user cannot stop an
attacker with their compromised private key.

\subsubsection{Contacting other Users}

Even if users could reliably distribute and revoke public keys, conventional
email clients still allow users to communicate with others in insecure ways.
Users can bring harm to themselves by accidentally sending email in the clear
when they meant to encrypt it.  Also, users can bring harm to others
by accidentally divulging their communications by carbon copying
their cleartext in an email to a user who does not use encryption.

Neither existing SMTP clients (including Web clients) nor
SMTP servers address these problems.  SMTP clients do not help users with key
distribution or revocation, and they do not help the user discover whether or
not they have the right key.  Web SMTP clients are even less secure, because the
Web client offloads transmission to a remote server (which now must be trusted
by the user).  If the user wants to use another device to send an email, such as
a public terminal, they have to divulge a private key to the device.

SMTP is already ill-suited for encrypted communications because at a minimum the email's
sender and recipient must be readable by all SMTP servers between the sender and
recipients.  Also, due to its store-and-forward architecture, any messages
accidentally sent in the clear will be stored by the servers for an
indefinite amount of time.  Users do not get to choose which servers store and forward
messages, and users cannot ``unsend'' messages if they discover that they sent
them to the wrong recipient.

\subsection{Role of SDS}

This thesis presents a backwards-compatible mail system built on top of
Syndicate.  Unlike conventional email, the Syndicate email system
automatically encrypts data end-to-end and ensures that users
discover each other's \emph{current} public keys by way of its SSI
system.  User can do the following with this system:

\begin{itemize}
\item \textbf{Automate key management}.  Users do not need to interact with keys
at all.  Users do not need to trust external key escrows or certificate
authorities, and they do not need to participate in webs of trust.  Instead,
users rely on Syndicate's blockchain-powered SSI system to discover each other's
current public keys.

\item \textbf{Control where emails are hosted and who can request them}.
A user's message contents will
not be relayed through the SMTP network, but will instead be hosted in one or
more storage hosts of the user's choosing.  Recipients will instead download and
decrypt the message once they have discovered where it is hosted and have
obtained sufficient permission.

\item \textbf{Support sending to legacy users}.  The Syndicate email system does \emph{not}
require both sender and recipient to use the same client in order to achieve
better security than legacy email.  If the recipient does not use this new system, the sender has
the ability to contact the receiver while
preserving sender-chosen security properties.  For example, 
the sender can share the message body via a trusted private shared cloud storage folder
that only the sender and receiver can access, and send the URL to the message
body via SMTP.  Only the recipient will be able to access the data.

\item \textbf{Safely use untrusted devices}.  This secure email system uses Syndicate's SSI system to
allow users to derive short-lived throw-away keys for signing and encrypting
messages on untrusted devices, like public terminals.  The keys are
automatically distributed and revoked.
\end{itemize}

\subsection{Design}

The Syndicate email system follows a similar design to the Internet Mail
2000~\cite{internet-mail-2000} proposal.  Users store their
encrypted emails in a Syndicate volume, which they
use to selectively give recipients access to their messages.  The system uses
the SMTP network to allow senders to inform receivers when they have new
messages waiting for them (Figure~\ref{fig:chap4-syndicate-mail}).

\begin{figure}[h]
   \centering
   \includegraphics[width=0.9\textwidth,page=24]{figures/dissertation-figures}
   \caption{Design of end-to-end encrypted email with Syndicate SDS.  Alice can
   send email from both a personal device and a public terminal; the latter of
   which gets assigned a temporary session key that expires shortly after being
   created.  Bob's client detects new mail from Alice via the legacy SMTP
   network by receiving a signed list of URLs that point to Alice's chosen
   storage services.  If Alice emails non-users of this system, her UG employs a
   custom ``message gateway'' (MG) type to Push the message payload to them
   while enforcing her custom security policies (such as ``store this message in
   a private shared Dropbox folder that the recipients can access and email them
   the URL'').}
   \label{fig:chap4-syndicate-mail}
\end{figure}

\subsubsection{Setup}

Each user stores their preferred email address in the SSI system.
Alice sends a message to Bob by looking up Bob's account
information in the SSI system, and then obtaining his email address.  In order
to convince Alice that he is the ``right'' Bob (i.e. the Bob she is looking
for), he includes additional credentials in his SSI data, such as social proofs
or signed attestations from trusted third parties.  The Syndicate email system
is not concerned with implementing a particular authentication strategy, but instead gives users
the ability to prove that various pieces of user-submitted identifying state associated with
the email address are signed by the same key that owns the email address
in the SSI system.  For example, if Alice knows that Bob owns the website
\texttt{www.bob.com}, Bob could authenticate to Alice by hosting his SSI
username and a signature on
\texttt{www.bob.com} and list a pointer to \texttt{www.bob.com} in his user
account on the SSI system.

The system is designed to accommodate multiple devices owned by the user by
storing all emails in a single volume that spans the user's devices.  Each
device has its own key-pair in the volume certificate graph, which is used to create
gateways specific to that device.  The user has an ``admin'' email account (i.e. an
account that is tied to the Syndicate volume owner account that stores her
emails).  The admin account is controlled
from a trusted device and is used to add or revoke permission to communicate from
other devices.

When a user signs up for the system for the first time, she downloads and
installs a mailer daemon
that implements an SMTP and IMAP endpoint locally.  The user points their preferred email
client to the local mailer daemon to send and receive messages.  In addition,
the daemon implements an HTTP interface for serving the mail client encrypted
messages from the Syndicate volume.

The mailer daemon prompts the user to generate a device-specific
Syndicate user account and two gateways (a UG and an RG)
when it is installed.  The user does so by using her
admin account.  The installer wizard gives the user the option of pre-allocating
keys for her devices and their gateways, which can be fetched and installed on untrusted devices
on-the-fly without requiring her to use her admin account again.  Their
keypairs are encrypted with a password of the user's choice, and stored to the user's
volume.

\subsubsection{Signing In}

Each device the user sends mail from receives its own keypair.  Each
device-specific key is associated with an optional expiry timestamp and
revocation certificate, which are stored in the user's Syndicate volume for
safekeeping.

Signing in with a new device requires ensuring that the device-specific private key is
available.  For devices the users trust, this is achieved simply by (1)
installing the software, and (2) allowing the device to register its public key
with the user's account in the SSI system.  An untrusted device, such as a
public kiosk, would receive a key with an expiry date and revocation certificate.
When the user signs out of the device, she would ``activate'' the revocation
certificate by appending a signed timestamp to it and 
moving it to a canonical path in her volume.  Other users'
clients would discover and process it automatically when receiving a message,
thereby ensuring that the kiosk does not use the private key after the user is
done with it.  The key expiry timestamp
ensures that the key expires nonetheless if the user is unable to successfully
sign out (i.e. unable to post the revocation certificate).

The device-specific key state includes the device-specific user account and the
device-specific gateway keys that the mailer daemon will use to interact with
the volume.  Each devices' gateways only write to one directory of the volume,
and mark their files as read-only by other devices (which the MS enforces).
The mailer daemon develops a coherent view of the mailboxes by listing all of
the devices' directory states.

In addition to creating device-specific Syndicate keys, the software also
creates a generic read-only UG and read-only RG whose private keys are publicly readable
and exposed in the volume.  These gateways are meant to allow recipients to access
the volume's ciphertext, so the designated recipient can decrypt them.
They are configured in the certificate graph to only have read capabilities, and
to only serve on \texttt{localhost}.  This ensures that all of the user's other
gateways will ignore them, and that anyone can run them on their computers to
access the inbox data.

\subsubsection{Sending and Receiving Mail}

The mailer daemon implements a Syndicate UG and RG (e.g. as subprocesses).
The UG implements the SMTP
and HTTP endpoints, and the RG uploads messages to the user's preferred storage
service, such as their personal Dropbox folder or a S3 bucket.

When the UG receives an outgoing message, its \texttt{serialize()} driver method inspects the
message for the recipient, and automatically looks up the public key in the SSI
system to encrypt the message to the recipient before sending it to the RG.
\emph{This way, the sender is never involved with selecting the key for a
recipient user}.  The software additionally makes a copy of the sent message encrypted with the
sender's public key, and stores it into the device's ``sent'' mailbox.

The mailer daemon informs the recipient that they have a message waiting for
them by sending a small amount of discovery information to the recipient's
email address via SMTP.  This discovery information is signed by the sender,
to prove its authenticity to the recipient.  It identifies the path to the
message in the volume, as well as the hash of the ciphertext.

The recipient's mailer daemon polls the user's SMTP inbox for discovery
messages.  When it finds one, it fetches, authenticates, and decrypts the
associated message from the sender's volume,
and locally stores it so the user's mail client can read it as a normal
email.  It does so automatically as part of the \texttt{deserialize()} driver
method in the UG---this driver method only succeeds if the message could be
authenticated.  The discovery message's sender email address
is used to look up the user's device keys in the SSI system to perform the authentication.
\emph{This way, the receiver never needs to select the key for the
sender to authenticate the message}.

The sender must host the email contents for the recipient until either the
recipient downloads it.  Once the recipient daemon has fetched the cleartext, he
encrypts and backs up a copy via its RG for safe-keeping.
The sender can delete the messages she sent at any time, thereby granting her
the opportunity to ``un-send'' an email's message body if she can do so before
the recipient fetches it.  The sender can garbage-collect old messages once
she is sure the recipient has fetched them, or once the information is no longer
relevant.  For example, the sender could simply delete all messages she
sent over one month ago.

If the sender includes multiple recipients, or includes a new recipient part-way
through the email chain, their mailer daemon detects this and ensures
that the previous conversation is kept secret.  This is achieved by having the
local RG in the mailer daemon remember which email threads have which
recipients, and ensure that their respective messages are re-encrypted before transmission.
This conversation metadata is encrypted and stored on the user's volume, so it
is accessible from all devices' RGs.  This decreases the likelihood that a user
accidentally divulges cleartext in carbon copies on the email client---the message
would simply fail to send if the user did this.

\subsubsection{Legacy Compatibility}

As with PGP before it, the Syndicate-powered email system requires both sender
and recipient to use it in order to realize the full benefits.  Unlike PGP, the
developer can ensure that certain safety features are in place if only the
sender uses the software.  This is made possible by Syndicate's
aggregation driver programming model.

It is important to recognize that when it comes to email, the correct way to
send a message depends on the sender, the recipient, the content, and the context in which
it is sent.  For example, two friends exchanging vacation photos do not need the same
security guarantees as an anonymous informant communicating with a law
enforcement agent.

One of the major drawbacks of PGP is that cannot work if either the
sender or recipient do not use it.  This significantly limits the set of senders
and recipients.  Moreover, PGP-encrypted messages are easy to spot in SMTP
traffic, which makes it easy for network eavesdroppers to identify users who
have something to hide.

What is needed is for senders and receivers to be able to communicate even if
one of them does not use PGP-like encryption.  The approach taken here is to
make it easy for the sender to control how the message will be delivered, while
allowing messages to be discovered by the recipient over legacy SMTP.  The
sender is free to set up the delivery process to implement the security
guarantees on a case-by-case basis, subject to what she knows about the recipient and subject to
the contents of the message.  For example:

\begin{itemize}
\item The sender can encrypt the message with a password known to the recipient,
and send the message body in a common document format, like Microsoft Word or
PDF, that the recipient can open and decrypt with already-installed software.
This can provide the confidentiality of PGP.
\item The sender can replicate the message to a shared private storage provider
like a Dropbox folder or private \texttt{git} repository, and send the
recipient the URL over SMTP.  This process can be carried out via HTTPS.
While this does not provide the same
degree of end-to-end confidentiality and authenticity as PGP, it guarantees that as long as
the certificate authorities and shared storage are trusted, then only the sender, the recipient, and the
storage provider can view the message (but SMTP servers see nothing).
\item The sender can select which network to use to transmit the data, based on
the recipient.  For example, an enterprise user could require all messages sent
to the company SMTP server must be sent through the corporate
VPN.  The aggregation driver would refuse to send messages unless it detected
that the VPN was available.  This ensures that all email messages sent by employees are
visible only to the company and the recipient.
\end{itemize}

These examples do not provide the same guarantees of PGP, but they are
better than relying only on legacy SMTP for email.  While they can all be done
today in an ad-hoc manner without SDS today,
Syndicate lets users ensure that they are all executed
automatically and consistently.  Moreover, the way these features are
implemented allows them to be reused in multiple different contexts, giving senders the
ability to \emph{combine} different features to create a custom message
transmission process.

Addressing legacy compatibility is a practical application of Syndicate's custom
gateway types.  The deployment designed so that the RG's
Push driver stage (1) reassembles the Pushed chunks received from the UG
(embedded in the email client) back into the original email, (2)
scans the certificate graph for gateways with a type identifier specific to the
email client (the ``MG'' gateway in Figure~\ref{fig:chap4-syndicate-mail}),
and (3) forwards the reassembled email to them for further
processing.

When the MG receives the message, it inspects the message
headers and runs a user-specified program based on the recipient address.  The
user-specified program is responsible for actually transmitting the email.
For example, each of the above examples can be implemented with separate
programs that are invoked as subprocesses that take the message as input and
carry out the actual transmission.

The transmission programs themselves are part of the email-type gateway's driver.  The user
deploys them to her volume by updating the certificate graph.  Since the volume
spans all of her devices, each of her devices will have the most up-to-date
transmission programs available whenever the user sends a message.

\subsubsection{Search Indexing}

Since all messages are encrypted client-side, there is no option for server-side
message indexing.  Instead, the user's RGs incrementally build up a
word-to-email index as part of their Push stage logic, just as they do in the
serverless groupware example.  The index itself is
encrypted with the user's public keys, so it is visible only on the user's devices. 
In fact, the code to maintain the users' indexes can simply be re-used by the
RGs without affecting the design or implementation of the mail clients.

There are two reasons to offload search indexing to the RGs instead of allowing
applications to handle this.  First, this preserves the index across all devices. 
This is especially important for Web clients, which cannot easily store a large amount of state
locally on their own (HTML \texttt{localStorage} is limited to 5MB, for example).  Second, it
makes it easier to implement additional features like spam filtering, described below.

\subsubsection{Spam Filtering}

A key usability problem with encrypted email is that the servers cannot filter spam,
since they cannot read the messages.  This can be addressed in four ways within
the volume's aggregation driver.
\\
\noindent{\textbf{Shared Spam Database}}. First, the aggregation driver is programmed to have the RGs in a user's volume build
up a \emph{shared} set of classification data from user input.  When the user
moves data to the ``spam'' mailbox, the RG driver's Push stage generates and
a feature vector from the cleartext and stores it in a shared storage
provider.  This allows
users to share each others' spam feature information.

The shared storage itself is implemented as a separate, third party volume that enforces write-once read-many
access patterns, and tracks which users add which features.  That is, the RGs to the volume do not allow a record to be
written more than once, and do not allow records to be deleted (except by the
volume owner).  This ensures
that users do not accidentally clobber one another's writes, and a malicious
user (such as a spammer) cannot erase the feature vectors.  If it is later
discovered that a particular user's records were written with malicious intent,
they can be removed by the volume owner.

This arrangement is similar to existing third party spam detectors such as
Spamhaus~\cite{spamhaus}, where a third party aggregates spam information 
on behalf of many users.  The spam volume owner would aggregate the spam
information to train a spam classifier, and write the classifier parameters
to the volume.  A user's mailer daemon would connect to the volume in a read-only fashion
to read the classifier parameters, and use them to classify the user's inbound
messages as spam or not spam.  Because the volume is shared across many users
(and can be replicated by any user), the users are able to avoid spam-detection
service lock-in because they can (1) independently calculate the spam classifier
parameters, and (2) come up with their own, better classification system if the
spam volume owner does not do a good enough job.

Anyone can set up and run a collective spam filtering process.  Users are free
to unilaterally decide which ones to use.  Therefore, this approach does not
infringe on organizational autonomy.
\\
\noindent{\textbf{Sender Pays for Storage}}. The second anti-spam feature is that by design, the user pays for storing messages to recipients.  Since each
recipient has a different public key, the user must encrypt a message for each
recipient.  As a result, a spammer
must store a lot of state to spam many users at their own expense.  This
discourages, but does not completely remove, bulk spam.  This is similar to
the Internet 2000~\cite{internet-mail-2000} webmail proposal.
\\
\noindent{\textbf{SSI Proofs of Payment}}. The third measure is to take advantage of the fact that the SSI system is implemented on top of
a public blockchain.  This feature allows for some interesting anti-spam mechanisms.  A recipient can require the sender
to include a ``proof-of-payment'' on the message, generated by a transaction on the
underlying blockchain.  This would have the effect of both rate-limiting
spammers and making emailing users prohibitively expensive to do at scale.  It
would also allow senders to prioritize messages by paying higher fees. This
is a technique that was successfully employed by
Earn~\cite{earn-co}, for example, whereby a user will only see a
message if the sender has paid a minimum amount of money required by the
recipient.
\\
\noindent{\textbf{SSI Social Proofs}}. The fourth measure is to re-use a concept from Gaia-powered groupware to require
that a sender provide sufficient proofs in the SSI system that they are a
legitimate human being, and not a bot.  For example, a recipient can enforce a
default anti-spam policy whereby a sender must supply evidence in their SSI
account that they own at least five unique social media accounts, and that the
accounts undergo a minimum amount of activity.  This makes it hard to send
spam at scale because (1) the spammer would need to circumvent all of the social
media systems' anti-bot mitigations, and (2) if the spammer gets caught, they
have to register a new identity in the SSI system (necessitating a blockchain
transaction).  Since the blockchain itself grows at a fixed rate, and since
blockchain peers effectively bid on the ability to write new transactions, a
spammer could not easily register many identities without paying a high price
(i.e. the price gets higher the faster the spammer tries to register new
identities).  This allows the system to overcome the limits of prior proof-of-work
techniques~\cite{anti-spam-proof-of-work} which either had a fixed proof-of-work
threshold or a threshold that increased independently of the system's usage.

All four of these techniques would be implemented in part by the Acquire stage of the
mailer daemon's RG.  This ensures that all email clients automatically benefit
from these mechanisms without modification.

\subsection{Implementation}

The prototype system, SyndicateMail, is implemented in 4100 lines of Python and
1700 lines of Java.  It implements end-to-end encryption across multiple devices
and offers legacy compatibility with SMTP.

The system is being refactored to use the search indexing
logic from Gaia to implement search indexing in Syndicate.  The RG
driver runs the indexing logic as a subprocess in a \texttt{node.js} VM.  The
spam filtering is carried out simply by passing the text through an existing
spam-detecting system such as \texttt{spamd}~\cite{spamd} or \texttt{spam-assassin}
~\cite{spam-assassin}, and only forwarding the email text if it is not spam.

\subsection{Discussion}

In terms of the number of patches to write, it would be costly to implement this email system without
SDS.  Each email client would need to be patched to store its state to the
storage provider of the user's choice, whereas the use of Syndicate ensures that
storage services only need to be ported once.  By moving data signing,
encryption, decryption, and verification to the storage layer, 
and using Syndicate's SSI system to bootstrap key trust, Syndicate enables the
use of existing email clients with encrypted email without forcing
users to understand public-key cryptography.  By using gateways to represent the
capabilities of each device, the system is able to provide the convenience users expect from Webmail
without forcing them to manually copy private keys between devices.

Filtering spam and preventing accidental cleartext disclosure are problems
that require the system to inspect email contents on the user's behalf.  This
is achieved by having the user's RGs carry out this inspection locally,
instead of forcing the user to trust an external SMTP server to do so on their
behalf.  This is crucial to ensuring end-to-end message confidentiality, and
is required to be implemented at a layer \emph{beneath} email clients to ensure
that the user's choice of client does not alter the system's ability to ensure
message confidentiality and prevent spam delivery.  These problems
are both addressed by allowing the user to run application-specific aggregation driver
stages interposed between their personal devices and the rest of the network.

\section{CDN-accelerated Scientific Data Staging}

Scientific computing is increasingly conducted across multiple research groups.  Data is
generated and stored in the labs where a scientific instrument or dataset is
curated, and then shared across the world with collaborators.  Similarly,
collaborator labs publish their data analyses, which get downloaded by other labs
(and classrooms) for further consumption.

The third application presented here is to use 
Syndicate to implement a cross-site data processing framework that
allows scientists to take advantage of commodity cloud storage and CDNs to host
and deliver data to each lab.  For dataset curators, this reduces the task of
exposing a dataset to collaborators to running a Syndicate AG that can crawl the
dataset (with a dataset-specific driver) and serve chunks of it to downstream
UGs.  For dataset readers, this reduces the task of accessing a dataset to
fetching a dataset-specific Docker~\cite{docker} image that mounts the dataset
as a read/write filesystem backed by the dataset AG, an intermediate CDN, and
the user's personal cloud storage.

\subsection{Motivation}

The main motivation for considering an SDS approach to scientific data storage is
that due to the nature of the data they gather, each lab will have its own data curation
policies, its own unique data access patterns, and its own data-sharing policies.
There is not a one-size-fits-all approach for hosting scientific data, and labs will need to tailor
their storage systems to meet their specific needs (especially since their needs
change over time, depending on the nature of the data they produce).

This need to accommodate changing data storage and access policies is evident in
the evolution and wide success of state-of-the-art scientific storage systems
like iRODS~\cite{irods}, which offer
user-programmable policies (``rules'' and ``microservices'') that allow
individual scientists, project teams, and entire labs to programmatically
specify their curation policies and have them automatically enforced.
In fact, iRODS is considered to pioneer SDS concepts
(Chapter~\ref{chap:related-work}).

The scientific data-sharing framework uses commodity CDNs and cloud storage to help iRODS deployments
handle ``fan-out'' data distribution cases, where many labs across the wide
area want to read existing datasets and write back changes that will be
incorporated into the iRODS dataset.  CDNs would let individual iRODS
deployments scale up the number of reads they could service while preserving the
policies encoded in its rule sets and microservices.  Commodity cloud
storage would allow users to host the results of their computations and share
them with their lab mates and peers before generating and preserving a
``curated copy'' of the data back to iRODS.

\subsection{Challenges}

Augmenting existing systems like iRODS
with commodity infrastructure introduces challenges of
its own.  It is not enough to simply place a CDN in between iRODS and remote readers for
three reasons:

\begin{itemize}
\item \textbf{Protocol Incompatibility}.  CDNs are designed for Web content
acceleration, which means
using HTTP as the data delivery protocol.  However,
iRODS does not speak HTTP.  A protocol translation layer is required.
\item \textbf{Cache Thrashing}.  CDNs are designed for caching lots of
``small'' files---i.e. website assets like HTML or CSS that are not usually
gigabytes in size.  However, iRODS data can be extremely large, and clients may
only even want a small range of an iRODS file.  Serving iRODS data with a CDN
while getting good bandwidth will require file-level fragmentation and reassembly
on both the producer's and consumers' endpoints.
\item \textbf{Cache Incoherency}.  iRODS is a read/write datastore.
While some users are reading from a file, another user can be writing to it.
This can cause readers to cache corrupt data, which in turn gets served to
future readers by the CDN.  Avoiding this problem requires manual coordination
between readers, writers, and the cache operator.
\end{itemize}

In addition, sharing the results of local computations and generating a
dataset to write back to iRODS has its own challenges:

\begin{itemize}
\item \textbf{Replica discovery}.  Suppose a scientist reads some data from iRODS,
runs some local jobs on the data, and saves the job's results to the lab's shared Dropbox
folder.  How do the scientist's peers find the data, so they can run their own
analysis on it?  Today, they email the links to the peers or put the links on the lab
website.  However, this introduces a manual, tedious process for sharing data.  Can we
automate data discovery with Syndicate?
\item \textbf{Replica write-back}.  A scientist's collaborators do not always
have access to her lab's iRODS deployment.  How do her collaborators get their results incorporated
into her deployment?  More specifically, how do they discover a set of
authentication credentials to use to do so?  How does the data ingress server authenticate the
collaborator if they do not have an iRODS account?  Today, the solution is to find
and email an iRODS user with sufficient privileges and ask them to incorporate
the changes.  But can this be done automatically, without requiring users
in the loop?
\end{itemize}

As will be shown, these problems can be solved with the right configuration of Syndicate gateways.

\subsection{Role of SDS}

The need for software-defined storage in scientific computing is not new.  The
labs that gather and share scientific data must already do so according to
data-specific rules.  These include rules governing storage aspects like
national export controls, disclosure of proprietary or potentially dangerous information, and
even mundane concerns like ensuring the data appears in the correct format.

Prior to systems like iRODS, these rules had to be enforced either within the
scientific computing applications, or within a bespoke storage system.
Enforcing the same rules across many labs' applications poses a high cost of
coordination, since each lab's applications must be audited for compliance.
Enforcing a set of rules within a bespoke storage system requires constructing a
bespoke storage system for each rule set.  Allowing a storage system to have its
curation rules programmed at runtime without changing the application-facing
storage APIs is the ``sweet spot'' of SDS for scientific computing.

This Syndicate-powered scientific data-sharing framework extends an existing system (iRODS) with
Syndicate to allow existing workflows to take advantage of commodity infrastructure
(CDNs and cloud storage) without affecting the application-facing storage APIs.
Crucially, the data-sharing framework does so in a way that \emph{preserves} the data owner's existing iRODS
rules in a global setting, while allowing the owner to specify additional
rules within Syndicate to specifically control how data is disseminated once it
leaves iRODS.

\subsection{Design}

An iRODS system can store many different datasets, and each dataset can have its
own access control policies set by the owner.  These are enforced internally by
iRODS when other users attempt to access the data.

\begin{figure}[h]
   \centering
   \includegraphics[width=0.9\textwidth,page=25]{figures/dissertation-figures}
   \caption{Overview of CDN-accelerated scientific data.  The iRODS deployment
   is private, and accessible only via the AG and RG (which run in trusted
   networks).  Remote UGs leverage the MS and CDN to read cached but fresh data,
   regardless of the CDN's caching policies.  When UGs write data, they do so
   via the trusted RG which sends the changes to the proper datasets.  All the
   while, the AG keeps the MS metadata consistent with writes from non-Syndicate
   iRODS clients by subscribing to a (iRODS-specific) message queue.}
   \label{fig:chap4-syndicate-datasets}
\end{figure}

The strategy for distributing each dataset to remote readers is to allow an
iRODS user to ``export'' their dataset by way of a Syndicate AG, and later
``import'' changes to it by way of an RG.  Both the AG and RG run with the
permissions of the dataset owner (Figure~\ref{fig:chap4-syndicate-datasets}).

The AG crawls the owner's dataset using
an iRODS driver and exports individual file metadata to the Syndicate MS.
It runs within a demilitarized zone (DMZ) on the network, linking the iRODS data
to the outside world.  It acts as an origin
server to the CDN and uses its iRDOS driver to load and serve file data as blocks and
manifests to downstream readers.  A dataset owner can run many AGs, and can have
different AGs index different parts of the dataset.

The RG accepts inbound write requests from external users who want to
incorporate their changes to the dataset owner's files.  It also runs in the
DMZ, so it can receive inbound requests.  It uses the dataset owner's
credentials to access iRODS.

The dataset owner's AG and RG operate in \emph{separate} volumes---one for
distributing data to readers (backed by the AG), and one for accepting new data
from writers (backed by the RG).  A remote user would mount the AG-backed volume
for read-only access, and would mount the RG-backed volume for read/write
access.  The AG-backed volume is meant for sharing datasets with collaborators in
wide-area settings using commodity CDNs, while the RG-backed volume is meant for
granting privileged users the ability to save data back into the dataset.
While it is possible for a user to mount
both datasets on the same host, in practice they mount one or the other.

\subsubsection{Read Authorization}

Each AG acts as an origin server to one or more downstream CDNs.  It does not
have direct contact with UGs at the edge, which initiate access flows.  This is
acceptable for public datasets, where no read authentication is needed.

If read confidentiality is desired, the AG will encrypt its
manifests and blocks with the \texttt{serialize()} stage of its driver.  It uses each
UG's public key from the certificate graph to send it a shared secret.  The
encryption is deterministic, such that two requests for the same chunk will
resolve to the same ciphertext.  This allows multiple UGs to leverage the CDN
for read availability without the CDN being trusted with data
confidentiality---UGs reading the same chunk will fetch the same ciphertext, and
the CDN will only cache one copy of the chunk's ciphertext.

The two drawbacks of this configuration is that the CDN can still see access patterns
on the ciphertext, and anyone who can read from the CDN can fetch ciphertext.
This may allow unauthorized principals to infer information about the data being
accessed.  If scientists wish to avoid this outcome, then the solution is to use
a trusted CDN that will carry out authentication at the edge.

Regardless of the disposition of the CDN, the scientific applications are none
the wiser as to the authentication steps taken by the UG.  This is
because the UG's driver handles the interfacing with the CDN.  If the
UG needs a decryption key to read chunks (i.e. for read confidentiality), then
the volume owner can distribute them to the UGs by sharing it through the
certificate graph (encrypting the decryption key with each UG's public key).
The UG fetches and decrypts the key automatically, as part of its driver code.  If
the UG needs an access credential to the CDN (i.e. for metadata
confidentiality), then the volume owner can use the certificate graph to encrypt
and distribute it to remote UGs and write the CDN driver to submit the
credential to the CDN on read.

\subsubsection{Write Authorization}

Each RG acts as an aggregation point for data generated by external UGs.
Unlike the read path, the UGs have direct contact with the RG on the write path.
As such, the UG's Push stage can ensure data confidentiality across the wide
area simply by contacting the RG via a TLS channel using client-side certificates.  The UG and RG configurations in the
certificate graph would be structured to include TLS keys and certificates for each
gateway, so the RG could authenticate UGs and UGs could authenticate the RG.

Only the RG has write access to the iRODS deployment.  The volume owner installs its
iRODS credentials via the certificate graph as well, ensuring that they are
confidential and up-to-date by encrypting them with the RG's public key.

The RG has the ability to perform write authorizations on a per-file and even a
per-file-region basis.  This is because the UG informs the RG which file it
writes to (i.e. as part of the manifest and block information it Pushes), and it
informs the RG whenever it renames or truncates a file.  The RG has the ability
to NACK operations that do not conform to the volume owner's policies.  For
example, a UG may be denied a request to rename a file into a separate
\texttt{\$HOME} directory in order to prevent users from gaining control of
parts of the dataset.

\subsubsection{Preserving Existing Policies}

From iRODS's perspective, the AG and RG are the only readers and writers to a
particular dataset.  Moreover, their reads and writes reflect the global
sequences of reads and writes initiated by wide-area users.  As such, the
volume owner retains the power to \emph{globally} enforce her iRODS-specific rules on data
accesses---any gateway-initiated accesses must also conform to the
already-deployed iRODS rules.

The volume owner can also set \emph{per-user} rules within her
gateways' drivers.  Importantly, iRODS does not need to be aware of the
Syndicate users, nor aware of the per-user policies that the gateways enforce,
since they occur in a separate layer.

As a result, scientists do not need to do any extra work or carry out any extra
configuration steps to begin sharing their iRODS-hosted data with off-site users.
All of their iRODS-local access policies continue to apply, and the scientist
has the \emph{option} of creating more-detailed rules within Syndicate.  The
only step the scientists must take is starting up a publicly-accessible AG and
RG, which will read and write to their datasets on their behalf when remote
users request it.

\subsection{Implementation}

The prototype data-sharing framework employs a variant of the
Akamai~\cite{akamai} CDN deployed on
OpenCloud~\cite{opencloud}---a federated computing platform similar to PlanetLab.
The CDN is operated by the OpenCloud developers and is made available to all
participating sites.

Several AGs have been deployed on top of the iRODS deployment at the University of
Arizona, and registered as origin servers on the OpenCloud-hosted CDN.  In
addition, several Docker images~\cite{syndicate-website} and a dataset-mounting
tool~\cite{sdm} have been made 
available for the general public to try out the system.

When a user downloads and installs a read/write Docker image, she receives two
mounted volumes---the read-only volume containing the dataset, and a read/write
volume for writing the results of her experiment.  She and her collaborators
will see each other's results when they are written.  Several RGs are deployed
that will write back her and other users' results both to iRODS and to a temporary S3
bucket that gets cleared every 24 hours.

iRODS-compatible applications interface with iRODS through FUSE and through a
set of command-line utilities.  The framework comes with a FUSE filesystem to preserve
compatibility.

A Hadoop filesystem plugin has been implemented that allows Hadoop
computing jobs to pull data from iRODS via Syndicate and the underlying CDN.
The HDFS plugin gives the job scheduler insight as to where Syndicate UGs have
locally cached chunks, so it can schedule tasks on nodes that already have the
data present.

The iRODS-facing driver is ~1900 lines of Python, of which ~640 are specific to
iRODS, ~375 are specific to the AG, and ~375 are specific to the RG.  The
UG-specific CDN interfacing driver is ~100 lines of Python.  The Hadoop plugin
is ~2300 lines of Java.

\subsection{Discussion}

Again, the utility of using SDS to link existing scientific data stores to
commodity cloud infrastructure is that SDS allows users to ``slap
on'' extra storage and data distribution capacity with little effort.  The
marginal cost of adding support for new storage and CDNs in terms of lines of
code is small enough that it can be achieved in as little time as a couple
of hours, including testing.

The key benefit to iRODS users in particular is that the iRODS system needs no
special modification to be made compatible with the CDN.  This is because
SDS effectively ports the CDN and storage to iRODS, instead of the other way around.
In doing so, iRODS-compatible applications (and even applications that only use
iRODS indirectly, such as Hadoop jobs) can transparently benefit from its extra
availability without having to overcome these aforementioned challenges.

\section{Remarks}

In these sample applications, using an SDS system to host data addresses
several difficult problems.  In all three examples, 
\emph{SDS provides semantic independence from storage systems} (i.e.
portability).  By both wrapping individual services behind well-defined service driver models, and by
allowing the volume owner to inject aggregation driver logic in-between the
application endpoints and the underlying services, the SDS systems ensured that applications
had access to a persistent data store that behaved exactly they way they needed
\emph{as if the application was using a purpose-built storage system}.

In all cases, the desired application-specific storage semantics were realized
by implementing gateways to handle unrelated storage concerns, and composing them
together into handle access and mutate flows.
In the Gaia groupware-powered applications, users host their data on whatever storage providers they want
while deploying gateways to enforce arbitrary access controls and maintain globally-visible search
indexes.  In the email example, users host
their email in whatever storage providers they want and deploy gateways to
implement spam filtering, search indexes, message prioritization.  In the scientific
data-sharing example, users augment existing storage systems with the CDNs and cloud
storage of their choice while deploying gateways to preserve the original system's global access
control rules and end-to-end consistency semantics.  Being able to compose
gateways was crucial to preserving storage semantics across organizations, since
in all cases there were sensitive operations like access controls and encryption
that could only be allowed to execute on certain computers.

The separation between aggregation drivers and service drivers proved useful in
practice.  Having these two layers of indirection allows service
drivers to be reused across different applications, and even across different SDS systems.
In doing so, \emph{SDS reduced the marginal cost of adding support for new
services to writing a single driver}.
Neither the aggregation driver logic nor the application need to be
patched when a new service becomes available.  Instead, a volume owner only
needs to deploy an updated service driver to her relevant gateways.

A slew of engineering problems were solved by
designing the storage layers of these applications to treat users as the
de-facto data owners.
For groupware, this enabled gateways to authenticate and vet new users who would
share their groupware data.  For email, this enabled gateways to discover the
sender and recipient public keys automatically and preserve end-to-end
authenticity and confidentiality.  For scientific data sharing,
this enabled gateways to identify and preserve dataset access controls regardless of the infrastructure used
to ship the data to external consumers.