-
Notifications
You must be signed in to change notification settings - Fork 0
/
related.tex
476 lines (396 loc) · 26.3 KB
/
related.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
% TODO: other system-of-systems that solved end-to-end semantics on top of cloud
% services, or solved organizational autonomy, but not both.
% i.e. cloud services, end-to-end semantics, org autonomy: pick at most 2
\chapter{Related Work}
\label{chap:related-work}
The SDS approach described in this thesis
is synthesized from ideas from several different bodies of work. Individual
concepts in SDS are based on time-tested design principles and engineering
techniques that have seen widespread usage. The main contribution is
a new way to apply many existing principles in a coherent fashion
to address long-standing challenges in
the design of wide-area networked applications.
\section{Software-defined Storage in Industry}
% SDS
Over the course of the development of this thesis, the term ``software-defined
storage'' has emerged as a marketing term in the software
industry with a different meaning than the one put forth in this thesis.
In the industry, SDS refers to software that implements some form of
network-wide storage virtualization or storage emulation
in the context of a single organization (e.g. company or datacenter)~\cite{techcenter-sds-definition}.
Industry offerings that refer to themselves as ``software-defined storage''
focus on implementing datacenter storage hardware as software, thereby
decoupling datacenter tenants and operators from specific vendors. This is a
complementary to but fundamentally different problem from this work, which
focuses on preserving \emph{wide-area} applications' storage semantics in the face of
changes to underlying storage systems. This work addresses the problem of
preserving storage semantics \emph{across} multiple organizational boundaries, whereas
prior work focuses on applying storage provisioning policies \emph{within} a
single organization.
\subsection{Virtual Block Devices}
One category of industry offerings that refers to itself as ``software-defined
storage'' focuses on implementing programmable block devices.
Recent work on datacenter storage networks have focused on providing
abstractions similar to software-defined networking (SDN) to manage VM disk I/O queues. In
IOFlow~\cite{ioflow}, virtual block devices
are mapped into VMs as virtual hard drives, and the datacenter implements a
centralized storage control plane and distributed data plane to shape I/O traffic to and
from storage servers.
Some storage vendors have begun to refer to their existing iSCSI, NAS, and SAN offerings as
``software-defined storage''~\cite{computerweekly-storagebuzz}. In particular,
they tout the ability to run the storage network controllers in
software (whereas previously, they had been implemented on dedicated hardware).
Another work that refers to itself as software-defined storage
is software-defined flash~\cite{sdf-baidu}, where
datacenter applications interact with solid-state disks (SSDs) via a
software-defined flash controller. This work focuses on improving utilization
of the hardware by allowing the application to directly control aspects of the
hardware that are typically left to the device driver or firmware (i.e. I/O scheduling,
channel utilization, provisioning, etc.).
While there is some conceptual overlap with the wide-area SDS work in this
thesis insofar as applying policies on storage data flows,
the bulk of this prior work focuses on applying policies on shaping IOP traffic. There
is minimal focus on preserving application-level semantics. Since these systems are
always under the control of a single organization (i.e. the datacenter), there
is no need for them to worry about preserving organizational autonomy.
\subsection{Storage System Emulation}
Other industry offerings that refer to themselves as ``software-defined storage''
focus on emulating existing storage systems, instead of specific pieces of
hardware. This allows tenants to move from one datacenter to another without having to
rewrite the storage interfacing logic. For example, industry offerings exist to
provide compatibility with NFS, CIFS, SMB, and Amazon S3 (contemporary
offerings include those from
Veritas~\cite{veritas}, Scality~\cite{scality}, Acronis~\cite{acronis}).
Like this prior work, the work in this thesis implements storage system emulation
through service drivers. The difference between prior work and this thesis is
the use of aggregation drivers to allow \emph{multiple} cloud storage systems to
be combined while preserving end-to-end semantics.
\subsection{Storage Abstraction and Virtualization}
Cloud storage gateways~\cite{cloud-storage-gateway} are type of
network appliance that sometimes refer to themselves as ``software-defined storage.''
They allow an organization to apply certain data management
policies on organization-originated data bound for cloud storage. These
policies include transparent compression, de-duplication, encryption, access
logging and so on.
The wide-area SDS gateways described in this thesis act as ``virtual'' cloud
storage gateways after a fashion, in that they apply local data transformations in the form of
an aggregation driver stage. Unlike cloud storage gateways, SDS gateways exist
entirely in software and can be provisioned, duplicated, migrated,
reprogrammed, and arranged in a particular network topology on-the-fly.
Software compatibility libraries like Apache libcloud~\cite{libcloud} and various
storage-specific userspace
filesystems~\cite{s3fs}~\cite{dropbox}~\cite{google-drive-fs} try to
provide a uniform interface for accessing disparate cloud storage resources.
This is equivalent to what service drivers do in SDS. However, these systems only provide a uniform
\emph{syntactic} interface (i.e. a filesystem). The storage semantics are
different for each back-end system---even though two software compatibility libraries have the same
interfaces, they cannot be used interchangeably.
\section{Operating System Storage Principles}
SDS isolates application design from both the syntactic and semantic interfaces
of underlying storage systems, and facilitates code-reuse by allowing
its components to be recombined to create new functionality. This concept is
inspired by well-established operating system design principles.
\subsection{Operating System Storage Design}
The designs of virtual filesystem abstractions in various points of UNIX's evolution
~\cite{vnodes-sun-1986}~\cite{netbsd4.4-vfs-1995}~\cite{plan9-filesystem}~\cite{freebsd-design-book}
address this concern. Like SDS, they introduce two layers of indirection
---a set of filesystem drivers that overlay a set of block device drivers---to
isolate single-device semantics from cross-device semantics. This is analogous
to the concept of service drivers and aggregation drivers being logically
distinct abstraction layers. One layer provides a common ``narrow waist''
between the hardware and API, and the other layer implements storage semantics
for a host of applications ``on top'' of the narrow waist.
The presence of two layers of indirection can also be found in the storage
architectures of other multi-user non-UNIX
operating systems like VMS~\cite{vms-driver-model}, Microsoft
Windows~\cite{ms-windows-driver-model}, and IBM System z~\cite{ibm-vsam}.
This is an emergent design pattern in storage systems with multiple
back-ends, and this work in SDS echos this pattern.
One task performed by SDS systems is storage virtualization. Synthesizing
logical volumes from multiple devices~\cite{lvm} has been a mainstay in most UNIX-like
operating systems, and hierarchical storage management~\cite{hp-hsm} has been used
in production in mainframes and workstations for decades. This is a use-case that can also be
fulfilled by SDS with the right storage drivers.
\subsection{Composable Software Systems}
SDS allows developers to construct complex storage systems by composing
unrelated stages of different aggregation drivers into a single driver.
This design principle is similar to the UNIX design philosophy~\cite{unix-design-philosophy},
stackable filesystems~\cite{stackable-filesystems}, network function virtualization systems
like ONOS~\cite{onos},
and programmable network processors like the Click modular router~\cite{click-modular-router}.
SDS aggregation drivers are constructed by chaining together a sequence of
reusable stages to form a program that controls the system's
end-to-end I/O processing. This is analogous to how UNIX programs are built
by chaining together unrelated programs into pipelines, how stackable
filesystems can be chained together to implement complex storage semantics, how
how Click router components can be chained together to implement complex
packet-processing programs, and how network functions are chained together to
implement complex network-wide processing.
In all cases, the developer synthesizes new functionality by swapping existing
modules in and out, as opposed to writing bespoke code for each use-case.
SDS extends this idea to multi-user, multi-network settings, and shows how different
composable parts can be combined by developers
while operating in separate organizations.
\section{Secure Code Deployment}
A major operational facet of SDS systems is the ability for volume owners
to specify and upgrade drivers at runtime. This allows them to preserve the
storage semantics of their volumes in face of changes in the underlying
services. But in order to do so, SDS systems must offer a way to securely
deploy new driver code at run-time, without interrupting the running system.
\subsection{Secure Code Deployment}
The concerns addressed by prior work like Stork~\cite{stork},
Raven~\cite{raven}, and The Update Framework~\cite{TUF}
revolve around ensuring end-to-end software authenticity and
freshness, so that the remote hosts deploy the code the owner specifies without
having to trust any intermediate repositories.
SDS must address this concern as
well in the deployment of driver code and configuration. This work in this
thesis operate under the additional constraint that upgrades must be atomic with
respect to all ongoing application-level operations.
\subsection{Secure Remote Code Execution}
The aggregation driver only executes successfully if all of its stages
execute successfully, even though they run in separate organizations. This
introduces the problem of verifying that the remote processor executed the code
as prescribed. Prior work in secure remote execution focuses on new
computational techniques, like implementing
homomorphic encryption~\cite{homomorphic-encryption}, preserving auditable
execution traces~\cite{versum}, or relying on trusted hardware extensions in
the remote computers~\cite{intel-sgx}.
SDS gateways complement this prior work in multi-host, multi-network settings by
ensuring that each gateway has a coherent view of the other gateways invoking
its driver code. A data flow only executes if all affected gateways can first
verify that they each know what code each other gateway is running. It is
possible to construct SDS gateways that employ a secure computation techniques
in this related work, and in doing so, arrive at a data flow processing algorithm
whose execution can be audited end-to-end by concerned users. However, this is
the responsibility of the gateway implementation's driver runtime environment.
\section{Software-defined Networking}
Software-defined networking (SDN) addresses similar types of problems for
network policy as SDS addresses for data policy. SDS builds on several design
techniques pioneered in SDN systems.
\subsection{Control and Data Planes}
Early SDN systems like
4D~\cite{4D} and Ethane~\cite{ethane} introduced the concept of separating a
distributed data-processing plane from a logically-central control plane.
This allowed SDN operators to specify top-down, globally-scoped policies for controlling
network traffic---a key innovation over earlier work in active
networking~\cite{road-to-sdn}. SDS applies this concept by
separating data flow processing from configuration management, whereby the
volume owner operates the logically-central control plane for the
volume by manipulating a certificate graph.
Another key concept introduced by 4D is the notion of
dedicated subsystems for discovery of network elements and rule dissemination to
them. SDS's use of a self-sovereign identity system and certificate graph
addresses the same kinds of problems, but for users and gateways instead of network
elements.
SDN systems encourage open interfaces between their control and data planes.
For example, the OpenFlow specification~\cite{openflow} describes interfaces for
network elements to implement in order to participate in an SDN system.
This removes a barrier to entry by allowing control planes and data planes
to be implemented independently, leading to a proliferation of different network operating
systems~\cite{onos}~\cite{NOX}~\cite{stratum}~\cite{bigswitch}~\cite{road-to-sdn}.
SDS extends this concept for
storage by encouraging the development of many different type of gateways, which
can be tailored to individual applications while remaining compatible with an
existing SDS system. Gateways share a common interchange format (chunks), but
allow developers to independently implement their own APIs and custom behaviors.
This was leveraged implement email-specific replica
gateways on top of Syndicate, for example.
\subsection{Ease of Programmability}
More recent work in SDN programmability, such as Frenetic~\cite{frenetic} and
Pyretic~\cite{pyretic}, focuses on allowing developers to write a global flow
control program in a familiar language that compiles into individual flow rules.
This work takes a similar approach in the design of Syndicate's aggregation
drivers, whereby a single driver program is automatically distributed and
executed piecemeal across all of the volume owner's gateways.
\section{Peer-to-peer Storage}
SDS systems distribute data chunks in a peer-to-peer fashion, but use
a logically central metadata service to discover and route requests. In
addition, they employ a ``trust-to-trust'' user discovery
layer~\cite{trust-to-trust-principle} which SDS elements use to bootstrap trust
in one another. Prior works in distributed storage systems have faced similar
challenges to the ones that necessitated these SDS design elements, but have
addressed them in different ways.
\subsection{Content Discovery}
Content discovery is an important aspect of distributed storage design.
Prior work in scalable distributed storage
systems~\cite{berkeley-xFS}~\cite{farsite}~\cite{zebra}~\cite{spritefs}~\cite{glusterfs}~\cite{lustre}
has shown the utility of implementing separate metadata servers from data
servers. This helps system operators to better manage access control and
consistency by placing the logic to do so within one system component.
SDS systems employ a metadata service to achieve the same end, but
such that metadata servers are not part of the trusted computing base.
In addition, SDS elements may be programmed to
validate the metadata via application-specific criteria.
In wide-area systems that span multiple organizations, a key difficulty with
addressing content discovery is ensuring that the system can operate under
organizational churn. The solutions in prior work depend on who manages the content discovery
mechanism.
\subsection{Single-stakeholder Content Discovery}
Wide-area storage systems like Oceanstore~\cite{oceanstore}, Pond~\cite{pond},
and Bonafide~\cite{bonafide} implement content discovery by relying on a
federation of BFT nodes, which perform write admission control and write
serialization. While they are all able tolerate the failures of other storage
elements, the users of these systems do \emph{not} participate in content
discovery and instead trust the federation to not be faulty.
SDS systems like Syndicate offer a more flexible approach. While Syndicate
relies on a central metadata service for content discovery, the service is only
trusted to keep data available. Moreover, each application, through the use of an aggregation driver, can
program its volumes to validate system metadata in arbitrary ways. This allows
the application to seamlessly control how much trust it places in metadata
services outside of its control. For example, in the limit
the set of Syndicate gateways can maintain their own replicated log of metadata
writes through their aggregation driver,
and use the log to monitor the metadata service for faults.
\subsection{Multi-stakeholder Content Discovery}
Prior wide-area storage systems like Shark~\cite{shark}, CoralCDN~\cite{coral}, Vanish~\cite{vanish},
OpenDHT~\cite{opendht}, and BitTorrent~\cite{bittorrent} are designed with
\emph{multi-stakeholder} content discovery mechanisms. Anyone can stand up additional
nodes to help with content discovery.
A key improvement in multi-stakeholder content discovery systems offered by SDS
is the use of a blockchain as a shared source of truth between storage elements.
All of the aforementioned prior works rely on DHTs or DSHTs~\cite{dsht} in order to
scale the number of records, and work by distributing routing information
evenly across a number of hosts while tolerating node churn and supporting
fast queries.
The drawback with this approach is that they are vulnerable to Sybil
attacks~\cite{sybil-attack} and route-censorship
attacks~\cite{dht-route-censorship}. This can cause to a loss of routing state,
and can possibly cause invalid or malicious routing state to be used.
The two SDS systems in this work avoid these problems by ensuring each node has a 100\% replica of the
routing state (within the SSI system), and ensure that the state size grows at a fixed rate by relying
on a public blockchain as a rate-limiter. As a result, gateways in the
prototype SDS systems can discover one another and one another's content as long
as at least one SSI node is reachable.
\subsection{User Authentication}
Peer-to-peer systems are often multi-user systems. To support multiple users,
they need to perform some form of user discovery and authentication.
Most network filesystems systems that run within a single organization (like
NFS~\cite{nfs} and GFS~\cite{gfs}) use a trusted, centralized user directory,
which allows users and administrators to enumerate user accounts and assign them
easy-to-remember names without name collisions.
Federated filesystems like AFS~\cite{afs} and Farsite~\cite{farsite} use a
trusted, existing public-key infrastructure to discover users in a similar
fashion.
Other systems try to do without a centralized user directory, but at the expense
of removing human-friendly user identifiers. For example,
SFS~\cite{sfs} eschews user enumeration by addressing this problem with
self-certifying paths, where users are identified by public keys.
Others like WheelFS~\cite{wheelfs} punt on user discovery altogether, and
require each operator to curate the public keys of trusted users out-of-band.
By relying on a self-sovereign identity system, SDS systems allow users to
discover each other's public keys without a centralized user directory, and
without introducing human-unfriendly names. Certain PKI systems like
attribute-based encryption~\cite{ibe-shamir}~\cite{ibe-weil} try to enable this, but have the significant
drawback that each user must re-key if a single private key is compromised.
\section{User-defined Storage Semantics}
Different applications need to make different trade-offs in their storage. To
accommodate this, prior work has provided
control points to help them make these trade-offs. SDS systems take
this idea to its logical conclusion, where the application itself specifies a
portable, reusable driver that defines its end-to-end semantics.
In their simplest forms, systems that offer user-defined consistency do so by
allowing the user to choose from a handful of built-in semantics that all reads
and writes to the data will follow. This includes systems like
WheelFS~\cite{wheelfs}, PRACTI~\cite{practi}, and
Bayou~\cite{bayou}, where the user can tag files and sessions as needing to
adhere to a certain consistency models have a certain degree of
durability.
Some storage systems try to resolve write conflicts by deferring to user
decisions. These include version control systems like
subversion~\cite{subversion} and git~\cite{git}, and file storage systems like
Dropbox~\cite{dropbox}.
Other storage systems like Coda~\cite{coda}, COPS~\cite{cops}, and
Ori~\cite{ori} allow the user to supply a conflict-resolution algorithm for
handling conflicts. The algorithm is later used by the system in order to
resolve conflicts between replicas.
Systems that need high availability or high durability allow clients to choose
replica placement. These include programmable CDNs like
CloudFlare~\cite{cloudflare} and Akamai~\cite{akamai}, as well as
high-availability cloud storage like S3~\cite{s3}. Open-membership storage
layers that offer this include BitStore~\cite{bitstore} and Storj~\cite{storj}.
In the context of SDS systems, developers materialize systems that address
\emph{all} storage concerns within a single storage element---the aggregation
driver. This allows them maximum flexibility in addressing storage concerns.
SDS helps developers manage the associated complexity and development overhead
of doing so by providing an aggregation driver specification that facilitates
component reuse and incremental deployment.
\section{Applications}
This thesis described three sample SDS-powered applications, all of which have been
implemented in prior work but with significant constraints. Re-implementing
them with SDS helped to remove many of these constraints.
\subsection{Encrypted Email}
PGP~\cite{pgp} has long been the ``gold standard'' for encrypted communication
over email. It works by allowing users to send and receive encrypted messages
over SMTP. However, multiple usability
studies~\cite{why-johnny-cant-encrypt}~\cite{why-johnny-still-still-cant-encrypt}
have shown that users have a hard time interacting with cryptographic key pairs.
Email clients like Enigmail~\cite{enigmail} and Mailvelope~\cite{mailvelope}
attempt to alleviate some usability problems, but ultimately require users to
actively participate in key management and key discovery. All PGP-based email
requires both the sender and recipients to participate in order to realize
message confidentiality and authenticity.
The Syndicate-powered email application differs from prior work in that it removes the
need for humans to manage keys. In doing so, it provides a user experience
comparable with Webmail. The underlying SDS gateways automatically encrypt and
decrypt messages on endpoints, and provide multiple options for communicating
with legacy SMTP email users.
\subsection{Groupware}
Software that helps groups of users work on a shared task has been a significant
computer application since the late 1980s, with many early works focusing
on various ways to allow clients to interact with shared state on a server~\cite{readings-in-groupware}.
Many groupware applications, such as video-conferencing, chat rooms,
and document-sharing have subsequently been realized as Web applications.
Examples today include Microsoft Office 365~\cite{microsoft-apps}, Google
Docs~\cite{google-docs}, and Slack~\cite{slack}.
An architectural mainstay of most groupware systems is that they follow a client-server model.
The users run the client software on their computers (e.g.
as a Web page in a Web browser in contemporary systems) to interact with shared state hosted on one or
more servers. Clients are not assumed to be reliable, and do not host any
authoritative state. Instead, servers store the authoritative state of the system and
present clients with views of it.
What this means for groupware implementations is that most of the application's business logic runs on the
servers. This is necessary in order to address global data management
concerns, such as access controls, concurrency handling, and
replication. Since only the servers process operations on authoritative state, only
the servers are in a position to handle these concerns. In doing so, groupware
servers pose a single point of failure---if a groupware server fails, clients
cannot interact with their data.
The serverless groupware library built on Gaia
avoids this issue by separating the business logic from the storage logic.
Gaia still provides groupware applications with the client-server computing model,
but such that the role of the server is reduced to loading
and storing opaque blobs of data. Gaia instead
moves application business logic into an aggregation driver, which can be
deployed, upgraded, and migrated across a dynamic set of gateways at runtime.
This provides a degree of operational flexibility not seen in prior groupware
implementations---a volume owner can transparently migrate the groupware from
one storage provider to another (to tolerate changes in service providers),
and from one set of gateways to another (to tolerate changes in trust relationships and
changes in host availability).
\subsection{Scientific Data Set Staging}
Related work on sharing scientific data across multiple networks includes
work on dedicated sharing and transfer services like Globus~\cite{globus},
hosting data in highly-available datacenters like ABoVE~\cite{nasa-above} and
Cyverse~\cite{cyverse}, and relying on an array of network caches to distribute
data from one data origin to many downstream readers (like
CernVMFS~\cite{cernvmfs}).
Scientific data transfer services allow labs to explicitly share and copy
data from one site to another. Globus allows scientists to pipe data between
servers that the requester can access, and allows scientists to share data from
commodity cloud storage to external data consumers. Like Globus, Syndicate
encourages reusing commodity cloud storage services for hosting data, and
leverages existing identity management services to authenticate data and
transfer requests between organizations.
This thesis is not the first to propose using a network of commodity caches to help
distribute scientific data. CernVMFS~\cite{cernvmfs} allows scientists at CERN
to share data with the world. CernVMFS implements a catalog service that
functions like the Syndicate MS and AG in that it provides an index over the
available datasets, which can be fetched out of band and used to read the data
itself through the caches.
This work extends this concept by supporting reads and writes while preserving
cache coherency. Unlike CernVMFS, the Syndicate-powered data-sharing framework
allows the upstream datasets to be written
to at runtime, while there are ongoing reads. The AG and MS allow readers to
discover the latest data without having to rely on consistency hints from the
caches (such as Etags or Last-Modified HTTP headers).