-
Notifications
You must be signed in to change notification settings - Fork 0
/
applications.tex
1204 lines (1037 loc) · 68.9 KB
/
applications.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Applications}
\label{chap:applications}
This chapter presents three applications built with Syndicate and Gaia.
In all cases, the ability to control end-to-end semantics within SDS
(instead of the application) enables
developers to tackle difficult data management
techniques, in ways that both preserve backwards-compatibility with existing
applications and preserve forward-compatibility with future storage features.
Applications do not need to be modified to leverage
new commodity services, and data flows and gateway placement let developers
consistently solve data management problems across multiple applications.
\section{Serverless Groupware}
% Groupware with Gaia
Groupware is a common category of Web application that allow users to
collaborate via data-sharing. Groupware applications include shared to-do lists, calendars,
documents, contact lists, and so on. Multiple users read and write to the same
storage medium in order to coordinate their activities.
The data storage story for groupware today requires each user to be able to see
a consistent view of her data, regardless of which of her devices read or write
it. Since groupware is often used in sensitive
settings such as corporations, users have an expectation of privacy---by
default, their state is only visible to their devices. Users must
\emph{explicitly} share data with other users (or the public), and if they do
so, their shared data is visible to all other users on all of their devices.
In conventional groupware software, this is achieved by running a shared server.
The users in the same user group have read and write access to the server's
state, and the server resolves conflicts between writes and enforces access
controls. In addition, the server takes advantage of its global view of the
users' state to build up derived state like edit histories and
backups. From a data policy perspective, all users trust one
organization composed of the server and all of the user groups'
devices.
In multi-organization settings, or in settings where users do not directly know
one another, implementing shared groupware is more challenging. Each user (or
subgroups of users) have different policies regarding how their data is to be shared.
For example, a user's personal calendar should not be shared with work
colleagues. What is needed is a groupware system where users can \emph{self-organize} into
user groups with which to share data, in a way where users can easily
authenticate one another and establish trust relationships with minimal
coordination. This is achieved with a Gaia groupware library.
The groupware library differs from existing groupware software in two key ways.
First, it lets each user host their data on whichever cloud services (or
servers) they choose, while preserving end-to-end storage semantics for
groupware applications. Second, it gives each user the ability to vet each
other user in the system by having users prove ownership of existing social
media accounts. This latter feature allows users to self-organize into their
own per-application organizations with minimal coordination. By posting
machine-checkable proofs-of-ownership on social media that are cryptographically
linked to accounts in Gaia's SSI system (henceforth referred to as ``social
proofs''), a user can easily vet other users when deciding to share groupware
data with them. For example, users can leverage social proofs to prove that
they work in the same company, or go to the same school, or have the same shared
interests.
\subsection{Motivation}
Groupware software falls into two categories: in-house groupware servers that
the users of an organization must maintain themselves, or outsources groupware
servers that run in third party servers. There are undesirable
trade-offs for both types of groupware. In the first case, users incur an
ongoing operational cost for keeping the software up-to-date and keeping the
server running. The advantage, however, is that they unilaterally control all
aspects of the server's data storage---including how often it gets backed up,
who can view the data, what kinds of derived state it makes, what version(s) of
the software it runs, and so on.
The second type of groupware is increasingly popular. Companies like Microsoft
and Google each have suites of software-as-a-service offerings that take the
operational responsibilities out of the user's
hands~\cite{gapps}~\cite{microsoft-apps}. The advantage is that the
SaaS offerings have potentially higher uptime and are managed by experts, and
are available at a predictable cost to users no matter how easy or hard it is to
maintain it. The downside, however, is that the SaaS provider has global
visibility into the users' data, regardless of the users' desired privacy
settings. If the SaaS provider is hacked, their groupware data can be exposed
to the public. If the SaaS provider goes out of business, the groupware data
can be lost forever. If the SaaS provider changes its API, then any custom
integrations with the platform break.
There does not exist a middle ground where users can share their data in a way
that is convenient for them (like what SaaS offers), but with the policy
controls they would get by running an in-house groupware server. The serverless
groupware library for Gaia fulfills this need.
\subsection{Role of SDS}
Gaia enables the best of both worlds. Users get all of the
operational convenience of SaaS with the privacy and data controls of having
their own servers. Importantly, Gaia allows users to select whichever storage
providers they want without affecting the design of the groupware software.
In addition, ancillary functionality like search indexing can be
implemented in Gaia gateways and reused in other applications by way of the
global relational database design pattern described in the previous chapter.
The users rely in Gaia's SSI system to bootstrap data confidentiality and
authenticity. The gateways in Gaia ensure that all data is signed and encrypted
when it leaves the device, such that only the user's designated recipients (if
any) can view it. In addition, the groupware software uses Gaia to ensure that
applications are isolated from one another at the volume level---an application
client can only access application-specific state.
A key operational concern of groupware systems is that they must only allow
users to view one another's data \emph{with the owner's permission}. Gaia's
gateways enable this by allow users to implement data-specific checks when sharing
data. This is achieved by giving users the ability to create and vet one
another's social proofs. Importantly, the social proofs are verified
automatically by the software and presented to the user as part of the
permission-granting user experience.
\subsection{Design}
The groupware software is designed to run within the Web browser. The
application logic runs as a Web page, and loads and stores the user's
credentials and data via a co-located Gaia node. This allows decouples the
user experience and application functionality from the user's shared
storage concerns. For example, one user can store their data on Dropbox and
another user can store theirs on Google Drive, but the application can access
each user's data regardless via the Gaia node. A system overview is given
in Figure~\ref{fig:chap4-gaia-groupware}.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=23]{figures/dissertation-figures}
\caption{Design of serverless groupware with the Gaia SDS system. Alice
lists signed certificate graphs in her SSI user account data, as well as the
list of her personal devices' public keys and social proofs. While Alice can
write to her storage from her private Gaia nodes, she can make her data
available via a public Gaia node as long as her SSI account contains enough
social proofs that she is a valid application user. Bob
uses this public gateway to discover and read her shared data.}
\label{fig:chap4-gaia-groupware}
\end{figure}
\subsubsection{Setup}
A user receives a volume for each groupware application she uses. When she signs up for a specific
application, the groupware software inserts an application-specific set of keys
into the user's SSI account information, indexed under the application's name. To provide
confidentiality, the user has the option of encrypting this routing information
such that only her trusted peers can discover that she uses it. Her other
devices and other users' devices inspect her account in the SSI system to determine
which keys to use to authenticate the data she writes, as well as discover how
to access her storage (i.e. which Gaia nodes to contact, which storage providers
to contact, etc.).
\subsubsection{Sign-in}
The groupware software employs device-specific keypairs to allow the user to
sign in via multiple devices. When the user signs in for the first time, her
device creates a volume for her and registers \emph{all} of her devices as
belonging to the same volume owner. Then, when the user signs in from a different device, she can
still read and write data to her existing volume and administrate it.
The software ensures that her devices are aware of each other via a
``delegation record'' in her SSI account. The delegation record lists all of
the user's device IDs and their public keys. This way, when the user creates a
new volume, the software automatically grants all devices the volume owner
privileges. To the user, it appears that they simply began using the app from a
separate device, just as they would have had it been a conventional Web
groupware application.
If the user wants to add or remove a device, she must re-generate her delegation
record with the current set of device public keys.
To do this securely, the software requires a quorum of signatures from a trusted
subset of her devices (configurable by the user).
A delegation record will only be considered valid if it
is accompanied by a sufficient number of signatures from this trusted device
set. For example, a user might require a signature from two of three of her
devices in order to add a fourth device or remove the third, and in doing so
tolerate the loss of one of her three devices. This way, the user can control
which devices are allowed to write to her data while tolerating the loss or
compromise of a pre-configured set of them.
Both the quorum threshold and the public keys of the
trusted devices are listed in the user's SSI zone file. Since changing the zone
file requires a blockchain transaction in the SSI system, there will be a
widely-replicated auditable log of each user's device key rotations. This makes it
easy for users (and their collaborators) to check key lifetimes, and makes it
risky for attackers to attempt to change keys (since they cannot do so
silently).
When the user signs in, the groupware library creates a gateway for the device
she is using if one does not exist already. Her device will sign the
new certificate graph for the app's volume and make it available in her SSI account. The
software authenticates data from the user by (1) looking up the user's ID in the
SSI system, (2) extracting the trusted device public keys and quorum threshold
from the zone file, (3) validating the delegation record, and (4) validating the
certificate graph against the delegation record. The software caches
monotonically-increasing version numbers for the certificate graphs locally to
prevent stale certificate graphs from being reused.
\subsubsection{Reading and Writing Data}
Since a user gives each application its own volume, a groupware application like
a shared calendar spans the set of users' devices. Gaia ensures that when the
application client is loaded, it only has visibility into the
application-specific volumes the users have created (i.e. so a malicious
or buggy application cannot read another application's state).
The groupware storage interface references data by its volume key and owner user. For
example, to read Bob's file \texttt{today.cal}, Alice's application client would call
\texttt{get(``today.cal'', ``bob.id'')}, where \texttt{bob.id} is Bob's username
in the underlying SSI system. All the while, Gaia
ensures that Alice's calendar application only discovers the routing information
to Bob's calendar volume.
\hfill \break
\noindent{\textbf{Read Authorization}}
\hfill \break
When writing shared data, the user must ensure that it is readable by a given
set of other users. How does the writer identify these other users,
and how can the software identify users as belonging to particular
organizations? The groupware library addresses these problems by both allowing
the writer to specify other individual readers, and by allowing the writer
to specify which social proofs a reader must have (as well as a way to vet
them).
The user is free to choose which proofs are required for their
application, depending on the application. For example, a cryptocurrency
investment application could require a user to produce a signed KYC
(know-your-customer) attestations
from the government and the user's bank that prove that the user is an
accredited investor. This proof would be signed and stored in a social media
platform that the groupware library can crawl (such as
AngelList~\cite{angellist}).
Once a Gaia gateway knows which social media proofs are required to read
a key value, it will only accept read requests from users who present
the requisite proofs. To facilitate this check, users insert URLs to the
proofs within their SSI account linked to their names in the SSI system (which
the Gaia gateway looks up on-the-fly).
\hfill \break
\noindent{\textbf{Searching}}
\hfill \break
Public groupware data is readily indexed by anyone who wishes to stand up a
Gaia database instance to crawl the set of application-specific
volumes. In addition, private groupware data can still be indexed---either by a
trusted, private Gaia database, or by downstream user groups.
To implement private search in a user group,
the groupware software ensures that the local device's Push
stage indexes the contents of the file before encrypting and replicating it.
The Push stage encrypts the index data with the viewers' public keys, so
the viewers will be able to search for the file by keyword.
The index itself is application-specific, but can do things such as
associate search terms to file names and word counts.
The index data is structured as a per-user prefix tree, so
that a search query only needs to fetch a narrow subset of the index to find
files with the search term.
A global untrusted relational database can accelerate delivery of encrypted
index files to downstream readers. Trusted readers asynchronously fetch,
decrypt, and incrementally reconstruct the writer's index locally to service
search queries. Depending on the sizes of the index and the number of users,
the application may take different strategies for fetching the encrypted
index---for example, a large user group may employ a private trusted instance of
a Gaia relational database that can eagerly build up a search index, whereas a
small user group may simply fetch and reconstruct each other users' indexes as
needed.
\subsection{Implementation}
The groupware library implementation is the work of multiple contributors.
It is implemented in two parts: Javascript library that facilitates user
sign-ins and application-specific volume creation, discovery, reads, and writes,
and a UI that allows users to manage their social proofs.
It was developed in collaboration with Blockstack Public Benefit Corporation~\cite{blockstack-pbc}.
Several applications have been independently built by Blockstack community members
with the groupware library. Examples include:
\begin{itemize}
\item \textbf{Blockstack To-Dos}: This is a private to-do list application
that uses single-reader Gaia volumes to store private user to-do lists.
\item \textbf{Graphite}: This is a Google Docs work-alike~\cite{graphite-docs}. Users store and share documents and
spreadsheets via multi-reader Gaia volumes. The data is encrypted by
default, so that only the designated readers can access it. It makes use
of a Gaia database to facilitate secure document discovery---the database
discovers encrypted pointers to the encrypted document, so that only the
intended recipient can access the data. It also offers end-to-end
encrypted messaging, where messages are replicated to Gaia volumes for
long-term storage.
\item \textbf{Blockstagram}: This is an Instagram work-alike that allows
users to securely share photos via multi-reader Gaia
volumes~\cite{blockstagram}. Photos are
encrypted with the recipients' public keys before being replicated, thereby
providing end-to-end confidentiality. It was developed by a team of eight
Web application developers with no prior experience with Gaia (or
Blockstack, Gaia's SSI system) in less than 36 hours at a hackathon in
Berlin~\cite{patrick-tweet-blockstagram}. % https://twitter.com/PatrickWStanley/status/970307376690626561
\item \textbf{Stealthy.im}: This is an end-to-end encrypted chat application,
where users can securely send text and pictures
real-time~\cite{stealthy.im}. It uses
multi-reader Gaia volumes to store chat data, and uses a Gaia database to
discover and invite users to chat. A similar Gaia-powered application is
\textbf{Hermes}~\cite{hi-hermes}.
\item \textbf{Coins}: This is a private cryptocurrency portfolio application
that uses single-reader Gaia volumes to securely and confidentially store
the user's cryptocurrency holdings~\cite{coins}. It allows the user to track the worth
of their holdings without exposing them to anyone outside of the user's
computer.
\item \textbf{Publik}: This is a microblogging application that uses
multi-reader Gaia volumes to share blog posts~\cite{publik}. A Gaia
database for indexing hashtags and user posts is under development.
\item \textbf{Bellweathr}: This is a business analytics program that uses
machine learning in the user's Web browser to help a business owner
identify patterns in customer purchases~\cite{bellweathr}. Business
owners use Gaia to load and store encrypted copies of their customer data
and trained models, thereby ensuring that it will remain private.
Equivalent applications today require business owners to expose their customer
data to third parties, which puts both they and their customers at risk
to hackers and security mishaps.
\end{itemize}
All of these applications use Gaia and its SSI system to load, store, and share
user data. The SSI system implementation (the Blockstack Naming
Service~\cite{bns}) removes the need for per-app password databases and per-app
identity services, and Gaia removes the need for per-app data silos. Users can
share data from one application to
another~\cite{blockstack-technical-faq-share-data} without the application's
permission or cooperation.
The applications Graphite, Blockstagram, Stealthy.im, and Hermes all rely on a
global database instance to discover other application users. They are not
coupled to a specific instance; anyone can deploy a new global database if the
default instance misbehaves or is not trusted.
\subsection{Discussion}
The usefulness of SDS is apparent in its ability to implement its users'
data-hosting policies independently of the applications. Each user can keep their groupware data on
the storage providers of their choice, and in doing so, control their
availability, durability, and access control independently of one another and independently of
the applications. For example, a user's Gaia node can programmatically delete
old Stealthy.im messages without Stealthy.im's permission. As another example,
a user's Gaia node can limit access to its owner's Graphite documents by denying
reads from hosts outside its local area network.
At the same time, application developers do not need to care
about hosting user data, and do not need to worry about coupling their data to
specific storage systems. All of the third-party applications above do not rely
on application servers.
As an optimization, their respective developers deploy
Gaia global databases to help users discover one another. For example,
Stealthy.im implements an invite mechanism using a Gaia global database, and
Graphite uses a Gaia global database to help users discover shared files.
However, the developer is not required to deploy and maintain a global database.
Gaia global databases only host soft-state in the application, and any user can
instantiate their own global database and derive the same
database state. This means that as long as at least one user is interested in
preserving Stealthy.im's invite system or Graphite's
document discovery system, they can do so without the developer's help.
The expressive power given to developers by the aggregation driver model is
apparent in the ability to control read and write access based on whether or not
the requesting user has made particular social proofs. The social proof check
code only needed to be written once, and it now works across all groupware
applications and all cloud services. The expressive power is also apparent in
the ability to automatically generate private search indexes in response to reads and
writes.
The main difficulty with giving users direct control over their groupware data
today is that it forced them to run a shared groupware server (or collectively
trust someone to do so on their behalf). By instead
implementing what used to be server-side functionality as aggregation driver
stages, the library removed the need for a shared server while preserving each
user's control over their data.
\section{End-to-End Encrypted Email}
The ability for SDS systems to instantiate application-specific data flows gives
users the power to enforce data transmission and storage concerns in
\emph{existing} protocols as well. This is demonstrated by using Syndicate to construct
end-to-end encrypted email that addresses long-standing
usability concerns that impede the widespread use of PGP~\cite{pgp}.
\subsection{Motivation}
Encrypted email is not a new concept. However, it has proven notoriously difficult to
deploy~\cite{why-johnny-cant-encrypt}
~\cite{why-johnny-still-still-cant-encrypt} due to the need for users to manage
private keys. In addition, deploying end-to-end encrypted email over legacy
SMTP servers and clients leaves users vulnerable to two security flaws: users
can only achieve end-to-end encryption if they all share keys, and users can
accidentally leak other users' cleartext when including new users in an email
thread.
\subsubsection{Using Private Keys}
Even if users had a good understanding of public key cryptography, they must still contend
with key distribution and key revocation. Key distribution is not addressed by
the encrypted email systems studied. However, existing methods---key escrows,
certificate authorities (e.g. S/MIME~\cite{smime}, DANE~\cite{dane},
x.509~\cite{x509}), and webs-of-trust are difficult to use securely, and easy to
use incorrectly.
Key escrows and certificate authorities are ``centralized''
entities that often live outside of a users' organizations, which makes it
difficult for users to reason about their trustworthiness. Only organizations
whose data policies admit a trusted third party can make use of these services.
Trusting a third party for such a task carries the risk of compromise: if a
widely-used certificate authority is compromised, it can lead to widespread
data exposure. Users may not discover until after harm has been done to
them, such as identity theft.
Webs of trust do a better job than centralized key servers at preserving
organizational autonomy because they allow each organization to unilaterally
decide which other organizations to trust. However, there is a high
coordination cost in maintaining them. This
is because trust is \emph{not} transitive by nature---if Alice trusts Bob and Bob
trusts Charlie, it does not follow that Alice trusts Charlie. Users in each
organization need to be wary of the degree to which to trust their peers, and wary of the trust
judgments their peers will make. Moreover, they must curate their webs of
trust to account for changes in the organization. For example, if Bob is fired
from his job, then all of Bob's coworkers must update their webs of trust to stop trusting his
email signing key.
Key revocation adds another layer of complexity. Key revocation certificates
and signed key expiration dates do not go far enough in making encrypted email
usable. If a user loses both their private key and their key revocation
certificate, then they have to get other users to re-establish trust in them
from scratch. If the user's private key is compromised, then the attacker can
send arbitrary emails before the user can transmit their key revocation
certificate. If the user loses their revocation certificate, or if the attacker
can stop the certificate from reaching the victims, then the user cannot stop an
attacker with their compromised private key.
\subsubsection{Contacting other Users}
Even if users could reliably distribute and revoke public keys, conventional
email clients still allow users to communicate with others in insecure ways.
Users can bring harm to themselves by accidentally sending email in the clear
when they meant to encrypt it. Also, users can bring harm to others
by accidentally divulging their communications by carbon copying
their cleartext in an email to a user who does not use encryption.
Neither existing SMTP clients (including Web clients) nor
SMTP servers address these problems. SMTP clients do not help users with key
distribution or revocation, and they do not help the user discover whether or
not they have the right key. Web SMTP clients are even less secure, because the
Web client offloads transmission to a remote server (which now must be trusted
by the user). If the user wants to use another device to send an email, such as
a public terminal, they have to divulge a private key to the device.
SMTP is already ill-suited for encrypted communications because at a minimum the email's
sender and recipient must be readable by all SMTP servers between the sender and
recipients. Also, due to its store-and-forward architecture, any messages
accidentally sent in the clear will be stored by the servers for an
indefinite amount of time. Users do not get to choose which servers store and forward
messages, and users cannot ``unsend'' messages if they discover that they sent
them to the wrong recipient.
\subsection{Role of SDS}
This thesis presents a backwards-compatible mail system built on top of
Syndicate. Unlike conventional email, the Syndicate email system
automatically encrypts data end-to-end and ensures that users
discover each other's \emph{current} public keys by way of its SSI
system. User can do the following with this system:
\begin{itemize}
\item \textbf{Automate key management}. Users do not need to interact with keys
at all. Users do not need to trust external key escrows or certificate
authorities, and they do not need to participate in webs of trust. Instead,
users rely on Syndicate's blockchain-powered SSI system to discover each other's
current public keys.
\item \textbf{Control where emails are hosted and who can request them}.
A user's message contents will
not be relayed through the SMTP network, but will instead be hosted in one or
more storage hosts of the user's choosing. Recipients will instead download and
decrypt the message once they have discovered where it is hosted and have
obtained sufficient permission.
\item \textbf{Support sending to legacy users}. The Syndicate email system does \emph{not}
require both sender and recipient to use the same client in order to achieve
better security than legacy email. If the recipient does not use this new system, the sender has
the ability to contact the receiver while
preserving sender-chosen security properties. For example,
the sender can share the message body via a trusted private shared cloud storage folder
that only the sender and receiver can access, and send the URL to the message
body via SMTP. Only the recipient will be able to access the data.
\item \textbf{Safely use untrusted devices}. This secure email system uses Syndicate's SSI system to
allow users to derive short-lived throw-away keys for signing and encrypting
messages on untrusted devices, like public terminals. The keys are
automatically distributed and revoked.
\end{itemize}
\subsection{Design}
The Syndicate email system follows a similar design to the Internet Mail
2000~\cite{internet-mail-2000} proposal. Users store their
encrypted emails in a Syndicate volume, which they
use to selectively give recipients access to their messages. The system uses
the SMTP network to allow senders to inform receivers when they have new
messages waiting for them (Figure~\ref{fig:chap4-syndicate-mail}).
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=24]{figures/dissertation-figures}
\caption{Design of end-to-end encrypted email with Syndicate SDS. Alice can
send email from both a personal device and a public terminal; the latter of
which gets assigned a temporary session key that expires shortly after being
created. Bob's client detects new mail from Alice via the legacy SMTP
network by receiving a signed list of URLs that point to Alice's chosen
storage services. If Alice emails non-users of this system, her UG employs a
custom ``message gateway'' (MG) type to Push the message payload to them
while enforcing her custom security policies (such as ``store this message in
a private shared Dropbox folder that the recipients can access and email them
the URL'').}
\label{fig:chap4-syndicate-mail}
\end{figure}
\subsubsection{Setup}
Each user stores their preferred email address in the SSI system.
Alice sends a message to Bob by looking up Bob's account
information in the SSI system, and then obtaining his email address. In order
to convince Alice that he is the ``right'' Bob (i.e. the Bob she is looking
for), he includes additional credentials in his SSI data, such as social proofs
or signed attestations from trusted third parties. The Syndicate email system
is not concerned with implementing a particular authentication strategy, but instead gives users
the ability to prove that various pieces of user-submitted identifying state associated with
the email address are signed by the same key that owns the email address
in the SSI system. For example, if Alice knows that Bob owns the website
\texttt{www.bob.com}, Bob could authenticate to Alice by hosting his SSI
username and a signature on
\texttt{www.bob.com} and list a pointer to \texttt{www.bob.com} in his user
account on the SSI system.
The system is designed to accommodate multiple devices owned by the user by
storing all emails in a single volume that spans the user's devices. Each
device has its own key-pair in the volume certificate graph, which is used to create
gateways specific to that device. The user has an ``admin'' email account (i.e. an
account that is tied to the Syndicate volume owner account that stores her
emails). The admin account is controlled
from a trusted device and is used to add or revoke permission to communicate from
other devices.
When a user signs up for the system for the first time, she downloads and
installs a mailer daemon
that implements an SMTP and IMAP endpoint locally. The user points their preferred email
client to the local mailer daemon to send and receive messages. In addition,
the daemon implements an HTTP interface for serving the mail client encrypted
messages from the Syndicate volume.
The mailer daemon prompts the user to generate a device-specific
Syndicate user account and two gateways (a UG and an RG)
when it is installed. The user does so by using her
admin account. The installer wizard gives the user the option of pre-allocating
keys for her devices and their gateways, which can be fetched and installed on untrusted devices
on-the-fly without requiring her to use her admin account again. Their
keypairs are encrypted with a password of the user's choice, and stored to the user's
volume.
\subsubsection{Signing In}
Each device the user sends mail from receives its own keypair. Each
device-specific key is associated with an optional expiry timestamp and
revocation certificate, which are stored in the user's Syndicate volume for
safekeeping.
Signing in with a new device requires ensuring that the device-specific private key is
available. For devices the users trust, this is achieved simply by (1)
installing the software, and (2) allowing the device to register its public key
with the user's account in the SSI system. An untrusted device, such as a
public kiosk, would receive a key with an expiry date and revocation certificate.
When the user signs out of the device, she would ``activate'' the revocation
certificate by appending a signed timestamp to it and
moving it to a canonical path in her volume. Other users'
clients would discover and process it automatically when receiving a message,
thereby ensuring that the kiosk does not use the private key after the user is
done with it. The key expiry timestamp
ensures that the key expires nonetheless if the user is unable to successfully
sign out (i.e. unable to post the revocation certificate).
The device-specific key state includes the device-specific user account and the
device-specific gateway keys that the mailer daemon will use to interact with
the volume. Each devices' gateways only write to one directory of the volume,
and mark their files as read-only by other devices (which the MS enforces).
The mailer daemon develops a coherent view of the mailboxes by listing all of
the devices' directory states.
In addition to creating device-specific Syndicate keys, the software also
creates a generic read-only UG and read-only RG whose private keys are publicly readable
and exposed in the volume. These gateways are meant to allow recipients to access
the volume's ciphertext, so the designated recipient can decrypt them.
They are configured in the certificate graph to only have read capabilities, and
to only serve on \texttt{localhost}. This ensures that all of the user's other
gateways will ignore them, and that anyone can run them on their computers to
access the inbox data.
\subsubsection{Sending and Receiving Mail}
The mailer daemon implements a Syndicate UG and RG (e.g. as subprocesses).
The UG implements the SMTP
and HTTP endpoints, and the RG uploads messages to the user's preferred storage
service, such as their personal Dropbox folder or a S3 bucket.
When the UG receives an outgoing message, its \texttt{serialize()} driver method inspects the
message for the recipient, and automatically looks up the public key in the SSI
system to encrypt the message to the recipient before sending it to the RG.
\emph{This way, the sender is never involved with selecting the key for a
recipient user}. The software additionally makes a copy of the sent message encrypted with the
sender's public key, and stores it into the device's ``sent'' mailbox.
The mailer daemon informs the recipient that they have a message waiting for
them by sending a small amount of discovery information to the recipient's
email address via SMTP. This discovery information is signed by the sender,
to prove its authenticity to the recipient. It identifies the path to the
message in the volume, as well as the hash of the ciphertext.
The recipient's mailer daemon polls the user's SMTP inbox for discovery
messages. When it finds one, it fetches, authenticates, and decrypts the
associated message from the sender's volume,
and locally stores it so the user's mail client can read it as a normal
email. It does so automatically as part of the \texttt{deserialize()} driver
method in the UG---this driver method only succeeds if the message could be
authenticated. The discovery message's sender email address
is used to look up the user's device keys in the SSI system to perform the authentication.
\emph{This way, the receiver never needs to select the key for the
sender to authenticate the message}.
The sender must host the email contents for the recipient until either the
recipient downloads it. Once the recipient daemon has fetched the cleartext, he
encrypts and backs up a copy via its RG for safe-keeping.
The sender can delete the messages she sent at any time, thereby granting her
the opportunity to ``un-send'' an email's message body if she can do so before
the recipient fetches it. The sender can garbage-collect old messages once
she is sure the recipient has fetched them, or once the information is no longer
relevant. For example, the sender could simply delete all messages she
sent over one month ago.
If the sender includes multiple recipients, or includes a new recipient part-way
through the email chain, their mailer daemon detects this and ensures
that the previous conversation is kept secret. This is achieved by having the
local RG in the mailer daemon remember which email threads have which
recipients, and ensure that their respective messages are re-encrypted before transmission.
This conversation metadata is encrypted and stored on the user's volume, so it
is accessible from all devices' RGs. This decreases the likelihood that a user
accidentally divulges cleartext in carbon copies on the email client---the message
would simply fail to send if the user did this.
\subsubsection{Legacy Compatibility}
As with PGP before it, the Syndicate-powered email system requires both sender
and recipient to use it in order to realize the full benefits. Unlike PGP, the
developer can ensure that certain safety features are in place if only the
sender uses the software. This is made possible by Syndicate's
aggregation driver programming model.
It is important to recognize that when it comes to email, the correct way to
send a message depends on the sender, the recipient, the content, and the context in which
it is sent. For example, two friends exchanging vacation photos do not need the same
security guarantees as an anonymous informant communicating with a law
enforcement agent.
One of the major drawbacks of PGP is that cannot work if either the
sender or recipient do not use it. This significantly limits the set of senders
and recipients. Moreover, PGP-encrypted messages are easy to spot in SMTP
traffic, which makes it easy for network eavesdroppers to identify users who
have something to hide.
What is needed is for senders and receivers to be able to communicate even if
one of them does not use PGP-like encryption. The approach taken here is to
make it easy for the sender to control how the message will be delivered, while
allowing messages to be discovered by the recipient over legacy SMTP. The
sender is free to set up the delivery process to implement the security
guarantees on a case-by-case basis, subject to what she knows about the recipient and subject to
the contents of the message. For example:
\begin{itemize}
\item The sender can encrypt the message with a password known to the recipient,
and send the message body in a common document format, like Microsoft Word or
PDF, that the recipient can open and decrypt with already-installed software.
This can provide the confidentiality of PGP.
\item The sender can replicate the message to a shared private storage provider
like a Dropbox folder or private \texttt{git} repository, and send the
recipient the URL over SMTP. This process can be carried out via HTTPS.
While this does not provide the same
degree of end-to-end confidentiality and authenticity as PGP, it guarantees that as long as
the certificate authorities and shared storage are trusted, then only the sender, the recipient, and the
storage provider can view the message (but SMTP servers see nothing).
\item The sender can select which network to use to transmit the data, based on
the recipient. For example, an enterprise user could require all messages sent
to the company SMTP server must be sent through the corporate
VPN. The aggregation driver would refuse to send messages unless it detected
that the VPN was available. This ensures that all email messages sent by employees are
visible only to the company and the recipient.
\end{itemize}
These examples do not provide the same guarantees of PGP, but they are
better than relying only on legacy SMTP for email. While they can all be done
today in an ad-hoc manner without SDS today,
Syndicate lets users ensure that they are all executed
automatically and consistently. Moreover, the way these features are
implemented allows them to be reused in multiple different contexts, giving senders the
ability to \emph{combine} different features to create a custom message
transmission process.
Addressing legacy compatibility is a practical application of Syndicate's custom
gateway types. The deployment designed so that the RG's
Push driver stage (1) reassembles the Pushed chunks received from the UG
(embedded in the email client) back into the original email, (2)
scans the certificate graph for gateways with a type identifier specific to the
email client (the ``MG'' gateway in Figure~\ref{fig:chap4-syndicate-mail}),
and (3) forwards the reassembled email to them for further
processing.
When the MG receives the message, it inspects the message
headers and runs a user-specified program based on the recipient address. The
user-specified program is responsible for actually transmitting the email.
For example, each of the above examples can be implemented with separate
programs that are invoked as subprocesses that take the message as input and
carry out the actual transmission.
The transmission programs themselves are part of the email-type gateway's driver. The user
deploys them to her volume by updating the certificate graph. Since the volume
spans all of her devices, each of her devices will have the most up-to-date
transmission programs available whenever the user sends a message.
\subsubsection{Search Indexing}
Since all messages are encrypted client-side, there is no option for server-side
message indexing. Instead, the user's RGs incrementally build up a
word-to-email index as part of their Push stage logic, just as they do in the
serverless groupware example. The index itself is
encrypted with the user's public keys, so it is visible only on the user's devices.
In fact, the code to maintain the users' indexes can simply be re-used by the
RGs without affecting the design or implementation of the mail clients.
There are two reasons to offload search indexing to the RGs instead of allowing
applications to handle this. First, this preserves the index across all devices.
This is especially important for Web clients, which cannot easily store a large amount of state
locally on their own (HTML \texttt{localStorage} is limited to 5MB, for example). Second, it
makes it easier to implement additional features like spam filtering, described below.
\subsubsection{Spam Filtering}
A key usability problem with encrypted email is that the servers cannot filter spam,
since they cannot read the messages. This can be addressed in four ways within
the volume's aggregation driver.
\\
\noindent{\textbf{Shared Spam Database}}. First, the aggregation driver is programmed to have the RGs in a user's volume build
up a \emph{shared} set of classification data from user input. When the user
moves data to the ``spam'' mailbox, the RG driver's Push stage generates and
a feature vector from the cleartext and stores it in a shared storage
provider. This allows
users to share each others' spam feature information.
The shared storage itself is implemented as a separate, third party volume that enforces write-once read-many
access patterns, and tracks which users add which features. That is, the RGs to the volume do not allow a record to be
written more than once, and do not allow records to be deleted (except by the
volume owner). This ensures
that users do not accidentally clobber one another's writes, and a malicious
user (such as a spammer) cannot erase the feature vectors. If it is later
discovered that a particular user's records were written with malicious intent,
they can be removed by the volume owner.
This arrangement is similar to existing third party spam detectors such as
Spamhaus~\cite{spamhaus}, where a third party aggregates spam information
on behalf of many users. The spam volume owner would aggregate the spam
information to train a spam classifier, and write the classifier parameters
to the volume. A user's mailer daemon would connect to the volume in a read-only fashion
to read the classifier parameters, and use them to classify the user's inbound
messages as spam or not spam. Because the volume is shared across many users
(and can be replicated by any user), the users are able to avoid spam-detection
service lock-in because they can (1) independently calculate the spam classifier
parameters, and (2) come up with their own, better classification system if the
spam volume owner does not do a good enough job.
Anyone can set up and run a collective spam filtering process. Users are free
to unilaterally decide which ones to use. Therefore, this approach does not
infringe on organizational autonomy.
\\
\noindent{\textbf{Sender Pays for Storage}}. The second anti-spam feature is that by design, the user pays for storing messages to recipients. Since each
recipient has a different public key, the user must encrypt a message for each
recipient. As a result, a spammer
must store a lot of state to spam many users at their own expense. This
discourages, but does not completely remove, bulk spam. This is similar to
the Internet 2000~\cite{internet-mail-2000} webmail proposal.
\\
\noindent{\textbf{SSI Proofs of Payment}}. The third measure is to take advantage of the fact that the SSI system is implemented on top of
a public blockchain. This feature allows for some interesting anti-spam mechanisms. A recipient can require the sender
to include a ``proof-of-payment'' on the message, generated by a transaction on the
underlying blockchain. This would have the effect of both rate-limiting
spammers and making emailing users prohibitively expensive to do at scale. It
would also allow senders to prioritize messages by paying higher fees. This
is a technique that was successfully employed by
Earn~\cite{earn-co}, for example, whereby a user will only see a
message if the sender has paid a minimum amount of money required by the
recipient.
\\
\noindent{\textbf{SSI Social Proofs}}. The fourth measure is to re-use a concept from Gaia-powered groupware to require
that a sender provide sufficient proofs in the SSI system that they are a
legitimate human being, and not a bot. For example, a recipient can enforce a
default anti-spam policy whereby a sender must supply evidence in their SSI
account that they own at least five unique social media accounts, and that the
accounts undergo a minimum amount of activity. This makes it hard to send
spam at scale because (1) the spammer would need to circumvent all of the social
media systems' anti-bot mitigations, and (2) if the spammer gets caught, they
have to register a new identity in the SSI system (necessitating a blockchain
transaction). Since the blockchain itself grows at a fixed rate, and since
blockchain peers effectively bid on the ability to write new transactions, a
spammer could not easily register many identities without paying a high price
(i.e. the price gets higher the faster the spammer tries to register new
identities). This allows the system to overcome the limits of prior proof-of-work
techniques~\cite{anti-spam-proof-of-work} which either had a fixed proof-of-work
threshold or a threshold that increased independently of the system's usage.
All four of these techniques would be implemented in part by the Acquire stage of the
mailer daemon's RG. This ensures that all email clients automatically benefit
from these mechanisms without modification.
\subsection{Implementation}
The prototype system, SyndicateMail, is implemented in 4100 lines of Python and
1700 lines of Java. It implements end-to-end encryption across multiple devices
and offers legacy compatibility with SMTP.
The system is being refactored to use the search indexing
logic from Gaia to implement search indexing in Syndicate. The RG
driver runs the indexing logic as a subprocess in a \texttt{node.js} VM. The
spam filtering is carried out simply by passing the text through an existing
spam-detecting system such as \texttt{spamd}~\cite{spamd} or \texttt{spam-assassin}
~\cite{spam-assassin}, and only forwarding the email text if it is not spam.
\subsection{Discussion}
In terms of the number of patches to write, it would be costly to implement this email system without
SDS. Each email client would need to be patched to store its state to the
storage provider of the user's choice, whereas the use of Syndicate ensures that
storage services only need to be ported once. By moving data signing,
encryption, decryption, and verification to the storage layer,
and using Syndicate's SSI system to bootstrap key trust, Syndicate enables the
use of existing email clients with encrypted email without forcing
users to understand public-key cryptography. By using gateways to represent the
capabilities of each device, the system is able to provide the convenience users expect from Webmail
without forcing them to manually copy private keys between devices.
Filtering spam and preventing accidental cleartext disclosure are problems
that require the system to inspect email contents on the user's behalf. This
is achieved by having the user's RGs carry out this inspection locally,
instead of forcing the user to trust an external SMTP server to do so on their
behalf. This is crucial to ensuring end-to-end message confidentiality, and
is required to be implemented at a layer \emph{beneath} email clients to ensure
that the user's choice of client does not alter the system's ability to ensure
message confidentiality and prevent spam delivery. These problems
are both addressed by allowing the user to run application-specific aggregation driver
stages interposed between their personal devices and the rest of the network.
\section{CDN-accelerated Scientific Data Staging}
Scientific computing is increasingly conducted across multiple research groups. Data is
generated and stored in the labs where a scientific instrument or dataset is
curated, and then shared across the world with collaborators. Similarly,
collaborator labs publish their data analyses, which get downloaded by other labs
(and classrooms) for further consumption.
The third application presented here is to use
Syndicate to implement a cross-site data processing framework that
allows scientists to take advantage of commodity cloud storage and CDNs to host
and deliver data to each lab. For dataset curators, this reduces the task of
exposing a dataset to collaborators to running a Syndicate AG that can crawl the
dataset (with a dataset-specific driver) and serve chunks of it to downstream
UGs. For dataset readers, this reduces the task of accessing a dataset to
fetching a dataset-specific Docker~\cite{docker} image that mounts the dataset
as a read/write filesystem backed by the dataset AG, an intermediate CDN, and
the user's personal cloud storage.
\subsection{Motivation}
The main motivation for considering an SDS approach to scientific data storage is
that due to the nature of the data they gather, each lab will have its own data curation
policies, its own unique data access patterns, and its own data-sharing policies.
There is not a one-size-fits-all approach for hosting scientific data, and labs will need to tailor
their storage systems to meet their specific needs (especially since their needs
change over time, depending on the nature of the data they produce).
This need to accommodate changing data storage and access policies is evident in
the evolution and wide success of state-of-the-art scientific storage systems
like iRODS~\cite{irods}, which offer
user-programmable policies (``rules'' and ``microservices'') that allow
individual scientists, project teams, and entire labs to programmatically
specify their curation policies and have them automatically enforced.
In fact, iRODS is considered to pioneer SDS concepts
(Chapter~\ref{chap:related-work}).
The scientific data-sharing framework uses commodity CDNs and cloud storage to help iRODS deployments
handle ``fan-out'' data distribution cases, where many labs across the wide
area want to read existing datasets and write back changes that will be
incorporated into the iRODS dataset. CDNs would let individual iRODS
deployments scale up the number of reads they could service while preserving the
policies encoded in its rule sets and microservices. Commodity cloud
storage would allow users to host the results of their computations and share
them with their lab mates and peers before generating and preserving a
``curated copy'' of the data back to iRODS.
\subsection{Challenges}
Augmenting existing systems like iRODS
with commodity infrastructure introduces challenges of
its own. It is not enough to simply place a CDN in between iRODS and remote readers for
three reasons:
\begin{itemize}
\item \textbf{Protocol Incompatibility}. CDNs are designed for Web content
acceleration, which means
using HTTP as the data delivery protocol. However,
iRODS does not speak HTTP. A protocol translation layer is required.
\item \textbf{Cache Thrashing}. CDNs are designed for caching lots of
``small'' files---i.e. website assets like HTML or CSS that are not usually
gigabytes in size. However, iRODS data can be extremely large, and clients may
only even want a small range of an iRODS file. Serving iRODS data with a CDN
while getting good bandwidth will require file-level fragmentation and reassembly
on both the producer's and consumers' endpoints.
\item \textbf{Cache Incoherency}. iRODS is a read/write datastore.
While some users are reading from a file, another user can be writing to it.
This can cause readers to cache corrupt data, which in turn gets served to
future readers by the CDN. Avoiding this problem requires manual coordination
between readers, writers, and the cache operator.
\end{itemize}
In addition, sharing the results of local computations and generating a
dataset to write back to iRODS has its own challenges:
\begin{itemize}
\item \textbf{Replica discovery}. Suppose a scientist reads some data from iRODS,
runs some local jobs on the data, and saves the job's results to the lab's shared Dropbox
folder. How do the scientist's peers find the data, so they can run their own
analysis on it? Today, they email the links to the peers or put the links on the lab
website. However, this introduces a manual, tedious process for sharing data. Can we
automate data discovery with Syndicate?
\item \textbf{Replica write-back}. A scientist's collaborators do not always
have access to her lab's iRODS deployment. How do her collaborators get their results incorporated
into her deployment? More specifically, how do they discover a set of
authentication credentials to use to do so? How does the data ingress server authenticate the
collaborator if they do not have an iRODS account? Today, the solution is to find
and email an iRODS user with sufficient privileges and ask them to incorporate
the changes. But can this be done automatically, without requiring users
in the loop?
\end{itemize}
As will be shown, these problems can be solved with the right configuration of Syndicate gateways.
\subsection{Role of SDS}
The need for software-defined storage in scientific computing is not new. The
labs that gather and share scientific data must already do so according to
data-specific rules. These include rules governing storage aspects like
national export controls, disclosure of proprietary or potentially dangerous information, and
even mundane concerns like ensuring the data appears in the correct format.
Prior to systems like iRODS, these rules had to be enforced either within the
scientific computing applications, or within a bespoke storage system.
Enforcing the same rules across many labs' applications poses a high cost of
coordination, since each lab's applications must be audited for compliance.
Enforcing a set of rules within a bespoke storage system requires constructing a
bespoke storage system for each rule set. Allowing a storage system to have its
curation rules programmed at runtime without changing the application-facing
storage APIs is the ``sweet spot'' of SDS for scientific computing.
This Syndicate-powered scientific data-sharing framework extends an existing system (iRODS) with
Syndicate to allow existing workflows to take advantage of commodity infrastructure
(CDNs and cloud storage) without affecting the application-facing storage APIs.
Crucially, the data-sharing framework does so in a way that \emph{preserves} the data owner's existing iRODS
rules in a global setting, while allowing the owner to specify additional
rules within Syndicate to specifically control how data is disseminated once it
leaves iRODS.
\subsection{Design}
An iRODS system can store many different datasets, and each dataset can have its
own access control policies set by the owner. These are enforced internally by
iRODS when other users attempt to access the data.
\begin{figure}[h]
\centering
\includegraphics[width=0.9\textwidth,page=25]{figures/dissertation-figures}
\caption{Overview of CDN-accelerated scientific data. The iRODS deployment
is private, and accessible only via the AG and RG (which run in trusted
networks). Remote UGs leverage the MS and CDN to read cached but fresh data,
regardless of the CDN's caching policies. When UGs write data, they do so
via the trusted RG which sends the changes to the proper datasets. All the
while, the AG keeps the MS metadata consistent with writes from non-Syndicate
iRODS clients by subscribing to a (iRODS-specific) message queue.}
\label{fig:chap4-syndicate-datasets}