-
Notifications
You must be signed in to change notification settings - Fork 0
/
evaluation.tex
1242 lines (1077 loc) · 62.3 KB
/
evaluation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Evaluation}
\label{chap:evaluation}
This chapter presents performance numbers for Gaia and Syndicate.
Ultimately, the end-to-end performance of an SDS system depends on the decisions
made in the application-specific aggregation and service driver implementations.
However, each SDS system will impose measurable, predictable overheads on
reads and writes. It is important for developers to know where these overheads
come from and how big they are in order to make good driver and application design decisions.
In light of the measured overheads, this chapter gives developers
recommendations on how to implement their aggregation drivers to minimize the
impact for specific workloads and end-to-end semantics.
\section{Overview}
SDS systems offer developers a trade-off. On the one hand, the cost to
developers and users is some additional performance overhead on the read and write paths to cloud services.
This is because the data needs to be converted
into access and mutate flows comprised of manifests and chunks, which must be
relayed through one or more gateways en route to the cloud services.
The SDS metadata service may also need to be contacted in order to
complete the operation.
On the other hand, SDS systems offer significant gains over the status quo.
First, SDS systems greatly reduce the cost of building and maintaining
system-of-systems applications built on cloud services. By handling storage
semantics and inter-organizational trust relationships
independent of applications, SDS systems allow developers to spend more time on
their application business logic and less time on addressing the users'
storage and trust concerns. Second, SDS applications
give organizations unilateral control over how
their data is hosted, which frees developers from having to be
responsible for conforming to their hosting policies.
Third, SDS decouples application data from the application
code, preventing users from being locked into relying on a specific
application. These gains are realized in each
sample application presented in Chapter~\ref{chap:applications}.
Despite overheads, using a SDS system does \emph{not} mean that application workloads take more time or
space than they would had the application avoided SDS. In fact,
the workload on the SDS system can be \emph{faster} and \emph{require less
space}, depending on the behavior of the service and aggregation drivers.
For example, a service driver can cache blocks on behalf of a slow service like Amazon Glacier~\cite{amazon-glacier},
allowing a read-heavy workload to execute faster with the SDS system
than it would have if the workload had to contact the service directly.
As another example, an aggregation driver can de-duplicate
and compress chunks before sending them to service drivers, speeding up data
transmission and reducing the amount of storage space needed when compared to
directly writing to storage services.
This chapter describes and measures the time and space overheads in both
Syndicate and Gaia's \emph{default} access and mutate flow behaviors, and
provides recommendations on how to design an aggregation driver to minimize
their impact on common workloads. The measurement focus is on the
\emph{efficiency} of reads and writes---that is, what fraction of time and space
is used for loading and storing the application's data.
Some aggregation driver design recommendations distilled from
these measurements have already been used in real-world applications built
on these SDS systems.
The efficiency measurement gives developers a way to
measure the impact of their service and aggregation driver implementations on
the end-to-end overhead perceived by the application.
A higher efficiency implies higher application-perceived goodput. Performant
driver implementations maximize efficiency using tactics such as caching
MS-obtained data, or running access and mutate flow steps in parallel when
possible.
To maximize efficiency, the developers need to first understand the SDS system's
overheads in order to make good engineering decisions to minimize their
impact on reads and writes. But in order to write an effective aggregation
driver, developers need to know what overheads exist in a SDS system first.
\section{Access Flows}
An access flow runs in two steps: the Discover step translates the
name of a record into a manifest ID, and an Acquire step uses the manifest
ID to fetch the block(s) that contain the requested data. Both steps may be
implemented in the aggregation driver, but the SDS system supplies a default
implementation if the aggregation driver does not.
This section presents the time and space overheads of the systems' default
access flow behavior, so developers can better understand how to design
aggregation drivers and applications to minimize their effect on their
workflows.
\subsection{Overheads in Gaia}
In both Gaia and Syndicate, the default behavior of the Discover step is to query the
metadata service. This adds measurable time overhead to the access flow's
execution, comprised of a network round-trip plus the time required by the MS
implementation to look up the current manifest ID for the record.
Gaia is designed for applications where each user is the owner of a volume that
contains all of their application-specific data.
When using a Gaia-powered application, the volume owner usually
has only one device online---the one she is
currently using. To take advantage of this, Gaia provisions gateways to run
both the Discover and Acquire step on the volume owner's personal device by
default. The node itself participates in the peer-to-peer metadata service,
and in doing so maintains a local copy
of all of the users' volume certificate graphs and pointers to Gaia metadata.
In addition, the node maintains a set of service drivers that its gateways can use to load and store
chunks on behalf of the organization that runs it.
The default Discover step in Gaia is to try to fetch the volume record from each
storage system it has a driver for. This includes
looking up the volume owner's public key and
certificate graph in the node's local replica of
the system's set of zone files, since this information is required to
authenticate the record. The volume record itself is stored in the volume
owner's chosen cloud storage services, meaning that the gateway running the
Discover step only has to ask its
node's co-located storage driver to fetch it and decode it. In the worst case,
the Discover step has to try each storage service before succeeding.
The volume record contains the volume's manifest ID (recall that a Gaia volume
only has a single manifest). This is passed to the Acquire step, which by
default will fetch each of the volume owner's key space shards in parallel and
reassemble them into the list of keys available in the volume. Once the full
set of keys are assembled, the Acquire step can finally fetch the requested
value (as a single chunk).
To fetch the value's chunk, the default Acquire step looks at the volume record
to get a list of upstream Gaia gateways or storage providers that the volume
owner has listed as possibly storing a copy. The Acquire step iterates through
them in order, using one of the node's service drivers to contact each service.
Functionally speaking, the act of asking an upstream Gaia gateway for the chunk is
equivalent to asking a storage provider---to the requesting
gateway, there is no difference in mechanism (i.e. the upstream gateway is
treated like a storage provider).
To summarize, the sources of overhead in Gaia's default access flows stages are:
\begin{itemize}
\item \textbf{Fetching the volume record in the Discover step}. This adds both time overhead in
the form of a round-trip to a storage service, and a constant-space overhead from having
to store the volume record (which contains the manifest IDs).
\item \textbf{Fetching the manifest in the Acquire step}. This
is the work required to find the globally-consistent view of the manifest.
Each device that can write to a Gaia volume stores its own manifest, so a reader
will need to fetch all of them and merge them. This adds both a time
overhead in the form of a round-trip to a storage service (one per device
manifest), and a $O(kn)$ space overhead for $n$ keys and $k$ devices. The time overhead of
interest is the combined time
required to fetch all device manifests in parallel, and the time required to authenticate
each device's signature. The space
overhead is dominated by $n$, since $k$ is the number of devices the volume
owner can use to write (which in practice is small---on the order of 10 or fewer).
Each key requires a constant amount of space.
\item \textbf{Decoding and authenticating the chunk fetched from the Acquire
step}. This adds a time overhead that is $O(n)$ in the number of bytes, since the chunk must
be hashed.
\end{itemize}
\subsubsection{Measured Overheads}
Gaia's read performance was evaluated for records whose sizes vary between 64K
and 640K, in intervals of 64K. These sizes
were chosen to emulate the sizes of files being stored. For example, a typical
image would be around 640k (taken from Blockstagram~\cite{blockstagram}),
a typical text document would be around 320k (taken from Graphite
Docs~\cite{graphite-docs}), and the set of the user's
microblog posts would be around 64k (taken from Publik~\cite{publik}).
The read tests were done on a Gaia deployment that matches the \emph{cold start}
default deployment given to each user.
In this deployment, the user runs a local Gaia node that Discovers and Acquires data
from an upstream Gaia node. The upstream Gaia node loads and stores data from
Microsoft Azure Blob Storage~\cite{azure-blob-storage} in response to downstream
Gaia requests. Anyone can read from it, but it only accepts writes from users
who have stored profiles that point to at least two social media accounts (i.e.
in order to prevent bots from abusing the system). In this test, no caching of
any kind was performed by the local Gaia node.
This test evaluates a production deployment of Gaia that serves over
15,000 users. It is used by the majority of users of the Gaia-powered
applications described in Chapter~\ref{chap:applications}.
An instrumented version of a Gaia node was used to evaluate 150 reads on these
file sizes on the live Gaia network. A single record was written to an empty
volume and read back 150 times from a single device.
The overall read performance is shown in
Figure~\ref{fig:gaia-read-total}.
\begin{figure}[htp!]
\centering
\includegraphics[width=\textwidth]{figures/results/gaia-reads.all/Total}
\caption{Box-and-whiskers plot of the overall read performance.}
\label{fig:gaia-read-total}
\end{figure}
Unsurprisingly, the time taken to store data increases linearly with the amount
of data being stored. Storing a ``small'' file of 64K took a median of 0.836
seconds on the cold path, and an average of 1.56 seconds with a standard
deviation of 1.79.
Each read operation is composed of a Discover and Acquire stage. The
performance of both of these stages is shown in
Figure~\ref{fig:gaia-read-stages}.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-reads.all/Discover}
\label{fig:gaia-read-discover}
\caption{Gaia Discover performance.}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-reads.all/Acquire}
\label{fig:gaia-read-acquire}
\caption{Gaia Acquire performance.}
\end{subfigure}
\caption{Box-and-whiskers plots of access flow stage performances in Gaia,
for file sizes between 64K and 640K in increments of 64K.}
\label{fig:gaia-read-stages}
\end{figure}
As expected, the Discover stage time overhead is constant relative to the file
size---all that is happening in this stage is the constant-sized
volume record is being loaded.
Also, as expected, the Acquire stage time increases with file size, since
it includes the times taken to fetch the (single) manifest and fetch the (single) block.
The Acquire stage time overheads are further broken down in
Figure~\ref{fig:gaia-acquire-breakdown}.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-reads.all/GetManifest}
\label{fig:gaia-read-getmanifest}
\caption{Times taken to fetch the manifest in the Acquire step, for given
key/value block sizes.}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-reads.all/GetBlocks}
\label{fig:gaia-read-discover}
\caption{Times taken to fetch the key/value block in the Acquire step, for
given sizes.}
\end{subfigure}
\caption{Box-and-whiskers plots of the Acquire stage performance.}
\label{fig:gaia-acquire-breakdown}.
\end{figure}
In this test, the manifest size was constant, and took between 0.522 and 0.576
seconds to download in the median case. The times taken to download the record
increased with the record size, taking between 0.223 seconds (for 64K) and 0.448
seconds (for 640K) in the median cases.
The efficiency for each read was calculated for record sizes tested. Two
efficiency results were calculated. The ``cold efficiency'' is calculated as
the time taken to fetch the record data
divided by the total running time of the access flow. This includes fetching
the volume record from the underlying cloud storage provider, despite the fact
that in practice this record can be safely cached. The ``warm efficiency'' is
similar to the ``cold efficiency'', except the total running time of the access
flow \emph{excludes} the time taken to fetch the volume record. This
calculation reflects the act of caching the volume record.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-reads.all/Efficiency}
\label{fig:gaia-read-getmanifest}
\caption{The cold efficiency, which includes the time taken to fetch the
volume record over the network.}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-reads.all/Efficiency_cache}
\label{fig:gaia-read-discover}
\caption{The warm efficiency, which assumes the volume record is cached
and excludes it from the access flow's running time}
\end{subfigure}
\caption{Box-and-whiskers plots of Gaia's read efficiencies.}
\label{fig:gaia-read-efficiencies}.
\end{figure}
Both efficiencies are shown in Figure~\ref{fig:gaia-read-efficiencies}. Naturally,
the efficiencies will tend towards 1.0 for larger and larger record sizes.
However, the act of caching the volume record can significantly improve the read
efficiency for small records.
\subsubsection{Recommendations for Developers}
Relative to fetching data directly from a storage provider, Gaia's most
noticeable time overheads come from fetching and decoding all of the metadata
and volume record information required to request the data.
Fortunately, these costs can be
mitigated by metadata-reading and metadata-writing strategies in
their applications' aggregation drivers.
\hfill \break
\noindent{\textbf{Metadata Caching}}
\hfill \break
In applications that assume at most one writer per login session,
the Gaia gateways can cache data for the duration of the session.
For example, the Gaia-powered Coins application~\cite{coins} assumes that the
user only accesses the data from the same trusted device. This application's
aggregation driver can
spare users the cost of reloading metadata on each read by having its Acquire
and Publish steps maintain a coherent local copy.
Some applications only need delta-consistency~\cite{delta-consistency}. In such
applications, users are already expected to have to wait a few seconds for newly-written data to
appear. In this case, the Acquire step can be implemented to cache metadata for
a particular key for an application-configurable amount of time. For example, a
social media application could cache the metadata for a user's avatar for a long
time, since it is not expected that the user will change it frequently. This
would save other readers the cost of having to fetch the same metadata for it
over and over.
The exact rules for how long to cache a given key/value pair's metadata are
application-specific, and affect the end-to-end storage semantics of the volume.
As such, metadata caching is not part of Gaia's default
behavior. Instead, Gaia defers to the aggregation driver to make these
decisions.
In practice, the volume record is cached for the duration of the session. This
was omitted from the read test since the test was meant to
measure overheads from cold starts (in order to identify overhead points).
Simply caching this record for the duration
of a login removes about 0.16 seconds from the read time.
\hfill \break
\noindent{\textbf{Metadata Streaming}}
\hfill \break
Some applications like online shared document editors need to access metadata quickly
from a small set of peers. In these cases, the Acquire and Publish steps of the
aggregation driver can augment the default read path by eagerly replicating
their metadata via a shared broadcast channel (such as a shared Web socket), in
addition to replicating it to cloud storage.
This way, the Acquire step would listen for Publish events from other peers, and
eagerly update cached metadata before an application read occurs. If the
broadcast channel is down or starts dropping messages, the Acquire step would fall back
to fetching metadata via the slow path.
Discovering the broadcast channel can be achieved by placing hints in the
volume record in the Gaia MS, such as a set of Websocket URLs. On loading, the
application would allow the user to connect to other peers by querying the
peers' volume records to find their broadcast channels of choice.
Not all applications need this feature, and even for applications that do need
it, the developer would need to select a broadcast channel that is capable of
handling the application's workload. As such, Gaia defers
metadata streaming to the aggregation driver implementation.
At least two Gaia-powered applications---Stealthy~\cite{stealthy.im} and
Hermes~\cite{hi-hermes}---are known to use this technique. Both are
encrypted chat applications, and use a streaming mechanism to accelerate message delivery instead of
forcing clients to continuously poll each others' Gaia volumes. They
both replicate message logs to Gaia so recipients can read messages
sent to them while they were away.
\subsection{Overheads in Syndicate}
In Syndicate, the Discover and Acquire steps always run on user gateways.
In the default case, the same user gateway runs both
an access flow's Discover and Acquire steps.
The default behavior of Syndicate's Discover step is to contact the
cloud-hosted MS for the latest manifest ID. Unlike Gaia, Syndicate's
metadata records are arranged into a filesystem-like file hierarchy, and
gateways do not need to obtain a full record of the volume metadata in order to access
blocks.
The time overhead of fetching a given metadata record in Syndicate
is dependent in how deep into the metadata
hierarchy it is, since at a minimum the user gateway will need to verify that
its cached copies of the path's directory logs are up-to-date.
If the directory's log is not cached or has been modified
since the last request, then the overhead will also include fetching and
synchronizing each directory log in the path.
The space overhead for processing the metadata path
is simply the sum of the metadata record fetched, plus the
sum of the sizes of the directory logs.
The default behavior of the Acquire step in Syndicate is to first fetch the
manifest from an upstream gateway, and then fetch the relevant blocks from one
or more upstream gateways. It uses the metadata record from the Discover step
to determine whether or not it needs to contact an acquisition gateway or the
volume's replica gateways.
When reading a record whose data is hosted in a curated dataset, the user
gateway's default Acquire step always contacts the acquisition gateway that
acts as its coordinator. It determines which gateway this is using the metadata
record acquired in the Discover step. The acquisition gateway is not given any
any specialized aggregation driver logic by default. It responds to the user
gateway by using its service
driver to fetch and serve the requested block or manifest from the underlying
data set.
When reading a record whose data is stored in a cloud storage provider, the user
gateway may choose from a set of replica gateways that may be able to access it.
The default behavior of the Acquire step in this case is to ask each replica
gateway in order of increasing gateway ID. Once the user gateway successfully
fetches the manifest, it fetches at most six blocks in parallel from the replica
gateways. Similar to how it fetches the manifest, the user gateway will try
contacting each replica gateway in sequence by replica ID for the block (but
will process at most six blocks concurrently).
The choice of six parallel connections is inspired by the same implementation choice made in Web
browsers~\cite{browserscope-browsertest}. % http://www.browserscope.org/?category=network
The replica gateways are not given any aggregation driver logic for the Acquire
step by default; they simply use their service drivers to fetch and serve chunks
from their services upon request.
To summarize, the overheads in Syndicate's default access flow stages are:
\begin{itemize}
\item \textbf{Synchronizing the metadata path's directory logs}.
In the best case, all of the path entries will be cached and up-to-date. In
this case, the time overhead is $O(n)$ for $n$ path entries, and $O(dn)$ space
overhead for $d$ entries per directory log. In the worst case,
all path entries will not be cached. In this case, both the time and space
overheads are $O(dn)$.
\item \textbf{Fetching blocks and manifests through a replica or acquisition gateway}. When
servicing an application read, the user gateway does not fetch the data directly
from the cloud service or dataset, but instead contacts an upstream replica
gateway or acquisition gateway to
fetch it on its behalf. However, this overhead is only
incurred when the user gateway cannot load the requested block or manifest from
an upstream CDN.
\item \textbf{Fetching, authenticating, and decoding the manifest}.
This adds $O(m)$ time and space overhead in the best case (i.e. the first
replica gateway contact has the manifest),
where $m$ is the number of blocks in the record. In the worst case, the time overhead is
$O(m+r)$ for $r$ replica gateways, since in the worst case the last replica
gateway to be contacted out of the set of replica gateways in the volume
has the manifest.
\item \textbf{Searching for the correct replica gateway}. Fetching a manifest
or block that is available only from replica gateways incurs at worst a $O(g)$
overhead, for $g$ replica gateways. This is due to the simple but inefficient
strategy of contacting replica gateways in order by gateway ID. The $g$
parameter is not expected to change very often relative to the occurrence of reads and
writes, since the user is not expected to frequently add and remove gateways.
\item \textbf{Fetching and decoding the data as a set of blocks}. A record that exceeds the
volume block size will be fetched piecemeal over HTTP. This adds $O(n/b)$ time
and space overhead, where $n$ is the number of bytes and $b$ is the block size.
The overheads come from processing and discarding the HTTP headers. In
addition, each block will be hashed in order to authenticate it,
yielding a $O(n/b)$ time overhead.
\end{itemize}
\subsubsection{Measured Overheads}
Syndicate reads and writes data as fixed-sized blocks. Obviously, a larger
block size and a larger file size would improve the efficiency of Syndicate
reads and writes, since a greater fraction of the total operation time would be
spent uploading or downloading data. Therefore, read performance was measured
on a small block size of only 1k, and on a moderate block size of 10k, in order
to emphasize read overheads in the measurements.
File sizes were chosen between 10x and 100x the block size, in intervals of 10
blocks (meaning 10 different file sizes were tested per block size). Each file
of each size was downloaded 100 times to measure overheads.
The read test was carried out on a volume with one UG and one RG.
The RG and MS ran on a Microsoft
Azure VM and stored metadata and chunks to the VM's local disk.
The Discover step in these tests synchronized one directory (the root) and fetched
one metadata record. The Acquire step in these tests fetched a manifest representing between 10 and
100 blocks. The size of the block's metadata in the manifest is
constant---manifest sizes and download times are a function of the number of
blocks only. However, the size of the record metadata on the MS increases with
the number of blocks, since the MS stores a garbage-collection log for all
blocks in the record. Each garbage-collection entry is only 8 bytes.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-writes.1024.all/Total}
\label{fig:syndicate-read-total-1k}
\caption{Syndicate access flow performance with 1K blocks}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-writes.10240.all/Total}
\label{fig:syndicate-read-total-1k}
\caption{Syndicate access flow performance with 10K blocks}
\end{subfigure}
\caption{Box-and-whiskers plots of end-to-end access flow performances in
Syndicate, for 1K and 10K block sizes.}
\label{fig:syndicate-read-total}
\end{figure}
The total read performances for 1K and 10K block sizes are shown in
Figure~\ref{fig:syndicate-read-total}. These represent the times taken by all
of the Discover and Acquire logic, including the aforementioned overheads.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.1024.all/Discover}
\label{fig:syndicate-read-discover-1k}
\caption{Syndicate Discover performance with 1K blocks}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.1024.all/Acquire}
\label{fig:syndicate-read-acquire-1k}
\caption{Syndicate Acquire performance with 1K blocks}
\end{subfigure}
\caption{Box-and-whiskers plots of access flow stage performances in
Syndicate, for file sizes between 10K and 100K and a block size of 1K.}
\label{fig:syndicate-read-stages-1K}
\end{figure}
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.10240.all/Discover}
\label{fig:syndicate-read-discover-10k}
\caption{Syndicate Discover performance with 10K blocks}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.10240.all/Acquire}
\label{fig:syndicate-read-acquire-1k}
\caption{Syndicate Acquire performance with 10K blocks}
\end{subfigure}
\caption{Box-and-whiskers plots of access flow stage performances in
Syndicate, for file sizes between 100K and 1000K and a block size of 10K.}
\label{fig:syndicate-read-stages-10K}
\end{figure}
Figures~\ref{fig:syndicate-read-stages-1K} and \ref{fig:syndicate-read-stages-10K}
show the Discover and Acquire stage performances of Syndicate access flows for 1K and
10K blocks, respectively.
In both cases, the times taken by the Discover step increase slightly for
records with 80, 90, and 100 blocks. This is due to the fact that the MS
incurs extra disk overhead from loading the a metadata record with a larger
garbage-collection log. This could be optimized away.
Unsurprisingly, the Acquire stages increase in time as a linear function of the
number of blocks fetched. The spreads of the distributions increase
with the number of blocks because more blocks introduce more noise into the
measurement. The time taken to fetch records increases faster for 10K blocks
than for 1K blocks.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.1024.all/GetManifest}
\label{fig:syndicate-getmanifest-1k}
\caption{Syndicate's performance of fetching manifests with 1K blocks}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.1024.all/GetBlocks}
\label{fig:syndicate-getblocks-1k}
\caption{Syndicate's performance of fetching records of various sizes with
1K blocks.}
\end{subfigure}
\caption{Box-and-whiskers plots of Syndicate's Acquire stage performance with 1K
blocks.}
\label{fig:syndicate-acquire-breakdown-1K}
\end{figure}
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.10240.all/GetManifest}
\label{fig:syndicate-getmanifest-10k}
\caption{Syndicate's performance of fetching manifests with 10K blocks}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.10240.all/GetBlocks}
\label{fig:syndicate-read-acquire-1k}
\caption{Syndicate's performance of fetching the blocks for records of various sizes with
10K blocks.}
\end{subfigure}
\caption{Box-and-whiskers plots of Syndicate's Acquire stage performance with
10K blocks.}
\label{fig:syndicate-acquire-breakdown-10K}
\end{figure}
The tasks of fetching the manifest ID and fetching
the blocks for the Acquire stages are further broken down in
Figures~\ref{fig:syndicate-acquire-breakdown-1K} and
\ref{fig:syndicate-acquire-breakdown-10K}.
The times taken to fetch the manifests are about the same across the record
and block sizes sizes measured. While
the size of a manifest grows linearly with the number of blocks, it does so via
a small constant factor per block (48 bytes per block). The times taken to
fetch the blocks are the reason why the Acquire stage's time increases linearly
with the record size.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.1024.all/Efficiency}
\label{fig:syndicate-cold-efficiency-1k}
\caption{The cold efficiency for 1K blocks, which includes the time taken to fetch the
metadata record from the MS.}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.1024.all/Efficiency_cache}
\label{fig:syndicate-warm-efficiency-1k}
\caption{The warm efficiency, which assumes the metadata record is cached
and excludes it from the access flow's running time}
\end{subfigure}
\caption{Box-and-whiskers plots of Syndicate's read efficiencies for 1K blocks.}
\label{fig:syndicate-read-efficiencies-1k}
\end{figure}
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.10240.all/Efficiency}
\label{fig:syndicate-cold-efficiency-10k}
\caption{The cold efficiency for 1K blocks, which includes the time taken to fetch the
metadata record from the MS.}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/syndicate-reads.10240.all/Efficiency_cache}
\label{fig:syndicate-warm-efficiency-10k}
\caption{The warm efficiency, which assumes the metadata record is cached
and excludes it from the access flow's running time}
\end{subfigure}
\caption{Box-and-whiskers plots of Syndicate's read efficiencies for 10K blocks.}
\label{fig:syndicate-read-efficiencies-10k}
\end{figure}
Figures~\ref{fig:syndicate-read-efficiencies-1k} and
\ref{fig:syndicate-read-efficiencies-10k} plot the cold and warm efficiencies of
Syndicate, for 1K blocks and 10K blocks respectively. In both cases, the
efficiencies approach 1.0 as the record sizes increases. The warm efficiency
excludes the time taken to fetch the metadata record from the MS (i.e. by
caching it locally). Caching metadata makes a noticeable difference in the
system's efficiency for small files---it can be up to 33% higher.
\subsubsection{Recommendations for Developers}
Syndicate gives developers several options to manage read performance in the
face of these overheads. A few of these recommendations have been put into
practice in production settings.
\hfill \break
\noindent{\textbf{Long Metadata TTL with Explicit Invalidation}}
\hfill \break
Volumes of scientific data often have few writers. In cases where a volume is
backed by a dataset, the only writer would be the acquisition gateway that
crawls the dataset. In cases where a volume acts as a data dump or a scratch space,
writes happen only when a workload finishes, and occur on the same set of
metadata paths (e.g. each user or each workflow would write its dump to its own
directory).
Developers can take advantage of these special cases to save round-trips to the MS.
The Publish steps of writer gateways can be programmed to
broadcast a metadata invalidation hint to all read-capable gateways in the
volume. The MS would only be contacted as a fallback.
This strategy is used in the scientific data-sharing application today in order
to improve the efficiency of reading small files.
\hfill \break
\noindent{\textbf{Use a CDN}}
\hfill \break
Syndicate was designed to be used with a CDN. Developers wishing to get the
best read performance would implement their Acquire step to contact one or more
CDNs that can pull down chunks from upstream replica gateways. This is highly
beneficial for read-heavy workloads, where most of the chunks can be cached
close to readers. This reduces the number of network round-trips and reduces
the amount of transit traffic out of cloud storage providers, all without
violating end-to-end storage semantics and organizational autonomy.
The performance boost developers can expect to see depends on the CDN leveraged
and the size of the working set.
However, the benefit to breaking data into chunks is that
developers can expect the CDN to accelerate reads even if only
part of the data is cached.
This strategy is used in the scientific data-sharing application today. The
CDN---an instance of the Akamai~\cite{akamai} CDN---is deployed on OpenCloud~\cite{opencloud}.
\hfill \break
\noindent{\textbf{Gateway-local Block Cache}}
\hfill \break
Since Syndicate handles end-to-end semantics at a level above block transport,
each gateway can implement a write-coherent block cache in its Acquire stage.
This effectively adds multiple tiers to a commodity CDN---the first tier would be at the user
gateways, the CDN would be the shared middle tier, and the replica and acquisition
gateways would be the top tier. Syndicate gateways offer this feature as a
built-in option, but using it requires the developer to set the cache size first
(which is workload-specific).
This strategy is deployed in the scientific data-sharing application.
\hfill \break
\noindent{\textbf{Chunk Advertisement}}
\hfill \break
If the developer implements a gateway-local block cache in the Acquire step, a
complementary feature would be allowing gateways to advertise to one another
which chunks they have cached. If the Acquire step detects that a nearby peer
has a cached chunk, then it could fetch the chunk from the nearby peer instead
of from an upstream cache. This is useful in cluster computing, where
host-to-host bandwidth is high but bandwidth in and out of the cluster is
comparatively low. It may be cheaper to fetch a chunk from a cluster peer than
an upstream CDN node.
This strategy is also useful for MapReduce~\cite{mapreduce}-style
workflows, where the job scheduler can query gateways to determine
which chunks are already cached so it can schedule jobs on hosts that already
have the requisite data. This is a feature implemented in Syndicate's HDFS driver, for
example.
This is not part of the default behavior because it makes assumptions about the
network bandwidth between gateways and assumptions about the threat model the
deployment faces. A wide-area Syndicate volume would not want this feature,
because it would disclose to the Internet information about which gateways could
access which data, and thus give an attacker insight into which hosts
to compromise in order to exfiltrate it.
\hfill \break
\noindent{\textbf{Chunk Compression}}
\hfill \break
Syndicate gives developers the ability to control the wire-format of each chunk.
If the entropy of the data is low, then developers stand to gain by having their
gateways' \texttt{serialize()} and \texttt{deserialize()} driver methods
compress and decompress chunks. However, if the data has high entropy, then
this strategy would be useless. Syndicate does not do this by default, but
instead defers to developers to make the right decision based on their data.
\hfill \break
\noindent{\textbf{Read-ahead}}
\hfill \break
Many scientific workflows read sequentially. If this is the application's
behavior, then the developer can program the Acquire step to pre-fetch blocks
asynchronously. This is useful if the application is reading
variable-sized ranges of a file that straddle block boundaries---the last block
fetched in one read will be the first block fetched in the next read, so keeping
it local would save a round-trip.
Syndicate does not perform read-ahead by
default because it cannot assume that data reads are sequential. In a
random-read workload, read-ahead would be more wasteful than the default
behavior. However, if the developers know that their application has a
read-sequential access pattern, then they can add this behavior to the Acquire
stage.
\hfill \break
\hfill \break
\noindent{\textbf{Favor Shallow Metadata Hierarchies}}
\hfill \break
Developers can reduce the amount of time spent querying metadata by organizing
their data into shallow directory hierarchies. This would cut down on the
number of round-trips to the MS to resolve a path. In addition, developers can
ensure that their directories do not get too big in order to minimize the
cold-cache start-up time for a user gateway to synchronize its metadata logs.
This strategy is used in the scientific data-sharing application.
\section{Mutate Flows}
A mutate flow has three steps: a Build step which constructs a new manifest
for a record that incorporates the modified blocks, a Push step which replicates
the new manifest and new blocks, and a Publish step which makes the mutation
visible to subsequent access flows. A SDS system supplies default
implementations of these steps, but they may be overwritten by the aggregation
driver. This section presents the time and space overheads the default steps in
Gaia and Syndicate impose on top of application writes, and presents a
discussion on how developers can minimize them.
\subsection{Overheads in Gaia}
To handle writes, Gaia's default strategy to process a
mutate flow is to do so entirely on the volume owner's device. When the volume
owner signs into the application, the device's Gaia node instantiates gateways with the Build,
Push, and Publish stages in order to service application writes for this session.
The Build stage takes the new key/value pair the application is trying to write,
and assembles a new key space shard to replicate. The Push stage takes the
key's value and replicates it to the volume's cloud storage
services. This may include Pushing them to an upstream Gaia node, which carries
out further processing (but to the node doing the Push, the upstream Gaia node
looks and behaves like another cloud storage service). The Publish stage takes
the new key space shard and replicates it alongside the Pushed key value.
The default deployment of Gaia implements a couple of optimizations.
First, the Gaia node optimizes the execution of a mutate flow as a sequence
of subroutine calls. There is minimal overhead between passing control from a
Build stage to a Push stage, and from a Push stage to a Publish stage.
Second, the Push and Publish stages execute in parallel by default. This is
because there often no logical dependencies between them that require them
to run sequentially.
The end-to-end default write overheads include:
\begin{itemize}
\item \textbf{The time and space overheads of generating the new metadata}. In
the Build step, the Gaia node will need to hash the key/value pair
chunk and append it to the manifest. This adds a $O(n)$ time overhead, where
$n$ is the size of the value. In addition, the Gaia node will need to ensure
that it has a fresh copy of the key shard before it can build a new key
shard (i.e. before the mutate flow executes, another mutate flow may have
executed from another one of the user's devices).
\item \textbf{The time and space overheads in storing the new key shard}. Each new
key added takes $O(1)$ additional space to the volume's manifest, and $O(1)$
additional space to the volume's metadata.
Storing the key shard adds a $O(n)$ time and space overhead for $n$ records in the volume
(since in the worst case, a key shard can have as many records as there are
keys in the volume). These costs are incurred in the Publish step, where the
volume's manifest is replicated.
\end{itemize}
\subsubsection{Measured Overheads}
Write overheads in Gaia were measured on the live Gaia network. Just as with
the read experiment, this test was conducted on a representative Gaia deployment
whereby the user runs a local Gaia node that will Push and Publish new data to
an upstream Gaia node, which in turn Pushes the data (as chunks) to a bucket in
Microsoft Azure.
The test wrote key/value pairs with sizes ranging between 64K and 640K, in
intervals of 64K, to simulate writing
data from real-world Gaia applications. The test wrote the files 150 times
using an instrumented Gaia node to measure overheads. Each run was from a cold
start---there was no caching performed between requests.
\begin{figure}[htp!]
\centering
\includegraphics[width=\textwidth]{figures/results/gaia-writes.all/Total}
\caption{Box-and-whiskers plot of the overall write performance.}
\label{fig:gaia-write-total}
\end{figure}
The overall write performance for Gaia is shown in
Figure~\ref{fig:gaia-write-total}. While the measurement is noisy, the write
times increase linearly with the file size. The source of the noise comes from
the fact that the upstream Gaia node is shared with many Gaia users.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-writes.all/Build}
\label{fig:gaia-write-build}
\caption{Gaia Build performance. Note that this includes the cost of
fetching the existing manifest before constructing a new one.}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-writes.all/PushPublish}
\label{fig:gaia-write-pushpublish}
\caption{Gaia Push/Publish performance (both stages execute in parallel).}
\end{subfigure}
\caption{Box-and-whiskers plots of mutate flow stage performances in Gaia,
for file sizes between 64K and 640K in increments of 64K.}
\label{fig:gaia-write-stages}
\end{figure}
Figure~\ref{fig:gaia-write-stages} shows the Build, Push, and Publish stage
performances in Gaia. In this test, the Build stage includes
the time taken to fetch a copy of the device manifest to update. If the device
manifest is cached, then the Build stage is extremely fast---effectively the amount of
time taken to hash the data and insert it into a hash table and serialize the
hash table to a string for upload.
The Push and Publish stages run in parallel in Gaia by default, so their
measurements are combined.
\begin{figure}[htp!]
\centering
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-writes.all/Efficiency}
\label{fig:gaia-read-getmanifest}
\caption{The cold efficiency, which includes the time taken to fetch the
volume record over the network.}
\end{subfigure}
\begin{subfigure}[b]{.8\textwidth}
\includegraphics[width=\textwidth]{figures/results/gaia-writes.all/Efficiency_cache}
\label{fig:gaia-read-discover}
\caption{The warm efficiency, which assumes the volume record is cached
and excludes it from the mutate flow's running time}
\end{subfigure}
\caption{Box-and-whiskers plots of Gaia's write efficiencies.}
\label{fig:gaia-write-efficiencies}.
\end{figure}
This test calculated the efficiency of Gaia's write path in two ways---a ``cold
write efficiency'' which includes the cost of fetching the manifest in the Build
stage, and a ``warm write efficiency'' which excludes this step. Including both
efficiency measures is valuable to developers because often times, the manifest
can be safely cached across writes. Figure~\ref{fig:gaia-write-efficiencies}
reports both cold and warm efficiencies. The efficiency of the write path
improves somewhat when the manifest can be cached across writes.
\subsubsection{Recommendations to Developers}
Developers have a few strategies available to alter the performance of writes in Gaia.
The specific strategies taken ultimately depend on the workload and data being
stored.
\hfill \break
\noindent{\textbf{Incremental Key Space Shard Writes}}
\hfill \break
Some applications may have a large key space, but only need to carry out a
key/value writes at a time.
The aggregation driver has an opportunity to reduce the amount of time and space that
need to be consumed to carry the write out by only writing the new key metadata.
If the application only wrote one value, then only one key in the manifest would
be altered. The Build stage could be optimized to inspect the Gaia node's
key space shard in-between writes, only pass along the delta between writes
to the Push stage.
The Push stage would accumulate deltas from the Build stage, and combine them
into a single key shard in the backend storage service. Then, a subsequent
Acquire step would continue to fetch the key space shard as expected.
The reason this is not the default behavior is because patching a record
efficiently requires the cloud service to support a ``range write''
API call, whereby the client specifies a byte offset and length as to where to
write the given data. Most popular cloud storage providers do not support
this---they only allow clients to write whole records. For these services,
a Push stage could not efficiently write key shard deltas, since it would need
to load the entire key shard, patch it, and store the entire updated key shard
on each write. As such, this behavior would be added by developers in the
special case where they were using a suitable cloud storage provider.
\hfill \break
\noindent{\textbf{Write Batching}}
\hfill \break
Applications may not need all of their writes to be Published immediately.
Instead, a Publish can reflect many writes at once. This would be allowed if
the application's storage semantics do not require all peers to see each others'
most-recent state. This can lead to better overall performance for applications
that frequently overwrite the same key/value pairs---overwritten key/value pairs
would not need to be replicated.
Applications that have semantics compatible with write-batching can not only
realize better performance than the default behavior, but also take advantage of
client-side libraries that offer more expressive storage interfaces.
Examples include Compass~\cite{blockstack-compass}, which provides a
MongoDB-like interface, and \texttt{sql.js}~\cite{sql.js}, which provides a
client-side SQLite implementation. Both of these libraries are easily used with
Gaia, provided that the application's storage semantics allow write-batching
(i.e. a Publish would take place in response to the application committing a
transaction in one of these APIs). In fact, Compass was designed specifically
for Gaia by a third party contributor.
\subsection{Overheads in Syndicate}
Syndicate's default write strategy is make data as durable as possible.
This is realized by the default behaviors of replicating all
manifests and blocks to all replica gateways in the volume in the
Push stage, and synchronously uploading the record's metadata to the Syndicate
MS in the Publish stage.
User gateways invoke the Build, Push, and Publish stages on write.
Since Syndicate is designed to process scientific workloads, it expects
multiple write-capable user gateways to be online at once. However, it assumes
that user gateways usually (but not always) write to the same files that they
coordinate. This is reasonable in practice, since scientific computing loads
are usually designed to run on many parallel computers which share as little
data with one another as possible.
In light of this, the default common-case behaviors of the Build, Push, and
Publish stages in Syndicate are to assemble a new manifest locally (Build),
replicate the manifest and blocks to all replica gateways (Push), and
synchronously upload the new metadata to the MS (Publish). Push and Publish are
run automatically when a record is \texttt{close()}'ed, if the application does
not do so explicitly via a call to \texttt{fsync()}. These are the default
behaviors of the user gateway carrying out the write is also the coordinator.
If the writer gateway is not the coordinator, then it enlists the
coordinator's help it carry out the write. The writer's Push stage will first
replicate the new blocks, and then synchronously request that the coordinator
both Push a new manifest with the requested changes
and Publish new metadata that reflects it.
By default, replica gateways do not have any specialized aggregation driver
logic on the write path.
They simply accept chunks from user gateways, and replicate them with
their service drivers.
To summarize, the write overheads in Syndicate are as follows:
\begin{itemize}
\item \textbf{The time and space overheads of building a new manifest}. By
default, a UG will incur a network round-trip when it Builds a new manifest for a record that
it does not coordinate. In addition, the UG will incur a round-trip to
the MS to ensure that the new manifest it is modifying is fresh when executing
the default Build implementation.
\item \textbf{The time and space overheads of storing new metadata}. The
record's coordinator will incur a network round-trip to the MS when Publishing
new data, and storing the new metadata incurs $O(1)$ extra space. In the case
where the writer is not the coordinator, two network round-trips are incurred:
one to the MS and back, and one to the coordinator and back.
\item \textbf{The time and space overheads of storing a new manifest}. The
record's coordinator will incur a network round-trip to each replica gateway to
store the new manifest, and a network round-trip from each replica gateway to
its underlying storage services. This yields $O(g)$ round-trips, where $g$ is the number of