forked from oasis-tcs/virtio-spec
-
Notifications
You must be signed in to change notification settings - Fork 0
/
virtio-vsock.tex
336 lines (252 loc) · 13.9 KB
/
virtio-vsock.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
\section{Socket Device}\label{sec:Device Types / Socket Device}
The virtio socket device is a zero-configuration socket communications device.
It facilitates data transfer between the guest and device without using the
Ethernet or IP protocols.
\subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
19
\subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
\begin{description}
\item[0] rx
\item[1] tx
\item[2] event
\end{description}
\subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
\begin{description}
\item[VIRTIO_VSOCK_F_STREAM (0)] stream socket type is supported.
\item[VIRTIO_VSOCK_F_SEQPACKET (1)] seqpacket socket type is supported.
\item[VIRTIO_VSOCK_F_NO_IMPLIED_STREAM (2)] stream socket type is not implied.
\end{description}
\drivernormative{\subsubsection}{Feature bits}{Device Types / Socket Device / Feature bits}
The driver SHOULD accept the VIRTIO_VSOCK_F_NO_IMPLIED_STREAM feature if
offered by the device.
If no feature bit has been negotiated, the driver SHOULD act as if
VIRTIO_VSOCK_F_STREAM has been negotiated.
If VIRTIO_VSOCK_F_SEQPACKET has been negotiated, but not
VIRTIO_VSOCK_F_NO_IMPLIED_STREAM, the driver MAY act as if
VIRTIO_VSOCK_F_STREAM has also been negotiated.
\devicenormative{\subsubsection}{Feature bits}{Device Types / Socket Device / Feature bits}
The device SHOULD offer the VIRTIO_VSOCK_F_NO_IMPLIED_STREAM feature.
If no feature bit has been negotiated, the device SHOULD act as if
VIRTIO_VSOCK_F_STREAM has been negotiated.
If VIRTIO_VSOCK_F_SEQPACKET has been negotiated, but not
VIRTIO_VSOCK_F_NO_IMPLIED_STREAM, the device MAY act as if
VIRTIO_VSOCK_F_STREAM has also been negotiated.
\subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
Socket device configuration uses the following layout structure:
\begin{lstlisting}
struct virtio_vsock_config {
le64 guest_cid;
};
\end{lstlisting}
The \field{guest_cid} field contains the guest's context ID, which uniquely
identifies the device for its lifetime. The upper 32 bits of the CID are
reserved and zeroed.
The following CIDs are reserved and cannot be used as the guest's context ID:
\begin{tabular}{|l|l|}
\hline
CID & Notes \\
\hline \hline
0 & Reserved \\
\hline
1 & Reserved \\
\hline
2 & Well-known CID for the host \\
\hline
0xffffffff & Reserved \\
\hline
0xffffffffffffffff & Reserved \\
\hline
\end{tabular}
\subsection{Device Initialization}\label{sec:Device Types / Socket Device / Device Initialization}
\begin{enumerate}
\item The guest's cid is read from \field{guest_cid}.
\item Buffers are added to the event virtqueue to receive events from the device.
\item Buffers are added to the rx virtqueue to start receiving packets.
\end{enumerate}
\subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Operation}
Packets transmitted or received contain a header before the payload:
\begin{lstlisting}
struct virtio_vsock_hdr {
le64 src_cid;
le64 dst_cid;
le32 src_port;
le32 dst_port;
le32 len;
le16 type;
le16 op;
le32 flags;
le32 buf_alloc;
le32 fwd_cnt;
};
\end{lstlisting}
The upper 32 bits of src_cid and dst_cid are reserved and zeroed.
Most packets simply transfer data but control packets are also used for
connection and buffer space management. \field{op} is one of the following
operation constants:
\begin{lstlisting}
#define VIRTIO_VSOCK_OP_INVALID 0
/* Connect operations */
#define VIRTIO_VSOCK_OP_REQUEST 1
#define VIRTIO_VSOCK_OP_RESPONSE 2
#define VIRTIO_VSOCK_OP_RST 3
#define VIRTIO_VSOCK_OP_SHUTDOWN 4
/* To send payload */
#define VIRTIO_VSOCK_OP_RW 5
/* Tell the peer our credit info */
#define VIRTIO_VSOCK_OP_CREDIT_UPDATE 6
/* Request the peer to send the credit info to us */
#define VIRTIO_VSOCK_OP_CREDIT_REQUEST 7
\end{lstlisting}
\field{len} is the size of the payload, in bytes. However, the driver may
provide buffer(s) for the payload that have a total size longer than
\field{len}, in which case only the first \field{len} bytes will be used for
the actual data.
\subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
The tx virtqueue carries packets initiated by applications and replies to
received packets. The rx virtqueue carries packets initiated by the device and
replies to previously transmitted packets.
If both rx and tx virtqueues are filled by the driver and device at the same
time then it appears that a deadlock is reached. The driver has no free tx
descriptors to send replies. The device has no free rx descriptors to send
replies either. Therefore neither device nor driver can process virtqueues
since that may involve sending new replies.
This is solved using additional resources outside the virtqueue to hold
packets. With additional resources, it becomes possible to process incoming
packets even when outgoing packets cannot be sent.
Eventually even the additional resources will be exhausted and further
processing is not possible until the other side processes the virtqueue that
it has neglected. This stop to processing prevents one side from causing
unbounded resource consumption in the other side.
\drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
\devicenormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
The tx virtqueue MUST be processed even when the rx virtqueue is full so long as there are additional resources available to hold packets outside the rx virtqueue.
\subsubsection{Addressing}\label{sec:Device Types / Socket Device / Device Operation / Addressing}
Flows are identified by a (source, destination) address tuple. An address
consists of a (cid, port number) tuple. The header fields used for this are
\field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
Currently stream and seqpacket sockets are supported. \field{type} is 1 (VIRTIO_VSOCK_TYPE_STREAM)
for stream socket types, and 2 (VIRTIO_VSOCK_TYPE_SEQPACKET) for seqpacket socket types.
\begin{lstlisting}
#define VIRTIO_VSOCK_TYPE_STREAM 1
#define VIRTIO_VSOCK_TYPE_SEQPACKET 2
\end{lstlisting}
Stream sockets provide in-order, guaranteed, connection-oriented delivery
without message boundaries. Seqpacket sockets provide in-order, guaranteed,
connection-oriented delivery with message and record boundaries.
\subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device / Device Operation / Buffer Space Management}
\field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
stream sockets. The guest and the device publish how much buffer space is
available per socket. Only payload bytes are counted and header bytes are not
included. This facilitates flow control so data is never dropped.
\field{buf_alloc} is the total receive buffer space, in bytes, for this socket.
This includes both free and in-use buffers. \field{fwd_cnt} is the free-running
bytes received counter. The sender calculates the amount of free receive buffer
space as follows:
\begin{lstlisting}
/* tx_cnt is the sender's free-running bytes transmitted counter */
u32 peer_free = peer_buf_alloc - (tx_cnt - peer_fwd_cnt);
\end{lstlisting}
If there is insufficient buffer space, the sender waits until virtqueue buffers
are returned and checks \field{buf_alloc} and \field{fwd_cnt} again. Sending
the VIRTIO_VSOCK_OP_CREDIT_REQUEST packet queries how much buffer space is
available. The reply to this query is a VIRTIO_VSOCK_OP_CREDIT_UPDATE packet.
It is also valid to send a VIRTIO_VSOCK_OP_CREDIT_UPDATE packet without
previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
communicating updates any time a change in buffer space occurs.
\drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
sufficient free buffer space for the payload.
All packets associated with a stream flow MUST contain valid information in
\field{buf_alloc} and \field{fwd_cnt} fields.
\devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Buffer Space Management}
VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
sufficient free buffer space for the payload.
All packets associated with a stream flow MUST contain valid information in
\field{buf_alloc} and \field{fwd_cnt} fields.
\subsubsection{Receive and Transmit}\label{sec:Device Types / Socket Device / Device Operation / Receive and Transmit}
The driver queues outgoing packets on the tx virtqueue and incoming packet
receive buffers on the rx virtqueue. Packets are of the following form:
\begin{lstlisting}
struct virtio_vsock_packet {
struct virtio_vsock_hdr hdr;
u8 data[];
};
\end{lstlisting}
Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for
incoming packets are write-only.
\drivernormative{\paragraph}{Device Operation: Receive and Transmit}{Device Types / Socket Device / Device Operation / Receive and Transmit}
The \field{guest_cid} configuration field MUST be used as the source CID when
sending outgoing packets.
A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
unknown \field{type} value.
\devicenormative{\paragraph}{Device Operation: Receive and Transmit}{Device Types / Socket Device / Device Operation / Receive and Transmit}
The \field{guest_cid} configuration field MUST NOT contain a reserved CID as listed in \ref{sec:Device Types / Socket Device / Device configuration layout}.
A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
unknown \field{type} value.
\subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device / Device Operation / Stream Sockets}
Connections are established by sending a VIRTIO_VSOCK_OP_REQUEST packet. If a
listening socket exists on the destination a VIRTIO_VSOCK_OP_RESPONSE reply is
sent and the connection is established. A VIRTIO_VSOCK_OP_RST reply is sent if
a listening socket does not exist on the destination or the destination has
insufficient resources to establish the connection.
When a connected socket receives VIRTIO_VSOCK_OP_SHUTDOWN the header
\field{flags} field bit VIRTIO_VSOCK_SHUTDOWN_F_RECEIVE (bit 0) set indicates
that the peer will not receive any more data and bit VIRTIO_VSOCK_SHUTDOWN_F_SEND
(bit 1) set indicates that the peer will not send any more data. These hints are
permanent once sent and successive packets with bits clear do not reset them.
\begin{lstlisting}
#define VIRTIO_VSOCK_SHUTDOWN_F_RECEIVE 0
#define VIRTIO_VSOCK_SHUTDOWN_F_SEND 1
\end{lstlisting}
The VIRTIO_VSOCK_OP_RST packet aborts the connection process or forcibly
disconnects a connected socket.
Clean disconnect is achieved by one or more VIRTIO_VSOCK_OP_SHUTDOWN packets
that indicate no more data will be sent and received, followed by a
VIRTIO_VSOCK_OP_RST response from the peer. If no VIRTIO_VSOCK_OP_RST response
is received within an implementation-specific amount of time, a
VIRTIO_VSOCK_OP_RST packet is sent to forcibly disconnect the socket.
The clean disconnect process ensures that neither peer reuses the (source,
destination) address tuple for a new connection while the other peer is still
processing the old connection.
\subsubsection{Seqpacket Sockets}\label{sec:Device Types / Socket Device / Device Operation / Seqpacket Sockets}
\paragraph{Message and record boundaries}\label{sec:Device Types / Socket Device / Device Operation / Seqpacket Sockets / Boundaries}
Two types of boundaries are supported: message and record boundaries.
A message contains data sent in a single operation. A single message can be
split into multiple RW packets.
To provide message boundaries, last RW packet of each message has
VIRTIO_VSOCK_SEQ_EOM bit (bit 0) set in the \field{flags} of packet's header.
Record is any number of subsequent messages, where last message is sent with POSIX
MSG_EOR flag set. Record boundary means that receiver gets MSG_EOR flag set
in the corresponding message where sender set it.
To provide record boundaries, last RW packet of each record has VIRTIO_VSOCK_SEQ_EOR
bit (bit 1) set in the \field{flags} of packet's header.
\begin{lstlisting}
#define VIRTIO_VSOCK_SEQ_EOM (1 << 0)
#define VIRTIO_VSOCK_SEQ_EOR (1 << 1)
\end{lstlisting}
\subsubsection{Device Events}\label{sec:Device Types / Socket Device / Device Operation / Device Events}
Certain events are communicated by the device to the driver using the event
virtqueue.
The event buffer is as follows:
\begin{lstlisting}
#define VIRTIO_VSOCK_EVENT_TRANSPORT_RESET 0
struct virtio_vsock_event {
le32 id;
};
\end{lstlisting}
The VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event indicates that communication has
been interrupted. This usually occurs if the guest has been physically
migrated. The driver shuts down established connections and the
\field{guest_cid} configuration field is fetched again. Existing listen
sockets remain but their CID is updated to reflect the current
\field{guest_cid}.
\drivernormative{\paragraph}{Device Operation: Device Events}{Device Types / Socket Device / Device Operation / Device Events}
Event virtqueue buffers SHOULD be replenished quickly so that no events are
missed.
The \field{guest_cid} configuration field MUST be fetched to determine the
current CID when a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event is received.
Existing connections MUST be shut down when a
VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event is received.
Listen connections MUST remain operational with the current CID when a
VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event is received.