forked from opntr/pax-docs-mirror
-
Notifications
You must be signed in to change notification settings - Fork 1
/
vmmirror.txt
364 lines (311 loc) · 21.1 KB
/
vmmirror.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
1. Design
The goal of vma mirroring is to allow for creating special file mappings
where a given set of physical pages (those backing the file) is visible
at two different linear addresses in a given task. Furthermore the mirroring
logic ensures that the two mappings in linear address space will see the
same physical pages even after they go through a swap-out/swap-in cycle
or copy-on-write.
While vma mirroring is a generic idea, PaX uses it for very specific
purposes and therefore the implementation is a bit less generic than it
could be (but it results in simpler and less intrusive changes).
The first use is for mirroring executable regions into the code segment
under SEGMEXEC. In this case the 3 GB userland linear address space is
divided into two halves of 1.5 GB each and the code/data segment descriptors
are modified to cover only one or the other. To be able to execute code
under this setup we have to ensure that executable mappings are visible
in the code segment region (1.5-3 GB range in linear address space). Since
such executable mappings may contain data as well (constant strings,
function pointer tables, etc), we have to have a mirror of these mappings
at the same logical addresses in the data segment as well (0-1.5 GB range
in linear address space). The nice property of this setup is that a pair
of mirrored regions will have a constant difference between their start/end
addresses: 1.5 GB (or SEGMEXEC_TASK_SIZE as it is often referenced in the
code).
The second use of vma mirroring is to implement the mirror of the main
executable at a randomized address under RANDEXEC. Here again we will have
a constant (task specific) difference between the mirrored regions and can
simplify the implementation the same way as under SEGMEXEC.
There is also an implicit third situation, when both SEGMEXEC and RANDEXEC
are active for a task. At a first look this may appear very complex since
the executable region of the main executable would have to be mirrored
at three places instead of one: randomized mappings in both the data and
the code segment (at the same logical addresses) plus a mirror into the
code segment at the original logical address. Luckily, we can save on two
of them: the randomized mapping in the data segment is not needed as we do
not expect code to reference data in its executable segment in a position
independent manner (which is what would be needed for code to learn its own
location), second we do not need the original mapping mirrored in the code
segment as we explicitly do not want it to be executable (so that code
references to this region would raise a page fault and the RANDEXEC logic
could then react on it).
2. Implementation
vma mirroring requires two basic changes to the VM in Linux. First we have
to provide an interface for setting up the mirrors, second we have to
maintain the synchronicity between the mirrored regions' linear/physical
mappings.
Linux maintains a per task database of what is present in the given task's
address space. This database is a set of structures called vm_area_struct
each of which describes a single mapping (defined in include/linux/mm.h).
The database for a task can be viewed in /proc/<pid>/maps. For our purposes
the relevance of the vma database is that it directly guides the page fault
resolution logic which in turn is responsible for setting up the linear to
physical address translation on a per page basis.
To understand how all this works, consider task creation and its first
moment of life in userland. For ELF executables the load_elf_binary()
function in fs/binfmt_elf.c is responsible for populating the task's
address space with a few basic mappings such as the stack, the dynamic
linker (if the application in question is dynamically linked) and the main
executable itself. The file mappings are established by using the kernel's
internal do_mmap() interface through a simple wrapper called elf_map().
Note that at this point only the stack region has physical pages assigned
(since that is where arguments, the environment, etc go and therefore must
be present in physical memory at this early stage), the file mappings are
not yet backed by physical memory pages.
When the task begins its life in userland, the very first instruction fetch
in ld.so or the main executable will raise a page fault since the Linux VM
system does not establish a valid physical mapping until it is actually
needed (i.e., it is demand based). The first thing the architecture specific
page fault handler (for i386 it is do_page_fault() in arch/i386/mm/fault.c)
does is to find the vma structure that describes the region in which the
page fault happened then call the architecture independent handler
(handle_mm_fault() in mm/memory.c) which based on the fault and the vma
type will call the appropriate function to establish a physical page
containing the expected data (in our case it would be read from the file
backing the mapping, that is, somewhere from the .text section in an ELF
file).
The interface for setting up a vma mirror is a simple extension to the
already existing memory mapping interface. This interface is accessible
from userland as the mmap() library call. Since the vma mirroring facility
is meant to be used by specific PaX features only, userland initiated vma
mirroring requests are not allowed (that is why PaX returns an error from
do_mmap2() in arch/i386/kernel/sys_i386.c). Care must be taken however for
handling mmap() requests from tasks running under SEGMEXEC. This is because
they can create executable file mappings and therefore they must be mirrored
just like when the kernel itself establishes the initial file mappings as
discussed above. Since all mmap() requests go through do_mmap() (an inline
function defined in include/linux/mm.h) this is where PaX requests the extra
mirrored mappings for SEGMEXEC executables. Since do_mmap2() originally
gets around do_mmap() by directly calling do_mmap_pgoff(), we modified it
to use do_mmap() instead. This way we can ensure that the SEGMEXEC logic
gets to see both userland and kernel originated file mapping requests.
vma mirror requests use special arguments for calling do_mmap_pgoff() in
the end:
'file' must be NULL because the mirror will reference the same file
as the vma to be mirrored
'addr' has its normal meaning of specifying a hint for searching a
suitable hole in the address space where the mapping can go,
'len' must be 0 because it will be derived from the vma that is about
to get mirrored,
'prot' has its normal meaning,
'flags' has its normal meaning except that it must also specify the new
MAP_MIRROR flag and it must request a private mapping,
'pgoff' specifies the linear start address of the vma to be mirrored
(note that here it is measured in bytes and not PAGE_SIZE units).
The vma to be mirrored must exist at the specified start address ('pgoff')
and must not be mirrored or be a mirror itself already. Furthermore PaX will
not allow a writable mirror for a read-only vma. Note that these are only
sanity checks to detect early if there is a bug in the rest of the vma
mirroring logic (denied mirror requests will end up in a non-functioning
task and are therefore easy to see for an end user).
The second basic change needed for implementing vma mirroring is in the MMU
state management logic (which governs the linear/physical translation). Our
goal is simple: whenever the state of a mirrored page changes we will have
to propagate the change into the state of the mirror page as well (and do all
this atomically, that is, other state changing code must be locked out until
we finished). Such state changes occur in the following operations: page
fault servicing, munmap, mremap, mprotect, mlock and vma merging.
Servicing a fault means that the kernel finds out why a page fault occured
and when it is valid (it occured in a region described by a vma having proper
access rights) it will allocate storage in physical memory and set up a valid
linear/physical translation in the MMU (on i386 it means setting up a present
pte).
While the page fault classification (valid/invalid) is done in architecture
specific code, the actual service needs no longer to care about such details
and is architecture independent: handle_mm_fault(), therefore this is what
we have to modify. Also note that handle_mm_fault() is used by other code as
well that we would otherwise have to explicitly modify (get_user_page() for
example that is used by ptrace() among others).
The strategy for servicing a page fault in a mirrored vma is the following:
first we do some sanity checks on the mirror's vma (again in order to detect
potential bugs in the implementation early) then allocate the necessary MMU
resources (various levels of paging structures) so that by the time we get
to propagate the MMU state information, we will not have to worry about
resource allocation failures. After successful resource allocation we let
the original fault handling logic carry out its work (swap-in a page, do
copy-on-write, etc) and intervene when it has just established the new MMU
state for the mirror: we call the core of the vma mirroring logic in
pax_mirror_fault() in mm/memory.c.
To simplify the logic of the mirroring code, we established a simple naming
convention for variables related to one or the other vma: the vma for which
handle_mm_fault() was called is said to be mirrored and the corresponding
variables have no suffixes, whereas the other vma is called the mirror and
its variables are suffixed by _m. For example, vma_m is the vm_area_struct
pointer of the mirror vma.
pax_mirror_fault() first determines if it has anything to do in the first
place and if so, it looks up the mirror vma and associated information, such
as the mirror of the fault address and the related MMU structures (we are
interested in the page table entry as it contains the physical page number
that will have to be synchronized between the mirrors). Once the mirror's
pte is known, we have to see if it currently specifies a valid mapping and
if so, we have to invalidate it (and while handling the different cases, we
also take care of the resident set size: we have to increment it if the
mapping was not valid since it will be after mirroring). Invalidating the
current mapping in the mirror is derived from kernel code doing the same:
the munmap() and swap-out operations. The next and final step is to actually
propagate the new linear/physical mapping into the mirror: we look up the
physical page in the mirrored pte (and increment its use count since we are
going to create another reference to it in the mirror's pte) then construct
the mirror's pte from it and the appropriate access rights (the writability
state must be copied verbatim from the mirrored pte otherwise we would ruin
the copy-on-write logic).
The atomicity of all the above actions is ensured by holding the appropriate
page_table_lock on entry and never releasing it inside. This way the higher
level callers (who establish the mirrored pte) ensure that the mirror's pte
is established at the same time.
The last set of vma mirroring related changes ensures that userland can
modify/destroy mirrored regions only along with the corresponding mirrors
(creation was described in the do_mmap() changes).
The most complex change is in the munmap() logic which is responsible for
destroying all kinds of mappings. To understand the changes let's first look
at how it works in the standard kernel. The core function is do_munmap() in
mm/memory.c which begins by doing some checks on the area to be removed then
proceeds with moving all the vma structures that fall in there (fully or in
part) from the mm's vma list to a special one. In the next phase this list is
processed for clearing all linear/physical mappings in the MMU for each vma.
The final step is to free up page tables that may have become empty and are
no longer needed.
Mirror handling requires two changes in the above logic. First, we have to
detect if any to be removed vma is mirrored and if so we have to move its
mirror vma to another special list (during the same atomic operation as it
is done originally). Second, we have to clear the corresponding mappings in
the MMU for this list of vma structures as well.
While setting up the second special list is straightforward, the second step
is not as the original kernel code is rather badly organized and does not
lend itself for easy reuse. In order to avoid unnecessary code replication we
opted for rearranging the original code a bit through simple program
transformations: the MMU cleanup logic has been split into unmap_vma_list()
and unmap_vma(). This way processing the second special list can be done in
unmap_vma_mirror_list() which makes use of unmap_vma().
There is one last trick worth noting: map_count handling. This counter has to
be decremented for each vma which gets unmapped. The original do_munmap()
logic delegates this task to unmap_vma(). The problem with it is that at this
point all the vma structures have already been removed from the main mm vma
list however this counter is decremented one by one for each vma. This in
turn will trigger a kernel BUG message because of the inconsistency between
map_count and the actual number of vma structures on the mm vma list. Since
this is a kernel BUG (or 'feature', after all at the end of do_munmap()
everything will be in synch again), we decided to modify the original kernel
logic so that map_count gets decremented during the special lists preparation
phase.
The next change is in do_mremap() in mm/mremap.c where we ensure that
mirrored regions simply cannot be remapped (they can shrink however as it
simply means a call to do_munmap() which handles mirrors fine).
The remaining two userland interfaces are handled the same way because both
mprotect() and mlock() have the same internal logic: enumerate all mappings
in the given range and then act on each one of them individually. The
functions we modify are mprotect_fixup() and mlock_fixup(), respectively.
First the original functions are moved to __mprotect_fixup() and
__mlock_fixup() then they are called for each vma in a mirror when one is
encountered.
The vma merging mechanism is governed by the inline can_vma_merge() function
in include/linux/mm.h. PaX modifies this function to prevent anonymous mirror
mappings from getting inadvertantly merged with others (file mappings are
never merged).
3. Examples
To help better understand vma mirroring, we present a few address space
layouts and explain what happened there. In each case we used a copy of
/bin/cat in /tmp to execute "/tmp/cat /proc/self/maps". Note that for
the sake of simplicity we disabled RANDMMAP, this of course should not
be done in production systems. The [x] marks are not part of the original
output, we use them to refer to the various lines in the explanation.
Active PaX features: SEGMEXEC and MPROTECT
[1] 08048000-0804a000 R-Xp 00000000 00:0b 1109 /tmp/cat
[2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat
[3] 0804b000-0804d000 RW-p 00000000 00:00 0
[4] 20000000-20015000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
[5] 20015000-20016000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so
[6] 2001e000-20143000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
[7] 20143000-20149000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so
[8] 20149000-2014d000 RW-p 00000000 00:00 0
[9] 5fffe000-60000000 RW-p fffff000 00:00 0
[10] 68048000-6804a000 R-Xp 00000000 00:0b 1109 /tmp/cat
[11] 80000000-80015000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
[12] 8001e000-80143000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
Since cat is a dynamically linked executable, its address space will have
several file mappings besides the main executable. Let's see what each line
represents.
[1] is the first PT_LOAD segment of the /tmp/cat ELF file, it is mapped
with R-X rights, that is, it contains the executable code plus all
read-only initialized data as well. It is also mirrored by [10] because
it is executable.
[2] is the second PT_LOAD segment of the /tmp/cat ELF file, it is mapped
with RW- rights, that is, it contains writable data (all initialized and
the beginning of the uninitialized data, in our case all of it as they
fit into a single page).
[3] is the brk() managed heap (its size changes at runtime as cat calls
malloc()/free()/etc). Note that if cat had more uninitialized data than
what would fit into the gap left on the last page of mapping [2] then
the rest would be mapped here from the beginning of [3] and the brk()
managed heap would follow then.
[4] and [5] are the PT_LOAD segments of the dynamic linker, whereas
[6] and [7] are those of the C library. [4] and [6] are also mirrored by
[11] and [12] respectively as they are executable.
[8] is an anonymous mapping corresponding to uninitialized data in the C
library (if we take a look at the ELF program headers of libc, we will
see that the memory size of the second PT_LOAD segment is 4 pages more
than its file size).
[9] is another anonymous mapping containing the stack. We can observe that
it is at the end of the userland address space (which under SEGMEXEC
is at TASK_SIZE/2) and grows downwards.
[10], [11] and [12] are the mirrors of the executable file mappings [1],
[4] and [6] respectively (notice that each pair has exactly TASK_SIZE/2
"distance"). They are all above the TASK_SIZE/2 limit as well which
means that they are part of the code segment and hence executable.
Active PaX features: SEGMEXEC and RANDEXEC and MPROTECT
[1] 08048000-0804a000 R-Xp 00000000 00:0b 1109 /tmp/cat
[2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat
0804b000-0804d000 RW-p 00000000 00:00 0
[3] 20000000-20002000 ++-p 00000000 00:00 0
[4] 20002000-20003000 RW-p 00002000 00:0b 1109 /tmp/cat
20003000-20018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
20018000-20019000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so
20021000-20146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
20146000-2014c000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so
2014c000-20150000 RW-p 00000000 00:00 0
[5] 5fffe000-60000000 RW-p 00000000 00:00 0
[6] 80000000-80002000 R-Xp 00000000 00:0b 1109 /tmp/cat
80003000-80018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
80021000-80146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
Enabling RANDEXEC changes the layout slightly. In particular, beyond the
previously shown mappings we can see [3] which represents the dummy
anonymous mapping corresponding to the first (executable) PT_LOAD segment
of cat and [4] which is the mirror of the second PT_LOAD segment of cat
(notice that [2] and [4] have the same page offset values). Also observe
that [1] is mirrored above the TASK_SIZE/2 limit by [6] at a different
distance than TASK_SIZE/2, hence despite its having the R-X rights, it is
not actually executable: logical addresses of [1] are invalid in the code
segment, instead it is region [3] whose logical addresses are valid there
(in region [6]).
The careful reader has probably noticed a small difference between this
and the previous situation: the stack area in the first case is 3 pages
long whereas [5] here has only 2 pages. The reason for this discrepancy
has to do with RANDUSTACK: the first part of the stack randomization
cannot be disabled and in our case it happened to cause a big enough shift
that made the kernel allocate an extra page for the initial stack.
Active PaX features: PAGEEXEC and RANDEXEC and MPROTECT
[1] 08048000-0804a000 R--p 00000000 00:0b 1109 /tmp/cat
[2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat
0804b000-0804d000 RW-p 00000000 00:00 0
[3] 40000000-40002000 R-Xp 00000000 00:0b 1109 /tmp/cat
[4] 40002000-40003000 RW-p 00002000 00:0b 1109 /tmp/cat
40003000-40018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
40018000-40019000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so
40021000-40146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
40146000-4014c000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so
4014c000-40150000 RW-p 00000000 00:00 0
bfffe000-c0000000 RW-p fffff000 00:00 0
The last case where vma mirroring takes place has the simplest layout of
all as only the main executable is mirrored: [3] mirrors [1] and [4]
mirrors [2]. Notice that [1] no longer has R-X rights but R-- as under
PAGEEXEC it is the mapping rights that decide what is executable, not the
mapping's position.