Skip to content

Commit 71e3aac

Browse files
aagittorvalds
authored andcommitted
thp: transparent hugepage core
Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page*. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli <[email protected]> Acked-by: Rik van Riel <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 5c3240d commit 71e3aac

File tree

14 files changed

+1220
-35
lines changed

14 files changed

+1220
-35
lines changed

arch/x86/include/asm/pgtable_64.h

+5
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
286286
return pmd_set_flags(pmd, _PAGE_RW);
287287
}
288288

289+
static inline pmd_t pmd_mknotpresent(pmd_t pmd)
290+
{
291+
return pmd_clear_flags(pmd, _PAGE_PRESENT);
292+
}
293+
289294
#endif /* !__ASSEMBLY__ */
290295

291296
#endif /* _ASM_X86_PGTABLE_64_H */

include/linux/gfp.h

+3
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,9 @@ struct vm_area_struct;
109109
__GFP_HARDWALL | __GFP_HIGHMEM | \
110110
__GFP_MOVABLE)
111111
#define GFP_IOFS (__GFP_IO | __GFP_FS)
112+
#define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
113+
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
114+
__GFP_NO_KSWAPD)
112115

113116
#ifdef CONFIG_NUMA
114117
#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)

include/linux/huge_mm.h

+118
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
#ifndef _LINUX_HUGE_MM_H
2+
#define _LINUX_HUGE_MM_H
3+
4+
extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
5+
struct vm_area_struct *vma,
6+
unsigned long address, pmd_t *pmd,
7+
unsigned int flags);
8+
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
9+
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
10+
struct vm_area_struct *vma);
11+
extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
12+
unsigned long address, pmd_t *pmd,
13+
pmd_t orig_pmd);
14+
extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
15+
extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
16+
unsigned long addr,
17+
pmd_t *pmd,
18+
unsigned int flags);
19+
extern int zap_huge_pmd(struct mmu_gather *tlb,
20+
struct vm_area_struct *vma,
21+
pmd_t *pmd);
22+
23+
enum transparent_hugepage_flag {
24+
TRANSPARENT_HUGEPAGE_FLAG,
25+
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
26+
TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
27+
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
28+
#ifdef CONFIG_DEBUG_VM
29+
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
30+
#endif
31+
};
32+
33+
enum page_check_address_pmd_flag {
34+
PAGE_CHECK_ADDRESS_PMD_FLAG,
35+
PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
36+
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
37+
};
38+
extern pmd_t *page_check_address_pmd(struct page *page,
39+
struct mm_struct *mm,
40+
unsigned long address,
41+
enum page_check_address_pmd_flag flag);
42+
43+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
44+
#define HPAGE_PMD_SHIFT HPAGE_SHIFT
45+
#define HPAGE_PMD_MASK HPAGE_MASK
46+
#define HPAGE_PMD_SIZE HPAGE_SIZE
47+
48+
#define transparent_hugepage_enabled(__vma) \
49+
(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) || \
50+
(transparent_hugepage_flags & \
51+
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
52+
(__vma)->vm_flags & VM_HUGEPAGE))
53+
#define transparent_hugepage_defrag(__vma) \
54+
((transparent_hugepage_flags & \
55+
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
56+
(transparent_hugepage_flags & \
57+
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) && \
58+
(__vma)->vm_flags & VM_HUGEPAGE))
59+
#ifdef CONFIG_DEBUG_VM
60+
#define transparent_hugepage_debug_cow() \
61+
(transparent_hugepage_flags & \
62+
(1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
63+
#else /* CONFIG_DEBUG_VM */
64+
#define transparent_hugepage_debug_cow() 0
65+
#endif /* CONFIG_DEBUG_VM */
66+
67+
extern unsigned long transparent_hugepage_flags;
68+
extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
69+
pmd_t *dst_pmd, pmd_t *src_pmd,
70+
struct vm_area_struct *vma,
71+
unsigned long addr, unsigned long end);
72+
extern int handle_pte_fault(struct mm_struct *mm,
73+
struct vm_area_struct *vma, unsigned long address,
74+
pte_t *pte, pmd_t *pmd, unsigned int flags);
75+
extern int split_huge_page(struct page *page);
76+
extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
77+
#define split_huge_page_pmd(__mm, __pmd) \
78+
do { \
79+
pmd_t *____pmd = (__pmd); \
80+
if (unlikely(pmd_trans_huge(*____pmd))) \
81+
__split_huge_page_pmd(__mm, ____pmd); \
82+
} while (0)
83+
#define wait_split_huge_page(__anon_vma, __pmd) \
84+
do { \
85+
pmd_t *____pmd = (__pmd); \
86+
spin_unlock_wait(&(__anon_vma)->root->lock); \
87+
/* \
88+
* spin_unlock_wait() is just a loop in C and so the \
89+
* CPU can reorder anything around it. \
90+
*/ \
91+
smp_mb(); \
92+
BUG_ON(pmd_trans_splitting(*____pmd) || \
93+
pmd_trans_huge(*____pmd)); \
94+
} while (0)
95+
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
96+
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
97+
#if HPAGE_PMD_ORDER > MAX_ORDER
98+
#error "hugepages can't be allocated by the buddy allocator"
99+
#endif
100+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
101+
#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
102+
#define HPAGE_PMD_MASK ({ BUG(); 0; })
103+
#define HPAGE_PMD_SIZE ({ BUG(); 0; })
104+
105+
#define transparent_hugepage_enabled(__vma) 0
106+
107+
#define transparent_hugepage_flags 0UL
108+
static inline int split_huge_page(struct page *page)
109+
{
110+
return 0;
111+
}
112+
#define split_huge_page_pmd(__mm, __pmd) \
113+
do { } while (0)
114+
#define wait_split_huge_page(__anon_vma, __pmd) \
115+
do { } while (0)
116+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
117+
118+
#endif /* _LINUX_HUGE_MM_H */

include/linux/mm.h

+4
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void *objp);
111111
#define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */
112112
#define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */
113113
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
114+
#if BITS_PER_LONG > 32
115+
#define VM_HUGEPAGE 0x100000000UL /* MADV_HUGEPAGE marked this vma */
116+
#endif
114117

115118
/* Bits set in the VMA until the stack is in its final location */
116119
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
@@ -243,6 +246,7 @@ struct inode;
243246
* files which need it (119 of them)
244247
*/
245248
#include <linux/page-flags.h>
249+
#include <linux/huge_mm.h>
246250

247251
/*
248252
* Methods to modify the page usage count.

include/linux/mm_inline.h

+9-2
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,20 @@ static inline int page_is_file_cache(struct page *page)
2020
}
2121

2222
static inline void
23-
add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
23+
__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
24+
struct list_head *head)
2425
{
25-
list_add(&page->lru, &zone->lru[l].list);
26+
list_add(&page->lru, head);
2627
__inc_zone_state(zone, NR_LRU_BASE + l);
2728
mem_cgroup_add_lru_list(page, l);
2829
}
2930

31+
static inline void
32+
add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
33+
{
34+
__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
35+
}
36+
3037
static inline void
3138
del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
3239
{

include/linux/page-flags.h

+21
Original file line numberDiff line numberDiff line change
@@ -410,11 +410,32 @@ static inline void ClearPageCompound(struct page *page)
410410
#endif /* !PAGEFLAGS_EXTENDED */
411411

412412
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
413+
/*
414+
* PageHuge() only returns true for hugetlbfs pages, but not for
415+
* normal or transparent huge pages.
416+
*
417+
* PageTransHuge() returns true for both transparent huge and
418+
* hugetlbfs pages, but not normal pages. PageTransHuge() can only be
419+
* called only in the core VM paths where hugetlbfs pages can't exist.
420+
*/
421+
static inline int PageTransHuge(struct page *page)
422+
{
423+
VM_BUG_ON(PageTail(page));
424+
return PageHead(page);
425+
}
426+
413427
static inline int PageTransCompound(struct page *page)
414428
{
415429
return PageCompound(page);
416430
}
431+
417432
#else
433+
434+
static inline int PageTransHuge(struct page *page)
435+
{
436+
return 0;
437+
}
438+
418439
static inline int PageTransCompound(struct page *page)
419440
{
420441
return 0;

include/linux/rmap.h

+2
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,8 @@ enum ttu_flags {
198198
};
199199
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
200200

201+
bool is_vma_temporary_stack(struct vm_area_struct *vma);
202+
201203
int try_to_unmap(struct page *, enum ttu_flags flags);
202204
int try_to_unmap_one(struct page *, struct vm_area_struct *,
203205
unsigned long address, enum ttu_flags flags);

include/linux/swap.h

+2
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pages(void);
208208
/* linux/mm/swap.c */
209209
extern void __lru_cache_add(struct page *, enum lru_list lru);
210210
extern void lru_cache_add_lru(struct page *, enum lru_list lru);
211+
extern void lru_add_page_tail(struct zone* zone,
212+
struct page *page, struct page *page_tail);
211213
extern void activate_page(struct page *);
212214
extern void mark_page_accessed(struct page *);
213215
extern void lru_add_drain(void);

mm/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
3737
obj-$(CONFIG_FS_XIP) += filemap_xip.o
3838
obj-$(CONFIG_MIGRATION) += migrate.o
3939
obj-$(CONFIG_QUICKLIST) += quicklist.o
40+
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
4041
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
4142
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
4243
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o

0 commit comments

Comments
 (0)