Sync up with Linus #103

struct rtc embeds both struct dev and struct cdev. Unfortunately character device structure may outlive the parent rtc structure unless we set it up as parent of character device so that it will stay pinned until character device is freed. Signed-off-by: Dmitry Torokhov <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Since dab472e ("i2c / ACPI: Use 0 to indicate that device does not have interrupt assigned"), 0 is not a valid i2c client irq anymore, so change all driver's checks accordingly. The same issue occurs when the device is instantiated via device tree with no IRQ, or from the i2c sysfs interface, even before the patch above. Signed-off-by: Octavian Purdila <[email protected]> Reviewed-by: Mika Westerberg <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

…() for rtc_update_irq() Since commit e6229be ("rtc: make rtc_update_irq callable with irqs enabled") rtc_update_irq() is callable with irqs enabled. Signed-off-by: Henri Roosen <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Users of rtc_does_wakealarm() return value treat it as boolean so let's change the signature accordingly. Signed-off-by: Dmitry Torokhov <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Instead of using older style DEVICE_ATTR for wakealarm attribute let's switch to using DEVICE_ATTR_RW that ensures consistent across the kernel permissions on the attribute. Signed-off-by: Dmitry Torokhov <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Instead of creating wakealarm attribute manually, after the device has been registered, let's rely on facilities provided by the attribute groups to control which attributes are visible and which are not. This allows to create all needed attributes at once, at the same time that we register RTC class device. Signed-off-by: Dmitry Torokhov <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Implement the suspend/resume function in order to control rtc's irq_wake flag and handle as wakeup source. Signed-off-by: Henry Chen <[email protected]> Acked-by: Eddie Huang <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Add DA9062 RTC support into the existing DA9063 RTC driver component by using generic access tables for common register and bit mask definitions. The following change will add generic register and bit mask support to the DA9063 RTC. The changes are slightly complicated by requiring support for three register sets: DA9063-AD, DA9063-BB and DA9062-AA. The following alterations have been made to the DA9063 RTC: - Addition of a da9063_compatible_rtc_regmap structure to hold all generic registers and bitmasks for this type of RTC component. - A re-write of struct da9063 to use pointers for regmap and compatible registers/masks definitions - Addition of a of_device_id table for DA9063 and DA9062 defaults - Refactoring functions to use struct da9063_compatible_rtc accesses to generic registers/masks instead of using defines from registers.h - Re-work of da9063_rtc_probe() to use of_match_node() and dev_get_regmap() to provide initialisation of generic registers and masks and access to regmap Signed-off-by: Steve Twiss <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Steps to reproduce the problem: 1) Enable RTC wake-up option in BIOS Setup 2) Issue one of these commands in the OS: "poweroff" or "shutdown -h now" 3) System will shut down and then reboot automatically Root-cause of the issue: 1) During the shutdown process, the hwclock utility is used to save the system clock to hardware clock (RTC). 2) The hwclock utility invokes ioctl() with RTC_UIE_ON. The kernel configures the RTC alarm for the periodic interrupt (every 1 second). 3) The hwclock uitlity closes the /dev/rtc0 device, and the kernel disables the RTC alarm irq (AIE bit of Register B) via ioctl() with RTC_UIE_OFF. But, the configured alarm time is the current_time + 1. 4) After the next 1 second is elapsed, the AF (alarm interrupt flag) of Register C is set. 5) The S5 handler in BIOS is invoked to configure alarm registers (enable AIE bit and configure alarm date/time). But, BIOS does not clear the previous interrupt status during alarm configuration. Therefore, "AF=AIE=1" causes the rtc device to trigger an interrupt. 6) So, the machine reboots automatically right after shutdown. This patch cancels the alarm timer if the following condictions are met (suggested by Alexandre): 1) The configured alarm time is equal to current_time + 1 seconds. 2) The AIE timer is not in use. The member 'alarm_expires' is introduced in struct cmos_rtc because of the following reasons: 1) The configured alarm time can be retrieved from cmos_read_alarm(), but we need to take the 'wrapped timestamp' and 'time rollover' into consideration. The function __rtc_read_alarm() eliminates the concerns. To avoid the duplicated code in the lower level RTC driver, invoking __rtc_read_alarm from the lower level RTC driver is not encouraged. Moreover, the compilation error 'the undefined __rtc_read_alarm" is observed if the lower level RTC driver is compiled as a kernel module. 2) The uie_rtctimer.node.expires and aie_timer.node.expires can be retrieved for the configured alarm time. But, the problem is that either of them might configure the CMOS alarm time. We cannot make sure UIE timer or AIE tiemr configured the CMOS alarm time before. (uie_rtctimer or aie_timer is enabled and then is disabled). 3) The patch introduces the member 'alarm_expires' to keep the newly configured alarm time, so the above-mentioned concerns can be eliminated. The issue goes away after 20-time shutdown tests. Signed-off-by: Adrian Huang <[email protected]> Tested-by: Egbert Eich <[email protected]> Tested-by: Diego Ercolani <[email protected]> Cc: Borislav Petkov <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Commit d5a1c7e ("rtc-cmos: Add an alarm disable quirk") that added a special quirk is not needed because [PATCH 1/2] of this patchset makes the kernel more robust: rtc-cmos: Cancel alarm timer if alarm time is equal to now+1 seconds Signed-off-by: Adrian Huang <[email protected]> Tested-by: Egbert Eich <[email protected]> Tested-by: Diego Ercolani <[email protected]> Cc: Borislav Petkov <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The I2C core always reports the MODALIAS uevent as "i2c:<client name" regardless if the driver was matched using the I2C id_table or the of_match_table. So technically there's no need for a driver to export the OF table since currently it's not used. In fact, the I2C device ID table is mandatory for I2C drivers since a i2c_device_id is passed to the driver's probe function even if the I2C core used the OF table to match the driver. And since the I2C core uses different tables, OF-only drivers needs to have duplicated data that has to be kept in sync and also the dev node compatible manufacturer prefix is stripped when reporting the MODALIAS. To avoid the above, the I2C core behavior may be changed in the future to not require an I2C device table for OF-only drivers and report the OF module alias. So, it's better to also export the OF table to prevent breaking module autoloading if that happens. Signed-off-by: Javier Martinez Canillas <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The email address missed character ">", so add it. Signed-off-by: Leo Yan <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

In case of a probe error, it is possible to abort after issuing clk_prepare_enable(). Ensure the clock is disabled and unprepared in that case. Signed-off-by: Alexandre Belloni <[email protected]> Acked-by: Boris Brezillon <[email protected]> Acked-by: Nicolas Ferre <[email protected]>

rtc->sclk necessarily points to a valid clocks at this point. Else the probe would have aborted. Signed-off-by: Alexandre Belloni <[email protected]> Acked-by: Boris Brezillon <[email protected]> Acked-by: Nicolas Ferre <[email protected]>

Sort included headers alphabetically. Signed-off-by: Alexandre Belloni <[email protected]> Acked-by: Boris Brezillon <[email protected]> Acked-by: Nicolas Ferre <[email protected]>

See help for clk_get_rate(): "obtain the current clock rate (in Hz) for a clock source. This is only valid once the clock source has been enabled." It currently returns the correct value but that may not stay that way. Signed-off-by: Alexandre Belloni <[email protected]> Acked-by: Boris Brezillon <[email protected]> Acked-by: Nicolas Ferre <[email protected]>

Sort included headers alphabetically. Signed-off-by: Alexandre Belloni <[email protected]> Acked-by: Boris Brezillon <[email protected]> Acked-by: Nicolas Ferre <[email protected]>

IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there is no need to do that again from its callers. Drop it. gemini driver was using likely() for a failure case while the rtc driver is getting registered. That looks wrong and it should really be unlikely. But because we are killing all the unlikely() flags, lets kill that too. Signed-off-by: Viresh Kumar <[email protected]> Acked-by: Hans Ulli Kroll <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

drivers/rtc/rtc-gemini.c:151:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Fengguang Wu <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Commit dca1a4b ("clk: at91: keep slow clk enabled to prevent system hang") added a workaround for the slow clock as it is not properly handled by its users. Get and use the slow clock as it is necessary for the at91rm9200 rtc. Acked-by: Boris Brezillon <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

If century field is supported by the RTC CMOS device, then we should use it and then do not consider years greater that 169 as an error. For information, the year field of the rtc_time structure contains the value to add to 1970 to obtain the current year. This was a hack to be able to support years for 1970 to 2069. This patch remains compatible with this implementation. Signed-off-by: Sylvain Chouleur <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The change removes redundant sysfs binary file boundary checks, since this task is already done on caller side in fs/sysfs/file.c Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The change removes redundant sysfs binary file boundary checks, since this task is already done on caller size in fs/sysfs/file.c Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The change removes redundant sysfs binary file boundary checks, since this task is already done on caller side in fs/sysfs/file.c Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The change removes redundant sysfs binary file boundary checks, since this task is already done on caller side in fs/sysfs/file.c The change enables burst mode of access to SRAM for any read()/write() operations, it is worth to mention that this may influence on userspace, for instance prior to the change read(fd, buf, 1); read(fd, buf + 1, 1); and read(fd, buf, 2); sequences of syscalls over DS1511's sysfs "nvram" fd led to different DS1511 state changes and/or buf content, if some userspace applications are written specifically for DS1511 and exploit this strange "feature", they may be impacted. Also the change corrects NVRAM size accessible to userspace from 255 bytes to 256 bytes. Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The change removes redundant sysfs binary file boundary checks, since this task is already done on caller side in fs/sysfs/file.c Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The change removes redundant sysfs binary file boundary checks, since this task is already done on caller side in fs/sysfs/file.c Spinlock acquisition/release is moved out of the loop body to get atomic states of NVRAM reading and writing operations. Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The change removes redundant sysfs binary file boundary checks, since this task is already done on caller side in fs/sysfs/file.c Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Remove the useless includes and order the remaining ones alphabetically. Signed-off-by: Alexandre Belloni <[email protected]>

The driver currently emulates the concept of threaded IRQ using a workqueue, switch to threaded IRQ instead. Signed-off-by: Alexandre Belloni <[email protected]>

Use devm_request_threaded_irq() so it is not necessary to call free_irq() explicitly. Signed-off-by: Alexandre Belloni <[email protected]>

It is useless to print a message when probe fails as the user is already aware that it failed. Signed-off-by: Alexandre Belloni <[email protected]>

Use BIT() instead of hand coding. Signed-off-by: Alexandre Belloni <[email protected]>

The hardware is only capable of handling dates between 2000 and 2099, enforce that. Signed-off-by: Alexandre Belloni <[email protected]>

The datasheet specifies that transfer mode must be 0 for write and either 0x4 (simplified read) or 0 (standard read). 0x8 is not specified, use standard mode. Signed-off-by: Alexandre Belloni <[email protected]>

Stop setting the time to epoch when it is invalid. The proper way to handle that is to return an error when it is invalid instead of returning an incorrect value. Signed-off-by: Alexandre Belloni <[email protected]>

Remove useless error messages, at that point, the user already knows something went wrong but will not be able to do anything about it anyway. It is also highly unlikely that some registers are readable/writable but not some other ones. Also, transform rx8025_read_reg to be more resemblant to i2c_smbus_read_byte_data() Signed-off-by: Alexandre Belloni <[email protected]>

Instead of bailing out, disable alarms and continue when devm_request_threaded_irq() fails. This allows to still provide some functionality. Signed-off-by: Alexandre Belloni <[email protected]>

rx8025_init_client is modifying ctrl[0] and writing it to RX8025_REG_CTRL2 but ctrl[0] is actually RX8025_REG_CTRL1. Signed-off-by: Alexandre Belloni <[email protected]>

Wait for the user to set the time to reset the validity bits. Until then, the time may be invalid. Signed-off-by: Alexandre Belloni <[email protected]>

irq_freq is already initialized to 1 in rtc_device_register() Signed-off-by: Alexandre Belloni <[email protected]>

RX8025_BIT_CTRL2_CTFG was set to 0 only when it was already 0. Signed-off-by: Alexandre Belloni <[email protected]>

Check time validity when reading time as this is when we need to know. Signed-off-by: Alexandre Belloni <[email protected]>

According to the Armada38x functional errata FE-3124064, writing to the RTC TIME register may fail. As a workaround, after writing to RTC TIME register, issue a dummy write of 0x0 twice to the RTC Status register. This is the updated implementation of the Errata that eliminates the need of the long 100ms delay during the RTC set time procedure. [[email protected]]: removed the mutex and use the spinlock again Signed-off-by: Nadav Haklai <[email protected]> Reviewed-by: Neta Zur Hershkovits <[email protected]> Signed-off-by: Gregory CLEMENT <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

This driver is using device tree but is not including of.h Signed-off-by: Alexandre Belloni <[email protected]>

Definitions from linux/platform_data/atmel.h are not used, remove the include. Signed-off-by: Alexandre Belloni <[email protected]>

The clock enable/disable codes for alarm have been removed from commit 24e1455 ("drivers/rtc/rtc-s3c.c: delete duplicate clock control") and the clocks are disabled even if alarm is set, so alarm interrupt can't happen. The s3c_rtc_setaie function can be called several times with 'enabled' argument having same value, so it needs to check whether clocks are enabled or not. Signed-off-by: Joonyoung Shim <[email protected]> Cc: <[email protected]> # v4.1 Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

It's missed to call clk_unprepare() about info->rtc_src_clk in s3c_rtc_remove and to call clk_disable_unprepare about info->rtc_clk in error routine of s3c_rtc_probe. Signed-off-by: Joonyoung Shim <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

It's unnecessary the code that assigns info->rtc_clk to NULL in s3c_rtc_remove. Signed-off-by: Joonyoung Shim <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

If ds3232 work on some platform that is not implementing irq_set_wake, ds3232 will get a WARNING trace in resume. So fix ds3232->suspended state to false when irq_set_irq_wake return error. WARNING: CPU: 0 PID: 729 at kernel/irq/manage.c:604 irq_set_irq_wake+0x4b/0x8c() Unbalanced IRQ 201 wake disable Modules linked in: CPU: 0 PID: 729 Comm: sh Not tainted 3.12.19-rt30+ #25 [<800107d9>] (unwind_backtrace+0x1/0x88) from [<8000e4ef>] (show_stack+0xb/0xc) [<8000e4ef>] (show_stack+0xb/0xc) from [<802b5fa9>] (dump_stack+0x4d/0x60) [<802b5fa9>] (dump_stack+0x4d/0x60) from [<800186dd>] (warn_slowpath_common+0x45/0x64) [<800186dd>] (warn_slowpath_common+0x45/0x64) from [<80018717>] (warn_slowpath_fmt+0x1b/0x24) [<80018717>] (warn_slowpath_fmt+0x1b/0x24) from [<8003a8d3>] (irq_set_irq_wake+0x4b/0x8c) [<8003a8d3>] (irq_set_irq_wake+0x4b/0x8c) from [<80204fcb>] (ds3232_resume+0x2d/0x36) [<80204fcb>] (ds3232_resume+0x2d/0x36) from [<801954c7>] (dpm_run_callback.isra.13+0xb/0x28) [<801954c7>] (dpm_run_callback.isra.13+0xb/0x28) from [<80195b1b>] (device_resume+0x7b/0xa2) [<80195b1b>] (device_resume+0x7b/0xa2) from [<80195f0f>] (dpm_resume+0xbb/0x19c) [<80195f0f>] (dpm_resume+0xbb/0x19c) from [<801960d9>] (dpm_resume_end+0x9/0x12) [<801960d9>] (dpm_resume_end+0x9/0x12) from [<80037e1d>] (suspend_devices_and_enter+0x17d/0x1d0) [<80037e1d>] (suspend_devices_and_enter+0x17d/0x1d0) from [<80037ee1>] (pm_suspend+0x71/0x128) [<80037ee1>] (pm_suspend+0x71/0x128) from [<80037449>] (state_store+0x6d/0x80) [<80037449>] (state_store+0x6d/0x80) from [<800af4d5>] (sysfs_write_file+0x9f/0xde) [<800af4d5>] (sysfs_write_file+0x9f/0xde) from [<8007a437>] (vfs_write+0x7b/0x104) [<8007a437>] (vfs_write+0x7b/0x104) from [<8007a7f7>] (SyS_write+0x27/0x48) [<8007a7f7>] (SyS_write+0x27/0x48) from [<8000c121>] (ret_fast_syscall+0x1/0x44) Signed-off-by: Wang Dongsheng <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Factor out the RTC initialization from the platform device specific parts in order to share the RTC device ops with other drivers. Specifically, it will be shared with rtc-pxa driver. Signed-off-by: Rob Herring <[email protected]> Cc: Robert Jarzmik <[email protected]> Cc: Russell King <[email protected]> Cc: Alessandro Zummo <[email protected]> Cc: Alexandre Belloni <[email protected]> Cc: [email protected] Signed-off-by: Alexandre Belloni <[email protected]>

Currently, the rtc-sa1100 and rtc-pxa drivers co-exist as rtc-pxa has a superset of functionality. Having 2 drivers sharing the same memory resource is not allowed by the driver model if resources are properly declared. This problem was avoided by not adding memory resources to the SA1100 RTC driver, but that prevents clean-up of the SA1100 driver. This commit converts the PXA RTC to use the exported SA1100 RTC functions. Now the sa1100-rtc and pxa-rtc devices are mutually exclusive, so we must remove the sa1100-rtc from pxa27x and pxa3xx. Signed-off-by: Rob Herring <[email protected]> Cc: Daniel Mack <[email protected]> Cc: Haojian Zhuang <[email protected]> Cc: Robert Jarzmik <[email protected]> Cc: Russell King <[email protected]> Cc: Alessandro Zummo <[email protected]> Cc: Alexandre Belloni <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Alexandre Belloni <[email protected]>

The drivers for the SA1100 and PXA RTCs are now mutually exclusive, so add the memory resource for the sa1100-rtc device. Since the memory resource is already present in the pxa_rtc_resources, that makes sa1100_rtc_resources and pxa_rtc_resources equivalent, so use pxa_rtc_resources for both devices and remove the duplicate sa1100_rtc_resources. Signed-off-by: Rob Herring <[email protected]> Cc: Daniel Mack <[email protected]> Cc: Haojian Zhuang <[email protected]> Acked-by: Robert Jarzmik <[email protected]> Cc: Russell King <[email protected]> Cc: [email protected] Signed-off-by: Alexandre Belloni <[email protected]>

SA1100 and PXA differ only in register offsets which are currently hardcoded in a machine specific header. Some arm64 platforms (PXA1928) have this RTC block as well (and not the PXA270 variant). Convert the driver to use ioremap and set the register offsets dynamically. Since we are touching all the register accesses, convert them all to readl_relaxed/writel_relaxed. Signed-off-by: Rob Herring <[email protected]> Acked-by: Robert Jarzmik <[email protected]> Cc: Alessandro Zummo <[email protected]> Cc: Alexandre Belloni <[email protected]> Cc: [email protected] Signed-off-by: Alexandre Belloni <[email protected]>

Now that register definitions have been moved to the driver, we can remove them from machine specific code. Signed-off-by: Rob Herring <[email protected]> Cc: Russell King <[email protected]> Cc: [email protected] Signed-off-by: Alexandre Belloni <[email protected]>

Now that register definitions have been moved to the driver, regs-rtc.h is no longer used and can be removed. Signed-off-by: Rob Herring <[email protected]> Cc: Eric Miao <[email protected]> Cc: Haojian Zhuang <[email protected]> Cc: Russell King <[email protected]> Cc: [email protected] Signed-off-by: Alexandre Belloni <[email protected]>

With the SA1100 and PXA RTC drivers be mutually exclusive and no longer sharing hardware, PXA27x/PXA3xx platforms must use the PXA RTC driver as the SA1100 platform device is no longer registered. This change should be almost transparent to userspace. Former users of pxa-rtc should be aware that 2 RTCs will be available on their kernels, rtc0 being sa1100-rtc and rtc1 being pxa-rtc. Any userspace relying on the fact that rtc0 was pxa-rtc should be fixed. As a consequence: - the first reboot after the switch will have the wrong time, - on dual boot platform where the other OS programs some logic into the sa1100 rtc IP, a lack of fix in userspace, ie. a kernel changing sa1100-rtc thinking it is pxa-rtc could have dire consequence, such as wiping the other OS data partition. (Thanks to Robert Jarmik for help on the above commit text.) Signed-off-by: Rob Herring <[email protected]> Acked-by: Robert Jarzmik <[email protected]> Cc: Daniel Mack <[email protected]> Cc: Haojian Zhuang <[email protected]> Cc: Sergey Lapin <[email protected]> Cc: Russell King <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Philipp Zabel <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The RTC month value is 1-indexed, but the kernel assumes it is 0-indexed. This may result in the RTC not rolling over correctly. Signed-off-by: Bibek Basu <[email protected]> Signed-off-by: Felix Janda <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

adds file for description on device node bindings for RTC found on Xilinx Zynq Ultrascale+ MPSoC. Signed-off-by: Suneel Garapati <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Add support for RTC controller found on Xilinx Zynq Ultrascale+ MPSoC platform. Signed-off-by: Suneel Garapati <[email protected]> Acked-by: Moritz Fischer <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

According to datasheet, the S2MPS13X and S2MPS14X should update write buffer via setting WUDR bit to high after ctrl register is written. If not, ALARM interrupt of rtc-s5m doesn't happen first time when i use tools/testing/selftests/timers/rtctest.c test program and hour format is used to 12 hour mode in Odroid-XU3 board. One more issue is the RTC doesn't keep time on Odroid-XU3 board when i turn on board after power off even if RTC battery is connected. It can be solved as setting WUDR & RUDR bits to high at the same time after RTC_CTRL register is written. It's same with condition of only writing ALARM registers, so this is for only S2MPS14 and we should set WUDR & A_UDR bits to high on S2MPS13. I can't find any reasonable description about this like fix from datasheet, but can find similar codes from rtc driver source of hardkernel kernel and vendor kernel. Signed-off-by: Joonyoung Shim <[email protected]> Cc: <[email protected]> # v3.16 Reviewed-by: Krzysztof Kozlowski <[email protected]> Tested-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

rtc can either be supplied from internal 32k clock or external crystal generated 32k clock. Internal clock is SOC specific and the external clock is board dependent. Adding the corresponding nodes. Signed-off-by: Keerthy <[email protected]> Acked-by: Tony Lindgren <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

The rtc can be clocked by an internal 32K clock. Adding the support to enable the same. Signed-off-by: Keerthy <[email protected]> Acked-by: Tony Lindgren <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Configure the clock source to external clock if available. External clock is preferred as it can be ticking during suspend. Signed-off-by: Keerthy <[email protected]> Acked-by: Tony Lindgren <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

These platform drivers have a platform device ID table but the module alias information is not created so module autoloading will not work. Signed-off-by: Javier Martinez Canillas <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

These platform drivers have a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Javier Martinez Canillas <[email protected]> Acked-by: Andrew Lunn <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Remove unused variable 'res' and fix the following build warning: drivers/rtc/rtc-ds1374.c:667:6: warning: unused variable 'res' [-Wunused-variable] Reported-by: Olof's autobuilder <[email protected]> Signed-off-by: Fabio Estevam <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Add a sentinel to ab85xx_rtc_ids[] in order to fix the following error: drivers/rtc/rtc-ab8500: struct platform_device_id is 24 bytes. The last of 2 is: 0x61 0x62 0x38 0x35 0x34 0x30 0x2d 0x72 0x74 0x63 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x8c FATAL: drivers/rtc/rtc-ab8500: struct platform_device_id is not terminated with a NULL entry! Reported-by: Andrey Ryabinin <[email protected]> Reported-by: Olof's autobuilder <[email protected]> Signed-off-by: Fabio Estevam <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Fix RTC write bit as per application manual Cc: [email protected] # 4.1+ Signed-off-by: Mitja Spes <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]>

Commit 3cc2dac ("drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC") introduces calls to ioremap_wc and ioremap_uc. This causes build failures with parisc:allmodconfig. Map the missing functions to ioremap_nocache. Fixes: 3cc2dac ("drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC") Cc: Luis R. Rodriguez <[email protected]> Cc: Paul Gortmaker <[email protected]> Signed-off-by: Guenter Roeck <[email protected]> Signed-off-by: Helge Deller <[email protected]>

Commit 3a9ad0b ("PCI: Add pci_bus_addr_t") unconditionally introduced usage of 64-bit PCI bus addresses on all 64-bit platforms which broke PA-RISC. It turned out that due to enabling the 64-bit addresses, the PCI logic decided to use the GMMIO instead of the LMMIO region. This commit simply disables registering the GMMIO and thus we fall back to use the LMMIO region as before. Reverts commit 45ea2a5 ("PCI: Don't use 64-bit bus addresses on PA-RISC") To: [email protected] Cc: [email protected] Cc: Bjorn Helgaas <[email protected]> Cc: Meelis Roos <[email protected]> Cc: [email protected] # v3.19+ Signed-off-by: Helge Deller <[email protected]>

Craig Estey noticed that we didn't checked for in_atomic() in our page fault handler like other architectures. This commit adds this check by using faulthandler_disabled() which includes a check for pagefault_disabled() and in_atomic(). Reported-by: Craig Estey <[email protected]> Signed-off-by: Helge Deller <[email protected]>

When detecting a serial port on newer PA-RISC machines (with iosapic) we have a long way to go to find the right IRQ line, registering it, then registering the serial port and the irq handler for the serial port. During this phase spurious interrupts for the serial port may happen which then crashes the kernel because the action handler might not have been set up yet. So, basically it's a race condition between the serial port hardware and the CPU which sets up the necessary fields in the irq sructs. The main reason for this race is, that we unmask the serial port irqs too early without having set up everything properly before (which isn't easily possible because we need the IRQ number to register the serial ports). This patch is a work-around for this problem. It adds checks to the CPU irq handler to verify if the IRQ action field has been initialized already. If not, we just skip this interrupt (which isn't critical for a serial port at bootup). The real fix would probably involve rewriting all PA-RISC specific IRQ code (for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of the irq chips and proper irq enabling along this line. This bug has been in the PA-RISC port since the beginning, but the crashes happened very rarely with currently used hardware. But on the latest machine which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900, 1GHz) and which has the largest possible L1 cache size (64MB each), the kernel crashed at every boot because of this race. So, without this patch the machine would currently be unuseable. For the record, here is the flow logic: 1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq(). 2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq. 3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq 4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!) 5. serial_init_chip() then registers the 8250 port. Problems: - In step 4 the CPU irq shouldn't have been registered yet, but after step 5 - If serial irq happens between 4 and 5 have finished, the kernel will crash Signed-off-by: Helge Deller <[email protected]>

The attached change fixes the condition used in the "sub" instruction. A double word comparison is needed. This fixes the 64-bit LWS CAS operation on 64-bit kernels. I can now enable 64-bit atomic support in GCC. Cc: <[email protected]> Signed-off-by: John David Anglin <dave.anglin> Signed-off-by: Helge Deller <[email protected]>

Commit b1c9f16 ("xen: split counting of extra memory pages...") introduced an error when dom0 was started with limited memory. The problem arises in case dom0 is started with initial memory and maximum memory being the same and exactly a multiple of 1 GB. The kernel must be configured without CONFIG_XEN_BALLOON_MEMORY_HOTPLUG for the problem to happen. In this case it will crash very early during boot due to the virtual mapped p2m list not being large enough to be able to remap any memory: (XEN) Freed 304kB init memory. mapping kernel into physical memory about to get started... (XEN) traps.c:459:d0v0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000] (XEN) domain_crash_sync called from entry.S: fault at ffff82d080229a93 create_bounce_frame+0x12b/0x13a (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.5.2-pre x86_64 debug=n Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff81d120cb>] (XEN) RFLAGS: 0000000000000206 EM: 1 CONTEXT: pv guest (d0v0) (XEN) rax: ffffffff81db2000 rbx: 000000004d000000 rcx: 0000000000000000 (XEN) rdx: 000000004d000000 rsi: 0000000000063000 rdi: 000000004d063000 (XEN) rbp: ffffffff81c03d78 rsp: ffffffff81c03d28 r8: 0000000000023000 (XEN) r9: 00000001040ff000 r10: 0000000000007ff0 r11: 0000000000000000 (XEN) r12: 0000000000063000 r13: 000000000004d000 r14: 0000000000000063 (XEN) r15: 0000000000000063 cr0: 0000000080050033 cr4: 00000000000006f0 (XEN) cr3: 0000000105c0f000 cr2: ffffc90000268000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff81c03d28: (XEN) 0000000000000000 0000000000000000 ffffffff81d120cb 000000010000e030 (XEN) 0000000000010006 ffffffff81c03d68 000000000000e02b ffffffffffffffff (XEN) 0000000000000063 000000000004d063 ffffffff81c03de8 ffffffff81d130a7 (XEN) ffffffff81c03de8 000000000004d000 00000001040ff000 0000000000105db1 (XEN) 00000001040ff001 000000000004d062 ffff8800092d6ff8 0000000002027000 (XEN) ffff8800094d8340 ffff8800092d6ff8 00003ffffffff000 ffff8800092d7ff8 (XEN) ffffffff81c03e48 ffffffff81d13c43 ffff8800094d8000 ffff8800094d9000 (XEN) 0000000000000000 ffff8800092d6000 00000000092d6000 000000004cfbf000 (XEN) 00000000092d6000 00000000052d5442 0000000000000000 0000000000000000 (XEN) ffffffff81c03ed8 ffffffff81d185c1 0000000000000000 0000000000000000 (XEN) ffffffff81c03e78 ffffffff810f8ca4 ffffffff81c03ed8 ffffffff8171a15d (XEN) 0000000000000010 ffffffff81c03ee8 0000000000000000 0000000000000000 (XEN) ffffffff81f0e402 ffffffffffffffff ffffffff81dae900 0000000000000000 (XEN) 0000000000000000 0000000000000000 ffffffff81c03f28 ffffffff81d0cf0f (XEN) 0000000000000000 0000000000000000 0000000000000000 ffffffff81db82e0 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) ffffffff81c03f38 ffffffff81d0c603 ffffffff81c03ff8 ffffffff81d11c86 (XEN) 0300000100000032 0000000000000005 0000000000000020 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) Domain 0 crashed: rebooting machine in 5 seconds. This can be avoided by allocating aneough space for the p2m to cover the maximum memory of dom0 plus the identity mapped holes required for PCI space, BIOS etc. Reported-by: Boris Ostrovsky <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: David Vrabel <[email protected]>

Commit b1c9f16 ("xen: split counting of extra memory pages...") introduced an error when dom0 was started with limited memory occurring only on some hardware. The problem arises in case dom0 is started with initial memory and maximum memory being the same. The kernel must be configured without CONFIG_XEN_BALLOON_MEMORY_HOTPLUG for the problem to happen. If all of this is true and the E820 map of the machine is sparse (some areas are not covered) then the machine might crash early in the boot process. An example E820 map triggering the problem looks like this: [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000cf7fafff] usable [ 0.000000] BIOS-e820: [mem 0x00000000cf7fb000-0x00000000cf95ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000cf960000-0x00000000cfb62fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x00000000cfb63000-0x00000000cfd14fff] usable [ 0.000000] BIOS-e820: [mem 0x00000000cfd15000-0x00000000cfd61fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x00000000cfd62000-0x00000000cfd6cfff] ACPI data [ 0.000000] BIOS-e820: [mem 0x00000000cfd6d000-0x00000000cfd6ffff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x00000000cfd70000-0x00000000cfd70fff] usable [ 0.000000] BIOS-e820: [mem 0x00000000cfd71000-0x00000000cfea8fff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000cfea9000-0x00000000cfeb9fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x00000000cfeba000-0x00000000cfecafff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000cfecb000-0x00000000cfecbfff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x00000000cfecc000-0x00000000cfedbfff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000cfedc000-0x00000000cfedcfff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x00000000cfedd000-0x00000000cfeddfff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000cfede000-0x00000000cfee3fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x00000000cfee4000-0x00000000cfef6fff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000cfef7000-0x00000000cfefffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fed61000-0x00000000fed70fff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed8ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100001000-0x000000020effffff] usable In this case the area a0000-dffff isn't present in the map. This will confuse the memory setup of the domain when remapping the memory from such holes to populated areas. To avoid the problem the accounting of to be remapped memory has to count such holes in the E820 map as well. Reported-by: Boris Ostrovsky <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: David Vrabel <[email protected]>

When a pv-domain (including dom0) is started it tries to size it's p2m list according to the maximum possible memory amount it ever can achieve. Limit the initial maximum memory size to the architectural limit of the hardware in order to avoid overflows during remapping of memory. This problem will occur when dom0 is started with an initial memory size being a multiple of 1GB, but without specifying it's maximum memory size. The kernel must be configured without CONFIG_XEN_BALLOON_MEMORY_HOTPLUG for the problem to happen. Reported-by: Roger Pau Monné <[email protected]> Tested-by: Roger Pau Monné <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: David Vrabel <[email protected]>

Instead of using physical addresses for accounting of extra memory areas available for ballooning switch to pfns as this is much less error prone regarding partial pages. Reported-by: Roger Pau Monné <[email protected]> Tested-by: Roger Pau Monné <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: David Vrabel <[email protected]>

No need to use CONFIG_SMP around update_cr16_clocksource(). It checks for num_online_cpus() beeing greater than 1, which is always 1 in UP builds. Signed-off-by: Helge Deller <[email protected]>

Signed-off-by: Helge Deller <[email protected]>

It changed as result of other syscalls, and while the system call list itself was correctly updated, the selftest program was not. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirski <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Pull crypto fix from Herbert Xu: "This fixes a memory corruption bug in ghash-clmulni-intel due to insufficient memory allocation" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: ghash-clmulni: specify context size for ghash async algorithm

…/scm/linux/kernel/git/tyhicks/ecryptfs Pull ecryptfs fixes from Tyler Hicks: "Invalidate stale eCryptfs dcache entries caused by unlinked lower inodes" * tag 'ecryptfs-4.3-rc1-stale-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: eCryptfs: Delete a check before the function call "key_put" eCryptfs: Invalidate dcache entries when lower i_nlink is zero

…git/gerg/m68knommu Pull m68k/colfire fixes from Greg Ungerer: "Only a couple of patches this time. One migrating the clock driver code to the new set-state interface. The other cleaning up to use the PFN_DOWN macro" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: m68k/coldfire: use PFN_DOWN macro m68k/coldfire/pit: Migrate to new 'set-state' interface

…ux/kernel/git/tip/tip Pull more irq updates from Thomas Gleixner: "The second part of irq related updates: - Provide EOImode for GIC[V3] irq chips, which is a prerequisite for direct interrupt handling in [KVM] guests" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/GIC: Fix EOImode setting for non-DT/ACPI systems irqchip/GIC: Don't deactivate interrupts forwarded to a guest irqchip/GIC: Convert to EOImode == 1 irqchip/GICv3: Don't deactivate interrupts forwarded to a guest irqchip/GICv3: Convert to EOImode == 1

…ux/kernel/git/xen/tip Pull xen updates from David Vrabel: "Xen features and fixes for 4.3: - Convert xen-blkfront to the multiqueue API - [arm] Support binding event channels to different VCPUs. - [x86] Support > 512 GiB in a PV guests (off by default as such a guest cannot be migrated with the current toolstack). - [x86] PMU support for PV dom0 (limited support for using perf with Xen and other guests)" * tag 'for-linus-4.3-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (33 commits) xen: switch extra memory accounting to use pfns xen: limit memory to architectural maximum xen: avoid another early crash of memory limited dom0 xen: avoid early crash of memory limited dom0 arm/xen: Remove helpers which are PV specific xen/x86: Don't try to set PCE bit in CR4 xen/PMU: PMU emulation code xen/PMU: Intercept PMU-related MSR and APIC accesses xen/PMU: Describe vendor-specific PMU registers xen/PMU: Initialization code for Xen PMU xen/PMU: Sysfs interface for setting Xen PMU mode xen: xensyms support xen: remove no longer needed p2m.h xen: allow more than 512 GB of RAM for 64 bit pv-domains xen: move p2m list if conflicting with e820 map xen: add explicit memblock_reserve() calls for special pages mm: provide early_memremap_ro to establish read-only mapping xen: check for initrd conflicting with e820 map xen: check pre-allocated page tables for conflict with memory map xen: check for kernel memory conflicting with memory layout ...

Pull NMI backtrace update from Russell King: "These changes convert the x86 NMI handling to be a library implementation which other architectures can make use of. Thomas Gleixner has reviewed and tested these changes, and wishes me to send these rather than taking them through the tip tree. The final patch in the set adds an initial implementation using this infrastructure to ARM, even though it doesn't send the IPI at "NMI" level. Patches are in progress to add the ARM equivalent of NMI, but we still need the IRQ-level fallback for systems where the "NMI" isn't available due to secure firmware denying access to it" * 'nmi' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: ARM: add basic support for on-demand backtrace of other CPUs nmi: x86: convert to generic nmi handler nmi: create generic NMI backtrace implementation

…jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights: - PKCS#7 support added to support signed kexec, also utilized for module signing. See comments in 3f1e1be. ** NOTE: this requires linking against the OpenSSL library, which must be installed, e.g. the openssl-devel on Fedora ** - Smack - add IPv6 host labeling; ignore labels on kernel threads - support smack labeling mounts which use binary mount data - SELinux: - add ioctl whitelisting (see http://kernsec.org/files/lss2015/vanderstoep.pdf) - fix mprotect PROT_EXEC regression caused by mm change - Seccomp: - add ptrace options for suspend/resume" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits) PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them Documentation/Changes: Now need OpenSSL devel packages for module signing scripts: add extract-cert and sign-file to .gitignore modsign: Handle signing key in source tree modsign: Use if_changed rule for extracting cert from module signing key Move certificate handling to its own directory sign-file: Fix warning about BIO_reset() return value PKCS#7: Add MODULE_LICENSE() to test module Smack - Fix build error with bringup unconfigured sign-file: Document dependency on OpenSSL devel libraries PKCS#7: Appropriately restrict authenticated attributes and content type KEYS: Add a name for PKEY_ID_PKCS7 PKCS#7: Improve and export the X.509 ASN.1 time object decoder modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS extract-cert: Cope with multiple X.509 certificates in a single file sign-file: Generate CMS message as signature instead of PKCS#7 PKCS#7: Support CMS messages also [RFC5652] X.509: Change recorded SKID & AKID to not include Subject or Issuer PKCS#7: Check content type and versions MAINTAINERS: The keyrings mailing list has moved ...

Pull audit update from Paul Moore: "This is one of the larger audit patchsets in recent history, consisting of eight patches and almost 400 lines of changes. The bulk of the patchset is the new "audit by executable" functionality which allows admins to set an audit watch based on the executable on disk. Prior to this, admins could only track an application by PID, which has some obvious limitations. Beyond the new functionality we also have some refcnt fixes and a few minor cleanups" * 'upstream' of git://git.infradead.org/users/pcmoore/audit: fixup: audit: implement audit by executable audit: implement audit by executable audit: clean simple fsnotify implementation audit: use macros for unset inode and device values audit: make audit_del_rule() more robust audit: fix uninitialized variable in audit_add_rule() audit: eliminate unnecessary extra layer of watch parent references audit: eliminate unnecessary extra layer of watch references

…it/rostedt/linux-trace Pull tracing update from Steven Rostedt: "Mostly this is just clean ups and micro optimizations. The changes with more meat are: - Allowing the trace event filters to filter on CPU number and process ids - Two new markers for trace output latency were added (10 and 100 msec latencies) - Have tracing_thresh filter function profiling time I also worked on modifying the ring buffer code for some future work, and moved the adding of the timestamp around. One of my changes caused a regression, and since other changes were built on top of it and already tested, I had to operate a revert of that change. Instead of rebasing, this change set has the code that caused a regression as well as the code to revert that change without touching the other changes that were made on top of it" * tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated" tracing: Don't make assumptions about length of string on task rename tracing: Allow triggers to filter for CPU ids and process names ftrace: Format MCOUNT_ADDR address as type unsigned long tracing: Introduce two additional marks for delay ftrace: Fix function_graph duration spacing with 7-digits ftrace: add tracing_thresh to function profile tracing: Clean up stack tracing and fix fentry updates ring-buffer: Reorganize function locations ring-buffer: Make sure event has enough room for extend and padding ring-buffer: Get timestamp after event is allocated ring-buffer: Move the adding of the extended timestamp out of line ring-buffer: Add event descriptor to simplify passing data ftrace: correct the counter increment for trace_buffer data tracing: Fix for non-continuous cpu ids tracing: Prefer kcalloc over kzalloc with multiply

…t/mmarek/kbuild Pull core kbuild updates from Michal Marek: - modpost portability fix - linker script fix - genksyms segfault fix - fixdep cleanup - fix for clang detection * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kbuild: Fix clang detection kbuild: fixdep: drop meaningless hash table initialization kbuild: fixdep: optimize code slightly genksyms: Regenerate parser genksyms: Duplicate function pointer type definitions segfault kbuild: Fix .text.unlikely placement Avoid conflict with host definitions when cross-compiling

…it/mmarek/kbuild Pull kconfig updates from Michal Marek: - kconfig warns about junk characters in Kconfig files - merge_config.sh error handling - small cleanup * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: merge_config.sh: exit on missing input files kconfig: Regenerate shipped zconf.{hash,lex}.c files kconfig: warn of unhandled characters in Kconfig commands kconfig: Delete unnecessary checks before the function call "sym_calc_value"

…mmarek/kbuild Pull misc kbuild updates from Michal Marek: - deb-pkg: + module signing fix + dtb files are added to the package + do not require `hostname -f` to work during build + make deb-pkg generates a source package, bindeb-pkg has been added to only generate the binary package - rpm-pkg packages /lib/modules as well - new coccinelle patch and updates to existing ones - new stackusage & stackdelta script to collect and compare stack usage info (using gcc's -fstack-usage) - make tags understands trace_*_rcuidle() macros - .gitignore updates, misc cleanups * 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (27 commits) deb-pkg: add source package package/Makefile: move source tar creation to a function scripts: add stackdelta script kbuild: remove *.su files generated by -fstack-usage .gitignore: add *.su pattern scripts: add stackusage script kbuild: avoid listing /lib/modules in kernel spec file fallback to hostname in scripts/package/builddeb coccinelle: api: extend spatch for dropping unnecessary owner deb-pkg: simplify directory creation scripts/tags.sh: Include trace_*_rcuidle() in tags scripts/package/Makefile: rpmbuild is needed for rpm targets Kbuild: Add ID files to .gitignore gitignore: Add MIPS vmlinux.32 to the list coccinelle: simple_return: Add a blank line coccinelle: irqf_oneshot.cocci: Improve the generated commit log coccinelle: api: add vma_pages.cocci scripts/coccinelle/misc/irqf_oneshot.cocci: Fix grammar scripts/coccinelle/misc/semicolon.cocci: Use imperative mood coccinelle: simple_open: Use imperative mood ...

…ernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "This update has successfully completed a 0day-kbuild run and has appeared in a linux-next release. The changes outside of the typical drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and the introduction of ZONE_DEVICE + devm_memremap_pages(). Summary: - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic mechanism for adding device-driver-discovered memory regions to the kernel's direct map. This facility is used by the pmem driver to enable pfn_to_page() operations on the page frames returned by DAX ('direct_access' in 'struct block_device_operations'). For now, the 'memmap' allocation for these "device" pages comes from "System RAM". Support for allocating the memmap from device memory will arrive in a later kernel. - Introduce memremap() to replace usages of ioremap_cache() and ioremap_wt(). memremap() drops the __iomem annotation for these mappings to memory that do not have i/o side effects. The replacement of ioremap_cache() with memremap() is limited to the pmem driver to ease merging the api change in v4.3. Completion of the conversion is targeted for v4.4. - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem driver, update the VFS DAX implementation and PMEM api to provide persistence guarantees for kernel operations on a DAX mapping. - Convert the ACPI NFIT 'BLK' driver to map the block apertures as cacheable to improve performance. - Miscellaneous updates and fixes to libnvdimm including support for issuing "address range scrub" commands, clarifying the optimal 'sector size' of pmem devices, a clarification of the usage of the ACPI '_STA' (status) property for DIMM devices, and other minor fixes" * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits) libnvdimm, pmem: direct map legacy pmem by default libnvdimm, pmem: 'struct page' for pmem libnvdimm, pfn: 'struct page' provider infrastructure x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB add devm_memremap_pages mm: ZONE_DEVICE for "device memory" mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h dax: drop size parameter to ->direct_access() nd_blk: change aperture mapping from WC to WB nvdimm: change to use generic kvfree() pmem, dax: have direct_access use __pmem annotation dax: update I/O path to do proper PMEM flushing pmem: add copy_from_iter_pmem() and clear_pmem() pmem, x86: clean up conditional pmem includes pmem: remove layer when calling arch_has_wmb_pmem() pmem, x86: move x86 PMEM API to new pmem.h header libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option pmem: switch to devm_ allocations devres: add devm_memremap libnvdimm, btt: write and validate parent_uuid ...

When seq_show_option (commit a068acf: "fs: create and use seq_show_option for escaping") was merged, it did not correctly collide with cgroup's addition of legacy_name (commit 3e1d2ee: "cgroup: introduce cgroup_subsys->legacy_name") changes. This fixes the reported name. Signed-off-by: Kees Cook <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

On 32-bit: userfaultfd.c: In function 'locking_thread': userfaultfd.c:152: warning: left shift count >= width of type userfaultfd.c: In function 'uffd_poll_thread': userfaultfd.c:295: warning: cast to pointer from integer of different size userfaultfd.c: In function 'uffd_read_thread': userfaultfd.c:332: warning: cast to pointer from integer of different size Fix the shift warning by splitting the shift in two parts, and the integer/pointer warnigns by adding intermediate casts to "unsigned long". Signed-off-by: Geert Uytterhoeven <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

special_mapping_fault() is absolutely broken. It seems it was always wrong, but this didn't matter until vdso/vvar started to use more than one page. And after this change vma_is_anonymous() becomes really trivial, it simply checks vm_ops == NULL. However, I do think the helper makes sense. There are a lot of ->vm_ops != NULL checks, the helper makes the caller's code more understandable (self-documented) and this is more grep-friendly. This patch (of 3): Preparation. Add the new simple helper, vma_is_anonymous(vma), and change handle_pte_fault() to use it. It will have more users. The name is not accurate, say a hpet_mmap()'ed vma is not anonymous. Perhaps it should be named vma_has_fault() instead. But it matches the logic in mmap.c/memory.c (see next changes). "True" just means that a page fault will use do_anonymous_page(). Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Test-case: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <assert.h> void *find_vdso_vaddr(void) { FILE *perl; char buf[32] = {}; perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;" "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r"); fread(buf, sizeof(buf), 1, perl); fclose(perl); return (void *)atol(buf); } #define PAGE_SIZE 4096 int main(void) { void *vdso = find_vdso_vaddr(); assert(vdso); // of course they should differ, and they do so far printf("vdso pages differ: %d\n", !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); // split into 2 vma's assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0); // force another fault on the next check assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0); // now they no longer differ, the 2nd vm_pgoff is wrong printf("vdso pages differ: %d\n", !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); return 0; } Output: vdso pages differ: 1 vdso pages differ: 0 This is because split_vma() correctly updates ->vm_pgoff, but the logic in insert_vm_struct() and special_mapping_fault() is absolutely broken, so the fault at vdso + PAGE_SIZE return the 1st page. The same happens if you simply unmap the 1st page. special_mapping_fault() does: pgoff = vmf->pgoff - vma->vm_pgoff; and this is _only_ correct if vma->vm_start mmaps the first page from ->vm_private_data array. vdso or any other user of install_special_mapping() is not anonymous, it has the "backing storage" even if it is just the array of pages. So we actually need to make vm_pgoff work as an offset in this array. Note: this also allows to fix another problem: currently gdb can't access "[vvar]" memory because in this case special_mapping_fault() doesn't work. Now that we can use ->vm_pgoff we can implement ->access() and fix this. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Test-case: #define _GNU_SOURCE #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <assert.h> void *find_vdso_vaddr(void) { FILE *perl; char buf[32] = {}; perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;" "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r"); fread(buf, sizeof(buf), 1, perl); fclose(perl); return (void *)atol(buf); } #define PAGE_SIZE 4096 void *get_unmapped_area(void) { void *p = mmap(0, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1,0); assert(p != MAP_FAILED); munmap(p, PAGE_SIZE); return p; } char save[2][PAGE_SIZE]; int main(void) { void *vdso = find_vdso_vaddr(); void *page[2]; assert(vdso); memcpy(save, vdso, sizeof (save)); // force another fault on the next check assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0); page[0] = mremap(vdso, PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, get_unmapped_area()); page[1] = mremap(vdso + PAGE_SIZE, PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE, get_unmapped_area()); assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED); printf("match: %d %d\n", !memcmp(save[0], page[0], PAGE_SIZE), !memcmp(save[1], page[1], PAGE_SIZE)); return 0; } fails without this patch. Before the previous commit it gets the wrong page, now it segfaults (which is imho better). This is because copy_vma() wrongly assumes that if vma->vm_file == NULL is irrelevant until the first fault which will use do_anonymous_page(). This is obviously wrong for the special mapping. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This series of patches adds support for using PMD page table entries to map DAX files. We expect NV-DIMMs to start showing up that are many gigabytes in size and the memory consumption of 4kB PTEs will be astronomical. The patch series leverages much of the Transparant Huge Pages infrastructure, going so far as to borrow one of Kirill's patches from his THP page cache series. This patch (of 10): Since we're going to have huge pages in page cache, we need to call adjust file-backed VMA, which potentially can contain huge pages. For now we call it for all VMAs. Probably later we will need to introduce a flag to indicate that the VMA has huge pages. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is defined in linux/mm.h. Given that we don't want to include <linux/mm.h> in <linux/fs.h>, the easiest solution is to move the DAX-related functions to a new header, <linux/dax.h>. We could also have moved VM_FAULT_* definitions to a new header, or a different header that isn't quite such a boil-the-ocean header as <linux/mm.h>, but this felt like the best option. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Undo the change which "userfaultfd: call handle_userfault() for userfaultfd_missing() faults" made to set_huge_zero_page(). DAX will need that return value. Cc: Andrea Arcangeli <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Add a vma_is_dax() helper macro to test whether the VMA is DAX, and use it in zap_huge_pmd() and __split_huge_page_pmd(). [[email protected]: fix build] Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Allow non-anonymous VMAs to provide huge pages in response to a page fault. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

To use the huge zero page in DAX, we need these functions exported. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Similar to vm_insert_pfn(), but for PMDs rather than PTEs. The 'vmf_' prefix instead of 'vm_' prefix is intended to indicate that it returns a VMF_ value rather than an errno (which would only have to be converted into a VMF_ value anyway). Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This is the support code for DAX-enabled filesystems to allow them to provide huge pages in response to faults. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Use DAX to provide support for huge pages. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Fix typo s/CONFIG_TRANSPARENT_HUGEPAGES/CONFIG_TRANSPARENT_HUGEPAGE/ in #endif comment introduced by commit 2b26a92 ("dax: add huge page fault support"). Signed-off-by: Valentin Rothberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

DAX relies on the get_block function either zeroing newly allocated blocks before they're findable by subsequent calls to get_block, or marking newly allocated blocks as unwritten. ext4_get_block() cannot create unwritten extents, but ext4_get_block_write() can. Signed-off-by: Matthew Wilcox <[email protected]> Reported-by: Andy Rudoff <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

It would make more sense to have all the return values from vmf_insert_pfn_pmd() encoded in one place instead of having to follow the convention into insert_pfn(). Suggested by Jeff Moyer. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Jeff Moyer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Jan Kara pointed out I should be more explicit here about the perils of racing against truncate. The comment is mostly the same as for the PTE case. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

DAX wants different semantics from any currently-existing ext4 get_block callback. Unlike ext4_get_block_write(), it needs to honour the 'create' flag, and unlike ext4_get_block(), it needs to be able to return unwritten extents. So introduce a new ext4_get_block_dax() which has those semantics. We could also change ext4_get_block_write() to honour the 'create' flag, but that might have consequences on other users that I do not currently understand. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Jan Kara pointed out that in the case where we are writing to a hole, we can end up with a lock inversion between the page lock and the journal lock. We can avoid this by starting the transaction in ext4 before calling into DAX. The journal lock nests inside the superblock pagefault lock, so we have to duplicate that code from dax_fault, like XFS does. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Jan Kara <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

If two threads write-fault on the same hole at the same time, the winner of the race will return to userspace and complete their store, only to have the loser overwrite their store with zeroes. Fix this for now by taking the i_mmap_sem for write instead of read, and do so outside the call to get_block(). Now the loser of the race will see the block has already been zeroed, and will not zero it again. This severely limits our scalability. I have ideas for improving it, but those can wait for a later patch. Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The DAX code neglected to put the refcount on the huge zero page. Also we must notify on splits. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The original DAX code assumed that pgtable_t was a pointer, which isn't true on all architectures. Restructure the code to not rely on that assumption. [[email protected]: further fixes integrated into this patch] Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This is another place where DAX assumed that pgtable_t was a pointer. Open code the important parts of set_huge_zero_page() in DAX and make set_huge_zero_page() static again. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

If the first access to a huge page was a store, there would be no existing zero pmd in this process's page tables. There could be a zero pmd in another process's page tables, if it had done a load. We can detect this case by noticing that the buffer_head returned from the filesystem is New, and ensure that other processes mapping this huge page have their page tables flushed. Signed-off-by: Matthew Wilcox <[email protected]> Reported-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

I was basically open-coding it (thanks to copying code from do_fault() which probably also needs to be fixed). Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap. __dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from all mappings. We need to drop i_mmap_lock there to avoid lock deadlock. Re-aquiring the lock should be fine since we check i_size after the point. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

__dax_fault() takes i_mmap_lock for write. Let's pair it with write unlock on do_cow_fault() side. Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

nr_node_ids records the highest possible node id, which is calculated by scanning the bitmap node_states[N_POSSIBLE]. Current implementation scan the bitmap from the beginning, which will scan the whole bitmap. This patch reverses the order by scanning from the end with find_last_bit(). Signed-off-by: Wei Yang <[email protected]> Cc: Tejun Heo <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Commit febd594 ("mm/memory hotplug: init the zone's size when calculating node totalpages") refines the function free_area_init_core(). After doing so, these two parameters are not used anymore. This patch removes these two parameters. Signed-off-by: Wei Yang <[email protected]> Cc: Gu Zheng <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Each memblock_region has flags to indicates the type of this range. For the overlap case, memblock_add_range() inserts the lower part and leave the upper part as indicated in the overlapped region. If the flags of the new range differs from the overlapped region, the information recorded is not correct. This patch adds a WARN_ON when the flags of the new range differs from the overlapped region. Signed-off-by: Wei Yang <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

It has no callers. Signed-off-by: Vineet Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access for non-privileged users but hides PFNs from them. Also it adds bit 'map-exclusive' which is set if page is mapped only here: it helps in estimation of working set without exposing pfns and allows to distinguish CoWed and non-CoWed private anonymous pages. Second patch removes page-shift bits and completes migration to the new pagemap format: flags soft-dirty and mmap-exclusive are available only in the new format. This patch (of 5): This patch moves permission checks from pagemap_read() into pagemap_open(). Pointer to mm is saved in file->private_data. This reference pins only mm_struct itself. /proc/*/mem, maps, smaps already work in the same way. See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This patch removes page-shift bits (scheduled to remove since 3.11) and completes migration to the new bit layout. Also it cleans messy macro. Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This patch moves pmd dissection out of reporting loop: huge pages are reported as bunch of normal pages with contiguous PFNs. Add missing "FILE" bit in hugetlb vmas. Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This patch makes pagemap readable for normal users and hides physical addresses from them. For some use-cases PFN isn't required at all. See http://lkml.kernel.org/r/[email protected] Fixes: ab676b7 ("pagemap: do not leak physical addresses to non-privileged userspace") Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Naoya Horiguchi <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This patch sets bit 56 in pagemap if this page is mapped only once. It allows to detect exclusively used pages without exposing PFN: present file exclusive state 0 0 0 non-present 1 1 0 file page mapped somewhere else 1 1 1 file page mapped only here 1 0 0 anon non-CoWed page (shared with parent/child) 1 0 1 anon CoWed page (or never forked) CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context. MMap-exclusive bit doesn't reflect potential page-sharing via swapcache: page could be mapped once but has several swap-ptes which point to it. Application could detect that by swap bit in pagemap entry and touch that pte via /proc/pid/mem to get real information. See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com Requested by Mark Williamson. [[email protected]: fix spello] Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Mark Williamson <[email protected]> Tested-by: Mark Williamson <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Notes about recent changes. [[email protected]: various tweaks] Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Mark Williamson <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Since simple_strtoul is obsolete and memtest_pattern is type of int, use kstrtouint instead. Signed-off-by: Vladimir Murzin <[email protected]> Cc: Leon Romanovsky <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

- prefer pr_info(... to printk(KERN_INFO ... - use %pa for phys_addr_t - use cpu_to_be64 while printing pattern in reserve_bad_mem() Signed-off-by: Vladimir Murzin <[email protected]> Cc: Leon Romanovsky <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

memtest does not require these headers to be included. Signed-off-by: Vladimir Murzin <[email protected]> Cc: Leon Romanovsky <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS. On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android). This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process. Bongkyu tested it. Result is below. 1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 141184 2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 1387744 [[email protected]: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <[email protected]> Reported-by: Bongkyu Kim <[email protected]> Tested-by: Bongkyu Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jerome Marchand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

=== Short summary ==== iov_iter_fault_in_readable() works around a really rare case and we can avoid the deadlock it addresses in another way: disable page faults and work around copy failures by faulting after the copy in a slow path instead of before in a hot one. I have a little microbenchmark that does repeated, small writes to tmpfs. This patch speeds that micro up by 6.2%. === Long version === When doing a sys_write() we have a source buffer in userspace and then a target file page. If both of those are the same physical page, there is a potential deadlock that we avoid. It would happen something like this: 1. We start the write to the file 2. Allocate page cache page and set it !Uptodate 3. Touch the userspace buffer to copy in the user data 4. Page fault (since source of the write not yet mapped) 5. Page fault code tries to lock the page and deadlocks (more details on this below) To avoid this, we prefault the page to guarantee that this fault does not occur. But, this prefault comes at a cost. It is one of the most expensive things that we do in a hot write() path (especially if we compare it to the read path). It is working around a pretty rare case. To fix this, it's pretty simple. We move the "prefault" code to run after we attempt the copy. We explicitly disable page faults _during_ the copy, detect the copy failure, then execute the "prefault" ouside of where the page lock needs to be held. iov_iter_copy_from_user_atomic() actually already has an implicit pagefault_disable() inside of it (at least on x86), but we add an explicit one. I don't think we can depend on every kmap_atomic() implementation to pagefault_disable() for eternity. =================================================== The stack trace when this happens looks like this: wait_on_page_bit_killable+0xc0/0xd0 __lock_page_or_retry+0x84/0xa0 filemap_fault+0x1ed/0x3d0 __do_fault+0x41/0xc0 handle_mm_fault+0x9bb/0x1210 __do_page_fault+0x17f/0x3d0 do_page_fault+0xc/0x10 page_fault+0x22/0x30 generic_perform_write+0xca/0x1a0 __generic_file_write_iter+0x190/0x1f0 ext4_file_write_iter+0xe9/0x460 __vfs_write+0xaa/0xe0 vfs_write+0xa6/0x1a0 SyS_write+0x46/0xa0 entry_SYSCALL_64_fastpath+0x12/0x6a 0xffffffffffffffff (Note, this does *NOT* happen in practice today because the kmap_atomic() does a pagefault_disable(). The trace above was obtained by taking out the pagefault_disable().) You can trigger the deadlock with this little code snippet: fd = open("foo", O_RDWR); fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0); write(fd, &fdmap[0], 1); Signed-off-by: Dave Hansen <[email protected]> Cc: Al Viro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Tejun Heo <[email protected]> Cc: NeilBrown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Paul Cassella <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Explicitly state that __GFP_NORETRY will attempt direct reclaim and memory compaction before returning NULL and that the oom killer is not called in the current implementation of the page allocator. [[email protected]: s/has/have/] Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This makes set_recommended_min_free_kbytes() have a return type of void as it cannot fail. Signed-off-by: Nicholas Krause <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

There are essential elements to an oom context that are passed around to multiple functions. Organize these elements into a new struct, struct oom_control, that specifies the context for an oom condition. This patch introduces no functional change. Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The force_kill member of struct oom_control isn't needed if an order of -1 is used instead. This is the same as order == -1 in struct compact_control which requires full memory compaction. This patch introduces no functional change. Signed-off-by: David Rientjes <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Sysrq+f is used to kill a process either for debug or when the VM is otherwise unresponsive. It is not intended to trigger a panic when no process may be killed. Avoid panicking the system for sysrq+f when no processes are killed. Signed-off-by: David Rientjes <[email protected]> Suggested-by: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Describe the purpose of struct oom_control and what each member does. Also make gfp_mask and order const since they are never manipulated or passed to functions that discard the qualifier. Signed-off-by: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The "killed" variable in out_of_memory() can be removed since the call to oom_kill_process() where we should block to allow the process time to exit is obvious. Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

kmem_cache_destroy() does not tolerate a NULL kmem_cache pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all kmem_cache_destroy() callers (200+ as of 4.1) to do a NULL check if (cache) kmem_cache_destroy(cache); Or, otherwise, be invalid kmem_cache_destroy() users. Tweak kmem_cache_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

mempool_destroy() does not tolerate a NULL mempool_t pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all mempool_destroy() callers to do a NULL check if (pool) mempool_destroy(pool); Or, otherwise, be invalid mempool_destroy() users. Tweak mempool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

dma_pool_destroy() does not tolerate a NULL dma_pool pointer argument and performs a NULL-pointer dereference. This requires additional attention and effort from developers/reviewers and forces all dma_pool_destroy() callers to do a NULL check if (pool) dma_pool_destroy(pool); Or, otherwise, be invalid dma_pool_destroy() users. Tweak dma_pool_destroy() and NULL-check the pointer there. Proposed by Andrew Morton. Link: https://lkml.org/lkml/2015/6/8/583 Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

"memcg: export struct mem_cgroup" will add includes into linux/memcontrol.h which lead to further header dependency issues as reported by Guenter Roeck: In file included from include/linux/highmem.h:7:0, from include/linux/bio.h:23, from include/linux/writeback.h:192, from include/linux/memcontrol.h:30, from include/linux/swap.h:8, from ./arch/sparc/include/asm/pgtable_32.h:17, from ./arch/sparc/include/asm/pgtable.h:6, from arch/sparc/kernel/traps_32.c:23: include/linux/mm.h: In function 'is_vmalloc_addr': include/linux/mm.h:371:17: error: 'VMALLOC_START' undeclared (first use in this function) include/linux/mm.h:371:17: note: each undeclared identifier is reported only once for each function it appears in include/linux/mm.h:371:41: error: 'VMALLOC_END' undeclared (first use in this function) include/linux/mm.h: In function 'maybe_mkwrite': include/linux/mm.h:556:3: error: implicit declaration of function 'pte_mkwrite' The issue is that pgtable_32.h depends on swap.h to get swap_entry_t but that goes all the way down to linux/mm.h which wants to have VMALLOC_* which is defined later in pgtable_32.h, though. swap_entry_t is defined in include/mm_types.h so it should be sufficient to include this header without more dependencies. Signed-off-by: Michal Hocko <[email protected]> Reported-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [[email protected]: inline memcg_kmem_is_active] [[email protected]: do not expose type outside of CONFIG_MEMCG] [[email protected]: memcontrol.h needs eventfd.h for eventfd_ctx] [[email protected]: export mem_cgroup_from_task() to modules] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The only user is cgwb_bdi_init and that one depends on CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG so it doesn't make much sense to definte an empty stub for !CONFIG_MEMCG. Moreover ERR_PTR(-EINVAL) is ugly and would lead to runtime crashes if used in unguarded code paths. Better fail during compilation. Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Most of the exported functions in this header are not marked extern so change the rest to follow the same style. Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Restructure it to lower nesting level and help the planned threadgroup leader iteration changes. This is pure reorganization. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg doesn't check for NULL. The function relies on the mem_cgroup_is_root check because we shouldn't get NULL otherwise because mem_cgroup_from_task will always return !NULL. All other callers are checking for NULL and we can safely replace mem_cgroup_is_root() check by cg_proto != NULL which will be more straightforward (proto_cgroup returns NULL for the root memcg already). Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The only user is sock_update_memcg which is living in memcontrol.c so it doesn't make much sense to pollute sock.h by this inline helper. Move it to memcontrol.c and open code it into its only caller. Signed-off-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

CMA reserved memory is not part of total reserved memory. Currently when we print the total reserve memory it considers cma as part of reserve memory and do minus of totalcma_pages from reserved, which is wrong. In cases where total reserved is less than cma reserved we will get negative values & while printing we print as unsigned and we will get a very large value. Below is the show mem output on X86 ubuntu based system where CMA reserved is 100MB (25600 pages) & total reserved is ~40MB(10316 pages). And reserve memory shows a large value because of this bug. Before: [ 127.066430] 898908 pages RAM [ 127.066432] 671682 pages HighMem/MovableOnly [ 127.066434] 4294952012 pages reserved [ 127.066436] 25600 pages cma reserved After: [ 44.663129] 898908 pages RAM [ 44.663130] 671682 pages HighMem/MovableOnly [ 44.663130] 10316 pages reserved [ 44.663131] 25600 pages cma reserved Signed-off-by: Vishnu Pratap Singh <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Laurent Pinchart <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Danesh Petigara <[email protected]> Cc: Laura Abbott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The __test_page_isolated_in_pageblock() is used to verify whether all pages in pageblock were either successfully isolated, or are hwpoisoned. Two of the possible state of pages, that are tested, are however bogus and misleading. Both tests rely on get_freepage_migratetype(page), which however has no guarantees about pages on freelists. Specifically, it doesn't guarantee that the migratetype returned by the function actually matches the migratetype of the freelist that the page is on. Such guarantee is not its purpose and would have negative impact on allocator performance. The first test checks whether the freepage_migratetype equals MIGRATE_ISOLATE, supposedly to catch races between page isolation and allocator activity. These races should be fixed nowadays with 51bb1a4 ("mm/page_alloc: add freepage on isolate pageblock to correct buddy list") and related patches. As explained above, the check wouldn't be able to catch them reliably anyway. For the same reason false positives can happen, although they are harmless, as the move_freepages() call would just move the page to the same freelist it's already on. So removing the test is not a bug fix, just cleanup. After this patch, we assume that all PageBuddy pages are on the correct freelist and that the races were really fixed. A truly reliable verification in the form of e.g. VM_BUG_ON() would be complicated and is arguably not needed. The second test (page_count(page) == 0 && get_freepage_migratetype(page) == MIGRATE_ISOLATE) is probably supposed (the code comes from a big memory isolation patch from 2007) to catch pages on MIGRATE_ISOLATE pcplists. However, pcplists don't contain MIGRATE_ISOLATE freepages nowadays, those are freed directly to free lists, so the check is obsolete. Remove it as well. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Laura Abbott <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Seungho Park <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The pair of get/set_freepage_migratetype() functions are used to cache pageblock migratetype for a page put on a pcplist, so that it does not have to be retrieved again when the page is put on a free list (e.g. when pcplists become full). Historically it was also assumed that the value is accurate for pages on freelists (as the functions' names unfortunately suggest), but that cannot be guaranteed without affecting various allocator fast paths. It is in fact not needed and all such uses have been removed. The last remaining (but pointless) usage related to pages of freelists is in move_freepages(), which this patch removes. To prevent further confusion, rename the functions to get/set_pcppage_migratetype() and expand their description. Since all the users are now in mm/page_alloc.c, move the functions there from the shared header. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Laura Abbott <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Seungho Park <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

hugetlbfs is used today by applications that want a high degree of control over huge page usage. Often, large hugetlbfs files are used to map a large number huge pages into the application processes. The applications know when page ranges within these large files will no longer be used, and ideally would like to release them back to the subpool or global pools for other uses. The fallocate() system call provides an interface for preallocation and hole punching within files. This patch set adds fallocate functionality to hugetlbfs. fallocate hole punch will want to remove a specific range of pages. When pages are removed, their associated entries in the region/reserve map will also be removed. This will break an assumption in the region_chg/region_add calling sequence. If a new region descriptor must be allocated, it is done as part of the region_chg processing. In this way, region_add can not fail because it does not need to attempt an allocation. To prepare for fallocate hole punch, create a "cache" of descriptors that can be used by region_add if necessary. region_chg will ensure there are sufficient entries in the cache. It will be necessary to track the number of in progress add operations to know a sufficient number of descriptors reside in the cache. A new routine region_abort is added to adjust this in progress count when add operations are aborted. vma_abort_reservation is also added for callers creating reservations with vma_needs_reservation/vma_commit_reservation. [[email protected]: fix typo in comment, use more cols] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

fallocate hole punch will want to remove a specific range of pages. The existing region_truncate() routine deletes all region/reserve map entries after a specified offset. region_del() will provide this same functionality if the end of region is specified as LONG_MAX. Hence, region_del() can replace region_truncate(). Unlike region_truncate(), region_del() can return an error in the rare case where it can not allocate memory for a region descriptor. This ONLY happens in the case where an existing region must be split. Current callers passing LONG_MAX as end of range will never experience this error and do not need to deal with error handling. Future callers of region_del() (such as fallocate hole punch) will need to handle this error. Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

hugetlb page faults are currently synchronized by the table of mutexes (htlb_fault_mutex_table). fallocate code will need to synchronize with the page fault code when it allocates or deletes pages. Expose interfaces so that fallocate operations can be synchronized with page faults. Minor name changes to be more consistent with other global hugetlb symbols. Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

fallocate hole punch will want to unmap a specific range of pages. Modify the existing hugetlb_vmtruncate_list() routine to take a start/end range. If end is 0, this indicates all pages after start should be unmapped. This is the same as the existing truncate functionality. Modify existing callers to add 0 as end of range. Since the routine will be used in hole punch as well as truncate operations, it is more appropriately renamed to hugetlb_vmdelete_list(). Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Modify truncate_hugepages() to take a range of pages (start, end) instead of simply start. If an end value of LLONG_MAX is passed, the current "truncate" functionality is maintained. Existing callers are modified to pass LLONG_MAX as end of range. By keying off end == LLONG_MAX, the routine behaves differently for truncate and hole punch. Page removal is now synchronized with page allocation via faults by using the fault mutex table. The hole punch case can experience the rare region_del error and must handle accordingly. Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in the case where region_del returns an error. Since the routine handles more than just the truncate case, it is renamed to remove_inode_hugepages(). To be consistent, the routine truncate_huge_page() is renamed remove_huge_page(). Downstream of remove_inode_hugepages(), the routine hugetlb_unreserve_pages() is also modified to take a range of pages. hugetlb_unreserve_pages is modified to detect an error from region_del and pass it back to the caller. Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

In vma_has_reserves(), the current assumption is that reserves are always present for shared mappings. However, this will not be the case with fallocate hole punch. When punching a hole, the present page will be deleted as well as the region/reserve map entry (and hence any reservation). vma_has_reserves is passed "chg" which indicates whether or not a region/reserve map is present. Use this to determine if reserves are actually present or were removed via hole punch. Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Areas hole punched by fallocate will not have entries in the region/reserve map. However, shared mappings with min_size subpool reservations may still have reserved pages. alloc_huge_page needs to handle this special case and do the proper accounting. Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Currently, there is only a single place where hugetlbfs pages are added to the page cache. The new fallocate code be adding a second one, so break the functionality out into its own helper. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This is based on the shmem version, but it has diverged quite a bit. We have no swap to worry about, nor the new file sealing. Add synchronication via the fault mutex table to coordinate page faults, fallocate allocation and fallocate hole punch. What this allows us to do is move physical memory in and out of a hugetlbfs file without having it mapped. This also gives us the ability to support MADV_REMOVE since it is currently implemented using fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of a hugetlbfs file, which wasn't possible before. hugetlbfs fallocate only operates on whole huge pages. Based on code by Dave Hansen. Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Now that we have hole punching support for hugetlbfs, we can also support the MADV_REMOVE interface to it. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

memblock_overlaps_region() checks if the given memblock region intersects a region in memblock. If so, it returns the index of the intersected region. But its only caller is memblock_is_region_reserved(), and it returns 0 if false, non-zero if true. Both of these should return bool. Signed-off-by: Tang Chen <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Will Deacon <[email protected]> Cc: Vladimir Murzin <[email protected]> Cc: Fabian Frederick <[email protected]> Cc: Alexander Kuleshov <[email protected]> Cc: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 60000000] 2. on node 0: [100000000, 20000000000] 3. on node 1: [20000000000, 40000000000] 4. on node 4: [40000000000, 60000000000] 5. on node 5: [60000000000, 80000000000] 6. on node 2: [80000000000, a0000000000] 7. on node 3: [a0000000000, a0800000000] 8. on node 6: [c0000000000, a0800000000] 9. on node 7: [e0000000000, a0800000000] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a0800000000. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. If you run lscpu, it will show: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. After this patch, lscpu will show: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Will Deacon <[email protected]> Cc: Vladimir Murzin <[email protected]> Cc: Fabian Frederick <[email protected]> Cc: Alexander Kuleshov <[email protected]> Cc: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…ue_pages_range() This check was introduced as part of 6f4576e ("mempolicy: apply page table walker on queue_pages_range()") which got duplicated by 48684a6 ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)") by reintroducing it earlier on queue_page_test_walk() Signed-off-by: Aristeu Rozanski <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Acked-by: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Nowaday, set/unset_migratetype_isolate() is defined and used only in mm/page_isolation, so let's limit the scope within the file. Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Bootmem isn't popular any more, but some architectures still use it, and freeing to bootmem after calling free_all_bootmem_core() can end up scribbling over random memory. Instead, make sure the kernel generates a warning in this case by ensuring the node_bootmem_map field is non-NULL when are freeing or marking bootmem. An instance of this bug was just fixed in the tile architecture ("tile: use free_bootmem_late() for initrd") and catching this case more widely seems like a good thing. Signed-off-by: Chris Metcalf <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Paul McQuade <[email protected]> Cc: Tang Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

__nocast does no good for vm_flags_t. It only produces useless sparse warnings. Let's drop it. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Oleg Nesterov <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

If transparent huge pages are enabled, we can isolate many more pages than we actually need to scan, because we count both single and huge pages equally in isolate_lru_pages(). Since commit 5bc7b8a ("mm: thp: add split tail pages to shrink page list in page reclaim"), we scan all the tail pages immediately after a huge page split (see shrink_page_list()). As a result, we can reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run! This is easy to catch on memcg reclaim with zswap enabled. The latter makes swapout instant so that if we happen to scan an unreferenced huge page we will evict both its head and tail pages immediately, which is likely to result in excessive reclaim. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

reclaim_clean_pages_from_list() assumes that shrink_page_list() returns number of pages removed from the candidate list. But shrink_page_list() puts back mlocked pages without passing it to caller and without counting as nr_reclaimed. This increases nr_isolated. To fix this, this patch changes shrink_page_list() to pass unevictable pages back to caller. Caller will take care those pages. Minchan said: It fixes two issues. 1. With unevictable page, cma_alloc will be successful. Exactly speaking, cma_alloc of current kernel will fail due to unevictable pages. 2. fix leaking of NR_ISOLATED counter of vmstat With it, too_many_isolated works. Otherwise, it could make hang until the process get SIGKILL. Signed-off-by: Jaewon Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Currently a call to dma_pool_alloc() with a ___GFP_ZERO flag returns a non-zeroed memory region. This patchset adds support for the __GFP_ZERO flag to dma_pool_alloc(), adds 2 wrapper functions for allocing zeroed memory from a pool, and provides a coccinelle script for finding & replacing instances of dma_pool_alloc() followed by memset(0) with a single dma_pool_zalloc() call. There was some concern that this always calls memset() to zero, instead of passing __GFP_ZERO into the page allocator. [https://lkml.org/lkml/2015/7/15/881] I ran a test on my system to get an idea of how often dma_pool_alloc() calls into pool_alloc_page(). After Boot: [ 30.119863] alloc_calls:541, page_allocs:7 After an hour: [ 3600.951031] alloc_calls:9566, page_allocs:12 After copying 1GB file onto a USB drive: [ 4260.657148] alloc_calls:17225, page_allocs:12 It doesn't look like dma_pool_alloc() calls down to the page allocator very often (at least on my system). This patch (of 4): Currently the __GFP_ZERO flag is ignored by dma_pool_alloc(). Make dma_pool_alloc() zero the memory if this flag is set. Signed-off-by: Sean O. Stalley <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Vinod Koul <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Gilles Muller <[email protected]> Cc: Nicolas Palix <[email protected]> Cc: Michal Marek <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Add a wrapper function for dma_pool_alloc() to get zeroed memory. Signed-off-by: Sean O. Stalley <[email protected]> Cc: Vinod Koul <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Gilles Muller <[email protected]> Cc: Nicolas Palix <[email protected]> Cc: Michal Marek <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Add a wrapper function for pci_pool_alloc() to get zeroed memory. Signed-off-by: Sean O. Stalley <[email protected]> Cc: Vinod Koul <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Gilles Muller <[email protected]> Cc: Nicolas Palix <[email protected]> Cc: Michal Marek <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

add [pci|dma]_pool_zalloc coccinelle check. replaces instances of [pci|dma]_pool_alloc() followed by memset(0) with [pci|dma]_pool_zalloc(). Signed-off-by: Sean O. Stalley <[email protected]> Acked-by: Julia Lawall <[email protected]> Cc: Vinod Koul <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Gilles Muller <[email protected]> Cc: Nicolas Palix <[email protected]> Cc: Michal Marek <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Assorted compaction cleanups and optimizations. The interesting patches are 4 and 5. In 4, skipping of compound pages in single iteration is improved for migration scanner, so it works also for !PageLRU compound pages such as hugetlbfs, slab etc. Patch 5 introduces this kind of skipping in the free scanner. The trick is that we can read compound_order() without any protection, if we are careful to filter out values larger than MAX_ORDER. The only danger is that we skip too much. The same trick was already used for reading the freepage order in the migrate scanner. To demonstrate improvements of Patches 4 and 5 I've run stress-highalloc from mmtests, set to simulate THP allocations (including __GFP_COMP) on a 4GB system where 1GB was occupied by hugetlbfs pages. I'll include just the relevant stats: Patch 3 Patch 4 Patch 5 Compaction stalls 7523 7529 7515 Compaction success 323 304 322 Compaction failures 7200 7224 7192 Page migrate success 247778 264395 240737 Page migrate failure 15358 33184 21621 Compaction pages isolated 906928 980192 909983 Compaction migrate scanned 2005277 1692805 1498800 Compaction free scanned 13255284 11539986 9011276 Compaction cost 288 305 277 With 5 iterations per patch, the results are still noisy, but we can see that Patch 4 does reduce migrate_scanned by 15% thanks to skipping the hugetlbfs pages at once. Interestingly, free_scanned is also reduced and I have no idea why. Patch 5 further reduces free_scanned as expected, by 15%. Other stats are unaffected modulo noise. [1] https://lkml.org/lkml/2015/1/19/158 This patch (of 5): Compaction should finish when the migration and free scanner meet, i.e. they reach the same pageblock. Currently however, the test in compact_finished() simply just compares the exact pfns, which may yield a false negative when the free scanner position is in the middle of a pageblock and the migration scanner reaches the begining of the same pageblock. This hasn't been a problem until commit e14c720 ("mm, compaction: remember position within pageblock in free pages scanner") allowed the free scanner position to be in the middle of a pageblock between invocations. The hot-fix 1d5bfe1 ("mm, compaction: prevent infinite loop in compact_zone") prevented the issue by adding a special check in the migration scanner to satisfy the current detection of scanners meeting. However, the proper fix is to make the detection more robust. This patch introduces the compact_scanners_met() function that returns true when the free scanner position is in the same or lower pageblock than the migration scanner. The special case in isolate_migratepages() introduced by 1d5bfe1 is removed. Suggested-by: Joonsoo Kim <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Christoph Lameter <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Handling the position where compaction free scanner should restart (stored in cc->free_pfn) got more complex with commit e14c720 ("mm, compaction: remember position within pageblock in free pages scanner"). Currently the position is updated in each loop iteration of isolate_freepages(), although it should be enough to update it only when breaking from the loop. There's also an extra check outside the loop updates the position in case we have met the migration scanner. This can be simplified if we move the test for having isolated enough from the for-loop header next to the test for contention, and determining the restart position only in these cases. We can reuse the isolate_start_pfn variable for this instead of setting cc->free_pfn directly. Outside the loop, we can simply set cc->free_pfn to current value of isolate_start_pfn without any extra check. Also add a VM_BUG_ON to catch possible mistake in the future, in case we later add a new condition that terminates isolate_freepages_block() prematurely without also considering the condition in isolate_freepages(). Signed-off-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Rik van Riel <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Reseting the cached compaction scanner positions is now open-coded in __reset_isolation_suitable() and compact_finished(). Encapsulate the functionality in a new function reset_cached_positions(). Signed-off-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Joonsoo Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Rik van Riel <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…anner The compaction migrate scanner tries to skip THP pages by their order, to reduce number of iterations for pages it cannot isolate. The check is only done if PageLRU() is true, which means it applies to THP pages, but not e.g. hugetlbfs pages or any other non-LRU compound pages, which we have to iterate by base pages. This limitation comes from the assumption that it's only safe to read compound_order() when we have the zone's lru_lock and THP cannot be split under us. But the only danger (after filtering out order values that are not below MAX_ORDER, to prevent overflows) is that we skip too much or too little after reading a bogus compound_order() due to a rare race. This is the same reasoning as patch 99c0fd5 ("mm, compaction: skip buddy pages by their order in the migrate scanner") introduced for unsafely reading PageBuddy() order. After this patch, all pages are tested for PageCompound() and we skip them by compound_order(). The test is done after the test for balloon_page_movable() as we don't want to assume if balloon pages (or other pages with own isolation and migration implementation if a generic API gets implemented) are compound or not. When tested with stress-highalloc from mmtests on 4GB system with 1GB hugetlbfs pages, the vmstat compact_migrate_scanned count decreased by 15%. [[email protected]: change PageTransHuge checks to PageCompound for different series was squashed here] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Rik van Riel <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The compaction free scanner is looking for PageBuddy() pages and skipping all others. For large compound pages such as THP or hugetlbfs, we can save a lot of iterations if we skip them at once using their compound_order(). This is generally unsafe and we can read a bogus value of order due to a race, but if we are careful, the only danger is skipping too much. When tested with stress-highalloc from mmtests on 4GB system with 1GB hugetlbfs pages, the vmstat compact_free_scanned count decreased by at least 15%. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Rik van Riel <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This manually reverts 7e50533 ("selftests: add hugetlbfstest"). The hugetlbfstest test depends on hugetlb pages being counted in a task's rss. This functionality is not in the kernel, so the test will always fail. Remove test to avoid confusion. Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Joern Engel <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The hugetlb selftests provide minimal coverage. Have run script point people at libhugetlbfs for better regression testing. Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Joern Engel <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The URL for libhugetlbfs has changed. Also, put a stronger emphasis on using libgugetlbfs for hugetlb regression testing. Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Joern Engel <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David Rientjes <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

When booting an arm64 kernel w/initrd using UEFI/grub, use of mem= will likely cut off part or all of the initrd. This leaves it outside the kernel linear map which leads to failure when unpacking. The x86 code has a similar need to relocate an initrd outside of mapped memory in some cases. The current x86 code uses early_memremap() to copy the original initrd from unmapped to mapped RAM. This patchset creates a generic copy_from_early_mem() utility based on that x86 code and has arm64 and x86 share it in their respective initrd relocation code. This patch (of 3): In some early boot circumstances, it may be necessary to copy from RAM outside the kernel linear mapping to mapped RAM. The need to relocate an initrd is one example in the x86 code. This patch creates a helper function based on current x86 code. Signed-off-by: Mark Salter <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Russell King <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The use of mem= could leave part or all of the initrd outside of the kernel linear map. This will lead to an error when unpacking the initrd and a probable failure to boot. This patch catches that situation and relocates the initrd to be fully within the linear map. Signed-off-by: Mark Salter <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Russell King <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The early_ioremap library now has a generic copy_from_early_mem() function. Use the generic copy function for x86 relocate_initrd(). [[email protected]: remove MAX_MAP_CHUNK define, per Yinghai Lu] Signed-off-by: Mark Salter <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Russell King <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

THP pages will get a refcount in madvise_hwpoison() w/ MF_COUNT_INCREASED flag, however, the refcount is still held when fail to split THP pages. Fix it by reducing the refcount of THP pages when fail to split THP. Signed-off-by: Wanpeng Li <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

There is a race between madvise_hwpoison path and memory_failure: CPU0 CPU1 madvise_hwpoison get_user_pages_fast PageHWPoison check (false) memory_failure TestSetPageHWPoison soft_offline_page PageHWPoison check (true) return -EBUSY (without put_page) Signed-off-by: Wanpeng Li <[email protected]> Suggested-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…rror handling Introduce put_hwpoison_page to put refcount for memory error handling. Signed-off-by: Wanpeng Li <[email protected]> Suggested-by: Naoya Horiguchi <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Hwpoison injection takes a refcount of target page and another refcount of head page of THP if the target page is the tail page of a THP. However, current code doesn't release the refcount of head page if the THP is not supported to be injected wrt hwpoison filter. Fix it by reducing the refcount of head page if the target page is the tail page of a THP and it is not supported to be injected. Signed-off-by: Wanpeng Li <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…_hwpoison_page Replace most instances of put_page() in memory error handling with put_hwpoison_page(). Signed-off-by: Wanpeng Li <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

num_poisoned_pages counter will be changed outside mm/memory-failure.c by a subsequent patch, so this patch prepares wrappers to manipulate it. Signed-off-by: Naoya Horiguchi <[email protected]> Tested-by: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Wanpeng Li reported a race between soft_offline_page() and unpoison_memory(), which causes the following kernel panic: BUG: Bad page state in process bash pfn:97000 page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00 flags: 0x1fffff80080048(uptodate|active|swapbacked) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x40(active) Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45 Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 Call Trace: dump_stack+0x48/0x5c bad_page+0xe6/0x140 free_pages_prepare+0x2f9/0x320 ? uncharge_list+0xdd/0x100 free_hot_cold_page+0x40/0x170 __put_single_page+0x20/0x30 put_page+0x25/0x40 unmap_and_move+0x1a6/0x1f0 migrate_pages+0x100/0x1d0 ? kill_procs+0x100/0x100 ? unlock_page+0x6f/0x90 __soft_offline_page+0x127/0x2a0 soft_offline_page+0xa6/0x200 This race is explained like below: CPU0 CPU1 soft_offline_page __soft_offline_page TestSetPageHWPoison unpoison_memory PageHWPoison check (true) TestClearPageHWPoison put_page -> release refcount held by get_hwpoison_page in unpoison_memory put_page -> release refcount held by isolate_lru_page in __soft_offline_page migrate_pages The second put_page() releases refcount held by isolate_lru_page() which will lead to unmap_and_move() releases the last refcount of page and w/ mapcount still 1 since try_to_unmap() is not called if there is only one user map the page. Anyway, the page refcount and mapcount will still mess if the page is mapped by multiple users. This race was introduced by commit 4491f71 ("mm/memory-failure: set PageHWPoison before migrate_pages()"), which focuses on preventing the reuse of successfully migrated page. Before this commit we prevent the reuse by changing the migratetype to MIGRATE_ISOLATE during soft offlining, which has the following problems, so simply reverting the commit is not a best option: 1) it doesn't eliminate the reuse completely, because set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the target page if the pageblock of the page contains one or more unmovable pages (i.e. has_unmovable_pages() returns true). 2) the original code changes migratetype to MIGRATE_ISOLATE forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline, regardless of the original migratetype state, which could impact other subsystems like memory hotplug or compaction. This patch moves PageSetHWPoison just after put_page() in unmap_and_move(), which closes up the reported race window and minimizes another race window b/w SetPageHWPoison and reallocation (which causes the reuse of soft-offlined page.) The latter race window still exists but it's acceptable, because it's rare and effectively the same as ordinary "containment failure" case even if it happens, so keep the window open is acceptable. Fixes: 4491f71 ("mm/memory-failure: set PageHWPoison before migrate_pages()") Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: Wanpeng Li <[email protected]> Tested-by: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

memory_failure() can be called at any page at any time, which means that we can't eliminate the possibility of containment failure. In such case the best option is to leak the page intentionally (and never touch it later.) We have an unpoison function for testing, and it cannot handle such containment-failed pages, which results in kernel panic (visible with various calltraces.) So this patch suggests that we limit the unpoisonable pages to properly contained pages and ignore any other ones. Testers are recommended to keep in mind that there're un-unpoisonable pages when writing test programs. Signed-off-by: Naoya Horiguchi <[email protected]> Tested-by: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Since commit e3239ff ("memblock: Rename memblock_region to memblock_type and memblock_property to memblock_region"), all local variables of the membock_type type were renamed to 'type'. This commit renames all remaining local variables with the memblock_type type to the same view. Signed-off-by: Alexander Kuleshov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Shmem uses shmem_recalc_inode to update i_blocks when it allocates page, undoes range or swaps. But mm can drop clean page without notifying shmem. This makes fstat sometimes return out-of-date block size. The problem can be partially solved when we add inode_operations->getattr which calls shmem_recalc_inode to update i_blocks for fstat. shmem_recalc_inode also updates counter used by statfs and vm_committed_as. For them the situation is not changed. They still suffer from the discrepancy after dropping clean page and before the function is called by aforementioned triggers. Signed-off-by: Yu Zhao <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

__split_vma() doesn't need out_err label, neither need initializing err. copy_vma() can return NULL directly when kmem_cache_alloc() fails. Signed-off-by: Chen Gang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…eded In log_early function, crt_early_log should also count once when 'crt_early_log >= ARRAY_SIZE(early_log)'. Otherwise the reported count from kmemleak_init is one less than 'actual number'. Then, in kmemleak_init, if early_log buffer size equal actual number, kmemleak will init sucessful, so change warning condition to 'crt_early_log > ARRAY_SIZE(early_log)'. Signed-off-by: Wang Kai <[email protected]> Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

If the list_head is empty then we'll have called list_lru_from_kmem for nothing. Move that call inside of the list_empty if block. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This is merely a politeness: I've not found that shrink_page_list() leads to deadlock with the page it holds locked across wait_on_page_writeback(); but nevertheless, why hold others off by keeping the page locked there? And while we're at it: remove the mistaken "not " from the commentary on this Case 3 (and a distracting blank line from Case 2, if I may). Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

alloc_pages_exact_node() was introduced in commit 6484eb3 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Robin Holt <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Michael Ellerman <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Cliff Whickman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Perform the same debug checks in alloc_pages_node() as are done in __alloc_pages_node(), by making the former function a wrapper of the latter one. In addition to better diagnostics in DEBUG_VM builds for situations which have been already fatal (e.g. out-of-bounds node id), there are two visible changes for potential existing buggy callers of alloc_pages_node(): - calling alloc_pages_node() with any negative nid (e.g. due to arithmetic overflow) was treated as passing NUMA_NO_NODE and fallback to local node was applied. This will now be fatal. - calling alloc_pages_node() with an offline node will now be checked for DEBUG_VM builds. Since it's not fatal if the node has been previously online, and this patch may expose some existing buggy callers, change the VM_BUG_ON in __alloc_pages_node() to VM_WARN_ON. Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

alloc_pages_node() might fail when called with NUMA_NO_NODE and __GFP_THISNODE on a CPU belonging to a memoryless node. To make the local-node fallback more robust and prevent such situations, use numa_mem_id(), which was introduced for similar scenarios in the slab context. Suggested-by: Christoph Lameter <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

We cache isolate_start_pfn before entering isolate_migratepages(). If pageblock is skipped in isolate_migratepages() due to whatever reason, cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages that were freed. For example, the following scenario can be possible: - assume order-9 compaction, pageblock order is 9 - start_isolate_pfn is 0x200 - isolate_migratepages() - skip a number of pageblocks - start to isolate from pfn 0x600 - cc->migrate_pfn = 0x620 - return - last_migrated_pfn is set to 0x200 - check flushing condition - current_block_start is set to 0x600 - last_migrated_pfn < current_block_start then do useless flush This wrong flush would not help the performance and success rate so this patch tries to fix it. One simple way to know the exact position where we start to isolate migratable pages is that we cache it in isolate_migratepages() before entering actual isolation. This patch implements that and fixes the problem. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

s/succees/success/ Signed-off-by: Alexander Kuleshov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Commit 1dfb059 ("thp: reduce khugepaged freezing latency") fixed khugepaged to do not block a system suspend. But the result is that it could not get interrupted before the given timeout because the condition for the wait event is "false". This patch puts back the original approach but it uses freezable_schedule_timeout_interruptible() instead of schedule_timeout_interruptible(). It does the right thing. I am pretty sure that the freezable variant was not used in the original fix only because it was not available at that time. The regression has been there for ages. It was not critical. It just did the allocation throttling a little bit more aggressively. I found this problem when converting the kthread to kthread worker API and trying to understand the code. This bug is thought to have minimal userspace-visible impact. Somebody could set a high alloc_sleep value by mistake, and then try to fix it back, but khugepaged would keep sleeping until the high value expires. Signed-off-by: Petr Mladek <[email protected]> Cc: Andrea Arcangeli <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: David Rientjes <[email protected]> Cc: Ebru Akagunduz <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jiri Kosina <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

There's no point in initializing vma->vm_pgoff if the insertion attempt will be failing anyway. Run the checks before performing the initialization. Signed-off-by: Chen Gang <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The comment says that the per-cpu batchsize and zone watermarks are determined by present_pages which is definitely wrong, they are both calculated from managed_pages. Fix it. Signed-off-by: Yaowei Bai <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…mem_reserve_ratio in comments We use sysctl_lowmem_reserve_ratio rather than sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel is in defending lowmem from the possibility of being captured into pinned user memory. To avoid misleading, correct it in some comments. Signed-off-by: Yaowei Bai <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…_spanned_pages_in_node() When hot adding a node from add_memory(), we will add memblock first, so the node is not empty. But when called from cpu_up(), the node should be empty. Signed-off-by: Xishi Qiu <[email protected]> Cc: Tang Chen <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Taku Izumi <[email protected]>\ Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

For a memoryless node, the output of get_pfn_range_for_nid are all zero. It will display mem from 0 to -1. Signed-off-by: Zhen Lei <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Signed-off-by: Alexander Kuleshov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This patchset tweaks compaction and makes it possible to trigger pool compaction automatically when system is getting low on memory. zsmalloc in some cases can suffer from a notable fragmentation and compaction can release some considerable amount of memory. The problem here is that currently we fully rely on user space to perform compaction when needed. However, performing zsmalloc compaction is not always an obvious thing to do. For example, suppose we have a `idle' fragmented (compaction was never performed) zram device and system is getting low on memory due to some 3rd party user processes (gcc LTO, or firefox, etc.). It's quite unlikely that user space will issue zpool compaction in this case. Besides, user space cannot tell for sure how badly pool is fragmented; however, this info is known to zsmalloc and, hence, to a shrinker. This patch (of 7): __zs_compact() does not use `nr_to_migrate', drop it. Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Always account per-class `zs_size_stat' stats. This data will help us make better decisions during compaction. We are especially interested in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction will result in any memory gain. For instance, we know the number of allocated objects in the class, the number of objects being used (so we also know how many objects are not used) and the number of objects per-page. So we can ensure if we have enough unused objects to form at least one ZS_EMPTY zspage during compaction. We calculate this value on per-class basis so we can calculate a total number of zspages that can be released. Which is exactly what a shrinker wants to know. Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

This function checks if class compaction will free any pages. Rephrasing -- do we have enough unused objects to form at least one ZS_EMPTY page and free it. It aborts compaction if class compaction will not result in any (further) savings. EXAMPLE (this debug output is not part of this patch set): - class size - number of allocated objects - number of used objects - max objects per zspage - pages per zspage - estimated number of pages that will be freed [..] class-512 objs:544 inuse:540 maxobj-per-zspage:8 pages-per-zspage:1 zspages-to-free:0 ... class-512 compaction is useless. break class-496 objs:660 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:2 class-496 objs:627 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:1 class-496 objs:594 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:0 ... class-496 compaction is useless. break class-448 objs:657 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:4 class-448 objs:648 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:3 class-448 objs:639 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:2 class-448 objs:630 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:1 class-448 objs:621 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:0 ... class-448 compaction is useless. break class-432 objs:728 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:1 class-432 objs:700 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:0 ... class-432 compaction is useless. break class-416 objs:819 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:2 class-416 objs:780 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:1 class-416 objs:741 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:0 ... class-416 compaction is useless. break class-400 objs:690 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:1 class-400 objs:680 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:0 ... class-400 compaction is useless. break class-384 objs:736 inuse:709 maxobj-per-zspage:32 pages-per-zspage:3 zspages-to-free:0 ... class-384 compaction is useless. break [..] Every "compaction is useless" indicates that we saved CPU cycles. class-512 has 544 object allocated 540 objects used 8 objects per-page Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to migrate all of its objects and free this zspage; so compaction will not make a lot of sense, it's better to just leave it as is. Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Change zs_object_copy() argument order to be (DST, SRC) rather than (SRC, DST). copy/move functions usually have (to, from) arguments order. Rename alloc_target_page() to isolate_target_page(). This function doesn't allocate anything, it isolates target page, pretty much like isolate_source_page(). Tweak __zs_compact() comment. Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

`zs_compact_control' accounts the number of migrated objects but it has a limited lifespan -- we lose it as soon as zs_compaction() returns back to zram. It worked fine, because (a) zram had it's own counter of migrated objects and (b) only zram could trigger compaction. However, this does not work for automatic pool compaction (not issued by zram). To account objects migrated during auto-compaction (issued by the shrinker) we need to store this number in zs_pool. Define a new `struct zs_pool_stats' structure to keep zs_pool's stats there. It provides only `num_migrated', as of this writing, but it surely can be extended. A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to caller. Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats. Signed-off-by: Sergey Senozhatsky <[email protected]> Suggested-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Compaction returns back to zram the number of migrated objects, which is quite uninformative -- we have objects of different sizes so user space cannot obtain any valuable data from that number. Change compaction to operate in terms of pages and return back to compaction issuer the number of pages that were freed during compaction. So from now on we will export more meaningful value in zram<id>/mm_stat -- the number of freed (compacted) pages. This requires: (a) a rename of `num_migrated' to 'pages_compacted' (b) a internal API change -- return first_page's fullness_group from putback_zspage(), so we know when putback_zspage() did free_zspage(). It helps us to account compaction stats correctly. Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Perform automatic pool compaction by a shrinker when system is getting tight on memory. User-space has a very little knowledge regarding zsmalloc fragmentation and basically has no mechanism to tell whether compaction will result in any memory gain. Another issue is that user space is not always aware of the fact that system is getting tight on memory. Which leads to very uncomfortable scenarios when user space may start issuing compaction 'randomly' or from crontab (for example). Fragmentation is not always necessarily bad, allocated and unused objects, after all, may be filled with the data later, w/o the need of allocating a new zspage. On the other hand, we obviously don't want to waste memory when the system needs it. Compaction now has a relatively quick pool scan so we are able to estimate the number of pages that will be freed easily, which makes it possible to call this function from a shrinker->count_objects() callback. We also abort compaction as soon as we detect that we can't free any pages any more, preventing wasteful objects migrations. Signed-off-by: Sergey Senozhatsky <[email protected]> Suggested-by: Minchan Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY} pages. Put a page with higher ->inuse count first within its ->fullness_list, which will give us better chances to fill up this page with new objects (find_get_zspage() return ->fullness_list head for new object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL quicker. It performs a trivial and cheap ->inuse compare which does not slow down zsmalloc and in the worst case keeps the list pages in no particular order. A more expensive solution could sort fullness_list by ->inuse count. [[email protected]: code adjustments] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

There is no reason to prevent select ZS_ALMOST_FULL as migration source if we cannot find source from ZS_ALMOST_EMPTY. With this patch, zs_can_compact will return more exact result. Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

There is no need to recalcurate pages_per_zspage in runtime. Just use class->pages_per_zspage to avoid unnecessary runtime overhead. Signed-off-by: Minchan Kim <[email protected]> Acked-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

We can avoid taking class ->lock around zs_can_compact() in zs_shrinker_count(), because the number that we return back is outdated in general case, by design. We have different sources that are able to change class's state right after we return from zs_can_compact() -- ongoing I/O operations, manually triggered compaction, or two of them happening simultaneously. We re-do this calculations during compaction on a per class basis anyway. zs_unregister_shrinker() will not return until we have an active shrinker, so classes won't unexpectedly disappear while zs_shrinker_count() iterates them. Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

We can pass a NULL cache pointer to kmem_cache_destroy(), because it NULL-checks its argument now. Remove redundant test from destroy_handle_cache(). Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Make zram syslog error reporting more consistent. We have random error levels in some places. For example, critical errors like "Error allocating memory for compressed page" and "Unable to allocate temp memory" are reported as KERN_INFO messages. a) Reassign error levels Error messages that directly affect zram functionality -- pr_err(): Error allocating zram address table Error creating memory pool Decompression failed! err=%d, page=%u Unable to allocate temp memory Compression failed! err=%d Error allocating memory for compressed page: %u, size=%zu Cannot initialise %s compressing backend Error allocating disk queue for device %d Error allocating disk structure for device %d Error creating sysfs group for device %d Unable to register zram-control class Unable to get major number Messages that do not affect functionality, but user must be warned (because sysfs attrs will be removed in this particular case) -- pr_warn(): %d (%s) Attribute %s (and others) will be removed. %s Messages that do not affect functionality and mostly are informative -- pr_info(): Cannot change max compression streams Can't change algorithm for initialized device Cannot change disksize for initialized device Added device: %s Removed device: %s b) Update sysfs_create_group() error message First, it lacks a trailing new line; add it. Second, every error message in zram_add() has a "for device %d" part, which makes errors more informative. Add missing part to "Error creating sysfs group" message. Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

zswap_get_swap_cache_page and read_swap_cache_async have pretty much the same code with only significant difference in return value and usage of swap_readpage. I a helper __read_swap_cache_async() with the common code. Behavior change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload instead radix_tree_preload. Looks like, this wasn't changed only by the reason of code duplication. Signed-off-by: Dmitry Safonov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Herrmann <[email protected]> Cc: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The structure zpool_ops is not modified so make the pointer to it a pointer to const. Signed-off-by: Krzysztof Kozlowski <[email protected]> Acked-by: Dan Streetman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

The structure zbud_ops is not modified so make the pointer to it a pointer to const. Signed-off-by: Krzysztof Kozlowski <[email protected]> Acked-by: Dan Streetman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Remove zpool_init() and zpool_exit(); they do nothing other than print "loaded" and "unloaded". Signed-off-by: Dan Streetman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

…/abelloni/linux Pull RTC updates from Alexandre Belloni: "Core: - use is_visible() to control sysfs attributes - switch wakealarm attribute to DEVICE_ATTR_RW - make rtc_does_wakealarm() return boolean - properly manage lifetime of dev and cdev in rtc device - remove unnecessary device_get() in rtc_device_unregister - fix double free in rtc_register_device() error path New drivers: - NXP LPC24xx - Xilinx Zynq MP - Dialog DA9062 Subsystem wide cleanups: - fix drivers that consider 0 as a valid IRQ in client->irq - Drop (un)likely before IS_ERR(_OR_NULL) - drop the remaining owner assignment for i2c_driver and platform_driver - module autoload fixes Drivers: - 88pm80x: add device tree support - abx80x: fix RTC write bit - ab8500: Add a sentinel to ab85xx_rtc_ids[] - armada38x: Align RTC set time procedure with the official errata - as3722: correct month value - at91sam9: cleanups - at91rm9200: get and use slow clock and cleanups - bq32k: remove redundant check - cmos: century support, proper fix for the spurious wakeup - ds1307: cleanups and wakeup irq support - ds1374: Remove unused variable - ds1685: Use module_platform_driver - ds3232: fix WARNING trace in resume function - gemini: fix ptr_ret.cocci warnings - mt6397: implement suspend/resume - omap: support internal and external clock enabling - opal: Enable alarms only when opal supports tpo - pcf2127: use OFS flag to detect unreliable date and warn the user - pl031: fix typo for author email - rx8025: huge cleanup and fixes - sa1100/pxa: share common code - s5m: fix to update ctrl register - s3c: fix clocks and wakeup, cleanup - sirfsoc: use regmap - nvram_read()/nvram_write() functions for cmos, ds1305, ds1307, ds1343, ds1511, ds1553, ds1742, m48t59, rp5c01, stk17ta8, tx4939 - use rtc_valid_tm() error code when reading date/time instead of 0 for isl12022, pcf2123, pcf2127" * tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (90 commits) rtc: abx80x: fix RTC write bit rtc: ab8500: Add a sentinel to ab85xx_rtc_ids[] rtc: ds1374: Remove unused variable rtc: Fix module autoload for OF platform drivers rtc: Fix module autoload for rtc-{ab8500,max8997,s5m} drivers rtc: omap: Add external clock enabling support rtc: omap: Add internal clock enabling support ARM: dts: AM437x: Add the internal and external clock nodes for rtc rtc: s5m: fix to update ctrl register rtc: add xilinx zynqmp rtc driver devicetree: bindings: rtc: add bindings for xilinx zynqmp rtc rtc: as3722: correct month value ARM: config: Switch PXA27x platforms to use PXA RTC driver ARM: mmp: remove unused RTC register definitions ARM: sa1100: remove unused RTC register definitions rtc: sa1100/pxa: convert to run-time register mapping ARM: pxa: add memory resource to SA1100 RTC device rtc: pxa: convert to use shared sa1100 functions rtc: sa1100: prepare to share sa1100_rtc_ops rtc: ds3232: fix WARNING trace in resume function ...

…el/git/wsa/linux Pull i2c updates from Wolfram Sang: "Features: - new drivers: Renesas EMEV2, register based MUX, NXP LPC2xxx - core: scans DT and assigns wakeup interrupts. no driver changes needed. - core: some refcouting issues fixed and better API for that - core: new helper function for best effort block read emulation - slave framework: proper DT bindings and userspace instantiation - some bigger work for xiic, pxa, omap drivers .. and quite a number of smaller driver fixes, cleanups, improvements" * 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (65 commits) i2c: mux: reg Change ioread endianness for readback i2c: mux: reg: fix compilation warnings i2c: mux: reg: simplify register size checking i2c: muxes: fix leaked i2c adapter device node references i2c: allow specifying separate wakeup interrupt in device tree of/irq: export of_get_irq_byname() i2c: xgene-slimpro: dma_mapping_error() doesn't return an error code i2c: Replace I2C_CROS_EC_TUNNEL dependency eeprom: at24: use i2c_smbus_read_i2c_block_data_or_emulated i2c: core: Add support for best effort block read emulation i2c: lpc2k: add driver i2c: mux: Add register-based mux i2c-mux-reg i2c: dt: describe generic bindings i2c: slave: print warning if slave flag not set i2c: support 10 bit and slave addresses in sysfs 'new_device' i2c: take address space into account when checking for used addresses i2c: apply DT flags when probing i2c: make address check indpendent from client struct i2c: rename address check functions i2c: apply address offset for slaves, too ...

…ers/dvhart/linux-platform-drivers-x86 Pull x86 platform driver updates from Darren Hart: "Significant work on toshiba_acpi, including new hardware support, refactoring, and cleanups. Extend device support for asus, ideapad, and acer systems. New surface pro 3 buttons driver. Misc minor cleanups for thinkpad and hp-wireless. acer-wmi: - No rfkill on HP Omen 15 wifi thinkpad_acpi: - Remove side effects from vdbg_printk -> no_printk macro surface pro 3: - Add support driver for Surface Pro 3 buttons hp-wireless: - remove unneeded goto/label in hpwl_init ideapad-laptop: - add alternative representation for Yoga 2 to DMI table - Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list asus-laptop: - Add key found on Asus F3M MAINTAINERS: - Remove Toshiba Linux mailing list address toshiba_acpi: - Bump driver version to 0.23 - Remove unnecessary checks and returns in HCI/SCI functions - Refactor *{get, set} functions return value - Remove "*not supported" feature prints - Change *available functions return type - Add set_fan_status function - Change some variables to avoid warnings from ninja-check - Reorder toshiba_acpi_alt_keymap entries - Remove unused wireless defines - Transflective backlight updates - Avoid registering input device on WMI event laptops - Add /dev/toshiba_acpi device - Adapt /proc/acpi/toshiba/keys to TOS1900 devices" * tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: (21 commits) acer-wmi: No rfkill on HP Omen 15 wifi thinkpad_acpi: Remove side effects from vdbg_printk -> no_printk macro surface pro 3: Add support driver for Surface Pro 3 buttons hp-wireless: remove unneeded goto/label in hpwl_init ideapad-laptop: add alternative representation for Yoga 2 to DMI table asus-laptop: Add key found on Asus F3M MAINTAINERS: Remove Toshiba Linux mailing list address ideapad-laptop: Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list toshiba_acpi: Bump driver version to 0.23 toshiba_acpi: Remove unnecessary checks and returns in HCI/SCI functions toshiba_acpi: Refactor *{get, set} functions return value toshiba_acpi: Remove "*not supported" feature prints toshiba_acpi: Change *available functions return type toshiba_acpi: Add set_fan_status function toshiba_acpi: Change some variables to avoid warnings from ninja-check toshiba_acpi: Reorder toshiba_acpi_alt_keymap entries toshiba_acpi: Remove unused wireless defines toshiba_acpi: Transflective backlight updates toshiba_acpi: Avoid registering input device on WMI event laptops toshiba_acpi: Add /dev/toshiba_acpi device ...

Pull MMC updates from Ulf Hansson: "MMC core: - Fix a race condition in the request handling - Skip trim commands for some buggy kingston eMMCs - An optimization and a correction for erase groups - Set CMD23 quirk for some Sandisk cards MMC host: - sdhci: Give GPIO CD higher precedence and don't poll when it's used - sdhci: Fix DMA memory leakage - sdhci: Some updates for clock management - sdhci-of-at91: introduce driver for the Atmel SDMMC - sdhci-of-arasan: Add support for sdhci-5.1 - sdhci-esdhc-imx: Add support for imx7d which also supports HS400 - sdhci: A collection of fixes and improvements for various sdhci hosts - omap_hsmmc: Modernization of the regulator code - dw_mmc: A couple of fixes for DMA and PIO mode - usdhi6rol0: A few fixes and support probe deferral for regulators - pxamci: Convert to use dmaengine - sh_mmcif: Fix the suspend process in a short term solution - tmio: Adjust timeout for commands - sunxi: Fix timeout while gating/ungating clock" * tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc: (67 commits) mmc: android-goldfish: remove incorrect __iomem annotation mmc: core: fix race condition in mmc_wait_data_done mmc: host: omap_hsmmc: remove CONFIG_REGULATOR check mmc: host: omap_hsmmc: use ios->vdd for setting vmmc voltage mmc: host: omap_hsmmc: use regulator_is_enabled to find pbias status mmc: host: omap_hsmmc: enable/disable vmmc_aux regulator based on previous state mmc: host: omap_hsmmc: don't use ->set_power to set initial regulator state mmc: host: omap_hsmmc: avoid pbias regulator enable on power off mmc: host: omap_hsmmc: add separate function to set pbias mmc: host: omap_hsmmc: add separate functions for enable/disable supply mmc: host: omap_hsmmc: return error if any of the regulator APIs fail mmc: host: omap_hsmmc: remove unnecessary pbias set_voltage mmc: host: omap_hsmmc: use mmc_host's vmmc and vqmmc mmc: host: omap_hsmmc: use the ocrmask provided by the vmmc regulator mmc: host: omap_hsmmc: cleanup omap_hsmmc_reg_get() mmc: host: omap_hsmmc: return on fatal errors from omap_hsmmc_reg_get mmc: host: omap_hsmmc: use devm_regulator_get_optional() for vmmc mmc: sdhci-of-at91: fix platform_no_drv_owner.cocci warnings mmc: sh_mmcif: Fix suspend process mmc: usdhi6rol0: fix error return code ...

…t/tomba/linux Pull fbdev updates from Tomi Valkeinen: "Minor fixes and cleanups" * tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: video: fbdev: atmel_lcdfb: remove useless include video: fbdev: pxa168fb: Use devm_clk_get fbdev: ssd1307fb: fix error return code fbdev: fix snprintf() limit in show_bl_curve() video: fbdev: s3c-fb: Constify platform_device_id video: fbdev: atmel: fix warning for const return value video: fbdev: Drop owner assignment from platform_driver video: fbdev: Drop owner assignment from i2c_driver fbdev: remove unnecessary memset in vfb framebuffer: disable vgacon on microblaze arch fbdev: udlfb: remove unneeded initialization in few places fbdev: Allow compile test of GPIO consumers if !GPIOLIB fbdev: fix cea_modes array size

…git/broonie/regmap Pull regmap updates from Mark Brown: "This has been a busy release for regmap. By far the biggest set of changes here are those from Markus Pargmann which implement support for block transfers in smbus devices. This required quite a bit of refactoring but leaves us better able to handle odd restrictions that controllers may have and with better performance on smbus. Other new features include: - Fix interactions with lockdep for nested regmaps (eg, when a device using regmap is connected to a bus where the bus controller has a separate regmap). Lockdep's default class identification is too crude to work without help. - Support for must write bitfield operations, useful for operations which require writing a bit to trigger them from Kuniori Morimoto. - Support for delaying during register patch application from Nariman Poushin. - Support for overriding cache state via the debugfs implementation from Richard Fitzgerald" * tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits) regmap: fix a NULL pointer dereference in __regmap_init regmap: Support bulk reads for devices without raw formatting regmap-i2c: Add smbus i2c block support regmap: Add raw_write/read checks for max_raw_write/read sizes regmap: regmap max_raw_read/write getter functions regmap: Introduce max_raw_read/write for regmap_bulk_read/write regmap: Add missing comments about struct regmap_bus regmap: No multi_write support if bus->write does not exist regmap: Split use_single_rw internally into use_single_read/write regmap: Fix regmap_bulk_write for bus writes regmap: regmap_raw_read return error on !bus->read regulator: core: Print at debug level on debugfs creation failure regmap: Fix regmap_can_raw_write check regmap: fix typos in regmap.c regmap: Fix integertypes for register address and value regmap: Move documentation to regmap.h regmap: Use different lockdep class for each regmap init call thermal: sti: Add parentheses around bridge->ops->regmap_init call mfd: vexpress: Add parentheses around bridge->ops->regmap_init call regmap: debugfs: Fix misuse of IS_ENABLED ...

…kernel/git/joro/iommu Pull iommu updates for from Joerg Roedel: "This time the IOMMU updates are mostly cleanups or fixes. No big new features or drivers this time. In particular the changes include: - Bigger cleanup of the Domain<->IOMMU data structures and the code that manages them in the Intel VT-d driver. This makes the code easier to understand and maintain, and also easier to keep the data structures in sync. It is also a preparation step to make use of default domains from the IOMMU core in the Intel VT-d driver. - Fixes for a couple of DMA-API misuses in ARM IOMMU drivers, namely in the ARM and Tegra SMMU drivers. - Fix for a potential buffer overflow in the OMAP iommu driver's debug code - A couple of smaller fixes and cleanups in various drivers - One small new feature: Report domain-id usage in the Intel VT-d driver to easier detect bugs where these are leaked" * tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (83 commits) iommu/vt-d: Really use upper context table when necessary x86/vt-d: Fix documentation of DRHD iommu/fsl: Really fix init section(s) content iommu/io-pgtable-arm: Unmap and free table when overwriting with block iommu/io-pgtable-arm: Move init-fn declarations to io-pgtable.h iommu/msm: Use BUG_ON instead of if () BUG() iommu/vt-d: Access iomem correctly iommu/vt-d: Make two functions static iommu/vt-d: Use BUG_ON instead of if () BUG() iommu/vt-d: Return false instead of 0 in irq_remapping_cap() iommu/amd: Use BUG_ON instead of if () BUG() iommu/amd: Make a symbol static iommu/amd: Simplify allocation in irq_remapping_alloc() iommu/tegra-smmu: Parameterize number of TLB lines iommu/tegra-smmu: Factor out tegra_smmu_set_pde() iommu/tegra-smmu: Extract tegra_smmu_pte_get_use() iommu/tegra-smmu: Use __GFP_ZERO to allocate zeroed pages iommu/tegra-smmu: Remove PageReserved manipulation iommu/tegra-smmu: Convert to use DMA API iommu/tegra-smmu: smmu_flush_ptc() wants device addresses ...

…inux/kernel/git/shuah/linux-kselftest Pull kselftest update from Shuah Khan: "This update adds new zram test and fixes to problems found during testing this new zram test. In addition, there are a few bug fixes and ksefltest improvement patches from Linaro developers. I will send another update later on this week to fix kselftest breakage due to commit 2bf9e0a ("locking/static_keys: Provide a selftest") after the fix soaks in next for a couple of days" * tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests/zram: Makefile fix selftests/zram: must be run as root selftests: breakpoints: fix installing error on the architecture except x86 selftests: check before install selftests/zram: Adding zram tests

…nel/git/deller/parisc-linux Pull parisc updates from Helge Deller: "The most important changes in this patchset are: - re-enable 64bit PCI bus addresses which were temporarily disabled for PA-RISC in kernel 4.2 - fix the 64bit CAS operation in the LWS path which now enables us to enable the 64bit gcc atomic builtins even on 32bit userspace with 64bit kernel - fix a long-standing bug which sometimes crashed kernel at bootup while serial interrupt wasn't registered yet" * 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Use platform_device_register_simple("rtc-generic") parisc: Drop CONFIG_SMP around update_cr16_clocksource() parisc: Use double word condition in 64bit CAS operation parisc: Filter out spurious interrupts in PA-RISC irq handler parisc: Additionally check for in_atomic() in page fault handler PCI,parisc: Enable 64-bit bus addresses on PA-RISC parisc: Define ioremap_uc and ioremap_wc

Merge second patch-bomb from Andrew Morton: "Almost all of the rest of MM. There was an unusually large amount of MM material this time" * emailed patches from Andrew Morton <[email protected]>: (141 commits) zpool: remove no-op module init/exit mm: zbud: constify the zbud_ops mm: zpool: constify the zpool_ops mm: swap: zswap: maybe_preload & refactoring zram: unify error reporting zsmalloc: remove null check from destroy_handle_cache() zsmalloc: do not take class lock in zs_shrinker_count() zsmalloc: use class->pages_per_zspage zsmalloc: consider ZS_ALMOST_FULL as migrate source zsmalloc: partial page ordering within a fullness_list zsmalloc: use shrinker to trigger auto-compaction zsmalloc: account the number of compacted pages zsmalloc/zram: introduce zs_pool_stats api zsmalloc: cosmetic compaction code adjustments zsmalloc: introduce zs_can_compact() function zsmalloc: always keep per-class stats zsmalloc: drop unused variable `nr_to_migrate' mm/memblock.c: fix comment in __next_mem_range() mm/page_alloc.c: fix type information of memoryless node memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node() ...

Pull IPMI updates from Corey Minyard: "Most of these have been sitting in linux-next for more than a release, particularly commit 0fbcf4a ("ipmi: Convert the IPMI SI ACPI handling to a platform device") which is probably the most complex patch. That is also the one that changes drivers/acpi/acpi_pnp.c. The change in that file is only removing IPMI from a "special platform devices" list, since I convert it to the standard PNP interface. I posted this one to the ACPI list twice and got no response, and it seems to work well in my testing, so I'm hoping it's good. Hidehiro Kawai posted a set of changes that improves the panic time handling in the IPMI driver. The rest of the changes are minor bug fixes or cleanups and some documentation" * tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi: ipmi:ssif: Add a module parm to specify that SMBus alerts don't work ipmi: add of_device_id in MODULE_DEVICE_TABLE ipmi: Compensate for BMCs that wont set the irq enable bit ipmi: Don't call receive handler in the panic context ipmi: Avoid touching possible corrupted lists in the panic context ipmi: Don't flush messages in sender() in run-to-completion mode ipmi: Factor out message flushing procedure ipmi: Remove unneeded set_run_to_completion call ipmi: Make some data const that was only read ipmi: constify SSIF ACPI device ids ipmi: Delete an unnecessary check before the function call "cleanup_one_si" char:ipmi - Change 1 to true for bool type variables during initialization. impi:Remove unneeded setting of module owner to THIS_MODULE in the platform structure, powernv_ipmi_driver ipmi: Add a comment in how messages are delivered from the lower layer ipmi/powernv: Fix potential invalid pointer dereference ipmi: Convert the IPMI SI ACPI handling to a platform device ipmi: Add device tree bindings information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync up with Linus #103

Sync up with Linus #103

Commits on Sep 5, 2015

Commits on Sep 8, 2015

Commits on Sep 9, 2015