-
Notifications
You must be signed in to change notification settings - Fork 32
-
Notifications
You must be signed in to change notification settings - Fork 32
Security consideration: Multi-threading helps cache-based side channel attacks #1
Comments
Here's a CodePen that provides a basic high-resolution timer polyfill based on SharedArrayBuffers. Its resolution is on the order of 0.3ns. Tested on Firefox Nightly (built from https://hg.mozilla.org/mozilla-central/rev/72835344333f ) |
@mseaborn, this is the best place to discuss the spec. Thank you for highlighting this issue, it needs to be brought into the spec somehow. @Yossioren, noted :) |
At the ECMA TC39 meeting in Portland in September, 2015, the committee decided that resolving the present issue (in the sense of working out the consequences of it and figuring out mitigations, if the issue is deemed to be a real concern) is a blocker for acceptance to Stage 2. |
Thanks for letting me know Lars. Am I allowed to discuss this publicly? Sent from my potato.
|
@Yossioren, it should be OK to discuss this publicly. |
"For example, key press handling code produces a detectable signature of accesses to L3 cache sets. An attacker can use timings to deduce when the user is pressing keys." This sounds like a very concrete claim and the paper was written without depending on the SAB proposal, so it sounds like the claimed vulnerability already existed before independent of SAB. Is there a JavaScript demo out somewhere that would demonstrate this attack (or any other kind of attack based on the paper)? The case here seems to be that SAB makes the claimed already existing vulnerability statistically worse, since a better timer precision might improve a signal-to-noise ratio of such a statistical attack, but vulnerabilities are essentially binary, they either exist or not. Since the discussion is not about that SAB would open up a new vulnerability, but it might make the statistical success rate of an existing claimed attack better, the discussion should entail how many orders of magnitude are we talking about here? Chrome reduced their high-resolution timer resolution to 5 microseconds (see http://src.chromium.org/viewvc/blink/trunk/Source/core/timing/PerformanceBase.cpp?r1=198348&r2=198347&pathrev=198348 ). Why was 5 microseconds chosen? How much did that help against the success rate of a claimed attack? |
Hi Juj, There's a proof-of-concept for the attack that works pretty well on Firefox As you said, the attack isn't due to something that's wrong with SAB. The The latest versions of the big browsers don't have any timer that goes Coming back to the top of my comments - the attack isn't due to something Best regards, On Fri, Sep 25, 2015 at 6:33 PM, juj [email protected] wrote:
|
When the timer's precision is limited to 5us resolution in the manner shown above, couldn't the attacker recover the higher precision by observing the precise "edge" of the clock (when it jumps forward 5us)? |
Copying over a comment from Waldemar Horwat on es-discuss. The paper he references has not been mentioned earlier on this thread.
|
lukewagner wrote:
I suppose you mean something like the following? We can time operations by counting the number of loop operations between jumps in the clock's value ("ticks").
Yossef, has this approach been considered in the literature on cache attacks? It seems semi-obvious, but it didn't seem to come up when the lower-clock-resolution mitigations were added to browsers. A mitigation for this is presumably to add some jitter to the clock. How effective would that be? |
@mseaborn Exactly. Adding jitter seems potentially quite frustrating to users. |
Hi guys, That's a nice idea Mark, which I haven't seen discussed. It lets you set As for countermeasures, as long as the clock stays monotonically increasing On Wed, Sep 30, 2015 at 12:32 AM, Luke Wagner [email protected]
|
For the record (courtesy Brendan), an older timing attach via SVG and requestAnimationFrame: |
OK Mark, I implemented your idea in JS, here's the codepen: (just the output: http://s.codepen.io/yossioren/debug/XmpZMZ? ) The output of this page is a histogram -- the X axis is ticks, and the Y On my test machine you can fit only around 10 busy-wait ticks into one 5us On Wed, Sep 30, 2015 at 7:05 PM, Lars T Hansen [email protected]
|
This type of infoleak has existed for a while in Chrome using PNaCl's atomics, is available in all browsers through Flash's shareable byte array, and is possible in JavaScript on x86 using denormals. These are just a few examples: it seems like very high precision timing information leaks are an inevitability, and SAB are adding a redundant attack surface. It would be worth exploring the following in separate issues:
|
One of the concerns about allowing high-res timers via shared memory is that browsers wouldn't be able to disable this functionality if the resulting info leaks turn out to be more serious than we'd originally thought. Once web apps start using shared memory, disabling it would break the web. Browsers can easily change the timing resolution of Here's a possible answer to that concern: A simple mitigation is to pin all the relevant threads to the same CPU. (i.e. All threads that can concurrently access a Shared Array Buffer.) This should prevent the SAB from being used to construct a high-res timer, because there would be no fine-grained interleaving of the threads' instructions. This has the benefit that it's very simple to implement, e.g. using Obviously this takes away the performance benefit of using multiples cores/hyperthreads. But it does not have a performance impact beyond that: It does not slow down individual threads. It does not require interposing on memory accesses or inserting any delays/synchronisations (unlike Dthreads, for example). It does not require doing a CPS-style transform (as Emscripten's Emterpreter-Async functionality does today), which would increase power usage. Browser vendors or users who are particularly privacy conscious might choose to enable this pinning scheme. Otherwise, this pinning scheme gives us an "escape hatch" -- browsers could enable it if timing side channel attacks start to become more widely used. |
@mseaborn, thanks for the suggestion. While I think removing the performance benefit of multiple cores is going to sink the feature - without that benefit, the feature probably does not pay for itself in practice - it is helpful to remember that we have this escape if the timing attack becomes truly problematic and shared memory is the only remaining attack vector. @jfbastien, I'll open a separate issue for the API surface, it's a big topic in itself. |
To @jfbastien 's point about this not being a new attack surface: There is actually a way to share an ArrayBuffer between two workers already, and that's through WebAudio. A website can open an AudioWorker which gets an ArrayBuffer sent to it. The ArrayBuffer is not actually transferred to and from it, but instead just runs at the same time with both JS contexts having access to it at the same time. I'm not sure if it's possible to get the ArrayBuffer to be visible to more than two threads. The only communication mechanisms available in this context are postMessage and reads/writes to the ArrayBuffer; no atomic or futex instructions are available. AudioWorkers are new, but they just replace ScriptProcessorNode, which I believe has the same issue. If two threads would be enough to construct this high-precision timer, then SharedArrayBuffer doesn't worsen the attack surface vs what's already there in the web platform. |
Thanks for pointing out this API @littledan. I'll take a look at it and see http://codepen.io/yossioren/pen/LVBNMR On Mon, Oct 19, 2015 at 7:50 PM, littledan [email protected] wrote:
|
A couple of links pertaining to related discussions in the TOR community: |
@Yossioren, in reference to your earlier experiment with the idea from @lukewagner and @mseaborn, you write: "The output of this page is a histogram -- the X axis is ticks, and the Y axis is how often iters_per_tick is equal to this specific value. On my test machine you can fit only around 10 busy-wait ticks into one 5us system tick, and the jitter is terrible." I think it is possible to do better here. Specifically I think it should be possible to use a loop that does not have to call performance.now(), thus yielding a loop count with better precision. The way this would work is that we would first search for iteration counts that trigger changes to the value read from performance.now() after performing a known-fast or known-slow operation. To use this, we would then perform the operation to be measured and then iterate without reading the clock a set number of times and then read performance.now() to determine if it has changed. We can then conclude on the basis of that whether the operation was the fast operation (the reading should not have changed) or the slow operation (it should have changed). I have a proof of concept of this in the repository https://github.com/lars-t-hansen/ticktock: The program in fib.html demonstrates the likely running time of doubly-recursive fibonnaci(10) on the system under test; I see times of 500ns on an i7 MacBook Pro ("late 2013") and 1000ns on an older AMD FX4100 system in current versions of Firefox Developer Edition, which appears to use a 5us resolution for performance.now(). The program in granularity.html implements the algorithm above and is able to distinguish between fib(1) and fib(10) with what appears to me to be high reliability. The code is a little elaborate because it attempts to warm up the JIT properly and to avoid environmental effects such as loop warmup. But I do think it demonstrates at least the plausibility of the approach. On my systems, there are a few wrong guesses, usually less than 5%; tweaking the cutoff has helped here but the tweaking carries over to the other system. If, as you wrote somewhere (probably in your paper) that you can amplify the LLC miss cost up to about 1us, then this type of clock may be used to implement the attack in your paper. |
This is the Tor project thread that tracks the same issue as the present thread: |
Two new papers on the topic:
ARMageddon: Last-Level Cache Attacks on Mobile Devices
|
@littledan, re AudioWorker, I talked to an engineer here who worked on that and he told me that that hole has been closed because it could be used to crash the browser. The scenario he outlined was this: You have your main thread, your AudioWorker, and another Web Worker. You share the buffer between the main thread and the AudioWorker. Then you neuter the buffer by transfering it to the Web Worker, leaving the AudioWorker pointing to garbage. My understanding is that the AudioWorker spec has been updated. Looking at http://www.w3.org/TR/webaudio/, and searching for "acquire the content operation" [sic], seems to back this up. (Even without that fix I was told that the shared buffer could not be used for this attack, given the limited nature of the computation performed in the AudioWorker.) |
Lars, that's fascinating. Do you think your colleague would be able to Kol tuv, On Thu, Dec 10, 2015 at 5:33 PM, Lars T Hansen [email protected]
|
I was also under the impression AudioWorker use cases were why main thread allowance of futex blocking was necessary. To make audio workers (more) useful, critical sections may be required for access between main thread and worker, namely for something like streaming PCM audio. |
A writeup summarizing both what's in the discussion above and what's happened in discussions elsewhere (subject to updating but fairly stable): Edit 2016/3/1: corrected the link. |
Just recording a finding. Earlier @mseaborn made this suggestion: "A simple mitigation is to pin all the relevant threads to the same CPU. (i.e. All threads that can concurrently access a Shared Array Buffer.) This should prevent the SAB from being used to construct a high-res timer, because there would be no fine-grained interleaving of the threads' instructions." This seems like a fine mitigation to me but it's hampered (as far as the standardization work is concerned) by there not being a reliable thread pinning API on all major platforms. Notably, Mac OS X has only an advisory API, and does not provide pthread_setaffinity_np(). Presumably other platforms could be affected too (unknown). |
This is considered largely resolved (January 2016 TC39 meeting). Waldemar is still concerned - considers this a blocker - but much of the rest of the committee seems convinced by the argument that (a) this is not a new capability on the web (Flash, Java, PNaCl, native extensions) and (b) wasm will open the problem regardless and (c) this type of info-leak should be closed by addressing the info-leak, not closing down the timer. Additionally, Google's security team does not believe the bug is exploitable in any significant way (I'm paraphrasing a statement read at the meeting, please treat as such). |
Argh, did not mean to close this, but to move to Stage 3. |
Lars, this doesn't seem to go anywhere |
@ekr, I will fix the link. Thanks. |
@lars-t-hansen navigator.threadSecurity = true/false Edit: This would be advisory and platform dependent. It would only take one context with the value true to enable the request, with the "false" valued ones being ignored. |
In principle I think so, if the platform cooperates.
No, see below.
The API is not addressing the problem. The problem is that any process on the system at all, including the browser itself, is potentially vulnerable to cache sniffing from a loaded web page. The attacker would use two workers, one to carry out the attack and one to provide a clock signal. The attacker has no incentive to disable true parallelism, it needs it. The victim might not be a web page at all, it could be the SSL implementation in the browser or it could be outside the browser, and it can't just disable parallelism for the web pages either. |
Wouldn't that be an OS API issue then, and not the responsibility of the VM? As long as the VM does its "part" on security I don't see an issue. If the OS vendors lack an ability to mitigate this then relevant bugs should be filed into their bug reporting systems. I'm against capping parallelism, When I say "part" I mean having the advisory API be a NOP unless the OS provides actual mitigations. |
@taisel, The argument is that if you have an installed hardware base (measured in the hundreds of millions of units) that is vulnerable to cache sniffing attacks, and you are the provider of a software platform that allows essentially arbitrary code to be run without user intervention and hence an attack to be run everywhere (through an ad, say), then you can't just wash your hands by saying it's the hardware's fault (or the operating system's, though in this case I don't think it is). |
@lukewagner dug up a new paper on a possible mitigation based on Intel's cache allocation technology (CAT): https://ssrg.nicta.com.au/publications/nictaabstracts/8984.pdf: "This paper shows how such LLC side channel attacks can be defeated using a performance optimization feature recently introduced in commodity processors. Since most cloud servers use Intel processors, we show how the Intel Cache Allocation Technology (CAT) can be used to provide a system-level protection mechanism to defend from side channel attacks on the shared LLC. CAT is a way-based hardware cache-partitioning mechanism for enforcing quality-of-service with respect to LLC occupancy. However, it cannot be directly used to defeat cache side channel attacks due to the very limited number of partitions it provides. We present CATalyst, a pseudo-locking mechanism which uses CAT to partition the LLC into a hybrid hardware-software managed cache. We implement a proof-of-concept system using Xen and Linux running on a server with Intel processors, and show that LLC side channel attacks can be defeated. Furthermore, CATalyst only causes very small Not a panacea: CAT is Xeon-only and this needs OS support. |
Since this is really a Fact of life and browsers have started to ship this (https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/dnzvgTswfbc/AFIUge2oDQAJ) I will close the bug. |
@lars-t-hansen Interesting note: Linus Torvalds agrees with you. IIRC recently Intel tried to make some kind of mitigation rolled in as a feature - https://www.theregister.co.uk/2018/01/22/intel_spectre_fix_linux/ Seems this exact conversation is happening years later. I have no words now... |
AMD Zen 2! |
Shared memory allows the construction of a high-resolution timer (nanosecond resolution), which enables cache-based side channel attacks, such as those described in the paper "The Spy in the Sandbox -- Practical Cache Attacks in Javascript" (http://arxiv.org/abs/1502.07373).
For example, key press handling code produces a detectable signature of accesses to L3 cache sets. An attacker can use timings to deduce when the user is pressing keys.
Should the SharedArrayBuffer spec discuss this as a security consideration? This might help us come up with mitigations.
For comparison, browsers have tried to mitigate cache side channels by reducing the resolution of performance.now(). (e.g. See https://crbug.com/506723 for Chrome.) With multi-threading, though, it's easy to build your own high-res timer by creating a thread that increments a memory location in a tight loop.
I'm not terribly optimistic about mitigating this. Finding a solution is a research problem and may well be infeasible. I doubt this should block Shared Array Buffers, but if we're adding multi-threading to the Web platform we should at least be aware of what we're letting ourselves in for.
See also the Chromium issue tracking this: https://crbug.com/508166
[Aside: is there a preferred forum for discussing the SharedArrayBuffer spec? Brad Nelson and Ben Smith -- who are working on Chromium's support for SharedArrayBuffers -- told me there isn't one yet and recommended I file an issue in this issue tracker.]
The text was updated successfully, but these errors were encountered: