-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Trial PR] Add ThreadSanitizer support #22
Conversation
This is a trial PR in preparation for a PR upstream @jmid @kayceesrk @stedolan @anmolsahoo25 |
Thanks for the PR. I'll ask a few initial questions
support |
Perhaps @sadiqj may also want to take a look as he has a good sense of runtime and instrumentation. |
There should be no added cost without
Yes, from OCaml code, we instrument only the accesses to mutable locations. On the other hand, the C runtime is also instrumented, and C instrumentation has no mutability information, so the instrumentation there is likely to be more costly.
The memory usage is better than C/C++/Go: ours seems to be 2x-7x, whereas https://www.youtube.com/watch?v=5erqWdlhQLA reports 5x-10x. But the slowdown can indeed be seen as disappointing. I have three hypotheses:
Yes: the changes to the test suite only totalize 1583 insertions and 35 deletions on 45 files. So they constitute the bulk of the diff. The rest is split evenly between Finally, in the runtime, a portion of the diff (~100 lines) is due to this comment repeated several times: +CAMLno_tsan /* FIXME Race reports from this function clog the tests of user
+ programs, so we disable instrumentation here. However, Further
+ investigation would be needed about the cause of these race
+ reports.
+ */
static int try_update_object_header(value v, volatile value *p, value result,
mlsize_t infix_offset) { (As a side note, we had to silence many TSan reports internal to the runtime, which we intend to report separately.) |
regarding this, a promising track of improvement is to un-instrument all C functions that do not perform any synchronization operations critical for TSan. |
Investigating the CI failures now. |
Thanks. It would be useful to incorporate some of the answers to my questions into the PR message. |
e034bdb
to
f27628d
Compare
Thank you for taking the time to look at this @kayceesrk. I updated the PR message with the new elements. |
The CI failures were due to me forgetting to push an updated |
Thanks for the trial PR! I think the PR text is well-written and gives a good intro with relevant background info.
|
The other thing that you should include is how TSAN instrumentation is used to detect data race under the Multicore OCaml primitives i.e, translation of OCaml memory model to C memory model. This came up in the discussion at https://discuss.ocaml.org/t/ann-threadsanitizer-support-for-ocaml-5-0-0-first-public-release/10975. It would be useful to add a section on how the translation works. Given the last open question from Guillaume, do you think we're at a stage where we are sure that the technical work has been completed and there are no open questions? If there are open questions, mention them explicitly in the PR message. |
@jmid Thank you for your comments, I will follow the suggestions.
Guillaume has pinpointed something that I overlooked. I completed the instrumentation of memory accesses when I was not aware of the subtleties of translating the OCaml memory model to C, and I did not go back to question this instrumentation later. As a result, the current mapping from OCaml operations to C11 operations, that is used to inform TSan, does not quite reflect the correct mapping (ocaml/ocaml#10995):
In bold are the operations for which the mapping from OCaml to C11 memory operations is not faithful to the one described in ocaml/ocaml#10995. In addition, Guillaume’s question is relevant and we did not lay it out clearly before: are we mapping OCaml memory operations to well-defined C programs (but at the risk that TSan does not report OCaml data races), or mapping OCaml memory operations to possibly buggy C programs (when there is an OCaml data race)? I do not think any of this is a huge blocker, but the technical question should indeed be answered, and the implementation possibly corrected. I am leaning in favor of mapping OCaml data races to buggy C programs, but I need to verify that race-free programs will then always be mapped to race-free C programs. |
I edited to fix some wrong lines in the table. |
Isn't this what we want? So that the race detector can tell us when things go wrong? Does that lead to more false positives? More generally, can you comment on the soundness of TSAN for C++ programs and our encoding of TSAN for OCaml programs? A sound race detector has no false positives. If the underlying system is unsound, then there is no point in going for soundness with our encoding. Of course, reducing false positives is a worthwhile effort. |
Yes, I think that mapping racy OCaml operations to buggy C programs is what we want.
TSan is claimed to have no false positives (there are no papers describing TSan’s current algorithm, but see this slide: https://youtu.be/5erqWdlhQLA?t=1975), except sometimes when “there is tiny probability to miss a data race” (https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm#state-machine).
I think it might, or at least I cannot find a good mapping right now. We could just take the correct mapping (i.e., not the one above) and turn the operation for non-atomic loads into a plain write; that would ensure that a conflicting read & write to the same non-atomic location are turned into a C11 data race, and I think we don’t lose any happens-before edges that way; but then two conflicting writes do not translate to a data race.
If we signal non-atomic stores as just plain writes, then I think we lose some happens-before edges, which can lead to false positives. It would be useful to have the input of someone more expert than me in memory models on this question. |
I think I can chime in here. What is ThreadSanitizer doing?The objective of the ThreadSanitizer instrumentation of OCaml programs is to detect a race. A race is defined if there exist two memory events to a non-atomic location, with at least one being a write. A race exists if there is no happens-before (hb) edge between these two events. The ThreadSanitizer runtime mechanism builds up the hb-relation dynamically and signals if two events satisfying the above condition exist. Why embed an OCaml program in a C program?We use the C11 primitives provided by the ThreadSanitizer library to signal the hb-edges between OCaml memory events. For this, we use the existing hb-edges defined between RC11 events. For example, a But, since the OCaml memory model has additional hb-edges than RC11, we have to add more synchronization. Mapping OCaml memory events to C11 primitivesThe correct encoding of OCaml memory events to RC11 memory events is as follows. According to the OCaml Memory Model paper, there are 3 conditions for hb -
The current mapping is incorrect for the below reasons -
@kayceesrk |
Thank you for helping with this.
I am surprised by this. Does you mean that @stedolan’s mapping at ocaml/ocaml#10995 is wrong? |
Sorry I edited the previous mapping. I hope that does not seem confusing. |
No, no. I'm wrong there sorry. Edit: I am just trying to remember where C11 is broken when you mix |
There is a bug in the PLDI 2018 paper.
For non-atomic stores, it would be useful to sub-classify them as I have done in ocaml/ocaml#11418 (comment)
Please write down the compilation strategy for each of these separately. It will be much clearer this way.
Atomic accesses should use Overall, I think we need to have a correctness argument for compiling OCaml loads to plain loads. Here's my attempt at the soundness theorem
What do you think? Does the current implementation have this property? |
Thank you. Slight digression, but it would be very useful to have a document that reflects the current state of the knowledge about the memory model. At least for unexperienced readers like me, it’s difficult to find all the information and keep it in mind. As you mention, the PLDI18 paper requires some errata, and there things it doesn’t cover, like the mapping with the C11 model, or other architectures, that are only covered in Github discussions. I also have an increasing number of bookmarks of Github discussions that explain a particular point (like the publication safety of initializing writes at ocaml/ocaml#10995 (comment)) or explain the need to add a fence (ocaml/ocaml#10972 (comment)) or discuss potential issues (ocaml/ocaml#10992).
I think the current implementation probably doesn’t have it because e.g. OCaml atomic loads are signaled to TSan as acquired loads. I’m hoping that it can be easily fixed to have it by fixing the mapping, though. |
Memory model is likely to remain a complex topic given the nature of it. It may be useful to write an extended version of the PLDI 18 paper. The right time to write the paper would be when we have a few more native backends (s390x and power) and we’ve had lockfree data structures used for real. It is possible that the compilation strategy may evolve or new primitive operations may added. |
To summarize, here is the current candidate mapping:
|
I have some more questions. The optimized mapping in ocaml/ocaml#10995 assumes that you can transform
into:
Does TSAN now understand There is also |
In the rest of the discussion, I'm assuming that In OCaml, initialising writes are plenty. But these writes are not visible to other threads unless the newly allocated object is published. So we can optimise initialising writes by pushing the fence after that write to where the publication of the new object can happen. With that here is what I think the mappings will be: Atomic load
See Atomic store
See Non-atomic load
Non-atomic store (initialising)
Non-atomic store (assignment, integer)
Non-atomic store (assignment, pointer)
See Non-atomic store (non-word-sized field)No guarantees are provided for these accesses by the OCaml memory model. Hence, compiling them to
should be ok as happens-before will have to be established for them by other means (mutex, atomic variables, etc). What do you think about the proposal above? |
Still not. A maintainer confirmed to me that it can be simulated using fake atomic accesses.
Looking at the implementation, it is a dummy function that does nothing regarding TSan’s knowledge of the HB relationships.
Publication safety seems to also be established by address depedencies by some reports (reference ocaml/ocaml#10995 (comment)). I don’t know by what means TSan could take those into account.
Thank you for the updated proposal. I do not think that I, personally, can have a firm opinion about the absence of false positives with this mapping as of now. It certainly looks reasonable, but to completely convince myself that the theorem holds, I would first need to review all the relations needed in the OCaml memory model and verify that they are exactly the relations established by TSan with this mapping. This will take me a bit of work as, as I previously expressed, a lot of the arguments are currently flying over my head, and the information is scattered. |
5b320a7
to
10187d2
Compare
This commit contains two changes to the root Makefile: 1. Removal of the following line: $(DEPDIR)/runtime/%.bpic.$(D): OC_CFLAGS += $(SHAREDLIB_CFLAGS) This line is useless because it makes no sense to alter any variable in the CFLAGS category for the computation of dependencies, since only the C preprocessor is involved in this stage and it does not take C flags into account anyway, only C preprocessor flags. 2. When computing the dependencies for $(DEPDIR)/runtime/%.npic.$(D), one should not refer to $(SHAREDLIB_CFLAGS), for a similar reason. It has been verified that SHAREDLIB_CFLAGS is either empty, or contains just -fPIC which is indeed not necessary for computing dependencies (it is indeed a C flag rather than a C preprocessor flag).
In this target-specific definition, C and C preprocessor flags were mixed. This commit distinguishes one form the other.
Given the convention that the OC_* build varialbes are reserved for the build system, it seems better to make sure all of them are defined in the private Makefile.build_config file, rather than in Makefile.config which gets installed and thus becomes public. This commit moves the definitions of OC_CFLAGS, OC_CPPFLAGS, OC_LDFLAGS, OC_DLL_LDFLAGS and OC_EXE_LDFLAGS from Makefile.config.in to Makefile.build_config.in. It also moves the defintion of MKEXE_VIA_CC, since this variable relies on private build varables and does not seem relevant or useful outside of the context of the build of the compiler itself.
The renamings done in this commit are: OC_COMMON_CFLAGS -> OC_COMMON_COMPFLAGS OC_COMMON_LDFLAGS -> OC_COMMON_LINKFLAGS OC_BYTECODE_LDFLAGS -> OC_BYTECODE_LINKFLAGS OC_NATIVE_CFLAGS -> OC_NATIVE_COMPFLAGS OC_NATIVE_LDFLAGS -> OC_NATIVE_LINKFLAGS
This means moving its definition from Makefile.common to Makefile.build_config.in
This is to let configure specify flags that will be used when compiling C files to be linked with native code. The variable is not used in the build system yet, just defined.
10187d2
to
00d1c60
Compare
As discussed collectively, rebased on ocaml/ocaml#12108 which allowed a much needed cleaning of the build system aspects of TSan support. |
c92436b
to
9f01960
Compare
When compiled for linking with native code, C files use both the common preprocessor flags and the native-specific cppflags. The same should happen for assembly files and this commit makes sure this is the case.
This flag is unused for now. Co-authored-by: Olivier Nicole <[email protected]>
Restore libunwind detection when TSan is enabled at configure time. This is a cherry-picking of b694e84 by with some adaptations: - libunwind detection is only attempted when tsan is enabled - if tsan is enabled, libunwind is requested and not optional - libunwind_include_dirs and libunwind_link_dirs become libunwind_cppflags and libunwind_ldflags, respectively Co-authored-by: Olivier Nicole <[email protected]>
ThreadSanitizer support is in two parts: instrumentation of the generated native code, and runtime support. The Cmm instrumentation pass is in asmcomp/thread_sanitizer.ml[i]. The new C file runtime/tsan.c handles the task of letting TSan know about function entries and exits when raising exceptions or handling effects. Finally, some of the instrumentation calls have to be inserted directly into the assembly routines of runtime/amd64.S. Co-authored-by: Fabrice Buoro <[email protected]> Co-authored-by: Anmol Sahoo <anmol.sahoo25@gmail>
- Add a new set of tests in testsuite/tsan/ - A small number of tests have to be disabled when --enable-tsan is set, due to the fact that a call tree of depth >64k causes the ThreadSanitizer runtime to crash. (This is a limitation on the ThreadSanitizer side.) - Add the no-tsan action to ocamltest, in order to test whether --enable-tsan is set. Co-authored-by: Olivier Nicole <[email protected]>
TSan warns about data races on these functions. We reported those warnings in #11040, and silence them to avoid facing users of TSan with data race reports that are not related to their code. This is done in two ways: either un-instrumenting those functions, or adding their name in __tsan_default_suppressions in tsan.c. Co-authored-by: Fabrice Buoro <[email protected]>
The functions that we un-instrumentation are called often but should not contain data races with user code. Co-authored-by: Olivier Nicole <[email protected]>
8dfcd65
to
0760b24
Compare
Closing this PR now that the actual PR ocaml/ocaml#12114 has been opened. |
This PR introduces a new configure-time flag
--enable-tsan
to enable compilation with ThreadSanitizer (TSan) instrumentation. This is a joint work with Fabrice Buoro (@fabbing), based on the work of Anmol Sahoo (@anmolsahoo25). We also had help from @jhjourdan and @maranget on memory models, @shindere on the build system, and are grateful to @art-w and @gadmm for their useful feedback.TSan is an approach developed by Google to locate data races in code bases. It works by instrumenting your executables to keep a history of previous memory accesses (at a certain performance cost), in order to detect data races, even when they have no visible effect on the execution. TSan instrumentation has been implemented in various compilers (gcc, clang, as well as the Go and Swift compilers), and has proved very effective in detecting hundreds of concurrency bugs in large projects.
Executables instrumented with TSan report data races without false positives. However, data races in code paths that are not visited will not be detected.
Example
A data race consists in two accesses to the same location, at least one of them being a write, without any order being enforced between them:
Compiling this program with a TSan-enabled compiler is done by the usual command
ocamlopt program.ml
. Running it will output a report looking like this:Many other examples can be found in the test suite. More information about ThreadSanitizer usage are available in the readme of the ocaml-tsan repo, which has been released as a fork based on 5.0.0 (release date December 16, 2022). This approach already allowed to detect a number of data races in OCaml libraries:
The generated executables should not be impacted by this change when
--enable-tsan
is not set. Compilation times are the same as before without--enable-tsan
.Caveats
How TSan works
Executables are instrumented with calls to the TSan runtime library, which tracks accesses to shared data and reports races.
Internally the runtime library associates with each word of application memory at least 2 "shadow words". Each shadow word contains information about a recent memory access to that word, including a "scalar clock". Those clocks serve to establish a happens-before relation.
This information is maintained as a "shadow state" in a separate memory region, and updated at every (instrumented) memory access. A data race is reported every time two memory accesses are made to overlapping memory regions, and:
Instrumentation pass
Instrumentation calls are inserted in several places:
Memory accesses are instrumented with calls to the TSan runtime, with functions of the form
__tsan_read8
,__tsan_atomic_store
; those calls are inserted into theCmm
representation of code. We exploit mutability information to improve performance: immutable loads are translated to non-instrumented memory reads.Function entries and exits are instrumented with calls to
__tsan_func_entry
and__tsan_func_exit
. This is used by TSan to provide backtraces in data race reports (including backtraces of past memory accesses).The C runtime is instrumented as well, using the built-in ThreadSanitizer support of GCC or Clang. This is necessary as
Enabling
tsan
also causes C code to be built with-fno-omit-frame-pointers
. The OCaml compiler's--enable-frame-pointers
configure flag is however independent of--enable-tsan
.Instrumentation of function entries and exits is simple in C or C++ (where raising an exception always unwinds the stack), but challenging in OCaml which has non-local control flow due to exceptions and effect handlers. In order to keep correct backtraces in all circumstances, it is necessary to emit ad hoc calls to
__tsan_func_entry
and/or__tsan_func_exit
when an exception is raised, an effect is performed, or a continuation is resumed, in order to keep TSan’s view of the call stack up-to-date.Doing this requires unwinding the stack. When possible, we use the already-present mechanism of
frame_descr
descriptors, embedded in the executable, to do it. However, exceptions can be raised across C stack frames, and frame descriptors only exist for OCaml frames. In order to unwind the C parts of the stack, we use the libunwind library. As a consequence, one must install libunwind to build OCaml with--enable-tsan
.__tsan_func_entry
takes as an argument the return address in the current stack frame. This required adding a constructorCreturn_addr
toCmm
expressions andIreturn_addr
to the typeMach.operation
, whose semantics is to fetch the return address from the stack frame.Embedding of the OCaml memory model into the C11 one
TSan’s underlying memory model is the C11 model. Therefore, we have to map OCaml memory operations to C11 operations. The relations between the two models have been the subject of many discussions (ocaml/ocaml#10995, ocaml/ocaml#10992, ocaml/ocaml#10972 (comment)), which we took as a basis for our work. Conceptually, the instrumentation translates OCaml memory operations to C11 operations. Race-free OCaml programs are translated to race-free C programs, while OCaml programs containing races (in the OCaml sense) are translated to C programs that are racy (in the C11 sense).
The resulting runtime analysis does not report false positives, in the sense that all races reported by TSan are indeed data races in the OCaml sense1.
The mapping between OCaml memory operations and C11 operations signaled to TSan can be found here. In the same thread, the previous post contains a justification of the absence of false positives. In a meeting, we presented these to @maranget and @jhjourdan, who agreed that it should have the correctness properties we claim it has.
Performance cost of the instrumentation
First of all, there is no compilation or runtime cost associated with the change unless the compiler is configured with
--enable-tsan
.When TSan instrumentation is enabled, it incurs a slowdown and increases memory consumption. Preliminary benchmarks show a slowdown in the range 7x-13x and a memory consumption increase in the range 2x-7x. Some pathological cases exist, such as
testsuite/tests/lf_skiplist/test_parallel.ml
for which time and memory usage blow up by 23x and 27x, respectively. (We haven’t investigated further yet.)The memory consumption increase is better than for C/C++/Go (this conference talk reports 5x-10x). The slowdown is in range of what is observed in C/C++/Go (5x-15x), but since OCaml programs are expected to perform less mutation than C on average, there is probably room for improvement. There are probably low-hanging fruits in terms of optimization, notably removing instrumentation from the runtime where it is not strictly necessary to correctly detect races in user programs; and investigate the aforementioned pathological case.
Organization of the changes
This PR is best reviewed commit by commit.
It is based on the commits of ocaml/ocaml#12108 which introduces native-only C compiler flags. As a consequence, the changes actually belonging to this PR start at the 10th commit, “Add tsan configure flag”.
More than 60 % of the diff for this PR is due to a new set of tests in
testsuite/tsan/
. A small number of tests have to be disabled when--enable-tsan
is set, due to the fact that a call tree of depth >64k causes the ThreadSanitizer runtime to crash. (This is a limitation on the ThreadSanitizer side.)In the main “Add ThreadSanitizer support” commit, big changes are limited to a few files: the
Cmm
instrumentation pass is inasmcomp/thread_sanitizer.ml[i]
. The new C fileruntime/tsan.c
handles the task of letting TSan know about function entries and exits when raising exceptions or handling effects. Finally, some of the instrumentation calls have to be inserted directly into the assembly routines ofruntime/amd64.S
.In a separate commit, we disable TSan on a number of functions in the runtime. This is because TSan warns about data races on these functions. We reported those warnings in ocaml/ocaml#11040, and silence them to avoid facing users of TSan with data race reports that are not related to their code. To save performance, we also un-instrument some functions that are executed often, but should not create races with user code.
Footnotes
Except maybe in some rare cases involving data dependencies or
volatile
, due to limitations of TSan itself. False positives due tovolatile
can only appear in legacy FFI code, not written with parallelism in mind. ↩