-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/trace: TestTraceStress failed #9729
Comments
What hardware do you use? |
-> uname -a -> cat /proc/cpuinfo processor : 1 |
I see, thank you. |
On Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz $go version $go test -run TraceStress -v $./all.bash ok runtime/debug 0.012s .... $uname -a $ cat /proc/cpuinfo |
It is also failing for me, although with another error message ("failed to parse trace: bogus scanning end"). The same error was reported by one of the build bots:
|
similar problem
|
@dvyukov In case it helps, I can reproduce this issue 100% on my temporary Z620 in TOK... but not all on my permanent one back in SFO. SFO (does NOT repro):
TOK (DOES repro):
Let me know if granting you remote access would be helpful in investigating. I'll have it for at least two more weeks. |
I will appreciate access to a machine that can reproduce this. I will contact you offline. |
I can reproduce it on your machine. I wrote the following program to investigate it.
But diffs are never larger than ~250.
I hoped this problem is left in stone age. Linux must do its best to synchronize clocks today, so maybe it's bad BIOS or processor microcode. #define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <pthread.h>
#include <sched.h>
#include <errno.h>
void* thr(void *arg);
uint64_t sig, ticks0, ticks1;
int main() {
cpu_set_t mask0;
if (sched_getaffinity(0, sizeof(mask0), &mask0)) {
perror("sched_getaffinity");
exit(1);
}
int i;
for (i = 0; i < CPU_SETSIZE; i++) {
if (!CPU_ISSET(i, &mask0))
break;
int j;
for (j = 0; j < CPU_SETSIZE; j++) {
if (i == j || !CPU_ISSET(j, &mask0))
break;
printf(".");
fflush(0);
sig = ticks0 = ticks1 = 0;
pthread_t th[2];
pthread_attr_t attr[2];
cpu_set_t mask[2];
CPU_ZERO(&mask[0]);
CPU_SET(i, &mask[0]);
pthread_attr_init(&attr[0]);
pthread_attr_setaffinity_np(&attr[0], sizeof(attr[0]), &mask[0]);
pthread_create(&th[0], &attr[0], thr, (void*)1);
CPU_ZERO(&mask[1]);
CPU_SET(j, &mask[1]);
pthread_attr_init(&attr[1]);
pthread_attr_setaffinity_np(&attr[1], sizeof(mask[1]), &mask[1]);
pthread_create(&th[1], &attr[1], thr, 0);
pthread_join(th[0], 0);
pthread_join(th[1], 0);
pthread_attr_destroy(&attr[0]);
pthread_attr_destroy(&attr[1]);
if (ticks0 != 0)
printf("\n%d-%d: BROKEN %lu -> %lu (diff=%lu)\n", i, j, ticks0, ticks1, ticks0 - ticks1);
}
}
printf("\n");
return 0;
}
uint64_t cputicks() {
uint64_t low, high;
__asm__ __volatile__("lfence; rdtsc" : "=a" (low), "=d" (high));
return (high << 32) | low;
}
void* thr(void *arg) {
int i;
for (i = 0; i < 10000; i++) {
if (arg != 0) {
uint64_t t0 = cputicks();
__atomic_store_n(&sig, 1, __ATOMIC_RELEASE);
while (__atomic_load_n(&sig, __ATOMIC_RELAXED) == 1) {
}
uint64_t t1 = __atomic_load_n(&sig, __ATOMIC_ACQUIRE);
if (t1 <= t0) {
ticks0 = t0;
ticks1 = t1;
}
} else {
while (__atomic_load_n(&sig, __ATOMIC_ACQUIRE) != 1) {
}
__atomic_store_n(&sig, cputicks(), __ATOMIC_RELEASE);
}
}
return 0;
} |
We can manully synchronize RDTSC on different cores using RDTSCP instruction. Namely, measure RTT between all cores and offsetting values of RDTSCP by computed deltas. |
See the following issue for context: #9729 (comment) In short, RDTSC can produce skewed results without preceding LFENCE/MFENCE. Information on this matter is very scrappy in the internet. But this is what linux kernel does (see rdtsc_barrier). It also fixes the test program on my machine. Update #9729 Change-Id: I3c1ffbf129fdfdd388bd5b7911b392b319248e68 Reviewed-on: https://go-review.googlesource.com/5033 Reviewed-by: Ian Lance Taylor <[email protected]>
FYI the test still 100% fails on this AMD cpu:
LKML: "Brunner, Richard": TSC and Power Management Events on AMD Processors "this patch syncs the TSCs every 20 seconds. That is enough to sync |
Also saw on OpenBSD-amd64: https://storage.googleapis.com/go-build-log/114bbd11/openbsd-amd64-gce56_7da78022.log
|
FYI. Tried again today using a revision which includes the "fix cputicks on x86" patch, and it still fails.
This is on the same hardware as last time.
|
Same error on my very old AMD machine:
|
Dmitry, reassign as necessary. |
I don't believe this is the same problem as #10476. TestTraceStress had real (non-RDTSC-related) synchronization problems that Austin and Rick found and fixed. The failures in TestTraceStressStartStop are different, I believe. They're always about system calls, which are still not traced quite right. To avoid confusion, I'm going to close this and reopen #10476. |
Is it fixed? Dockerfile is in https://go-review.googlesource.com/#/c/9077/ In the container, the go version is: devel +3f4de49 Mon Apr 20 15:56:56 2015 +0000 root@0e6edb498432:/go/src/runtime/pprof# go test -run TraceStress -v |
No, it is not resolved. |
It doesn't fail outside of container (btw that is darwin/amd64). I don't see any failure reports from the build dashboard or my personal ubuntu machine currently. So, it's probably the docker container in my case. |
Yes, it fails on some (arguably misdesigned, see e.g. https://lkml.org/lkml/2005/11/4/173 ) CPUs. Nothing changed since the original report. This is bare Ubuntu LTS 14.04: ../bin/go version
|
It is vmware. Inside VM it says Intel Xeon CPU E5-2680 0 @ 2.70GHz, Windows 7.
It looks like my pc has 1 CPU only. Your program seems like it needs multiple CPUs. Isn't it? Can I use Go? Alex |
-> go version
go version devel +0e80b2e Thu Jan 29 13:07:30 2015 +0000 linux/amd64
-> go test -run TraceStress -v
=== RUN TestTraceStress
--- FAIL: TestTraceStress (1.71s)
trace_test.go:207: failed to parse trace: p 1 is running before start (offset 128960, time 1475855)
FAIL
exit status 1
FAIL runtime/pprof 1.718s
complete trace file: https://drive.google.com/file/d/0B7KKArVsnMd2WldQRDB4VXlOQXc/view?usp=sharing
The text was updated successfully, but these errors were encountered: