Investigate flaky test test-tick-processor #2471

joaocgreis · 2015-08-20T20:07:21Z

Examples of failures:

armv7-wheezy

The text was updated successfully, but these errors were encountered:

brendanashworth · 2015-08-21T04:06:47Z

I opened an issue (#2431) and it was thought not to be an issue, so I guess we'll see?

bnoordhuis · 2015-08-22T17:19:46Z

I just saw this on armv7-wheezy:

not ok 673 - test-tick-processor.js
#
#
##
## Fatal error in ../deps/v8/src/heap/heap.cc, line 4062
## Check failed: map->instance_type() == JS_OBJECT_TYPE || map->instance_type() == JS_ARRAY_TYPE.
##
#
#==== C stack trace ===============================
#
#(empty)
  ---
  duration_ms: 120.111
  ...

rvagg · 2015-08-23T13:52:27Z

it's flaking out on armv7-wheezy still https://jenkins-iojs.nodesource.com/job/node-test-commit-arm/267/nodes=armv7-wheezy/console

something's not happy there at all, fine on raspberry pi 2 though which is armv7 and wheezy-based, tho not exactly wheezy

targos · 2015-08-28T12:47:01Z

I got another error with the same test:

not ok 676 - test-tick-processor.js
# util.js:560
# name = name.substr(1, name.length - 2);
# ^
# 
# TypeError: name.substr is not a function
# at String.substr (native)
# at formatProperty (util.js:560:19)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# at formatValue (util.js:378:16)
# at formatProperty (util.js:537:15)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# child_process.js:458
# throw err;
# ^
# 
# Error: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/armv7-wheezy/out/Release/node -prof -pe function foo(n) {require('vm').runInDebugContext('Debug');return n < 2 ? n : setImmediate(function() { foo(n-1) + foo(n-2);}); };setTimeout(function() { process.exit(0); }, 2000);foo(40);
# util.js:560
# name = name.substr(1, name.length - 2);
# ^
# 
# TypeError: name.substr is not a function
# at String.substr (native)
# at formatProperty (util.js:560:19)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# at formatValue (util.js:378:16)
# at formatProperty (util.js:537:15)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# 
# at checkExecSyncError (child_process.js:435:13)
# at Object.execFileSync (child_process.js:455:13)
# at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-arm/nodes/armv7-wheezy/test/parallel/test-tick-processor.js:11:4)
# at Module._compile (module.js:434:26)
# at Object.Module._extensions..js (module.js:452:10)
# at Module.load (module.js:355:32)
# at Function.Module._load (module.js:310:12)
# at Function.Module.runMain (module.js:475:10)
# at startup (node.js:117:18)
# at node.js:951:3

https://jenkins-iojs.nodesource.com/job/node-test-commit-arm/374/nodes=armv7-wheezy/tapTestReport/test.tap-676/

joaocgreis · 2015-09-01T00:03:21Z

I ran this test repeatedly on Linux locally (Ubuntu 14.04). It failed 3 times out of 1188 runs (0.25%). It seems that this test is flaky at least on all Linux platforms (likely with a much higher failure rate in armv7).

matthewloring · 2015-09-01T23:17:37Z

I was able to reproduce this on Ubuntu locally by putting my machine under heavy load and gather some data. The raw ticks are too large for pastebin but here's the processed output:

http://pastebin.com/4rMfURK7

A single JS tick was collected and it did not include my test function. I need to investigate more to determine why it's missing foo.

Trott · 2015-09-02T01:37:10Z

I was able to get this one to blow up after 58 runs on OS X with this:

for i in {1..1000}; do echo $i; ./node test/parallel/test-tick-processor.js  || break; echo ====; done

Here's the error output:

assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: null == true
    at Object.<anonymous> (/Users/trott/io.js/test/parallel/test-tick-processor.js:27:1)
    at Module._compile (module.js:430:26)
    at Object.Module._extensions..js (module.js:448:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:471:10)
    at startup (node.js:117:18)
    at node.js:951:3

Trott · 2015-09-02T02:40:23Z

Not sure if it would address the indeterminacy (but I suspect it might), and I hope it's not a foolish question, but (ignoring the more-than-80-chars-in-a-line issue) is there any reason this part of the test:

cp.execFileSync(process.execPath, ['-prof', '-pe',
    'function foo(n) {' +
    'require(\'vm\').runInDebugContext(\'Debug\');' +
    'return n < 2 ? n : setImmediate(function() { foo(n-1) + foo(n-2);}); };' +
    'setTimeout(function() { process.exit(0); }, 2000);' +
    'foo(40);']);

...couldn't be something more like this?:

cp.execFileSync(process.execPath, ['-prof', '-pe',
    'function foo(n) {' +
    'require(\'vm\').runInDebugContext(\'Debug\');' +
    'return n < 2 ? process.exit(0); : setImmediate(function() { foo(n-1) + foo(n-2);}); };' +
    'foo(4);']);

...or this?:

cp.execFileSync(process.execPath, ['-prof', '-pe',
    'function foo(n) {' +
    'require(\'vm\').runInDebugContext(\'Debug\');' +
    'return n < 2 ? setImmediate(function () {process.exit(0);}) : setImmediate(function() { foo(n-1) + foo(n-2);}); };' +
    'foo(4);']);

/cc @matthewloring

matthewloring · 2015-09-02T03:13:38Z

The hope was to have the code run for a fixed time as opposed to a fixed number of cycles. We didn't want it to slow down too much on the slow machines or finish before sufficient samples had been taken on faster ones. We could try switching to something like this but it may be hard to find a sweet spot.

Trott · 2015-09-02T03:16:56Z

Inside of test/common.js, there's this:

exports.platformTimeout = function(ms) {
  if (process.arch !== 'arm')
    return ms;

  if (process.config.variables.arm_version === '6')
    return 7 * ms;  // ARMv6

  return 2 * ms;  // ARMv7 and up.
};

Maybe something similar to it could be created and used to tweak the number of iterations so ARM would only run (say) 20 times through the loop before process.exit() but everything else might go 200?

matthewloring · 2015-09-02T04:18:32Z

I'll look into it that but I'm worried something more tricky is going on. I tried both of the proposed alternatives above. The first flaked after 23 iterations and the second flaked after 166 iterations. Let me try to figure out why the C++/Javascript tick ratio is so high in the failing runs.

ofrobots · 2015-09-02T23:16:53Z

I looked into the logs posted by @matthewloring. The test wants to make sure it finds ticks in the javascript function foo; otherwise it fails.

Looking at the ticks recorded by V8, foo is definitely on the stack, just that it never happened to be at the top of the stack. The tick-processor by default prints stacks upto a depth of 5 which is barely not enough to find foo. I suspect this is the reason for the flakiness. If we are unlucky, we end up capturing only the ticks where foo never happens to be near the top of the stack.

The anonymous inner function at column 99 does show up higher up. The fix would be to either increase the callGraph size printed by the tick-processor or to grep for "LazyCompile.*[eval]:1". This should pattern match both the foo method and the inner anonymous function.

mhdawson · 2015-09-14T21:02:09Z

This test is also fails on PPC.

The issue for PPC is that on the V8 side the functionality used by test-tick-processor only includes symbols for functions if they are in the executable range and since PPC uses function descriptors as the entry points, and they are not in the executable range they symbols for most of the functions are discarded.

I was going to open a new issue but thought I'd mention it here first.

Per the discussion on #2471, the JS symbols checked for by this test were occasionally too deep in the stack and were being ignored by the tick processor. I have addressed this by increasing the stack depth inspected by the tick processor and looking for the eval symbol which is more likely to be present. Additional flakiness was caused by occasional misses of the code creation event for the JS function being executed. I now have separate code snippets to test for JS and C++ symbols and if the code creation event is missed for the JS symbol test then I check for a percentage of UNKNOWN symbols in processed output. This is considered a success as the processing scripts in the node repository are still correctly processing the ticks recieved from the v8 scripts. Further investigation is needed into the v8 profiling scripts to determine why code creation events are being missed. PR-URL: #2694 Reviewed-By: Ben Noordhuis <[email protected]>

matthewloring · 2015-09-14T22:19:35Z

#2694 was just landed to fix the flakiness I was aware of before this note on PPC. @mhdawson would you be able to post a raw tick log from a PPC machine so I can take a look? I don't have access to one locally.

Per the discussion on #2471, the JS symbols checked for by this test were occasionally too deep in the stack and were being ignored by the tick processor. I have addressed this by increasing the stack depth inspected by the tick processor and looking for the eval symbol which is more likely to be present. Additional flakiness was caused by occasional misses of the code creation event for the JS function being executed. I now have separate code snippets to test for JS and C++ symbols and if the code creation event is missed for the JS symbol test then I check for a percentage of UNKNOWN symbols in processed output. This is considered a success as the processing scripts in the node repository are still correctly processing the ticks recieved from the v8 scripts. Further investigation is needed into the v8 profiling scripts to determine why code creation events are being missed. PR-URL: #2694 Reviewed-By: Ben Noordhuis <[email protected]>

mhdawson · 2015-09-17T22:48:07Z

@dnakamura could you add a raw tick log

@matthewloring we have debugged the problem to the point were we believe the issue is due to PPC 64 BE using function descriptors (as does AIX 64 and 32 were we see the same problem). The issue seems to be on the V8 side where it only adds symbols in address ranges marked executable which is not the case for function descriptors. If we remove that check the test passes. We are still looking at how best to address the issue.

At this point I think we should likely create a new issue to track the PPC issue as opposed to using this one unless you think otherwise.

matthewloring · 2015-09-17T22:58:57Z

@mhdawson I think tracking it separately at this point makes sense, especially if the fixes will be going in on the V8 side.

ofrobots · 2015-09-17T23:14:25Z

Since the AIX problem isn't related to flakiness, I think it should be a separate issue.

Can this issue be closed now?

matthewloring · 2015-09-17T23:26:35Z

My fix is in and to my knowledge the flakiness is gone. Was hoping someone who has watched the CI since #2694 went in could confirm.

mhdawson · 2015-09-18T20:08:14Z

Created new issue for PPC failures here: #2957

mhdawson · 2015-10-22T23:40:19Z

At this point if nobody is complaining I think we should close this. Please re-open if you disagree

joaocgreis added build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests. labels Aug 20, 2015

brendanashworth added the arm Issues and PRs related to the ARM platform. label Aug 21, 2015

orangemocha mentioned this issue Aug 31, 2015

use unsorted array in Timers._unrefActive() #2540

Closed

joaocgreis mentioned this issue Sep 3, 2015

build: Add build-with-asan + use leak-check before v8 teardown #2376

Closed

matthewloring mentioned this issue Sep 4, 2015

tools: fix flakiness in test-tick-processor #2694

Closed

mhdawson closed this as completed Oct 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate flaky test test-tick-processor #2471

Investigate flaky test test-tick-processor #2471

joaocgreis commented Aug 20, 2015

brendanashworth commented Aug 21, 2015

bnoordhuis commented Aug 22, 2015

rvagg commented Aug 23, 2015

targos commented Aug 28, 2015

joaocgreis commented Sep 1, 2015

matthewloring commented Sep 1, 2015

Trott commented Sep 2, 2015

Trott commented Sep 2, 2015

matthewloring commented Sep 2, 2015

Trott commented Sep 2, 2015

matthewloring commented Sep 2, 2015

ofrobots commented Sep 2, 2015

mhdawson commented Sep 14, 2015

matthewloring commented Sep 14, 2015

mhdawson commented Sep 17, 2015

matthewloring commented Sep 17, 2015

ofrobots commented Sep 17, 2015

matthewloring commented Sep 17, 2015

mhdawson commented Sep 18, 2015

mhdawson commented Oct 22, 2015

Investigate flaky test test-tick-processor #2471

Investigate flaky test test-tick-processor #2471

Comments

joaocgreis commented Aug 20, 2015

brendanashworth commented Aug 21, 2015

bnoordhuis commented Aug 22, 2015

rvagg commented Aug 23, 2015

targos commented Aug 28, 2015

joaocgreis commented Sep 1, 2015

matthewloring commented Sep 1, 2015

Trott commented Sep 2, 2015

Trott commented Sep 2, 2015

matthewloring commented Sep 2, 2015

Trott commented Sep 2, 2015

matthewloring commented Sep 2, 2015

ofrobots commented Sep 2, 2015

mhdawson commented Sep 14, 2015

matthewloring commented Sep 14, 2015

mhdawson commented Sep 17, 2015

matthewloring commented Sep 17, 2015

ofrobots commented Sep 17, 2015

matthewloring commented Sep 17, 2015

mhdawson commented Sep 18, 2015

mhdawson commented Oct 22, 2015