Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test test-tick-processor #2471

Closed
joaocgreis opened this issue Aug 20, 2015 · 20 comments
Closed

Investigate flaky test test-tick-processor #2471

joaocgreis opened this issue Aug 20, 2015 · 20 comments
Labels
arm Issues and PRs related to the ARM platform. build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests.

Comments

@joaocgreis
Copy link
Member

Examples of failures:

@joaocgreis joaocgreis added build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests. labels Aug 20, 2015
@brendanashworth brendanashworth added the arm Issues and PRs related to the ARM platform. label Aug 21, 2015
@brendanashworth
Copy link
Contributor

I opened an issue (#2431) and it was thought not to be an issue, so I guess we'll see?

@bnoordhuis
Copy link
Member

I just saw this on armv7-wheezy:

not ok 673 - test-tick-processor.js
#
#
##
## Fatal error in ../deps/v8/src/heap/heap.cc, line 4062
## Check failed: map->instance_type() == JS_OBJECT_TYPE || map->instance_type() == JS_ARRAY_TYPE.
##
#
#==== C stack trace ===============================
#
#(empty)
  ---
  duration_ms: 120.111
  ...

@rvagg
Copy link
Member

rvagg commented Aug 23, 2015

it's flaking out on armv7-wheezy still https://jenkins-iojs.nodesource.com/job/node-test-commit-arm/267/nodes=armv7-wheezy/console

something's not happy there at all, fine on raspberry pi 2 though which is armv7 and wheezy-based, tho not exactly wheezy

@targos
Copy link
Member

targos commented Aug 28, 2015

I got another error with the same test:

not ok 676 - test-tick-processor.js
# util.js:560
# name = name.substr(1, name.length - 2);
# ^
# 
# TypeError: name.substr is not a function
# at String.substr (native)
# at formatProperty (util.js:560:19)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# at formatValue (util.js:378:16)
# at formatProperty (util.js:537:15)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# child_process.js:458
# throw err;
# ^
# 
# Error: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/armv7-wheezy/out/Release/node -prof -pe function foo(n) {require('vm').runInDebugContext('Debug');return n < 2 ? n : setImmediate(function() { foo(n-1) + foo(n-2);}); };setTimeout(function() { process.exit(0); }, 2000);foo(40);
# util.js:560
# name = name.substr(1, name.length - 2);
# ^
# 
# TypeError: name.substr is not a function
# at String.substr (native)
# at formatProperty (util.js:560:19)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# at formatValue (util.js:378:16)
# at formatProperty (util.js:537:15)
# at util.js:433:12
# at Array.map (native)
# at formatObject (util.js:432:15)
# 
# at checkExecSyncError (child_process.js:435:13)
# at Object.execFileSync (child_process.js:455:13)
# at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-arm/nodes/armv7-wheezy/test/parallel/test-tick-processor.js:11:4)
# at Module._compile (module.js:434:26)
# at Object.Module._extensions..js (module.js:452:10)
# at Module.load (module.js:355:32)
# at Function.Module._load (module.js:310:12)
# at Function.Module.runMain (module.js:475:10)
# at startup (node.js:117:18)
# at node.js:951:3

https://jenkins-iojs.nodesource.com/job/node-test-commit-arm/374/nodes=armv7-wheezy/tapTestReport/test.tap-676/

@joaocgreis
Copy link
Member Author

I ran this test repeatedly on Linux locally (Ubuntu 14.04). It failed 3 times out of 1188 runs (0.25%). It seems that this test is flaky at least on all Linux platforms (likely with a much higher failure rate in armv7).

@matthewloring
Copy link

I was able to reproduce this on Ubuntu locally by putting my machine under heavy load and gather some data. The raw ticks are too large for pastebin but here's the processed output:

http://pastebin.com/4rMfURK7

A single JS tick was collected and it did not include my test function. I need to investigate more to determine why it's missing foo.

@Trott
Copy link
Member

Trott commented Sep 2, 2015

I was able to get this one to blow up after 58 runs on OS X with this:

for i in {1..1000}; do echo $i; ./node test/parallel/test-tick-processor.js  || break; echo ====; done

Here's the error output:

assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: null == true
    at Object.<anonymous> (/Users/trott/io.js/test/parallel/test-tick-processor.js:27:1)
    at Module._compile (module.js:430:26)
    at Object.Module._extensions..js (module.js:448:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:471:10)
    at startup (node.js:117:18)
    at node.js:951:3

@Trott
Copy link
Member

Trott commented Sep 2, 2015

Not sure if it would address the indeterminacy (but I suspect it might), and I hope it's not a foolish question, but (ignoring the more-than-80-chars-in-a-line issue) is there any reason this part of the test:

cp.execFileSync(process.execPath, ['-prof', '-pe',
    'function foo(n) {' +
    'require(\'vm\').runInDebugContext(\'Debug\');' +
    'return n < 2 ? n : setImmediate(function() { foo(n-1) + foo(n-2);}); };' +
    'setTimeout(function() { process.exit(0); }, 2000);' +
    'foo(40);']);

...couldn't be something more like this?:

cp.execFileSync(process.execPath, ['-prof', '-pe',
    'function foo(n) {' +
    'require(\'vm\').runInDebugContext(\'Debug\');' +
    'return n < 2 ? process.exit(0); : setImmediate(function() { foo(n-1) + foo(n-2);}); };' +
    'foo(4);']);

...or this?:

cp.execFileSync(process.execPath, ['-prof', '-pe',
    'function foo(n) {' +
    'require(\'vm\').runInDebugContext(\'Debug\');' +
    'return n < 2 ? setImmediate(function () {process.exit(0);}) : setImmediate(function() { foo(n-1) + foo(n-2);}); };' +
    'foo(4);']);

/cc @matthewloring

@matthewloring
Copy link

The hope was to have the code run for a fixed time as opposed to a fixed number of cycles. We didn't want it to slow down too much on the slow machines or finish before sufficient samples had been taken on faster ones. We could try switching to something like this but it may be hard to find a sweet spot.

@Trott
Copy link
Member

Trott commented Sep 2, 2015

Inside of test/common.js, there's this:

exports.platformTimeout = function(ms) {
  if (process.arch !== 'arm')
    return ms;

  if (process.config.variables.arm_version === '6')
    return 7 * ms;  // ARMv6

  return 2 * ms;  // ARMv7 and up.
};

Maybe something similar to it could be created and used to tweak the number of iterations so ARM would only run (say) 20 times through the loop before process.exit() but everything else might go 200?

@matthewloring
Copy link

I'll look into it that but I'm worried something more tricky is going on. I tried both of the proposed alternatives above. The first flaked after 23 iterations and the second flaked after 166 iterations. Let me try to figure out why the C++/Javascript tick ratio is so high in the failing runs.

@ofrobots
Copy link
Contributor

ofrobots commented Sep 2, 2015

I looked into the logs posted by @matthewloring. The test wants to make sure it finds ticks in the javascript function foo; otherwise it fails.

Looking at the ticks recorded by V8, foo is definitely on the stack, just that it never happened to be at the top of the stack. The tick-processor by default prints stacks upto a depth of 5 which is barely not enough to find foo. I suspect this is the reason for the flakiness. If we are unlucky, we end up capturing only the ticks where foo never happens to be near the top of the stack.

The anonymous inner function at column 99 does show up higher up. The fix would be to either increase the callGraph size printed by the tick-processor or to grep for "LazyCompile.*[eval]:1". This should pattern match both the foo method and the inner anonymous function.

@mhdawson
Copy link
Member

This test is also fails on PPC.

The issue for PPC is that on the V8 side the functionality used by test-tick-processor only includes symbols for functions if they are in the executable range and since PPC uses function descriptors as the entry points, and they are not in the executable range they symbols for most of the functions are discarded.

I was going to open a new issue but thought I'd mention it here first.

bnoordhuis pushed a commit that referenced this issue Sep 14, 2015
Per the discussion on #2471, the JS symbols checked for by this test
were occasionally too deep in the stack and were being ignored by the
tick processor.

I have addressed this by increasing the stack depth inspected by the
tick processor and looking for the eval symbol which is more likely
to be present. Additional flakiness was caused by occasional misses
of the code creation event for the JS function being executed. I now
have separate code snippets to test for JS and C++ symbols and if
the code creation event is missed for the JS symbol test then I check
for a percentage of UNKNOWN symbols in processed output. This is
considered a success as the processing scripts in the node repository
are still correctly processing the ticks recieved from the v8
scripts. Further investigation is needed into the v8 profiling
scripts to determine why code creation events are being missed.

PR-URL: #2694
Reviewed-By: Ben Noordhuis <[email protected]>
@matthewloring
Copy link

#2694 was just landed to fix the flakiness I was aware of before this note on PPC. @mhdawson would you be able to post a raw tick log from a PPC machine so I can take a look? I don't have access to one locally.

Fishrock123 pushed a commit that referenced this issue Sep 15, 2015
Per the discussion on #2471, the JS symbols checked for by this test
were occasionally too deep in the stack and were being ignored by the
tick processor.

I have addressed this by increasing the stack depth inspected by the
tick processor and looking for the eval symbol which is more likely
to be present. Additional flakiness was caused by occasional misses
of the code creation event for the JS function being executed. I now
have separate code snippets to test for JS and C++ symbols and if
the code creation event is missed for the JS symbol test then I check
for a percentage of UNKNOWN symbols in processed output. This is
considered a success as the processing scripts in the node repository
are still correctly processing the ticks recieved from the v8
scripts. Further investigation is needed into the v8 profiling
scripts to determine why code creation events are being missed.

PR-URL: #2694
Reviewed-By: Ben Noordhuis <[email protected]>
rvagg pushed a commit that referenced this issue Sep 15, 2015
Per the discussion on #2471, the JS symbols checked for by this test
were occasionally too deep in the stack and were being ignored by the
tick processor.

I have addressed this by increasing the stack depth inspected by the
tick processor and looking for the eval symbol which is more likely
to be present. Additional flakiness was caused by occasional misses
of the code creation event for the JS function being executed. I now
have separate code snippets to test for JS and C++ symbols and if
the code creation event is missed for the JS symbol test then I check
for a percentage of UNKNOWN symbols in processed output. This is
considered a success as the processing scripts in the node repository
are still correctly processing the ticks recieved from the v8
scripts. Further investigation is needed into the v8 profiling
scripts to determine why code creation events are being missed.

PR-URL: #2694
Reviewed-By: Ben Noordhuis <[email protected]>
@mhdawson
Copy link
Member

@dnakamura could you add a raw tick log

@matthewloring we have debugged the problem to the point were we believe the issue is due to PPC 64 BE using function descriptors (as does AIX 64 and 32 were we see the same problem). The issue seems to be on the V8 side where it only adds symbols in address ranges marked executable which is not the case for function descriptors. If we remove that check the test passes. We are still looking at how best to address the issue.

At this point I think we should likely create a new issue to track the PPC issue as opposed to using this one unless you think otherwise.

@matthewloring
Copy link

@mhdawson I think tracking it separately at this point makes sense, especially if the fixes will be going in on the V8 side.

@ofrobots
Copy link
Contributor

Since the AIX problem isn't related to flakiness, I think it should be a separate issue.

Can this issue be closed now?

@matthewloring
Copy link

My fix is in and to my knowledge the flakiness is gone. Was hoping someone who has watched the CI since #2694 went in could confirm.

@mhdawson
Copy link
Member

Created new issue for PPC failures here: #2957

@mhdawson
Copy link
Member

At this point if nobody is complaining I think we should close this. Please re-open if you disagree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. build Issues and PRs related to build files or the CI. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

9 participants