fix #13166 tioselectors flaky test on freebsd+OSX #14634

timotheecour · 2020-06-11T00:12:23Z

fix [CI] tests/async/tioselectors.nim flaky test for freebsd + OSX CI #13166 flaky test

looks like there is a lag involved, not quite sure why (I thought timers at least on OSX had high resolution).
note that running locally, I can reproduce the flake but only with smaller values close to the t0=100 set in var timer = selector.registerTimer(100, false, 0)

when running locally, proportion of success increases as t increases:

90 => (0, 200)
98 => (11, 200)
100 => (67, 200)
101 => (107, 200)
103 => (161, 200)
104 => (186, 200)
105 => (200, 200)
110 => (200, 200)
140 => (200, 200)

when running on CI, perhaps there is some extra lag (virtualization?)

import os, osproc, strutils
const Iterations = 200
proc test() =
  var count = 0
  let file = "tests/async/tioselectors.nim"
  let exe = file.changeFileExt(ExeExt)
  let opt = "--threads:on -d:threadsafe --excessivestacktrace -d:danger"
  let run = "nim c --lib:lib $1 $2" % [opt, file.quoteShell]
  let (output, status) = execCmdEx run
  doAssert status == 0, output
  var cmds: seq[string]
  for i in 1..Iterations:
    cmds.add exe
  var count1 = 0
  var count2 = 0

  proc afterRunEvent(idx: int; p: Process) =
    let code = p.peekExitCode
    doAssert code != -1
    echo (code, )
    count1 += ord(code == 0)
    count2 += 1

  when true:
    let status2 = execProcesses(cmds, afterRunEvent = afterRunEvent)
    echo (count1, count2)
    doAssert status2 == 0
  else:
    for cmd in cmds:
      let status2i = execProcesses(@[cmd], afterRunEvent = afterRunEvent)
    echo (count1, count2)
when isMainModule: test()

here is some research:

kqueue timer timeout period

For example, if I set date (the timeout period) to 20 milliseconds, I
often wait 21 milliseconds which is very undesirable for my application.

FreeBSD is not a real-time OS. The timeouts specified in various syscalls (eg kevent(EVFILT_TIMER), nanosleep(), select(), poll()) specify minimum timeouts. Once the timeout (rounded up to the next
tick) has expired, the process will be placed back into the queue of processes eligible to be run by the scheduler - which may impose a further arbitrary delay. Periodic timers are somewhat better behaved: Scheduler delays only impact process scheduling after the timeout expires and the average rate should be very close to that requested.

While it is certainly true that FreeBSD is not a real-time OS, this does not explain the timer problems.

2 or 3 month ago I did a simple test with select and poll: I observed a systematic error of about 3-5% of the waiting time. So when you wait for 20ms, you may get 21ms (if running with
a low HZ value) because of rounding. But if you wait for 100s, you get 103 or even 105s on a completly idle machine (all services disabled).

I think that this is not a problem with beeing non-realtime, but a problem with time-keeping. Shouldn't it be possible to do this better?

I don't think it has anything to do with realtime either. I've been using gentoo linux to run my application using timerfd_create/read for 20 millisecond timing without any problems.

http://lists.digium.com/pipermail/asterisk-dev/2015-March/073302.html

Yes, EVFILT_TIMER events are reliable. I mean the interaction of kqueue with a timer and poll -- the expectation
that poll will return for a kqueue timer expiration.

https://opensource.apple.com/source/xnu/xnu-4570.41.2/bsd/kern/kern_event.c.auto.html

E) Whether the timer continues to tick across sleep By default all three do not. For interval and absolute, NOTE_MACH_CONTINUOUS_TIME causes them to tick across sleep With NOTE_ABSOLUTE | NOTE_MACHTIME | NOTE_MACH_CONTINUOUS_TIME: expires when mach_continuous_time() is > the passed in value.

http://openbsd-archive.7691.n7.nabble.com/ITIMER-resolution-td357053.html#a357222

On amd64, where hz(9) is 100, 10 milliseconds is indeed the shortest interval possible for the per-process itimer(2) timers. Smaller intervals will be rounded up to 10ms when the timer is scheduled in
the kernel. This is the case for all platforms except for alpha (where hz is 1024) and sh/landisk (where hz is 64). Other timer interfaces will have the same restriction, e.g. the kevent(2) EVFILT_TIMER filter has the same granularity as setitimer(2).

/cc @cheatfate

timotheecour · 2020-06-12T00:11:34Z

@euantorano can you LGTM this?

euantorano · 2020-06-12T05:03:58Z

Yep, will have a look over it this weekend.

…

On Fri, 12 Jun 2020, at 01:11, Timothee Cour wrote: @euantorano <https://github.com/euantorano> can you LGTM this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14634 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFW24NZOI633DAO3LQACL3RWFXEFANCNFSM4N25LHXQ>.

dom96

This overkill. Just up the timings in the select call, that's all that is needed here. All these changes here just make the code harder to understand and I cannot see them being necessary.

tests/async/tioselectors.nim

timotheecour · 2020-06-12T20:06:10Z

This overkill. Just up the timings in the select call, that's all that is needed here. All these changes here just make the code harder to understand and I cannot see them being necessary.

the extra code helps when investigating this issue so you can run the timer test N times without overhead and get statistical results on how many times it failed without stopping on 1st failure. There are still unknowns here:

registerTimer timer doesn't start immediately but select timer starts immediately
freebsd doesn't have this problem, but openbsd does have it, and OSX has it even more
the problem in CI is even worse even though IIUC OSX doesn't run on a VM in CI

dom96 · 2020-06-12T20:08:40Z

the extra code helps when investigating this issue so you can run the timer test N times without overhead and get statistical results on how many times

This should be a separate testing tool IMO.

examples/t14634.nim

timotheecour · 2020-06-13T02:10:01Z

done, PTAL

timotheecour · 2020-06-13T04:19:16Z

@Araq I'm merging this as all comments have been addressed and I'm keeping seeing CI errors due to this flaky test in other PR's

dom96 · 2020-06-13T10:18:32Z

Nice, thanks for taking my feedback on board.

(cherry picked from commit d149823)

timotheecour changed the title ~~fix tioselectors flaky test on freebsd+OSX~~ fix #13166 tioselectors flaky test on freebsd+OSX Jun 11, 2020

timotheecour force-pushed the pr_fix_13166_tioselectors branch from 638d8cf to b143fc6 Compare June 11, 2020 04:51

timotheecour requested a review from Araq June 11, 2020 05:40

dom96 requested changes Jun 12, 2020

View reviewed changes

tests/async/tioselectors.nim Show resolved Hide resolved

timotheecour added 5 commits June 12, 2020 12:43

fix tioselectors flaky test on freebsd+OSX

ab3a48a

fixup

f6ec36b

fixup

cefc0a1

fixup

effde8c

tweak timers to further reduce chance of flakes

ad231ca

timotheecour force-pushed the pr_fix_13166_tioselectors branch from b143fc6 to ad231ca Compare June 12, 2020 19:43

timotheecour added 4 commits June 12, 2020 13:08

disable broken fidget

08176b6

add examples/t14634.nim

5d02b7c

address comments

0108b7f

fixup

c52c892

Araq reviewed Jun 12, 2020

View reviewed changes

examples/t14634.nim Outdated Show resolved Hide resolved

examples/t14634.nim -> tests/m14634.nim

513d169

timotheecour merged commit d149823 into nim-lang:devel Jun 13, 2020

timotheecour deleted the pr_fix_13166_tioselectors branch June 13, 2020 04:19

timotheecour mentioned this pull request Jul 2, 2020

drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows #14885

Open

narimiran pushed a commit that referenced this pull request May 17, 2021

fix #13166 tioselectors flaky test on freebsd+OSX (#14634)

6dde797

(cherry picked from commit d149823)

etan-status mentioned this pull request Oct 27, 2021

Merge branch 'version-1-2' into nimbus status-im/Nim#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #13166 tioselectors flaky test on freebsd+OSX #14634

fix #13166 tioselectors flaky test on freebsd+OSX #14634

timotheecour commented Jun 11, 2020 •

edited

Loading

timotheecour commented Jun 12, 2020

euantorano commented Jun 12, 2020 via email

dom96 left a comment

timotheecour commented Jun 12, 2020

dom96 commented Jun 12, 2020

timotheecour commented Jun 13, 2020

timotheecour commented Jun 13, 2020 •

edited

Loading

dom96 commented Jun 13, 2020

fix #13166 tioselectors flaky test on freebsd+OSX #14634

fix #13166 tioselectors flaky test on freebsd+OSX #14634

Conversation

timotheecour commented Jun 11, 2020 • edited Loading

here is some research:

timotheecour commented Jun 12, 2020

euantorano commented Jun 12, 2020 via email

dom96 left a comment

Choose a reason for hiding this comment

timotheecour commented Jun 12, 2020

dom96 commented Jun 12, 2020

timotheecour commented Jun 13, 2020

timotheecour commented Jun 13, 2020 • edited Loading

dom96 commented Jun 13, 2020

timotheecour commented Jun 11, 2020 •

edited

Loading

timotheecour commented Jun 13, 2020 •

edited

Loading