Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #13166 tioselectors flaky test on freebsd+OSX #14634

Merged
merged 10 commits into from
Jun 13, 2020

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Jun 11, 2020

looks like there is a lag involved, not quite sure why (I thought timers at least on OSX had high resolution).
note that running locally, I can reproduce the flake but only with smaller values close to the t0=100 set in var timer = selector.registerTimer(100, false, 0)

when running locally, proportion of success increases as t increases:

90 => (0, 200)
98 => (11, 200)
100 => (67, 200)
101 => (107, 200)
103 => (161, 200)
104 => (186, 200)
105 => (200, 200)
110 => (200, 200)
140 => (200, 200)

when running on CI, perhaps there is some extra lag (virtualization?)

import os, osproc, strutils
const Iterations = 200
proc test() =
  var count = 0
  let file = "tests/async/tioselectors.nim"
  let exe = file.changeFileExt(ExeExt)
  let opt = "--threads:on -d:threadsafe --excessivestacktrace -d:danger"
  let run = "nim c --lib:lib $1 $2" % [opt, file.quoteShell]
  let (output, status) = execCmdEx run
  doAssert status == 0, output
  var cmds: seq[string]
  for i in 1..Iterations:
    cmds.add exe
  var count1 = 0
  var count2 = 0

  proc afterRunEvent(idx: int; p: Process) =
    let code = p.peekExitCode
    doAssert code != -1
    echo (code, )
    count1 += ord(code == 0)
    count2 += 1

  when true:
    let status2 = execProcesses(cmds, afterRunEvent = afterRunEvent)
    echo (count1, count2)
    doAssert status2 == 0
  else:
    for cmd in cmds:
      let status2i = execProcesses(@[cmd], afterRunEvent = afterRunEvent)
    echo (count1, count2)
when isMainModule: test()

here is some research:

For example, if I set date (the timeout period) to 20 milliseconds, I
often wait 21 milliseconds which is very undesirable for my application.

FreeBSD is not a real-time OS. The timeouts specified in various syscalls (eg kevent(EVFILT_TIMER), nanosleep(), select(), poll()) specify minimum timeouts. Once the timeout (rounded up to the next
tick) has expired, the process will be placed back into the queue of processes eligible to be run by the scheduler - which may impose a further arbitrary delay. Periodic timers are somewhat better behaved: Scheduler delays only impact process scheduling after the timeout expires and the average rate should be very close to that requested.

While it is certainly true that FreeBSD is not a real-time OS, this does not explain the timer problems.

2 or 3 month ago I did a simple test with select and poll: I observed a systematic error of about 3-5% of the waiting time. So when you wait for 20ms, you may get 21ms (if running with
a low HZ value) because of rounding. But if you wait for 100s, you get 103 or even 105s on a completly idle machine (all services disabled).

I think that this is not a problem with beeing non-realtime, but a problem with time-keeping. Shouldn't it be possible to do this better?

I don't think it has anything to do with realtime either. I've been using gentoo linux to run my application using timerfd_create/read for 20 millisecond timing without any problems.

Yes, EVFILT_TIMER events are reliable. I mean the interaction of kqueue with a timer and poll -- the expectation
that poll will return for a kqueue timer expiration.

E) Whether the timer continues to tick across sleep By default all three do not. For interval and absolute, NOTE_MACH_CONTINUOUS_TIME causes them to tick across sleep With NOTE_ABSOLUTE | NOTE_MACHTIME | NOTE_MACH_CONTINUOUS_TIME: expires when mach_continuous_time() is > the passed in value.

On amd64, where hz(9) is 100, 10 milliseconds is indeed the shortest interval possible for the per-process itimer(2) timers. Smaller intervals will be rounded up to 10ms when the timer is scheduled in
the kernel. This is the case for all platforms except for alpha (where hz is 1024) and sh/landisk (where hz is 64). Other timer interfaces will have the same restriction, e.g. the kevent(2) EVFILT_TIMER filter has the same granularity as setitimer(2).

/cc @cheatfate

@timotheecour timotheecour changed the title fix tioselectors flaky test on freebsd+OSX fix #13166 tioselectors flaky test on freebsd+OSX Jun 11, 2020
@timotheecour timotheecour requested a review from Araq June 11, 2020 05:40
@timotheecour
Copy link
Member Author

@euantorano can you LGTM this?

@euantorano
Copy link
Contributor

euantorano commented Jun 12, 2020 via email

Copy link
Contributor

@dom96 dom96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overkill. Just up the timings in the select call, that's all that is needed here. All these changes here just make the code harder to understand and I cannot see them being necessary.

tests/async/tioselectors.nim Show resolved Hide resolved
@timotheecour
Copy link
Member Author

This overkill. Just up the timings in the select call, that's all that is needed here. All these changes here just make the code harder to understand and I cannot see them being necessary.

the extra code helps when investigating this issue so you can run the timer test N times without overhead and get statistical results on how many times it failed without stopping on 1st failure. There are still unknowns here:

  • registerTimer timer doesn't start immediately but select timer starts immediately
  • freebsd doesn't have this problem, but openbsd does have it, and OSX has it even more
  • the problem in CI is even worse even though IIUC OSX doesn't run on a VM in CI

@dom96
Copy link
Contributor

dom96 commented Jun 12, 2020

the extra code helps when investigating this issue so you can run the timer test N times without overhead and get statistical results on how many times

This should be a separate testing tool IMO.

examples/t14634.nim Outdated Show resolved Hide resolved
@timotheecour
Copy link
Member Author

done, PTAL

@timotheecour
Copy link
Member Author

timotheecour commented Jun 13, 2020

@Araq I'm merging this as all comments have been addressed and I'm keeping seeing CI errors due to this flaky test in other PR's

@timotheecour timotheecour merged commit d149823 into nim-lang:devel Jun 13, 2020
@timotheecour timotheecour deleted the pr_fix_13166_tioselectors branch June 13, 2020 04:19
@dom96
Copy link
Contributor

dom96 commented Jun 13, 2020

Nice, thanks for taking my feedback on board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] tests/async/tioselectors.nim flaky test for freebsd + OSX CI
4 participants