Fix #55, add timeout to SB receive #144

jphickey · 2021-03-29T19:39:39Z

Describe the contribution
Apps should not pend forever on software bus receive, because they should also periodically check CFE_ES_RunLoop even if no
commands were received.

Fixes #55

Testing performed
Build and sanity check CFE, confirm sample_app runs correctly

Expected behavior changes
Sample App will check CFE_ES_Runloop() (and therefore respond to a request from ES) even if it does not get any message on the software bus.

System(s) tested on
Ubuntu 20.04

Additional context
As a "best practice" all apps should do something like this, instead of pending forever.

Contributor Info - All information REQUIRED for consideration of pull request
Joseph Hickey, Vantage Systems, Inc.

Apps should not pend forever on software bus receive, because they should also periodically check CFE_ES_RunLoop even if no commands were received.

skliper · 2021-03-29T20:00:51Z

Is this really "best practice" vs just another possible implementation? Could be more confusing in terms of performance monitoring, really just another thing to monitor/manage when the app should be getting messages at a regular rate already, if that regular wake-up isn't happening there's bigger issues than allowing the app to process the run loop. The timeout really is now asynchronous relative to regular scheduling so could set up weird races in certain conditions...

jphickey · 2021-03-29T20:11:31Z

Is this really "best practice" vs just another possible implementation? Could be more confusing in terms of performance monitoring, really just another thing to monitor/manage when the app should be getting messages at a regular rate already, if that regular wake-up isn't happening there's bigger issues than allowing the app to process the run loop. The timeout really is now asynchronous relative to regular scheduling so could set up weird races in certain conditions...

Disagree on that, it follows the normal schedule unless there is no activity for 1000ms, then it wakes up to check CFE_ES_RunLoop and then continues waiting. It will still follow the precise schedule from SCH if the app is scheduled in this method. There really isn't much downside to this pattern that I can see.

The main point (and why I called it "best practice") is that apps should be checking CFE_ES_Runloop periodically, no matter what other work they are doing - because this is the only way they can "notice" an admin request like CFE_ES_STOP_APP_CC etc.

Sure, one could ensure that the app is also scheduled by SCH to do the same so it never pends forever inside CFE_SB_ReceiveBuffer, but now if SCH has an issue then one will lose the ability to control the app. And also how can you restart SCH itself?

I suppose the comment about the possible extra ticks in the performance log is valid, but if this is a concern we can always move the Entry/Exit to be inside the if (status == CFE_SUCCESS) branch, so it only shows up in perf log when a real command was pending. But I don't think this is a big concern, because it is "honest" - compared to the alternative of not being able to shut down the app from ES command.

skliper · 2021-03-29T20:40:45Z

Still not buying the justification for additional complexity and promoting as the "best practice". I see it as app developers (or operational app management owner's) choice. If you want to pend, pend. If you want to pend w/ timeout, do that. If you want to just poll your pipe and process low priority work in the background, then do that. It seems like the trade is well known, I don't see a huge benefit to timeout in the sample app unless there's a clear operational need. The app is already periodically checking CFE_ES_Runloop based on the periodic SCH message, it's only off-nominal cases where it wouldn't... and users can select whatever behavior they want based on their app implementation for those off-nominal cases.

skliper · 2021-03-29T20:48:37Z

Or are we saying every app should nominally check CFE_ES_Runloop at 1 Hz (either by a scheduled message or a time out)? Seems like it's still developer's choice as to how that is satisfied... but if there's a rate at which it should be checked maybe it makes more sense in that context vs if it's a timeout or not?

jphickey · 2021-03-29T20:50:12Z

Here's the thing - it's important to call CFE_ES_RunLoop(), regardless of whatever else your app my be doing.

That is just the design of CFE - apps need to cooperate in order to respond to an admin request, and this function is the only way that info is communicated to the app. So apps need to make sure they cycle through their run loop fairly regularly.

As a general design principle, the thread that is responsible for calling CFE_ES_RunLoop shouldn't ever "pend forever" on anything. It could be a file read/write, a socket operation, anything.

Rule of thumb should be:

If you pend on anything in main loop, use a timeout
If you want to pend forever on anything, use a child task to do so

This seems pretty simple to me, and makes sure your apps behave even if other apps do not behave (i.e. self sufficient design). Correct behavior of one app in this regard should not depend on external stimuli. But maybe that's just me.

This seemed like a very simple way to ensure that we call CFE_ES_RunLoop and can control that app even if SCH goes out to lunch. I am surprised it has met pushback. I don't really care that much, if you don't like it, drop this PR.

jphickey · 2021-03-29T20:57:46Z

The absolute maximum period that an app can delay between calls to CFE_ES_RunLoop() is the time delay of the background task between the request and the forced shutdown, which is about 5 seconds in default config. But that doesn't leave any time for a graceful shutdown. So in practice you'd want to use a shorter value. I just tossed in 1 second for example, but it can be anything.

In most non-trivial apps there is more work for the main loop to do than just pend on the SB receive. Even the CI_LAB/TO_LAB do more work here. So a general pattern of putting a timeout here in the main loop, again I can't see how it really hurts anything, I'd rather have it polling extra times than not enough.

skliper · 2021-03-29T21:12:20Z

I don't have a problem with a timeout, certainly doesn't hurt. Although the only "correct behavior" the change seems to supports is the apps response to management requests in the absence of it's nominally scheduled message. If you can pend forever in a child then you are setting up a similar failure scenario which requires forceful cleanup, just at the child level. Why force an app to have a child if the app really should just pend? Again, I have no problem at all providing this as an example or even recommendation, but it's not the only way to do things nor do I see it as a requirement that all apps time out (if you allow children to pend which could require forceful cleanup then why not allow apps to also rely on the forceful cleanup if that's the desired behavior?)

jphickey · 2021-03-29T21:27:36Z

True it is not the only way to do things, but in general one shouldn't rely on a complex path of dependent logic (i.e. SCH running, sending to the command, successful subscription, etc) in order meet the requirement of checking CFE_ES_RunLoop on a regular basis.

There are lots of ways were relying on SCH (or anything else) to make a forever pend into not being forever can go wrong. The MsgId could be wrong, the subscription can fail for a number of reasons. True that most/any of those conditions are bugs - but does one want to be left in a state where the app has essentially gotten "hung" and we need to force-delete?

I'm not as worried about pending forever in a child task because as long as the main task reacts to the pending ES request, it can do whatever it needs to do to wake/clean up the child task. The point is that the app itself is in a better position to know if its dangerous (or not) to do a force delete than ES is, in the course of a normal shutdown.

This is why I referred to it as "best practice" earlier - not implying there aren't other ways of doing it, but a robust design pattern shouldn't depend on several external apps all being running and properly configured in order make basic, important control operations like CFE_ES_STOP_APP_CC/RESTART_APP_CC/RELOAD_APP_CC etc work as intended.

skliper · 2021-03-30T18:49:15Z

I do agree it makes sense for sample app based on how it's currently scheduled and how we use it. I also agree any app for which forced delete should be avoided needs to check CFE_ES_RunLoop (regardless of external triggers) at a rate faster than the forced delete timeout. As long as the comments are updated to reflect that I'm fine with the change.

skliper

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

astrogeco · 2021-03-31T17:10:03Z

CCB:2021-03-31 APPROVED

is the documentation "too forceful"?

astrogeco · 2021-04-08T15:37:02Z

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

@zanzaben was this addressed?

skliper · 2021-04-08T15:40:04Z

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

@zanzaben was this addressed?

Those were my comments. Not addressed yet, I'll provide suggested updated comments (or feel free if anyone else wants to take a shot at it).

astrogeco · 2021-04-08T15:54:14Z

So, to merge or not to merge?

skliper · 2021-04-08T16:02:15Z

Don't merge until resolved. Thx.

skliper · 2021-04-28T16:36:50Z

Marking to ignore until we get the requested change.

astrogeco · 2021-04-28T16:36:52Z

CCB:2021-04-28 continue postponing

Fix nasa#55, add timeout to SB receive

531a47f

Apps should not pend forever on software bus receive, because they should also periodically check CFE_ES_RunLoop even if no commands were received.

jphickey added the CCB:Ready Pull request is ready for discussion at the Configuration Control Board (CCB) label Mar 29, 2021

skliper requested changes Mar 31, 2021

View reviewed changes

astrogeco added CCB:Approved Indicates code approval by CCB and removed CCB:Ready Pull request is ready for discussion at the Configuration Control Board (CCB) labels Apr 7, 2021

astrogeco changed the base branch from main to integration-candidate April 8, 2021 15:35

astrogeco added the IC:2021-04-06 label Apr 8, 2021

astrogeco added CCB:Approved Indicates code approval by CCB and removed CCB:Approved Indicates code approval by CCB IC:2021-04-06 labels Apr 8, 2021

astrogeco removed the CCB:Approved Indicates code approval by CCB label Apr 19, 2021

skliper added the CCB:Ignore Pull Request is NOT ready for discussion. Has open actions. Will be re-examined at by next CCB. label Apr 28, 2021

skliper self-requested a review April 28, 2021 16:44

dzbaker assigned jphickey Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #55, add timeout to SB receive #144

Fix #55, add timeout to SB receive #144

jphickey commented Mar 29, 2021

skliper commented Mar 29, 2021

jphickey commented Mar 29, 2021

skliper commented Mar 29, 2021

skliper commented Mar 29, 2021

jphickey commented Mar 29, 2021

jphickey commented Mar 29, 2021

skliper commented Mar 29, 2021

jphickey commented Mar 29, 2021

skliper commented Mar 30, 2021

skliper left a comment

astrogeco commented Mar 31, 2021

astrogeco commented Apr 8, 2021

skliper commented Apr 8, 2021

astrogeco commented Apr 8, 2021

skliper commented Apr 8, 2021

skliper commented Apr 28, 2021

astrogeco commented Apr 28, 2021

Fix #55, add timeout to SB receive #144

Are you sure you want to change the base?

Fix #55, add timeout to SB receive #144

Conversation

jphickey commented Mar 29, 2021

skliper commented Mar 29, 2021

jphickey commented Mar 29, 2021

skliper commented Mar 29, 2021

skliper commented Mar 29, 2021

jphickey commented Mar 29, 2021

jphickey commented Mar 29, 2021

skliper commented Mar 29, 2021

jphickey commented Mar 29, 2021

skliper commented Mar 30, 2021

skliper left a comment

Choose a reason for hiding this comment

astrogeco commented Mar 31, 2021

astrogeco commented Apr 8, 2021

skliper commented Apr 8, 2021

astrogeco commented Apr 8, 2021

skliper commented Apr 8, 2021

skliper commented Apr 28, 2021

astrogeco commented Apr 28, 2021