Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #55, add timeout to SB receive #144

Open
wants to merge 1 commit into
base: integration-candidate
Choose a base branch
from

Conversation

jphickey
Copy link
Contributor

Describe the contribution
Apps should not pend forever on software bus receive, because they should also periodically check CFE_ES_RunLoop even if no
commands were received.

Fixes #55

Testing performed
Build and sanity check CFE, confirm sample_app runs correctly

Expected behavior changes
Sample App will check CFE_ES_Runloop() (and therefore respond to a request from ES) even if it does not get any message on the software bus.

System(s) tested on
Ubuntu 20.04

Additional context
As a "best practice" all apps should do something like this, instead of pending forever.

Contributor Info - All information REQUIRED for consideration of pull request
Joseph Hickey, Vantage Systems, Inc.

Apps should not pend forever on software bus receive, because
they should also periodically check CFE_ES_RunLoop even if no
commands were received.
@jphickey jphickey added the CCB:Ready Pull request is ready for discussion at the Configuration Control Board (CCB) label Mar 29, 2021
@skliper
Copy link
Contributor

skliper commented Mar 29, 2021

Is this really "best practice" vs just another possible implementation? Could be more confusing in terms of performance monitoring, really just another thing to monitor/manage when the app should be getting messages at a regular rate already, if that regular wake-up isn't happening there's bigger issues than allowing the app to process the run loop. The timeout really is now asynchronous relative to regular scheduling so could set up weird races in certain conditions...

@jphickey
Copy link
Contributor Author

Is this really "best practice" vs just another possible implementation? Could be more confusing in terms of performance monitoring, really just another thing to monitor/manage when the app should be getting messages at a regular rate already, if that regular wake-up isn't happening there's bigger issues than allowing the app to process the run loop. The timeout really is now asynchronous relative to regular scheduling so could set up weird races in certain conditions...

Disagree on that, it follows the normal schedule unless there is no activity for 1000ms, then it wakes up to check CFE_ES_RunLoop and then continues waiting. It will still follow the precise schedule from SCH if the app is scheduled in this method. There really isn't much downside to this pattern that I can see.

The main point (and why I called it "best practice") is that apps should be checking CFE_ES_Runloop periodically, no matter what other work they are doing - because this is the only way they can "notice" an admin request like CFE_ES_STOP_APP_CC etc.

Sure, one could ensure that the app is also scheduled by SCH to do the same so it never pends forever inside CFE_SB_ReceiveBuffer, but now if SCH has an issue then one will lose the ability to control the app. And also how can you restart SCH itself?

I suppose the comment about the possible extra ticks in the performance log is valid, but if this is a concern we can always move the Entry/Exit to be inside the if (status == CFE_SUCCESS) branch, so it only shows up in perf log when a real command was pending. But I don't think this is a big concern, because it is "honest" - compared to the alternative of not being able to shut down the app from ES command.

@skliper
Copy link
Contributor

skliper commented Mar 29, 2021

Still not buying the justification for additional complexity and promoting as the "best practice". I see it as app developers (or operational app management owner's) choice. If you want to pend, pend. If you want to pend w/ timeout, do that. If you want to just poll your pipe and process low priority work in the background, then do that. It seems like the trade is well known, I don't see a huge benefit to timeout in the sample app unless there's a clear operational need. The app is already periodically checking CFE_ES_Runloop based on the periodic SCH message, it's only off-nominal cases where it wouldn't... and users can select whatever behavior they want based on their app implementation for those off-nominal cases.

@skliper
Copy link
Contributor

skliper commented Mar 29, 2021

Or are we saying every app should nominally check CFE_ES_Runloop at 1 Hz (either by a scheduled message or a time out)? Seems like it's still developer's choice as to how that is satisfied... but if there's a rate at which it should be checked maybe it makes more sense in that context vs if it's a timeout or not?

@jphickey
Copy link
Contributor Author

Here's the thing - it's important to call CFE_ES_RunLoop(), regardless of whatever else your app my be doing.

That is just the design of CFE - apps need to cooperate in order to respond to an admin request, and this function is the only way that info is communicated to the app. So apps need to make sure they cycle through their run loop fairly regularly.

As a general design principle, the thread that is responsible for calling CFE_ES_RunLoop shouldn't ever "pend forever" on anything. It could be a file read/write, a socket operation, anything.

Rule of thumb should be:

  • If you pend on anything in main loop, use a timeout
  • If you want to pend forever on anything, use a child task to do so

This seems pretty simple to me, and makes sure your apps behave even if other apps do not behave (i.e. self sufficient design). Correct behavior of one app in this regard should not depend on external stimuli. But maybe that's just me.

This seemed like a very simple way to ensure that we call CFE_ES_RunLoop and can control that app even if SCH goes out to lunch. I am surprised it has met pushback. I don't really care that much, if you don't like it, drop this PR.

@jphickey
Copy link
Contributor Author

The absolute maximum period that an app can delay between calls to CFE_ES_RunLoop() is the time delay of the background task between the request and the forced shutdown, which is about 5 seconds in default config. But that doesn't leave any time for a graceful shutdown. So in practice you'd want to use a shorter value. I just tossed in 1 second for example, but it can be anything.

In most non-trivial apps there is more work for the main loop to do than just pend on the SB receive. Even the CI_LAB/TO_LAB do more work here. So a general pattern of putting a timeout here in the main loop, again I can't see how it really hurts anything, I'd rather have it polling extra times than not enough.

@skliper
Copy link
Contributor

skliper commented Mar 29, 2021

I don't have a problem with a timeout, certainly doesn't hurt. Although the only "correct behavior" the change seems to supports is the apps response to management requests in the absence of it's nominally scheduled message. If you can pend forever in a child then you are setting up a similar failure scenario which requires forceful cleanup, just at the child level. Why force an app to have a child if the app really should just pend? Again, I have no problem at all providing this as an example or even recommendation, but it's not the only way to do things nor do I see it as a requirement that all apps time out (if you allow children to pend which could require forceful cleanup then why not allow apps to also rely on the forceful cleanup if that's the desired behavior?)

@jphickey
Copy link
Contributor Author

True it is not the only way to do things, but in general one shouldn't rely on a complex path of dependent logic (i.e. SCH running, sending to the command, successful subscription, etc) in order meet the requirement of checking CFE_ES_RunLoop on a regular basis.

There are lots of ways were relying on SCH (or anything else) to make a forever pend into not being forever can go wrong. The MsgId could be wrong, the subscription can fail for a number of reasons. True that most/any of those conditions are bugs - but does one want to be left in a state where the app has essentially gotten "hung" and we need to force-delete?

I'm not as worried about pending forever in a child task because as long as the main task reacts to the pending ES request, it can do whatever it needs to do to wake/clean up the child task. The point is that the app itself is in a better position to know if its dangerous (or not) to do a force delete than ES is, in the course of a normal shutdown.

This is why I referred to it as "best practice" earlier - not implying there aren't other ways of doing it, but a robust design pattern shouldn't depend on several external apps all being running and properly configured in order make basic, important control operations like CFE_ES_STOP_APP_CC/RESTART_APP_CC/RELOAD_APP_CC etc work as intended.

@skliper
Copy link
Contributor

skliper commented Mar 30, 2021

I do agree it makes sense for sample app based on how it's currently scheduled and how we use it. I also agree any app for which forced delete should be avoided needs to check CFE_ES_RunLoop (regardless of external triggers) at a rate faster than the forced delete timeout. As long as the comments are updated to reflect that I'm fine with the change.

Copy link
Contributor

@skliper skliper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

@astrogeco
Copy link
Contributor

CCB:2021-03-31 APPROVED

  • is the documentation "too forceful"?

@astrogeco astrogeco added CCB:Approved Indicates code approval by CCB and removed CCB:Ready Pull request is ready for discussion at the Configuration Control Board (CCB) labels Apr 7, 2021
@astrogeco astrogeco changed the base branch from main to integration-candidate April 8, 2021 15:35
@astrogeco
Copy link
Contributor

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

@zanzaben was this addressed?

@skliper
Copy link
Contributor

skliper commented Apr 8, 2021

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

@zanzaben was this addressed?

Those were my comments. Not addressed yet, I'll provide suggested updated comments (or feel free if anyone else wants to take a shot at it).

@astrogeco
Copy link
Contributor

So, to merge or not to merge?

@skliper
Copy link
Contributor

skliper commented Apr 8, 2021

Don't merge until resolved. Thx.

@astrogeco astrogeco added CCB:Approved Indicates code approval by CCB and removed CCB:Approved Indicates code approval by CCB IC:2021-04-06 labels Apr 8, 2021
@astrogeco astrogeco removed the CCB:Approved Indicates code approval by CCB label Apr 19, 2021
@skliper skliper added the CCB:Ignore Pull Request is NOT ready for discussion. Has open actions. Will be re-examined at by next CCB. label Apr 28, 2021
@skliper
Copy link
Contributor

skliper commented Apr 28, 2021

Marking to ignore until we get the requested change.

@astrogeco
Copy link
Contributor

CCB:2021-04-28 continue postponing

@skliper skliper self-requested a review April 28, 2021 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CCB:Ignore Pull Request is NOT ready for discussion. Has open actions. Will be re-examined at by next CCB.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apps should not pend forever on software bus
3 participants