Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration reporting error before the progress deadline timeout expires #5552

Open
lsergio opened this issue May 27, 2024 · 8 comments
Open
Labels
kind/bug Something isn't working

Comments

@lsergio
Copy link
Contributor

lsergio commented May 27, 2024

What happened?

I have Camel-K running on an EKS cluster with autoscaling groups scaling up to 20 nodes.
At the moment this was reported, 8 nodes were running, and I created a new Integration object.

There was no room for a new pod in the running nodes, so Kubernetes spawned a new one. However, the Integration reported as Error immediately, even before the Deployment progress deadline expired.

This is the Integration report:

  - lastTransitionTime: "2024-05-27T17:42:16Z"
    lastUpdateTime: "2024-05-27T17:42:16Z"
    message: '0/8 nodes are available: 1 node(s) were unschedulable, 7 Insufficient
      cpu. preemption: 0/8 nodes are available: 1 Preemption is not helpful for scheduling,
      7 No preemption victims found for incoming pod..'
    reason: Error
    status: "False"
    type: Ready

And this is the Deployment status:

status:
  conditions:
  - lastTransitionTime: "2024-05-27T17:42:16Z"
    lastUpdateTime: "2024-05-27T17:42:16Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2024-05-27T17:42:16Z"
    lastUpdateTime: "2024-05-27T17:42:16Z"
    message: ReplicaSet "deploy-4f2a232d-fec3-42ad-b437-b5c47fcf1804-copy-5dbc986949"
      is progressing.
    reason: ReplicaSetUpdated
    status: "True"
    type: Progressing

As we can see, the deployment is still progressing.

I expected the status to be Error only after the progress deadline expired.

Steps to reproduce

No response

Relevant log output

No response

Camel K version

2.2.0

@lsergio lsergio added the kind/bug Something isn't working label May 27, 2024
@lsergio lsergio changed the title Integration reporting timeout error before the timeout expires Integration reporting error before the timeout expires May 27, 2024
@lsergio lsergio changed the title Integration reporting error before the timeout expires Integration reporting error before the progress deadline timeout expires May 27, 2024
@lsergio
Copy link
Contributor Author

lsergio commented May 27, 2024

After the new node is created, the Integration status changes to Ready = true. The impact on my side is that the Error triggers the wrong workflow on my monitoring application.

@squakez
Copy link
Contributor

squakez commented May 28, 2024

I think the correct way to monitor an healthy Integration is to watch both .status.phase and .conditions[READY]==true and ideally you should include the readiness probe via health trait to make sure that the Camel context is ready. This is because you probably don't want to trust blindly Kubernetes Deployment (which you can see it's not reporting an error status) but the Camel context, which is the application one knowing if something is healthy or not via its internal mechanisms.

@lsergio
Copy link
Contributor Author

lsergio commented May 28, 2024

In this specific case, I think the Deployment is correct by not reporting an error before the deadline expires. Per the docs, the deployment status should change to ProgressDeadlineExceeded after the 10min default timeout (or the progressDeadlineSeconds values) expires .
And it does:

  conditions:
  - lastTransitionTime: "2024-05-28T11:23:11Z"
    lastUpdateTime: "2024-05-28T11:23:11Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2024-05-28T11:24:12Z"
    lastUpdateTime: "2024-05-28T11:24:12Z"
    message: ReplicaSet "deploy-2f9e3f35-141a-46e6-a264-f5b82ad00adb-55659fcd77" has
      timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing

For monitoring purposes, I have enabled the Health trait, and I consider the Integration to be healthy when the Ready condition is true and the KnativeServiceReady condition is also true when applicable. This allows me to detect when an Integration is successfully deployed.

My other use case, though, is to detect when an Integration is failing due to a bad component configuration that causes the CamelContext to not start. In this scenario, the Ready condition will be false, but I still need to check the reason or phase to distinguish between a Camel Context that is still starting up and one that has failed. When it fails, the reason changes to Error and I can trigger an alert.

Having that Error status also when the deployment is still in progress leads me to a false alert.

@lsergio
Copy link
Contributor Author

lsergio commented May 28, 2024

I checked the source code and looks like the Deployment monitor is waiting for the progressdeadlineexceeded status to report an Integration error.

It seems there's something else causing the Integration to report an Error.

@lsergio
Copy link
Contributor Author

lsergio commented May 28, 2024

After reading this method, I figured out what is happening:

While there's no available nodes, the integration pod status is Pending and it reports:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T12:00:15Z"
    message: '0/6 nodes are available: 6 Insufficient cpu. preemption: 0/6 nodes are
      available: 6 No preemption victims found for incoming pod..'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

The monitor detects the Unschedulable reason and sets the Error reason in the Integration.
When the new node is ready, that condition changes to:

  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T12:01:07Z"
    status: "True"
    type: PodScheduled

There probably is a good reason for checking the pending Pod statuses, but shouldn't it be enough to check the Deployment status? Any issue with the pods will (or should) reflect on the Deployment status.

For my specific monitoring case, I will try and check the Deployment status. If it is still Progressing, I will ignore the Error reason.

@squakez
Copy link
Contributor

squakez commented May 28, 2024

The problem is that we need to know if the application is really starting or not, reason why we are checking the Pod as well. The Deployment would not report an application failure, but a "Deployment" failure (ie, cannot schedule for any reason).

@lsergio
Copy link
Contributor Author

lsergio commented May 28, 2024

I see. Well, one suggestion I have is to check the Deployment status Progressing condition. It is true, keep checking the Pods, but do not check for the Unschedulable reason. This will catch any more severe condition, like an ImagePullbackoff. The Integration Ready condition would be false.

When the Deployment times out, the Deployment Progressing condition will change to False, and then we could check the Pods and get the error message from the Unschedulable ones, setting the Integration Ready condition to False with the Error reason.

@lburgazzoli
Copy link
Contributor

The reason behind using the Pod and not the Deployment is that, depending on a number of factor, camel-k can generate a Knative Service, a Job or a Deployment (and maybe a resource that I don't recall), which makes the Pod is the only common denominator among the generated resources.

However I think we do not handle such case very well and I agree, it should not mark the integration as erroed but likely as progressing or something along the line. That said, I don't know how complex it would be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants