Skip to content

chore: add podAffinity and labels to identify the pipeline#3345

Closed
eliasscosta wants to merge 3 commits into
woodpecker-ci:mainfrom
eliasscosta:main
Closed

chore: add podAffinity and labels to identify the pipeline#3345
eliasscosta wants to merge 3 commits into
woodpecker-ci:mainfrom
eliasscosta:main

Conversation

@eliasscosta

@eliasscosta eliasscosta commented Feb 7, 2024

Copy link
Copy Markdown
Contributor

addressing #3344

The idea is to run in the same node when we need to run parallel or service steps and have a storage RWX is false.

@eliasscosta eliasscosta marked this pull request as ready for review February 7, 2024 11:55
Comment thread pipeline/backend/kubernetes/pod.go Outdated
Comment thread pipeline/backend/kubernetes/pod.go Outdated
Comment thread pipeline/backend/kubernetes/pod.go Outdated
Comment thread pipeline/backend/kubernetes/pod.go Outdated
Comment thread pipeline/backend/kubernetes/pod.go Outdated
Comment thread pipeline/backend/kubernetes/pod.go Outdated
Comment thread pipeline/backend/kubernetes/pod_test.go Outdated
Comment thread pipeline/backend/kubernetes/pod_test.go Outdated
@qwerty287 qwerty287 added the enhancement improve existing features label Feb 7, 2024
@eliasscosta eliasscosta requested a review from zc-devs February 7, 2024 19:35
@codecov

codecov Bot commented Feb 7, 2024

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9e96c9c) 36.32% compared to head (677646b) 35.87%.

❗ Current head 677646b differs from pull request most recent head c2c7da1. Consider uploading reports for the commit c2c7da1 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3345      +/-   ##
==========================================
- Coverage   36.32%   35.87%   -0.46%     
==========================================
  Files         224      227       +3     
  Lines       14759    15145     +386     
==========================================
+ Hits         5361     5433      +72     
- Misses       9008     9313     +305     
- Partials      390      399       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pat-s pat-s left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Affinity does not guarantee that a pod is placed on the same node, it is just a "recommendation", if possible. Why did you choose this one instead of nodeSelector?

What if the user adds manual affinity or nodeSelector settings into backend_options?

(We use spaces instead of tabs - this is also enforced by the linter in CI. Please see the editorconfig & friends.)

@eliasscosta

Copy link
Copy Markdown
Contributor Author

Affinity does not guarantee that a pod is placed on the same node, it is just a "recommendation", if possible. Why did you choose this one instead of nodeSelector?

Yes, we already use the nodeSelector and it selects a node from a pool with the same name/label, not the same node where the initial (service/step) starts to run. Then when it happened we got the error to to mount the volume. The idea is to try to run a service or pod(steps) in parallel, to achieve it, we need to run the pods in the same node.

What if the user adds manual affinity or nodeSelector settings into backend_options?

Today we don't have the configuration to change the affinity in the backend_options, if someone sets the nodeSelector it continues to work in the same way. Because this configuration I proposal is just when we have Storage RWX as false

(We use spaces instead of tabs - this is also enforced by the linter in CI. Please see the editorconfig & friends.)

I don't get it, do you have the link to it?

@eliasscosta eliasscosta requested a review from pat-s February 8, 2024 12:36
@eliasscosta eliasscosta closed this Feb 8, 2024
@eliasscosta eliasscosta reopened this Feb 8, 2024
@zc-devs

zc-devs commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

Why did you choose this one instead of nodeSelector?

The nodeSelector doesn't guarantee the steps run in the same node
As I understand. Suppose, we have selector:

nodeSelector:
  arch: amd64
  cpu: amd

and have 50 nodes with AMD processors. Then volume and all steps may be scheduled on separate nodes anyway.

we need to run the pods in the same node
where the initial (service/step) starts to run

Why do you stick to the Pods, pipeline (pipeline_number, repository)?
You have the problem: Pods run on other node than Volume.
Then logical solution should be: run Pods on Node, where Volume is created. What do you think?

We use spaces instead of tabs

It was, as I remember, pretty long time ago. Some time it changed to tabs for a code.
And in test it was / is a mix.
As now pod_test.go uses Tabs only, it is not relevant to this PR, I think.

podPrefix = "wp-"
StepLabel = "step"
podPrefix = "wp-"
PipelineNumberLabel = "pipeline_number"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we use dashes, I suggest to change to pipeline-number and workflow-name.

@pat-s

pat-s commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

#3252 (comment)
As I understand. Suppose, we have selector

No, nodeSelector is a harder scheduling setting than affinity.

  • nodeSelector: pods are only required on nodes which match the key:value
  • affinity: pods are preferred to run on nodes which match the key:value but won't be blocked to also run on other nodes

In the end, it all comes down to the key:value labels used and how "unique" they are to specific nodes.

and have 50 nodes with AMD processors. Then volume and all steps may be scheduled on separate nodes anyway.

Yes, because the key:value applies to multiple nodes. Yet if any approach is desired to fully control pod scheduling then it should be nodeSelector and not affinity (regardless of the key:value pairs used).

@pat-s pat-s left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this proposal again, and I am pretty sure this PR shouldn't be merged and labeled as 'wontfix'. Here's why:

  • RWX is used primarily to solve exactly such cases, i.e. step pods are allowed to be scheduled on different nodes and RWX can be attached to different nodes, providing the required underlying sources to all nodes
  • A RWO workaround is not really feasible since the only unique identifier for a specific single node is its IP. An alternative for this (beside nodeSelector) is to use tains and tolerations, yet this is also just partly addressing the issue if there are multiple "CI nodes".

I don't see a dynamic workaround in which WP infers on which node it started and then uses the same one dynamically for future steps of the same pipeline (please correct me if I am overlooking something).

In the same way, I'd argue that having a RWX is a basic element of a cluster and that implementing complex RWO scheduling workarounds are not worth it in contrast to the potential gains.

@eliasscosta eliasscosta closed this Feb 9, 2024
@zc-devs

zc-devs commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

to fully control pod scheduling then it should be nodeSelector and not affinity

Yes, but we (users) don't need to full control. We don't want to manually pick up the Node. We want run this Pods on the same Node, where those Volume was created.

This is how K3s Local Path Provisioner works:

  1. Create PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test
  namespace: test-woodpecker-runtime
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1G
  1. Kubernetes creates PVC in pending status.
  2. Create Pod:
apiVersion: v1
kind: Pod
metadata:
  name: pod-test
  namespace: test-woodpecker-runtime
spec:
  containers:
    - name: test
      image: alpine
      command:
        - /bin/sh
        - "-c"
      args:
        - sleep 300
      volumeMounts:
        - name: vol
          mountPath: /woodpecker
  volumes:
    - name: vol
      persistentVolumeClaim:
        claimName: pvc-test
  1. Kubernetes creates creates PV, binds PVC
  2. and creates Pod

All Pods run on the same node as Volume:

spec:
  nodeName: l7-1

I would like if it was implemented like above, except binding step:

  1. Create Volume.
  2. Get the Node.
  3. Create steps/Pods on the Node from step 2.

@zc-devs

zc-devs commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

I am pretty sure this PR shouldn't be merged and labeled as 'wontfix'

I disagree. I can translate current PR's implementation into run whole pipeline on the same Node requirements / feature request, which in itself deserves attention, although I prefer another solution.

@pat-s

pat-s commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

We want run this Pods on the same Node, where those Volume was created.

I understand what you want but still, affinity is the wrong approach for this as it is not at all guaranteeing this. When you're operating on more nodes, you have no guarantee for a pod to be scheduled on a specific node, yet at all to infer which one a specific PVC is attached to in a dynamic way.

To achieve what you want, you likely want to consider running all steps in a single pod instead of creating a new pod for each step (this is not possible right now). This has been proposed somewhere else already (can't find it), and there was some discussion about it. I think this should be a way to go as it would also address some requests related to caching.

Going this way, the whole pipeline would run on the same node without a complex logic to dynamically infer where the next upcoming pod should go to due to the need of using a RWO.
This would also speed up k8s execution in general since the mount operation and pod creation would only happen once.

@pat-s

pat-s commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

I disagree. I can translate current PR's implementation into run whole pipeline on the same Node requirements / feature request, which in itself deserves attention, although I prefer another solution.

So how should this be done? The only way I can think of would be

  1. Let the first pod be scheduled by the k8s scheduler
  2. From step 2 onwards, grep the node the PVC is being attached to
  3. Find a unique label to defined as nodeSelector and conditionally create upcoming pods with that nodeSelector

All of the above would require substantial code changes and would possibly conflict with the idea to move from the multi-pod approach to a multi-container approach, which would also solve the above.

@zc-devs

zc-devs commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

The only way I can think of would be

Seems, you are right in case I proposed. If it looks too complicated, I'm fine with current PR's implementation in general.
However, we can get rid of rwx/rwo condition logic and just bring new kubernetes backend option like try-to-run-on-same-node. Does it suit you @pat-s, @eliasscosta?

multi-container approach

This will implemented (if) in a couple years from now. User has a problem now and even proposed a fix.

@pat-s

pat-s commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

This will implemented (if) in a couple years from now. User has a problem now and even proposed a fix.

Not if someone pushed into this direction. It is actually not too complicated to get this implemented. Definitely not more complicated than implementing the workaround of this PR.

Seems, you are right in case I proposed. If it looks too complicated, I'm fine with current PR's implementation in general.
However, we can get rid of rwx/rwo condition logic and just bring new kubernetes backend option like try-to-run-on-same-node. Does it suit you @pat-s, @eliasscosta?

I am not fine with the current PR, though I am not in the position of solely blocking such PRs. But I see just this not being a solid solution for the issue at hand, it will cause additional issues in the future and makes use of affinity in a way it shouldn't be used.
If this gets merged, it will likely not be removed anymore either and being workaround another time and so forth. If it works for OP and his setup, this might be a candidate for a fork which needs to be synced with upstream.

Also, there are additional solutions which are actually more "simple": use a RWX, which should be a basic component on every cluster.

Overall, let me use a metaphor to describe the discussion here how it unfolds in my mind: If green and red cars exists, you also don't buy a red one and paint it green but rather sell the red one and buy a green one instead 🙃 (might not be a 100% fit but it should transport the message).

@eliasscosta

Copy link
Copy Markdown
Contributor Author

Thanks, everyone, it was a rich discussion and I decided to implement it outside to resolve our case. I already have an injector to add more labels for troubleshooting and then I will add affinity logic too.

We use AWS to run the k8s and RWX isn't an option for us, our primary tests the steps run too slowly using (EFS) and if we use other solutions (like: nfs operator) inside of the cluster maybe turn too complex the cluster management

I'm not comfortable with the multi-container approach for my case, I guess is too complex for a simple pod creation. But is an option to think about.

I try to follow the Kubernetes doc and use the options we have when is not possible to use the RWX.

Maybe If we have more labels on the pod and a way to set up the Affinity by an option in the agent would be great.

@pat-s

pat-s commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

I wonder what you're doing when RWX is limiting WRT to performance. Judging from my experience within the last years, I usually get suspicious when RWX is being blamed to be the major issue unless it comes to webservers and other tool operating with many small files which require very small latency.

Note that you can also attach additional volumes to a pod which can be used for certain compute tasks. Or the node's tempdir. You can also copy the repo to this volume before performing any tasks and clean up afterward (or set a storageClass policy to delete the volume after use). Not everything must necessarily happen on the RWX holding the repo data.

@eliasscosta

Copy link
Copy Markdown
Contributor Author

I wonder what you're doing when RWX is limiting WRT to performance. Judging from my experience within the last years, I usually get suspicious when RWX is being blamed to be the major issue unless it comes to webservers and other tool operating with many small files which require very small latency.

Note that you can also attach additional volumes to a pod which can be used for certain compute tasks. Or the node's tempdir. You can also copy the repo to this volume before performing any tasks and clean up afterward (or set a storageClass policy to delete the volume after use). Not everything must necessarily happen on the RWX holding the repo data.

Yes, the problem is the volume created by default for the pipeline, if we use a service the next pod try to get the same volume. If they don't run in the same node the pipeline get stuck running forever trying to mount.

@fernandrone

fernandrone commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

Thanks everyone for the comments. I'm a colleague to @eliasscosta so maybe I can bring some extra perspective.

I wonder what you're doing when RWX is limiting WRT to performance. Judging from my experience within the last years, I usually get suspicious when RWX is being blamed to be the major issue unless it comes to webservers and other tool operating with many small files which require very small latency.

We don't have any benchmarks at hand but our issue with Amazon EFS is that it was painfully slow for many common i/o intensive tasks. That would be the likes of npm install (which I think that classifies as requiring "many small files"). We're talking about 3-4x slower when compared to our default 3000 iops gp3 disks, so even small projects would take 20 minutes to install their dependencies.

It's possible that there's more that we could've done on that front, and unfortunately that was some while ago so I don't think I can get exact numbers and configuration to help. That's what ultimately pushed us away from EFS, so I'm not sure that would ever be an alternative for many workloads.

Note that you can also attach additional volumes to a pod which can be used for certain compute tasks. Or the node's tempdir. You can also copy the repo to this volume before performing any tasks and clean up afterward (or set a storageClass policy to delete the volume after use). Not everything must necessarily happen on the RWX holding the repo data.

Maybe I don't have the right picture of how this would work in practice, but if we assume that the NFS is unusably slow for many CI tasks, I can see this being a big burden on users:

  1. project is cloned to NFS volume
  2. copy project to separate ReadWriteOnce volume
  3. run npm install
  4. copy back to NFS so the result may be used by the next pod in the pipeline

And that assuming that the copy operation itself wouldn't be a bottleneck.

Affinity does not guarantee that a pod is placed on the same node, it is just a "recommendation", if possible

@pat-s it seems part of the issue here (not everything, I understand that you also have valid criticisms against the design itself, not just the implementation) is this, that affinity supposedly does not guarantee anything. But doesn't it?

I believe that preferredDuringSchedulingIgnoredDuringExecution indeed does not guarantee anything. However requiredDuringSchedulingIgnoredDuringExecution, as Elias implemented, should have guarantees, and will refuse to schedule if the rules are not met. It's explained here in the documentation:

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

image

The quote above is for the nodeAffinity rules, but still, the same logic applies for podAffinity. Am I missing something? I understand if you prefer another design but I believe this code guarantees the pods are scheduled on the same node 100% of the time (or else they're not scheduled, which is technically a possibility, i.e. if a node does not have enough resources...).

What if the user adds manual affinity or nodeSelector settings into backend_options?

I agree this could conflict with this implementation. In fact, it seems this could cause a regression and break working pipelines. Thus if we were to move forward (and I have a vested interest in moving forward because I'd prefer to not have to run a fork) I do like the idea of an opt-in flag like try-to-run-on-same-node.

@eliasscosta eliasscosta reopened this Feb 9, 2024
@pat-s

pat-s commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

@fernandrone Thanks for the extensive reply, appreciated.

Yes, requiredDuringSchedulingIgnoredDuringExecution can used so that affinity behaves essentially like nodeSelector, yet this is usually not what it should be used for (even though, as noted, a more expressive syntax can be used).
Until now I was aware that this was the actual core idea of the implementation because when I read "affinity" I think of "soft affinity" and not the idea to invoke it as an implicit fixed nodeSelector.

RWX

Yes, RWX can be slow when installing many packages like node or R packages. Yet if this happens in parallel, and we're not talking about 1k packages, it is usually not adding up to 20 mins.
Of course the use case of copying to a RWO tempdir is not something that should become the default, yet in your case, I think it would help to workaround and provide a much easier solution than the proposed affinity solution.

Using RWO in general

I see the advantages of having a RWO as the base PV. And we surely want to make everything better and not limit users or provide a bottlenecked system in any way.
However, a workaround is often "quickly" implemented but not often the best overall solution.
I still think that using the "one-pod-many-containers" solution is by far the best solution for the underlying problems, and also comes with the benefit of less time spent between the pod creations. Which is why my proposal would be to invest time on this approach and not to implement the proposed solution of this PR as it would also interfere with user-provided backend_options overwrites.

Disclaimer: This is just my opinion, and I am only one maintainer. If others join the discussion and overwrite my vote, I am happy to pursue another way. Overall: Just here to help and make things better/less complicated in the long run, not to block anything from being improved. 🤞

@fernandrone

fernandrone commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

Of course the use case of copying to a RWO tempdir is not something that should become the default, yet in your case, I think it would help to workaround and provide a much easier solution than the proposed affinity solution.

It might be specific to our use case (although I think that would be an issue to any medium-sized team) but we're a small team maintaining Woodpecker for other engineering teams. If it were just 1 or 2 pipelines that would be ok. The issue is that we'd need to document and expect that multiple non-infrastructure-savvy developers, who own their pipelines, would need to use their own judgement for when to use the RWO volume or not. On top of that we're migrating from a Drone infrastructure where that's not required which just adds to the confusion.

From our point of view a relatively simple patch on the server is much easier than having to update dozens+ of pipelines (+ performance issues), or forgoing ever using services/depends_on (of course, this point is just to illustrate if the solution is easier or not, not if it's the correct solution).

However, a workaround is often "quickly" implemented but not often the best overall solution.

Agreed. Although I personally see the nodeAffinity solution as a perfectly fine solution, that uses requiredDuringSchedulingIgnoredDuringExecution as intended, not as a quick workaround.

I really like the idea of running every containers on the same pod and FWIW that is how Tekton worked when I used it a couple of years ago. There seems to be a lot of advantages to that approach. However to me it also sounds like a much more complex change. I'm not sure if is is already something that is on the roadmap of the project and universally agreed upon.

The way I see it is, as of now, Woodpecker on a Kubernetes cluster doesn't work fully without a NFS volume: both services and depends_on (as well as the deprecated group) will not work properly and may altogether fail to start if the pods are scheduled on the wrong node. A NFS volume fixes this but from my limited experience introduces performance issues on common CI workloads that many teams will not find tolerable.

My suggestion would be that, unless the maintainers are committed to change the architecture from one-pod-per-step to one-pod-per-workflow (or are confident that this change can be made reasonably soon), it's reasonable to find a good enough solution within the current architecture 👍🏻

@zc-devs

zc-devs commented Feb 9, 2024

Copy link
Copy Markdown
Contributor

The way I see it is, as of now, Woodpecker on a Kubernetes cluster doesn't work fully without a NFS volume: both services and depends_on (as well as the deprecated group) will not work properly and may altogether fail to start if the pods are scheduled on the wrong node. A NFS volume fixes this but from my limited experience introduces performance issues on common CI workloads that many teams will not find tolerable.

The simplest solution is using Local Path Provisioner. It provides local RWO Volumes and binds Pods to the same Node automagically. Works well in self-hosted environment (with Woodpecker too, of course). Don't know how it goes in a cloud, though. Seems, you have a reason not to use it...

@pat-s

pat-s commented Feb 10, 2024

Copy link
Copy Markdown
Contributor

@fernandrone My proposed workaround is of course nothing that scales and not a solution for many pipelines or general use. Yet, I didn't know your use cases and environment before.

I really like the idea of running every containers on the same pod and FWIW that is how Tekton worked when I used it a couple of years ago. There seems to be a lot of advantages to that approach. However to me it also sounds like a much more complex change. I'm not sure if is is already something that is on the roadmap of the project and universally agreed upon.

So far it has only been a rough idea, also to tackle caching issues. The RWX approach in place right now has been there before I joined the project actively. I don't know why that decision was made, and hence can't comment on it. I am aware that Drone uses a single pod approach and that transitioning from Drone to WP would be more seamless if WP would do the same.

The only downside of that approach would be that resource-based scheduling would only be possible on the pipeline level and not step level, but I personally would see that rather as an improvement as it avoids having to define backend_options for every step.

My suggestion would be that, unless the maintainers are committed to change the architecture from one-pod-per-step to one-pod-per-workflow (or are confident that this change can be made reasonably soon), it's reasonable to find a good enough solution within the current architecture 👍🏻

I see your need but I would still highly vote to push the "one-pod-many-containers" approach and stay a little patient until that is done (unless another maintainer is against it).
There could also be the option to keep the current model (for backward comp etc.) and add the "one-pod-many-containers" on the side with an option to switch to it. I would be happy to work on it, as it not only solves your problem, it simplifies caching and removes the need for a RWX/volume mount in the helm chart (as a pipeline could also just operate on emptyDir then within a pod).

Would you and your team be motivated to go this route? And to clarify again: I would not want to merge this one as it actually would imply updating the docs and the chart and then potentially revert everything if the "one-pod-many-containers" is put in place (as this one makes the other idea practically useless).

I am aware that you would like to see an instant merge to continue working - but until then, there's also your for which could be used in the meantime.

@zc-devs

zc-devs commented Feb 10, 2024

Copy link
Copy Markdown
Contributor

@6543

6543 commented Feb 10, 2024

Copy link
Copy Markdown
Member

If there is a lot of arguing aganst & for it, i would say it should be a config option

(In the backemd config i guess)

@pat-s

pat-s commented Feb 11, 2024

Copy link
Copy Markdown
Contributor

Let's continue discussing in #3366 (thanks @zc-devs).

@6543 With the little but important difference that it is not a yes/no discussion here but more about "which approach to use" and "not to hurry but avoiding adding a lot of code which then must be migrated/removed again for a (possibly) better solution".

Let's close here for now and find an agreement first which most/a majority is happy with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/kubernetes enhancement improve existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants