chore: add podAffinity and labels to identify the pipeline#3345
chore: add podAffinity and labels to identify the pipeline#3345eliasscosta wants to merge 3 commits into
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3345 +/- ##
==========================================
- Coverage 36.32% 35.87% -0.46%
==========================================
Files 224 227 +3
Lines 14759 15145 +386
==========================================
+ Hits 5361 5433 +72
- Misses 9008 9313 +305
- Partials 390 399 +9 ☔ View full report in Codecov by Sentry. |
pat-s
left a comment
There was a problem hiding this comment.
Affinity does not guarantee that a pod is placed on the same node, it is just a "recommendation", if possible. Why did you choose this one instead of nodeSelector?
What if the user adds manual affinity or nodeSelector settings into backend_options?
(We use spaces instead of tabs - this is also enforced by the linter in CI. Please see the editorconfig & friends.)
Yes, we already use the
Today we don't have the configuration to change the affinity in the
I don't get it, do you have the link to it? |
The nodeSelector doesn't guarantee the steps run in the same node and have 50 nodes with AMD processors. Then volume and all steps may be scheduled on separate nodes anyway.
Why do you stick to the Pods, pipeline (
It was, as I remember, pretty long time ago. Some time it changed to tabs for a code. |
| podPrefix = "wp-" | ||
| StepLabel = "step" | ||
| podPrefix = "wp-" | ||
| PipelineNumberLabel = "pipeline_number" |
There was a problem hiding this comment.
As we use dashes, I suggest to change to pipeline-number and workflow-name.
No,
In the end, it all comes down to the key:value labels used and how "unique" they are to specific nodes.
Yes, because the key:value applies to multiple nodes. Yet if any approach is desired to fully control pod scheduling then it should be |
There was a problem hiding this comment.
I've thought about this proposal again, and I am pretty sure this PR shouldn't be merged and labeled as 'wontfix'. Here's why:
- RWX is used primarily to solve exactly such cases, i.e. step pods are allowed to be scheduled on different nodes and RWX can be attached to different nodes, providing the required underlying sources to all nodes
- A RWO workaround is not really feasible since the only unique identifier for a specific single node is its IP. An alternative for this (beside
nodeSelector) is to use tains and tolerations, yet this is also just partly addressing the issue if there are multiple "CI nodes".
I don't see a dynamic workaround in which WP infers on which node it started and then uses the same one dynamically for future steps of the same pipeline (please correct me if I am overlooking something).
In the same way, I'd argue that having a RWX is a basic element of a cluster and that implementing complex RWO scheduling workarounds are not worth it in contrast to the potential gains.
Yes, but we (users) don't need to full control. We don't want to manually pick up the Node. We want This is how K3s Local Path Provisioner works:
All Pods run on the same node as Volume: I would like if it was implemented like above, except binding step:
|
I disagree. I can translate current PR's implementation into |
I understand what you want but still, To achieve what you want, you likely want to consider running all steps in a single pod instead of creating a new pod for each step (this is not possible right now). This has been proposed somewhere else already (can't find it), and there was some discussion about it. I think this should be a way to go as it would also address some requests related to caching. Going this way, the whole pipeline would run on the same node without a complex logic to dynamically infer where the next upcoming pod should go to due to the need of using a RWO. |
So how should this be done? The only way I can think of would be
All of the above would require substantial code changes and would possibly conflict with the idea to move from the multi-pod approach to a multi-container approach, which would also solve the above. |
Seems, you are right in case I proposed. If it looks too complicated, I'm fine with current PR's implementation in general.
This will implemented (if) in a couple years from now. User has a problem now and even proposed a fix. |
Not if someone pushed into this direction. It is actually not too complicated to get this implemented. Definitely not more complicated than implementing the workaround of this PR.
I am not fine with the current PR, though I am not in the position of solely blocking such PRs. But I see just this not being a solid solution for the issue at hand, it will cause additional issues in the future and makes use of Also, there are additional solutions which are actually more "simple": use a RWX, which should be a basic component on every cluster. Overall, let me use a metaphor to describe the discussion here how it unfolds in my mind: If green and red cars exists, you also don't buy a red one and paint it green but rather sell the red one and buy a green one instead 🙃 (might not be a 100% fit but it should transport the message). |
|
Thanks, everyone, it was a rich discussion and I decided to implement it outside to resolve our case. I already have an injector to add more labels for troubleshooting and then I will add affinity logic too. We use AWS to run the k8s and RWX isn't an option for us, our primary tests the steps run too slowly using (EFS) and if we use other solutions (like: nfs operator) inside of the cluster maybe turn too complex the cluster management I'm not comfortable with the multi-container approach for my case, I guess is too complex for a simple pod creation. But is an option to think about. I try to follow the Kubernetes doc and use the options we have when is not possible to use the RWX. Maybe If we have more labels on the pod and a way to set up the Affinity by an option in the agent would be great. |
|
I wonder what you're doing when RWX is limiting WRT to performance. Judging from my experience within the last years, I usually get suspicious when RWX is being blamed to be the major issue unless it comes to webservers and other tool operating with many small files which require very small latency. Note that you can also attach additional volumes to a pod which can be used for certain compute tasks. Or the node's tempdir. You can also copy the repo to this volume before performing any tasks and clean up afterward (or set a storageClass policy to delete the volume after use). Not everything must necessarily happen on the RWX holding the repo data. |
Yes, the problem is the volume created by default for the pipeline, if we use a service the next pod try to get the same volume. If they don't run in the same node the pipeline get stuck running forever trying to mount. |
|
Thanks everyone for the comments. I'm a colleague to @eliasscosta so maybe I can bring some extra perspective.
We don't have any benchmarks at hand but our issue with Amazon EFS is that it was painfully slow for many common i/o intensive tasks. That would be the likes of It's possible that there's more that we could've done on that front, and unfortunately that was some while ago so I don't think I can get exact numbers and configuration to help. That's what ultimately pushed us away from EFS, so I'm not sure that would ever be an alternative for many workloads.
Maybe I don't have the right picture of how this would work in practice, but if we assume that the NFS is unusably slow for many CI tasks, I can see this being a big burden on users:
And that assuming that the copy operation itself wouldn't be a bottleneck.
@pat-s it seems part of the issue here (not everything, I understand that you also have valid criticisms against the design itself, not just the implementation) is this, that affinity supposedly does not guarantee anything. But doesn't it? I believe that https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity The quote above is for the nodeAffinity rules, but still, the same logic applies for podAffinity. Am I missing something? I understand if you prefer another design but I believe this code guarantees the pods are scheduled on the same node 100% of the time (or else they're not scheduled, which is technically a possibility, i.e. if a node does not have enough resources...).
I agree this could conflict with this implementation. In fact, it seems this could cause a regression and break working pipelines. Thus if we were to move forward (and I have a vested interest in moving forward because I'd prefer to not have to run a fork) I do like the idea of an opt-in flag like |
|
@fernandrone Thanks for the extensive reply, appreciated. Yes, RWXYes, RWX can be slow when installing many packages like node or R packages. Yet if this happens in parallel, and we're not talking about 1k packages, it is usually not adding up to 20 mins. Using RWO in generalI see the advantages of having a RWO as the base PV. And we surely want to make everything better and not limit users or provide a bottlenecked system in any way. Disclaimer: This is just my opinion, and I am only one maintainer. If others join the discussion and overwrite my vote, I am happy to pursue another way. Overall: Just here to help and make things better/less complicated in the long run, not to block anything from being improved. 🤞 |
It might be specific to our use case (although I think that would be an issue to any medium-sized team) but we're a small team maintaining Woodpecker for other engineering teams. If it were just 1 or 2 pipelines that would be ok. The issue is that we'd need to document and expect that multiple non-infrastructure-savvy developers, who own their pipelines, would need to use their own judgement for when to use the RWO volume or not. On top of that we're migrating from a Drone infrastructure where that's not required which just adds to the confusion. From our point of view a relatively simple patch on the server is much easier than having to update dozens+ of pipelines (+ performance issues), or forgoing ever using services/depends_on (of course, this point is just to illustrate if the solution is easier or not, not if it's the correct solution).
Agreed. Although I personally see the nodeAffinity solution as a perfectly fine solution, that uses I really like the idea of running every containers on the same pod and FWIW that is how Tekton worked when I used it a couple of years ago. There seems to be a lot of advantages to that approach. However to me it also sounds like a much more complex change. I'm not sure if is is already something that is on the roadmap of the project and universally agreed upon. The way I see it is, as of now, Woodpecker on a Kubernetes cluster doesn't work fully without a NFS volume: both My suggestion would be that, unless the maintainers are committed to change the architecture from one-pod-per-step to one-pod-per-workflow (or are confident that this change can be made reasonably soon), it's reasonable to find a good enough solution within the current architecture 👍🏻 |
The simplest solution is using Local Path Provisioner. It provides local RWO Volumes and binds Pods to the same Node automagically. Works well in self-hosted environment (with Woodpecker too, of course). Don't know how it goes in a cloud, though. Seems, you have a reason not to use it... |
|
@fernandrone My proposed workaround is of course nothing that scales and not a solution for many pipelines or general use. Yet, I didn't know your use cases and environment before.
So far it has only been a rough idea, also to tackle caching issues. The RWX approach in place right now has been there before I joined the project actively. I don't know why that decision was made, and hence can't comment on it. I am aware that Drone uses a single pod approach and that transitioning from Drone to WP would be more seamless if WP would do the same. The only downside of that approach would be that resource-based scheduling would only be possible on the pipeline level and not step level, but I personally would see that rather as an improvement as it avoids having to define
I see your need but I would still highly vote to push the "one-pod-many-containers" approach and stay a little patient until that is done (unless another maintainer is against it). Would you and your team be motivated to go this route? And to clarify again: I would not want to merge this one as it actually would imply updating the docs and the chart and then potentially revert everything if the "one-pod-many-containers" is put in place (as this one makes the other idea practically useless). I am aware that you would like to see an instant merge to continue working - but until then, there's also your for which could be used in the meantime. |
|
If there is a lot of arguing aganst & for it, i would say it should be a config option (In the backemd config i guess) |
|
Let's continue discussing in #3366 (thanks @zc-devs). @6543 With the little but important difference that it is not a yes/no discussion here but more about "which approach to use" and "not to hurry but avoiding adding a lot of code which then must be migrated/removed again for a (possibly) better solution". Let's close here for now and find an agreement first which most/a majority is happy with. |

addressing #3344
The idea is to run in the same node when we need to run parallel or service steps and have a storage RWX is false.