Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Implement the XCAT Service Node Pool #7473

Open
ornekofbf opened this issue Oct 10, 2024 · 3 comments
Open

How to Implement the XCAT Service Node Pool #7473

ornekofbf opened this issue Oct 10, 2024 · 3 comments

Comments

@ornekofbf
Copy link

I am currently working on setting up a hierarchical cluster environment using xCAT and am interested in implementing a service node pool to improve the load balancing and management of our computing resources. Our goal is to ensure that tasks are distributed efficiently among our compute nodes to optimize performance and resource utilization.

In the official document 2.16.5, it was proposed to use a service node pool to achieve load balancing and high availability, but I have not found a specific method or steps to implement a service node pool. I would greatly appreciate it if you could provide guidance on the following aspects:
1.Could you outline the specific steps required to set up a hierarchical cluster service node pool within xCAT?
2.Is there any detailed documentation or tutorials available that cover the setup process from start to finish?


@Obihoernchen
Copy link
Member

How many clients do you have? Usually I wouldn't recommend to do any hierarchical setup if you have <1000 clients.
It's just a lot of added effort and complexity for <1000 clients in my opionen.

Did you check https://xcat-docs.readthedocs.io/en/stable/advanced/hierarchy/define_service_node.html and https://xcat-docs.readthedocs.io/en/stable/advanced/hierarchy/index.html?highlight=pool already? As far as I know there is no additional documentation or guide available.

But the pool does not do automatic load balancing. The SN to CN assignment is fixed once a node boots.

One more question. You write:

Our goal is to ensure that tasks are distributed efficiently among our compute nodes to optimize performance and resource utilization.

Maybe I just misunderstand this sentence but xCAT is not really helping you with this. It deploys your servers but once the servers are deployed you should use workload managers like slurm to use resources on compute nodes.

@samveen
Copy link
Member

samveen commented Oct 14, 2024

@ornekofbf To elaborate on Markus's comment,

  • An xCAT cluster has one essential set of services related to xCAT - cluster management . Everything else is user compute.
  • Compute and workload management is not and has never been part of xCAT's purview.
  • Service nodes are useful in 2 cases:
    • Depending on the size of the cluster, the management node may not be able to handle the load of the number of nodes under management. Service nodes take on the load, freeing up resources on the management node
    • Nodes are on disjoint networks, and service nodes allow for management of nodes on those disjoint networks.
  • Workload management depends on the usage scenarios for the cluster.
    • A lot of HPC clusters use SLURM for workload management.
    • I have personally seen and worked with a custom workload management system implemented using Puppet/MCollective (by this guy for a large e-commerce platform).
    • It should be possible to implement a Kubernetes cluster on top of the xCAT cluster by adding postscripts to node deployments to initialize and add the node to a Kubernetes cluster, using any of the processes listed in the K8S docs, (in this example, xCAT does the physical infrastructure management, while K8S does compute resource management).

@ornekofbf
Copy link
Author

Hello, our cluster has 10000 nodes.
Sorry, I didn't describe the problem clearly before. I mainly wanted to know the specific steps for implementing a service node pool. Is it possible to implement a service node pool by following the instructions in the document(https://xcat-docs.readthedocs.io/en/stable/advanced/hierarchy/define_service_node.html)? Do we still need additional configurations?

微信图片_20241016215237

In addition, there is another issue for ordinary hierarchical clusters (without considering service node pools, only one service node is responsible for a group of computing nodes)
Should service nodes be configured with DHCP and other services? I found that if I don't configure the DHCP service for the service node, the management node will still act as the DHCP server when distributing the system, and the service node doesn't seem to be effective. But if I configure a DHCP server for the service node, the computing node will get stuck here when restarting and installing the system. Can you provide a corresponding solution or suggestion? Thank you!
error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants