-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] High Availability #144
Comments
I have been considering what it would take to provide HA for a cluster of nodes running WAMP routers. Here is a very rough design concept that tries to address what you are asking for. I would be interested in hearing feedback. |
We also thought about HA within WAMP, but came up with a slightly different approach. Our approach was to minimize the inter-router-communication required to maintain good performance and reliability. We split our services into smaller "application features", and wanted to use one "backend router" per "application feature". The backend router is where the backend services connect to.
However, I agree with @gammazero that some kind of federation service would certainly be required, regardless of the chosen implementation strategy. Additionally, I think nexus will need to open up some internal APIs, since the federation service would require deep interaction with the router itself. |
@gregkeys can you please provide some info/links about Crossbar HA mode? |
BTW, who ever is in search for multi-node router for scale/HA purposes, Wiola seems to fit the bill: |
@gammazero Im still considering you're proposal, at first glance it looks like a solid plan of action @martin31821 are you building custom routers with nexus to accomplish this? if so where is the strong interconnect taking place is that built into the frontend routers? @haizaar here is the link where crossbar mentions access to HA mode https://crossbario.com/products/enterprise-support/ |
At the moment, we have autobahnkreuz , but it has no high-availability included yet. |
@gammazero how would you prevent subscription/registration loops using federation agents connecting as normal clients to other routers? There seems to be a problem that clientA connected to routerA subscribes to a topic. The federation agentA (on routerA) detects that and subscribes to that topic on routerB it is connected to. On routerB is federation agentB running who detects this subscription if we do not differentiate between clients and agents and subscribes to the topic on routerA is is connected to. clientA can unsubscribe from the topic but since agentB still is subscribed to the topic, agentA will not unsubscribe from it on routerB. |
@gammazero has there been any work in HA, Load Balancing and Horizontal Scaling? |
@u774r4v4d1n there has been some research, primarily regarding state synchronization in a distributed scenario (for availability, not for scaling). cc'ing @fin-ger, his project https://github.com/fin-ger/building-a-distributed-wamp-router gives a nice overview of the results. |
We are currently using crossbar in production but would like to switch to something that offers high availability. Crossbar has HA however it is prohibitively expensive at around $14,000 per node
We run multi-tenanted clusters where each namespace represents a customer website, within each namespace is a deployment of crossbar.
All of our services are capable of scaling up and down however at 65,000 active connections Im expecting that we will begin to see problems with services and connected users as they begin to loose connectivity.
if the router crashes and restarts there is no failover strategy everything just goes into a retry cycle until crossbar comes back online
Nexus looks like a good option with its ability to drop sessions from the que, this could prevent the restarts but still leaves to question how do we scale up to millions of active connections, we'd like to scale up our wamp router so that it can spread the load accross multiple instances of the deployment.
We'd also like to have instances of nexus running on multiple nodes in the event that we loose a node we dont loose connectivity
The text was updated successfully, but these errors were encountered: