-
Notifications
You must be signed in to change notification settings - Fork 5
Closed
Labels
epicBig issue with multiple subissuesBig issue with multiple subissuesopsOperations and administrationOperations and administrationprocedureAction that must be executedAction that must be executedproductionAffects a production deployment that involves customersAffects a production deployment that involves customersr&d:polykey:core activity 4End to End Networking behind Consumer NAT DevicesEnd to End Networking behind Consumer NAT DevicessecuritySecurity riskSecurity risk
Description
Specification
It is now time for our second attempt at testnet deployment.
We had previously already done a PK deployment on AWS using ECS back when PK was 0.0.41
. While the AWS deployment worked, we hit a lot of problems which meant we had to go through an 8 month long refactoring process over the entire codebase.
Now that the codebase is finally refactored, we're ready for the second attempt.
The AWS architecture is basically the same as before, but our configuration should be a lot more simpler. There are some changes though.
- Before we had to deal with Node root certificates, now root certificates are no longer relevant to the testnet/mainnet deployment
- We are now separating into 2 clusters of PK seed nodes:
mainnet.polykey.io
andtestnet.polykey.io
. Themainnet
is intended for production use, and we will first prototype our testnet deployment and testnet will be where new versions of PK are tested before being released on production. - Both mainnet and testnet seed nodes will be trusted by default, but the PK releases should default to use the mainnet and have a switch to use the testnet.
- We don't know yet whether we should be using NLB or not, we may decide not to use a NLB at all. But there shouldn't be any sort of session state that is required for P2P functionality
- NLBs cannot be used with PK clients that are debugging the testnet/mainnet nodes, because they would resolve any possible node, and in this case there is in fact network session state. Instead PK client debugging has to be done with the container IPs.
- We know that IPv6 isn't supported yet so we will have IPv4 and DNS support.
- We should be using well known ports here of
1314
UDP and1315
TCP for the ingress port and the client port respectively. - The PK nodes are not stateless, they do require node state. However this node state is not important to us to persist. So any EBS volume mounted into the ECS container should work. Basically we just need a mutable temporary directory. What kind of mutations are there? Well the kademlia node graph is persisted atm and is not in-memory.
Additional context
- https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/issues/197 - old issue detailing how to configure the AWS infrastructure (we're doing it manually right now) refer to this when starting on this issue
- https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/issues/237 - old issue regarding old task definition envrionment variables
Tasks
- - Upload image to ECR "elastic container registry"
- check https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/issues/197#note_496723119
- make sure you have the right authentication details
- - Create ECS "elastic container service" Task Definition for the new image uploaded to ECR
- the Task Definition describes how to execute the container
- just like how we executed in
docker run
, we will need the same parameters - the PK agent will need a writable directory to be the node state, if we don't specify anything this will just be a temporary scratch layer by ECS, so we should be using a volume mount of some sort, the data inside this PK agent is not important, therefore any AWS volume should be ok, however an NFS/EFS volume might help us in case we want to debug things
- additional environment variables for unattended bootstrapping and port/host binding
- - Start the ECS service, just cluster of 1, test that it is working by using the PK CLI and directly contacting the ECS IP address and port for
PK_PORT
. - - Integrate firewall (security group), NLB and elastic IP to the NLB and then attach the
testnet.polykey.io
domain to the NLB- NLB won't maintain a session between connections in order to point to the same agent, ideally we won't need to have a common for NAT-busting purposes, this shouldn't be a problem for NAT-busting for either hole-punching relay or actual relay, as far I know
- if it is a problem, an alternative to NLB, is domain-level load balancing, where multiple EIPs are presented by
testnet.polykey.io
and they are randomised, cloudflare supports this https://www.cloudflare.com/en-au/learning/performance/what-is-dns-load-balancing/ resolve this in Update testnet.polykey.io to point to the list of IPs running seed keynodes #177
5. [ ] - Update reference documentation with testnet architecture including AWS, use a component diagram with relevant AWS resources- waiting on Pulumi... intermediate diagram available Testnet Node Deployment (testnet.polykey.io) #194 (comment)
Metadata
Metadata
Assignees
Labels
epicBig issue with multiple subissuesBig issue with multiple subissuesopsOperations and administrationOperations and administrationprocedureAction that must be executedAction that must be executedproductionAffects a production deployment that involves customersAffects a production deployment that involves customersr&d:polykey:core activity 4End to End Networking behind Consumer NAT DevicesEnd to End Networking behind Consumer NAT DevicessecuritySecurity riskSecurity risk