Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad: deep dive into networking #30

Open
1 of 11 tasks
noahehall opened this issue Jan 27, 2023 · 0 comments
Open
1 of 11 tasks

nomad: deep dive into networking #30

noahehall opened this issue Jan 27, 2023 · 0 comments

Comments

@noahehall
Copy link
Contributor

noahehall commented Jan 27, 2023

C

  • previously we made a deicsion to bake envoy + consul into each dker image, hopefully this doesnt backfire on us with nomad integration
    • any issues we're facing at the network layer is pure knowledge gap
    • im sure the architecture is sound unless evidence proves otherwise
  • nomad has first class consul (and vault) integration,
  • however, we are using nomad to start the consul service: lets see how this chicken-and-egg dependency plays out
    • best case scenario:
      • we can leave the consul + envoy baked into each image, supporting interoperability between envs
      • we dont need to setup consul for nomad tasks
        • we just need to point upstreams to the consul allocation
        • this can be achieved via a template on each task, than queries nomad service X to find retrieve the service IP
      • register nomad clients with the consul agent for the task their running
        • this is overkil: we just need to know where the services are deployed, then consul + envoy will take over
      • or perhaps set group.service.provider === nomad
        • worked perfectly

  • best case worked out perfectly: leaving this here for when I forget in the future
    • workaround scenario 1:
    • we create a user-defined network and have all clients join it
    • then upstreams can discover core-consul via nomad SRV records
    • all services use consul intentions anyway to manage authnz, so this shouldnt be too much of a security concern
    • workaround scenario 2:
      • we do a soft integration with consul + nomad, just for service discovery between allocations
      • one thing to watch for is redundant envoy + consul processes running
        • each cunt has a bootstrap file for managing the consul agent + envoy sidecar thats baked into the image
        • if we then run another consul + envoy process for nomad, that redundancy seems wasteful
    • worst case scenario: we have to remove consul + envoy from the image
      • this will require us to add additional docker services (1 for consul, 1 for envoy) for each application service in the compose file for development
        • definitely not something we want to do, hence why we baked them into the image
        • we will have to dupliate that logic in nomad for each env,
          • not something we want to do, hbence why we baked them into the image
    • less worst case, but stilll worst case scenario
      • use nomad for development:
        • then having consul + envoy baked into the image will be the problem, instead of this ticket
        • we can configure consul + envoy as a system job and it will automatically be provisioned on each client
          • this is idiomatic nomad
      • not something we want to do, nothing beats just pure fkn docker for development
        • lol hence why we baked the fkn consul and envoy into the image
      • we have validation, explicitly for running prod-like environment without imposing restrictions/non-dev concerns on developers

T

  • docker tasks use docker bridge and not nomad bridge, so we need to configure it
    • group.service: attrs to review
      • x
    • group.network: attrs to review, and should be used instead of task when attrs clash
      • x
    • task.config.X:
      • attrs to review
        • extra_hosts
        • ports
          • do a manual review of this, docker sets NOMAD_PORT_poop in each cunt
        • network_aliases: we can use the nomad runtime vars unlike docker to have distinct cunt aliases; but requires a user defined network
      • attrs to avoid
        • hostname
        • privileged
        • ipc_mode
        • ipv4_address
        • ipv6_address
      • must be configured at group.network
        • dns_search_domains
        • dns_options
        • dns_servers
        • network_mode
    • docker plugin conf
      • check the infra_image attr, from the docs it appears nomad hardcodes it to 3.1

A


  • issue 1: chatter across allocations
  • this was expected, as config is pretty much copypasted from the docker convert env file
  • core-consul (see below) hostname doesnt exist in validation
    • ^ it needs to point to the core-consul allocation ip
    • ^ or somehow discover on which client core-consul is allocated
  • sanity check:
    • set static port allocations for all core-consul (especially serf) ports
    • hard code core consul addr in core proxy retry_join attr
    • makes sense that it works with hardcoded values: since everythings running on my machine
    • still a useful sanity check
  • real fix: discovery....
--
2023-01-27T02:16:17.643Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=intention-match error="No known Consul servers" index=0
2023-01-27T02:16:17.643Z [ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=core-proxy-1-sidecar-proxy service_id=core-proxy-1-sidecar-proxy id=intentions error="error filling agent cache: No known Consul servers"
--
2023-01-27T02:15:12.560Z [INFO]  agent.client.serf.lan: serf: Attempting re-join to previously known node: core-vault-247bb920bc1a: 172.21.0.2:8301
2023-01-27T02:15:12.918Z [INFO]  agent: (LAN) joining: lan_addresses=["core-consul"]
2023-01-27T02:15:12.941Z [WARN]  agent.router.manager: No servers available
2023-01-27T02:15:12.978Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
2023-01-27T02:15:12.978Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0
  error=
  | 1 error occurred:
  | 	* Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
  | 
  
2023-01-27T02:15:12.978Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=10s
  error=
  | 1 error occurred:
  | 	* Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
  | 

-- issue: token/acl

--
2023-01-27T04:28:12.087Z [INFO]  agent.client.serf.lan: serf: Attempting re-join to previously known node: core-proxy-da6a390b2832: 172.22.0.3:8301
127.0.0.1:53492 [27/Jan/2023:04:28:12.107] edge forward_https/serverhttps 1/-1/+0 +0 -- 1/1/0/0/1 0/0
2023-01-27T04:28:13.388Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=172.26.65.117:8300 error="rpc error making call: Permission denied: token with AccessorID 'bdad85af-9fc8-e41d-593f-c73cebef40fc' lacks permission 'node:write' on \"core-proxy-4652f5c62fdf\""
2023-01-27T04:28:13.388Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=bdad85af-9fc8-e41d-593f-c73cebef40fc
--
127.0.0.1:39828 [27/Jan/2023:04:28:22.108] edge forward_https/serverhttps 1/-1/+0 +0 -- 1/1/0/0/1 0/0
2023-01-27T04:28:22.580Z [ERROR] agent.client: RPC failed to server: method=Catalog.Register server=172.26.65.117:8300 error="rpc error making call: Permission denied: token with AccessorID 'bdad85af-9fc8-e41d-593f-c73cebef40fc' lacks permission 'node:write' on \"core-proxy-4652f5c62fdf\""
2023-01-27T04:28:22.580Z [WARN]  agent: Node info update blocked by ACLs: node=3bb036b6-c034-7abb-42df-01c8f7a5b1ea accessorID=bdad85af-9fc8-e41d-593f-c73cebef40fc
--

  • issue: vault backend
  • this makes sense because vault has been commented
[NOTICE]   (15) : haproxy version is 2.7.1-3e4af0e
[NOTICE]   (15) : path to executable is /usr/local/sbin/haproxy
[WARNING]  (15) : config : [/var/lib/haproxy/configs/002-001-vault.cfg:19] : 'server lb-vault/core-vault-c-dns1' : could not resolve address 'core-vault.service.search', disabling server.
[WARNING]  (15) : config : [/var/lib/haproxy/configs/002-001-vault.cfg:20] : 'server lb-vault/core-vault-d-dns1' : could not resolve address 'core-vault', disabling server.
[NOTICE]   (15) : New worker (71) forked
[NOTICE]   (15) : Loading success.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: THE GROOVE
Development

No branches or pull requests

1 participant