Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set function channel to idle to prevent DNS resolution of deleted pod #14750

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Mar 18, 2022

Motivation

When running the Kubernetes runtime and deleting a function, it is possible to observe the following error log:

15:36:17.025 [worker-scheduler-0] INFO  org.apache.pulsar.functions.utils.Actions - Sucessfully completed action [ Deleting statefulset for function ds/default/gluon-revolut-ord-rsp-assembler ]
15:36:17.405 [grpc-default-executor-307] WARN  io.grpc.internal.ManagedChannelImpl - [Channel<99>: (pf-ds-default-gluon-revolut-ord-rsp-assembler-0.pf-ds-default-gluon-revolut-ord-rsp-assembler.pulsar.svc.cluster.local:9093)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host pf-ds-default-gluon-revolut-ord-rsp-assembler-0.pf-ds-default-gluon-revolut-ord-rsp-assembler.pulsar.svc.cluster.local, cause=java.lang.RuntimeException: java.net.UnknownHostException: pf-ds-default-gluon-revolut-ord-rsp-assembler-0.pf-ds-default-gluon-revolut-ord-rsp-assembler.pulsar.svc.cluster.local: Name or service not known
	at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:399)
	at io.grpc.internal.DnsNameResolver$Resolve.resolveInternal(DnsNameResolver.java:269)
	at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:225)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.UnknownHostException: pf-ds-default-gluon-revolut-ord-rsp-assembler-0.pf-ds-default-gluon-revolut-ord-rsp-assembler.pulsar.svc.cluster.local: Name or service not known
	at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
	at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
	at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)
	at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
	at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:624)
	at io.grpc.internal.DnsNameResolver.resolveAll(DnsNameResolver.java:367)
	... 5 more
}

This error happens because the ManagedChannel is created to target the StatefulSet pods, and the function worker deletes the pods before shutting down the ManagedChannel. Note that there is nothing about pod deletion that triggers the GRPC client to connect to the functions. The error happens due to the way that GRPC handles DNS and its frequent DNS resolution.

There are two solutions. First, we could shutdown the managed channel first, or we could set the channel to idle and prevent any new DNS resolution (as long as there are new connections). Given that the StatefulSet or the Service could fail to get deleted, it seems simpler to just set the channel to idle and then delete it after successfully deleting the function.

Although, it's worth noting enterIdle is labeled as experimental. It looks like it is implemented in the ManagedChannelImpl to do exactly what we want: stop DNS resolution for as long as the channel stays in "idle" state. Here is the associated Javadoc

  /**
   * Invoking this method moves the channel into the IDLE state and triggers tear-down of the
   * channel's name resolver and load balancer, while still allowing on-going RPCs on the channel to
   * continue. New RPCs on the channel will trigger creation of a new connection.
   *
   * <p>This is primarily intended for Android users when a device is transitioning from a cellular
   * to a wifi connection. The OS will issue a notification that a new network (wifi) has been made
   * the default, but for approximately 30 seconds the device will maintain both the cellular
   * and wifi connections. Apps may invoke this method to ensure that new RPCs are created using the
   * new default wifi network, rather than the soon-to-be-disconnected cellular network.
   *
   * <p>No-op if not supported by implementation.
   *
   * @since 1.11.0
   */
  @ExperimentalApi("https://github.com/grpc/grpc-java/issues/4056")
  public void enterIdle() {}

Modifications

  • Set all channels to idle before deleting the function pods in the K8s function runtime
  • Fix some copy/pasted log lines that were a bit confusing
  • Fix typos of sucess.

Verifying this change

This is a trivial change that will not affect the logic of function deletion. It just ensures graceful shutdown and avoids benign errors that might otherwise confuse users.

Does this pull request potentially affect one of the following parts:

This is a backwards compatible change.

Documentation

  • no-need-doc

This is an internal cleanup.

@michaeljmarshall michaeljmarshall added type/cleanup Code or doc cleanups e.g. remove the outdated documentation or remove the code no longer in use area/function labels Mar 18, 2022
@michaeljmarshall michaeljmarshall added this to the 2.11.0 milestone Mar 18, 2022
@michaeljmarshall michaeljmarshall self-assigned this Mar 18, 2022
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 18, 2022
@codelipenghui codelipenghui requested a review from freeznet March 21, 2022 03:29
@codelipenghui
Copy link
Contributor

@freeznet @nlu90 Please help review this PR.

Copy link
Contributor

@freeznet freeznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@codelipenghui codelipenghui merged commit 1e619ce into apache:master Mar 21, 2022
@michaeljmarshall michaeljmarshall deleted the address-funciton-worker-grpc-dns-error branch March 21, 2022 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/function doc-not-needed Your PR changes do not impact docs type/cleanup Code or doc cleanups e.g. remove the outdated documentation or remove the code no longer in use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants