Skip to content
This repository was archived by the owner on Oct 12, 2023. It is now read-only.

Retry operation when network connection is down.#31

Merged
ramya-rao-a merged 12 commits intoAzure:masterfrom
ShivangiReja:NetworkConn_Error
Mar 12, 2019
Merged

Retry operation when network connection is down.#31
ramya-rao-a merged 12 commits intoAzure:masterfrom
ShivangiReja:NetworkConn_Error

Conversation

@ShivangiReja
Copy link
Copy Markdown
Member

@ShivangiReja ShivangiReja commented Mar 4, 2019

Description

Continue retry operation when internet connection goes down.

  • Created a new method that checks for internet connectivity.
  • If we are not connected with the internet then continue retry operation.

Reference to any github issues

@ramya-rao-a @AlexGhiondea

lib/retry.ts Outdated
async function checkInternet(): Promise<boolean> {
return new Promise((resolve) => {
require("dns").lookup("ms.portal.azure.com", function (err: any): void {
if (err && err.code === "ENOTFOUND") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible that the this dns query resolves even when the connection is not there? For instance, if there is a dns cache it might be possible that the value is returned from the cache without doing a network call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
You are right, the lookup method indeed reuses the dns cache and therefore not guaranteed to perform any network communication. Therefore, we will have to use the resolve method instead of lookup which will query the dns server and hence always use the network to perform queries.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible to "resolve" without having "internet" FWIW. This occurs often on our wifi network - DNS queries work as the DNS server is within our LAN (still a few hops away), but access to the outside world is broken. I think this feature should be understood as a 90% sort of solution for "ethernet unplugged" and "switching to/from wifi" scenarios.

Copy link
Copy Markdown
Contributor

@AlexGhiondea AlexGhiondea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple of comments

lib/retry.ts Outdated
const isConnected = await checkNetworkConnection();
if (!isConnected) {
err.name = "ConnectionDetachError";
err.retryable = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make this check for every error seen by every retry attempt. Can we reduce the number of checks being made?
One example is when the existing error is already retryable, then we don't need to make this check.

Also, maybe we should only add this check when we see the ServiceCommunicationError which maps to the ENOTFOUND. Thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example is when the existing error is already retryable, then we don't need to make this check - This makes sense. I'll add one more check that if existing error is already retryble then no need to do anything.

Also, maybe we should only add this check when we see the ServiceCommunicationError which maps to the ENOTFOUND. - This check would be really specific for our use case only. In future if user will get different non-retryable error then he will not be able to retry again.

Also, I believe that if the error is not retryable and if the network connection is down, we should anyways retry the operation. What do you think ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the network was down, then we wouldn't be getting any error from the service in the first place isn't it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean that we always get the same ServiceCommunicationError during a network failure ? Doesn't it depends on the program state when network goes down ?

Also, what I am proposing above is that it doesn't matter whether we always get the ServiceCommunicationError or not, we should retry if the error is not retryable and we are not connected to the internet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the network is down and we make a network request, Nodejs will always give ENOTFOUND error and amqp-common translates that to ServiceCommunicationError. Nodejs doesn't care about the program state, it only sees a network request being made.

ENOTFOUND can happen due to multiple reasons, network being down is one of them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more concern about this :)

I was talking about the case when we have made connection with the service, sent a request and is now waiting for a response but network fails. What would happen in this case? Will we get the same ENOTFOUND error?

In this case, rhea will send a disconnected event first. This results in calling detached for all open senders and receivers. The detached function sees that the sdk didnt close the sender/receiver and will try to reconnect the sender/receiver. The first retry attempt to run _init to reconnect them will fail with the ServiceCommunicationError and that is when your network connection check will come in.

This case is specific for our Event Hub and Service bus SDK. If other user will use the same retry method of Amqp-common, will they get the same ENOTFOUND error?
In this case I was thinking It should either be ConnectionTimeout or SocketClose error?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: ENOTFOUND, neither Node docs, errno docs, or getaddrinfo docs say anything about this error. However, a search of GitHub code shows we are not the only ones responding to this kind of error, so I find it extremely unlikely Node would change this error in the future. I've sent some tweets on the topic, will get back here if I hear anything different.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENOTFOUND is indeed not a real posix error and that explains why it's hard to find any info on it. It's actually a node-specific error that isn't documented. The source of the error is here. While the comment implies the error will go away, on twitter, Anna (Node core member) suggests the error won't change and agrees that the comment should be removed and the error documented. I can push on that if we find it important, but personally I'm satisfied that this behavior won't change because of all the things that would break.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I was thinking It should either be ConnectionTimeout or SocketClose error?

Are you saying that we should rename ServiceCommunicationError or the ConnectionDetachedError?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing the loop here based on offline conversation

This case is specific for our Event Hub and Service bus SDK. If other user will use the same retry method of Amqp-common, will they get the same ENOTFOUND error?

amqp-common is tightly coupled with rhea-promise and rhea. So, all users of amqp-common will get the disconnected event when the network goes down, which they can then react to just how we do.

Also, lets rename ConnectionDetachedError to ConnectionLostError

@ramya-rao-a
Copy link
Copy Markdown
Contributor

This fixes #32

Copy link
Copy Markdown
Member

@ramya0820 ramya0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline - for the URL being pinged, depending on client side network setup there could only be access to azure service endpoint in context.
In such cases, using the right port number would be important as well (based on the network firewall configuration on client side).

@ramya-rao-a ramya-rao-a merged commit eee2aa2 into Azure:master Mar 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Client Issues that refer to the client sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants