Skip to content

[Storage] retry on incomplete XML responses#13076

Merged
jeremymeng merged 9 commits intoAzure:masterfrom
jeremymeng:storage-retry-incomplete-xml
Jan 26, 2021
Merged

[Storage] retry on incomplete XML responses#13076
jeremymeng merged 9 commits intoAzure:masterfrom
jeremymeng:storage-retry-incomplete-xml

Conversation

@jeremymeng
Copy link
Copy Markdown
Member

@jeremymeng jeremymeng commented Jan 5, 2021

When service times out (default max 30s) it terminates the connection
but current stable version of node_fetch doesn't report
error. Instead it returns the incomplete response which leads to XML
parse error. It's unlikely that service would send back incomplete
response on purpose so it doesn't hurt to treat this error as a
TIMEOUT error and retry the request.

The deserialization policy factory needs to move below retry policy
factory so parse error from deserialization can be retried.

After changing the order of deserialization policy and retry policy,
error.code is now populated properly by deserialization policy. This
surfaces an issue where an error with code ResourceNotFound will
also be retried because it contains eNotFound and we use
error.code.toString().toUpperCase().includes() to see if the error
is in the list. It passed the check for the network error code
ENOUTFOUND. This change fixes it by using exact match when checking
error code.

When service times out (default max 30s) it terminates the connection
but current stable version of `node_fetch` doesn't report
error. Instead it returns the incomplete response which leads to XML
parse error.  It's unlikely that service would send back incomplete
response on purpose so it doesn't hurt to treat this error as a
`TIMEOUT` error and retry the request.

The deserialization policy factory needs to move below retry policy
factory so parse error from deserialization can be retried.
@ghost ghost added the Storage Storage Service (Queues, Blobs, Files) label Jan 5, 2021
@jeremymeng jeremymeng added the Client This issue points to a problem in the data-plane of the library. label Jan 5, 2021
@jeremymeng
Copy link
Copy Markdown
Member Author

/azp run js - storage-blob - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeremymeng
Copy link
Copy Markdown
Member Author

/azp run js - storage-blob - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@HarshaNalluru HarshaNalluru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jeremymeng
Copy link
Copy Markdown
Member Author

@ljian3377 please have a look. If looking good I will change queue/fileshare/etc.

@ljian3377
Copy link
Copy Markdown
Member

ljian3377 commented Jan 7, 2021

The ultimate fix for partial response should be upgrading to node-fetch 3.x? Please leave the original issue open till we do that.

The change looks good but I not sure if it's useful. Have you tested this in a real environment? Does this fix mitigate the issue?

@jeremymeng
Copy link
Copy Markdown
Member Author

The ultimate fix for partial response should be upgrading to node-fetch 3.x?

Possibly although I've not tested it. In v3.x we might get an error from node-fetch which we would retry, instead of incomplete response. v3.x is in beta now. I am not sure about its timeline.

The change looks good but I not sure if it's useful. Have you tested this in a real environment? Does this fix mitigate the issue?

Yes it helps on running the repro code. There's still possibility of getting the same error in each retry if unlucky, but it's better than before.

After changing the order of deserialization policy and retry policy,
`error.code` is now populated properly by deserialization policy. This
surfaces an issue where an error with code `ResourceNotFound` will
also be retried because it contains `eNotFound` and we use
`error.code.toString().toUpperCase().includes()` to see if the error
is in the list. It passed the check for the network error code
`ENOUTFOUND`.  This change fixes it by using exact match when checking
error code.
@jeremymeng
Copy link
Copy Markdown
Member Author

/azp run js - storage-blob - tests

@jeremymeng
Copy link
Copy Markdown
Member Author

/azp run js - storage-file-share - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

.toString()
.toUpperCase()
.includes(retriableError))
(err.code && err.code.toString().toUpperCase() === retriableError)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right but let's also check with @XiaoningLiu

@ljian3377
Copy link
Copy Markdown
Member

Logged #13119

Copy link
Copy Markdown
Member

@xirzec xirzec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This ended up being a pretty elegant solution.

@jeremymeng jeremymeng merged commit ee90255 into Azure:master Jan 26, 2021
@jeremymeng jeremymeng deleted the storage-retry-incomplete-xml branch January 26, 2021 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Client This issue points to a problem in the data-plane of the library. Storage Storage Service (Queues, Blobs, Files)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants