Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Networking] Application Gateway API is broken #2187

Closed
tombuildsstuff opened this issue Dec 20, 2017 · 23 comments
Closed

[Networking] Application Gateway API is broken #2187

tombuildsstuff opened this issue Dec 20, 2017 · 23 comments
Assignees
Labels
Network Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@tombuildsstuff
Copy link
Contributor

tombuildsstuff commented Dec 20, 2017

👋

When attempting to delete an Application Gateway via the Azure SDK for Go / the Azure API's - the following HTTP request is generated:

DELETE /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestrg-4072404901912593413/providers/Microsoft.Network/applicationGateways/acctestag-4072404901912593413?api-version=2017-09-01 HTTP/1.1
Host: management.azure.com
User-Agent: Go/go1.9.2 (amd64-darwin) go-autorest/8.0.0 Azure-SDK-For-Go/v11.1.1-beta arm-network/;HashiCorp-Terraform-v0.10.6
Accept-Encoding: gzip

which returns a Long Running Request as shown below:

HTTP/1.1 202 Accepted
Cache-Control: no-cache
Pragma: no-cache
Expires: -1
Location: https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/providers/Microsoft.Network/locations/eastus/operationResults/2076bd9d-c8f3-44b0-896d-e5b067c51bb4?api-version=2017-09-01
Retry-After: 10
x-ms-request-id: 2076bd9d-c8f3-44b0-896d-e5b067c51bb4
Azure-AsyncOperation: https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/providers/Microsoft.Network/locations/eastus/operations/2076bd9d-c8f3-44b0-896d-e5b067c51bb4?api-version=2017-09-01
Strict-Transport-Security: max-age=31536000; includeSubDomains
Server: Microsoft-HTTPAPI/2.0
Server: Microsoft-HTTPAPI/2.0
x-ms-ratelimit-remaining-subscription-writes: 1199
x-ms-correlation-request-id: daf1cc5a-a38d-42ef-8535-b68ec8aa0e7a
x-ms-routing-request-id: UKSOUTH:20171220T142556Z:daf1cc5a-a38d-42ef-8535-b68ec8aa0e7a
Date: Wed, 20 Dec 2017 14:25:56 GMT
Content-Length: 0

The Azure SDK for Go automatically makes a request for this URL:

GET /subscriptions/00000000-0000-0000-0000-000000000000/providers/Microsoft.Network/locations/eastus/operations/2076bd9d-c8f3-44b0-896d-e5b067c51bb4?api-version=2017-09-01 HTTP/1.1
Host: management.azure.com
User-Agent: Go/go1.9.2 (amd64-darwin) go-autorest/8.0.0 Azure-SDK-For-Go/v11.1.1-beta arm-network/;HashiCorp-Terraform-v0.10.6
Accept-Encoding: gzip

.. and it receives that the deletion is InProgress:

HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Transfer-Encoding: chunked
Content-Type: application/json; charset=utf-8
Content-Encoding: gzip
Expires: -1
Retry-After: 10
Vary: Accept-Encoding
x-ms-request-id: 653c3d14-5ce7-42dd-b052-b44249ab87f4
Strict-Transport-Security: max-age=31536000; includeSubDomains
Server: Microsoft-HTTPAPI/2.0
Server: Microsoft-HTTPAPI/2.0
x-ms-ratelimit-remaining-subscription-reads: 14980
x-ms-correlation-request-id: 8d70896c-50fb-4384-8bb4-209194a626df
x-ms-routing-request-id: UKSOUTH:20171220T142700Z:8d70896c-50fb-4384-8bb4-209194a626df
Date: Wed, 20 Dec 2017 14:26:59 GMT

{
  "status": "InProgress"
}

.. eventually this returns Successful (which should mean that the Application Gateway no longer exists):

HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Transfer-Encoding: chunked
Content-Type: application/json; charset=utf-8
Content-Encoding: gzip
Expires: -1
Vary: Accept-Encoding
x-ms-request-id: f7e5a959-a3ca-4356-b559-d5bff8c8d914
Strict-Transport-Security: max-age=31536000; includeSubDomains
Server: Microsoft-HTTPAPI/2.0
Server: Microsoft-HTTPAPI/2.0
x-ms-ratelimit-remaining-subscription-reads: 14979
x-ms-correlation-request-id: d82f199f-c29f-4621-8076-41b18071b0d8
x-ms-routing-request-id: UKSOUTH:20171220T142711Z:d82f199f-c29f-4621-8076-41b18071b0d8
Date: Wed, 20 Dec 2017 14:27:11 GMT

{
  "status": "Succeeded"
}

At this time - it should be possible to delete the Subnet (since there's no items left within it). Attempting to do so invokes a Long Running Operation again that immediately fails:

HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Transfer-Encoding: chunked
Content-Type: application/json; charset=utf-8
Content-Encoding: gzip
Expires: -1
Vary: Accept-Encoding
x-ms-request-id: fa89ddc9-80f7-4f99-a14b-2bbc49c6cfca
Strict-Transport-Security: max-age=31536000; includeSubDomains
Server: Microsoft-HTTPAPI/2.0
Server: Microsoft-HTTPAPI/2.0
x-ms-ratelimit-remaining-subscription-reads: 14978
x-ms-correlation-request-id: fc8be827-c65c-4963-899c-9a29bce04598
x-ms-routing-request-id: UKSOUTH:20171220T142723Z:fc8be827-c65c-4963-899c-9a29bce04598
Date: Wed, 20 Dec 2017 14:27:23 GMT

{
  "status": "Failed",
  "error": {
    "code": "InternalServerError",
    "message": "An error occurred.",
    "details": []
  }
}

Unfortunately since this is an InternalServerError we could retry this - but we need to know that this is actually a bug in the underlying service, and not a temporary failed request (which we can't easily determine). Attempting to investigate this further by checking the state of this via the Azure API then returns a provisioningState of Failed with no information about any items in the network that might be still present:

$ az network vnet subnet  show --name internal --vnet-name=acctestvn-4072404901912593413 --resource-group acctestrg-4072404901912593413
{
  "addressPrefix": "10.254.0.0/24",
  "etag": "W/\"8385f6d6-1b24-41fa-8865-4cacb8490508\"",
  "id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/acctestrg-4072404901912593413/providers/Microsoft.Network/virtualNetworks/acctestvn-4072404901912593413/subnets/internal",
  "ipConfigurations": null,
  "name": "internal",
  "networkSecurityGroup": null,
  "privateAccessServices": null,
  "provisioningState": "Failed",
  "resourceGroup": "acctestrg-4072404901912593413",
  "resourceNavigationLinks": null,
  "routeTable": null
}

This is a similar issue to #1233 which affects Virtual Network Gateways - in both cases the Networking API's are returning incorrect information (that the resources have been deleted when they're actually still being deleted).

Would it be possible for someone to look into these bugs? From what I can ascertain - most of the complex Networking resources appear to share the same underlying bug; in this case we're seeing 90% of our tests failing consistently due to this bug.

Thanks!

@tombuildsstuff
Copy link
Contributor Author

ping @alvadb :)

@genevieve
Copy link

We hit this issue 3/5 times yesterday in West US.

@TechyMatt
Copy link

ping @alvadb

@bingosummer
Copy link
Member

I still hit the issue 4/4 times in Central US.

@tombuildsstuff
Copy link
Contributor Author

@alvadb @mcardosos is there any chance we could get someone to prioritise a fix for this? This API is pretty unusable in it's current state; unfortunately we've already shipped this resource so it's become a major pain point for us 😞

@jaishals
Copy link
Contributor

We have completed and committed the fix for this today. It should be available in production in a month time frame or sooner. I will update the thread if I get exact dates

@mcardosos
Copy link
Contributor

Cool!
Thanks @jaishals !

@tombuildsstuff
Copy link
Contributor Author

awesome, thanks @jaishals :)

@bingosummer
Copy link
Member

Hi @jaishals , thanks for your fix. How is the roll-out going? It seems West US still has the issue.

@jaishals
Copy link
Contributor

Hi @bingosummer , I checked and the release which has my fix is not rolled out yet in most of the regions

@tombuildsstuff
Copy link
Contributor Author

@jaishals is there a timeline to do so? It appears we're still seeing the issue in West Europe - from the API response:

Long running operation terminated with status 'Failed': Code="InternalServerError" Message="An error occurred." Details=[]

@jaishals
Copy link
Contributor

jaishals commented Jun 7, 2018

@tombuildsstuff , west europe has been rolled out with the fix now.

@bingosummer
Copy link
Member

Just tested creating/deleting application gateway using terraform in West Europe and still hit an error.

azurerm_application_gateway.azure_application_gateway: Still destroying... (ID: /subscriptions/xxx-...tionGateways/azure_application_gateway, 1m50s elapsed)
azurerm_application_gateway.azure_application_gateway: Destruction complete after 1m56s
azurerm_subnet.azure_subnet_appgw_in_default_rg: Destroying... (ID: /subscriptions/xxx-...re_bosh_network/subnets/azure_subnet_3)
azurerm_public_ip.azure_ip_application_gateway: Destroying... (ID: /subscriptions/xxx-...Addresses/azure_ip_application_gateway)
azurerm_public_ip.azure_ip_application_gateway: Destruction complete after 3s

Error: Error applying plan:

1 error(s) occurred:

* azurerm_subnet.azure_subnet_appgw_in_default_rg (destroy): 1 error(s) occurred:

* azurerm_subnet.azure_subnet_appgw_in_default_rg: Error waiting for completion for Subnet "azure_subnet_3" (VN "azure_bosh_network" / Resource Group "xxx"): Long running operation terminated with status 'Failed': Code="InternalServerError" Message="An error occurred." Details=[]

@tombuildsstuff
Copy link
Contributor Author

@jaishals thanks for that. Unfortunately it appears we're still seeing the same issue in West Europe that @bingosummer has posted:

Error waiting for completion for Subnet "subnet-7866697221497721034" (VN "acctest-vnet-7866697221497721034" / Resource Group "acctestrg-7866697221497721034"): Long running operation terminated with status 'Failed': Code="InternalServerError" Message="An error occurred." Details=[]

@bingosummer
Copy link
Member

@tombuildsstuff I'm working with @jaishals to troubleshoot it. I've provided Jaishal the detail information (subscription id, appgw name, ...) for investigations.

@tombuildsstuff
Copy link
Contributor Author

ping @bingosummer / @jaishals - any update here? :)

@jaishals
Copy link
Contributor

@tombuildsstuff Fix will be available only once the new release rollouts start in For some reason it has taken time this time. I am not sure of the cause. Sorry for the inconvenience and the delay

@tombuildsstuff
Copy link
Contributor Author

@jaishals thanks for the update :)

@salameer @devigned @kirthik sorry to pester - but would it be possible to escalate this release internally (for example by adding the Service Team label?) - as this API bug was reported 7 months and it sounds like a fix has been ready to ship for approximately 3 months?

Thanks!

@marstr
Copy link
Member

marstr commented Jul 23, 2018

@tombuildsstuff, I started a conversation internally. It sounds like this is still 2-4 weeks out, and may take more time depending on the region. I'll try to keep this thread updated as I learn more.

@jaishals
Copy link
Contributor

jaishals commented Aug 7, 2018

@tombuildsstuff The fix has been rolled out to the production. Can you please try the scenario that you were trying. I tried to create and delete a deployment using terraform and appgw / vnet / subnet all get successfully deleted.

@tombuildsstuff
Copy link
Contributor Author

@jaishals checking out nightly tests this appears to be fixed, I'm going to leave this issue open for the moment and confirm for sure in a week - but this looks good on first glance :). Thanks for getting this fixed 👍

@bingosummer
Copy link
Member

I also verified the fix in my pipeline. Looks good.

@tombuildsstuff
Copy link
Contributor Author

👋 hey all

To give an update here - taking a look at our test results this API appears to be fixed (the Application Gateway tests I've checked have passed consistently since 3rd August) - so I'm going to close this issue.

Thanks for all the work to get this resolved :)

@bsiegel bsiegel added the Service Attention Workflow: This issue is responsible by Azure service team. label Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Network Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

9 participants