Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: 512 byte DNS response size limit causes "cannot unmarshal DNS" error #51127

Closed
AaronFriel opened this issue Feb 10, 2022 · 38 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@AaronFriel
Copy link

AaronFriel commented Feb 10, 2022

So, you found this issue googling for "cannot unmarshal DNS"

There's good news: your issue has largely been fixed. The issue below was created initially because I discovered it in my network and operating system, but further discovery found that this issue has affected every major OS and users of VPNs, DNS providers written in Go, and more.

If you are a maintainer of code and someone has reported this issue: if you can update your build system to use Go 1.16.15 or 1.17.8, or Go 1.18, then you should see this go away and solve your users' issues.

If you are a user of a program and see this error, you need to ask the maintainer or creator of that package to do likewise. Unfortunately, there isn't a single set of instructions I can give for a workaround. If you're using a VPN, try using that program not on a VPN; that seems to be the most common user-reported scenario I've seen.


Original bug report:

What version of Go are you using (go version)?

$ go version
go version go1.17.6 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

Note: WSL2 on Windows. This is relevant, but not the sole scenario in which it can occur, see below.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/friel/.cache/go-build"
GOENV="/home/friel/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/friel/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/friel/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/friel/.local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/friel/.local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/friel/go/src/github.com/pulumi/pulumi-yaml/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3112884807=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Use infrastructure as code tools to manage Azure, and/or attempt to execute net.LookupIP("management.azure.com").

Example program:

package main

import (
	"fmt"
	"net"
)

func main() {
	ips, err := net.LookupIP("management.azure.com")
	if err != nil {
		panic(err)
	}
	for _, ip := range ips {
		fmt.Printf("%v", ip)
	}
}

What did you expect to see?

I expected to see the current IP, 13.86.219.80, as shown by the last line of:

$ host management.azure.com
management.azure.com is an alias for management.privatelink.azure.com.
management.privatelink.azure.com is an alias for arm-frontdoor-prod.trafficmanager.net.
arm-frontdoor-prod.trafficmanager.net is an alias for westus.management.azure.com.
westus.management.azure.com is an alias for arm-frontdoor-westus.trafficmanager.net.
arm-frontdoor-westus.trafficmanager.net is an alias for westus.cs.management.azure.com.
westus.cs.management.azure.com is an alias for rpfd-prod-by-01.cloudapp.net.
rpfd-prod-by-01.cloudapp.net has address 13.86.219.80

What did you see instead?

$ go run resolve-test.go 
panic: lookup management.azure.com on 172.20.32.1:53: cannot unmarshal DNS message

goroutine 1 [running]:
main.main()
        /home/friel/c/resolve-test/resolve-test.go:11 +0xe8
exit status 2

Miscellany

It looks like this issue is widely affecting infrastructure as code tools such as Pulumi, Terraform, and others when they make API calls to Microsoft Azure on the Windows Subsystem for Linux 2, on Microsoft Windows.

This is a bit of a rock and a hard place situation. Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification. The Go language team is in a position to be much more agile and issue a point release update to support a larger buffer size, even just going up to a single standard MTU of ~1500 bytes would resolve this issue in the near term.

As this problem primarily affects programs written in Go, in this author's estimation it seems unlikely a change in Windows' DNS server behavior could occur as quickly, even if the stars were to align on the need to change the implementation. Note that host, dig, nslookup, etc all behave correctly.

Collected notes and root cause analysis:

DNS Flag Day 2020 had an explicit goal of ensuring that resolvers had a minimum accepted buffer size of 1232 bytes: https://dnsflagday.net/2020/#action-dns-resolver-operators

AaronFriel added a commit to AaronFriel/go that referenced this issue Feb 10, 2022
This resolves golang#51127 in the near term by defaulting to a larger buffer
size. This is not a permanent fix or implementation of EDNS(0) or [IETF
RFC6891](https://datatracker.ietf.org/doc/html/rfc6891).

These changes should be reviewed by someone with more experience than I.

:)

Signed-off-by: Aaron Friel <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/384076 mentions this issue: net/dns: Increase UDP response buffer to 1232 bytes

@mvdan mvdan changed the title affected/package: net, 512 byte DNS response size limit causes error when making requests to Azure on WSL2. net: 512 byte DNS response size limit causes error when making requests to Azure on WSL2 Feb 10, 2022
@mvdan
Copy link
Member

mvdan commented Feb 10, 2022

cc @ianlancetaylor @neild

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 10, 2022
@gutzi
Copy link

gutzi commented Feb 10, 2022

Workaround: We were able to work around the problem by adding a DNS entry in the hosts file:
51.107.60.33 management.azure.com
When using WSL, the hostfile can be edited in Windows. %windir%\system32\drivers\etc\hosts and then restart the WSL.
So at least we could use Terraform again.

@AaronFriel
Copy link
Author

AaronFriel commented Feb 10, 2022

For what it's worth, there is no generally applicable workaround that fixes users' experience without other side effects and possible downsides.

That IP isn't the same IP I see, so I wonder if there's some geographic DNS response occurring.

@seankhliao
Copy link
Member

previously #11070

@seankhliao
Copy link
Member

Even from the linked site, the recommendation for the increased buffer size is for EDNS0 which is not implemented here (ref #6464). Equally important on their site is the support for TCP, and had WSL followed spec and returned a proper truncated response, it would have been retried gracefully.

@AaronFriel
Copy link
Author

AaronFriel commented Feb 10, 2022

@seankhliao

I would push back on the notion that this should be resolved elsewhere.

Go is the exception to behaving correctly: other userland programs such as dig(1), nslookup(1), host(1), as well as glibc API calls such as getaddrinfo(3) work. I can write Python, C#, Rust, C, etc, and those will work correctly in this networking environment.

Go is adhering strictly to an antiquated standard, EDNS0 has been a standard since 1999 and larger responses are not a new specification or the result of rapidly moving network standards or the ground shifting under Go. Strict adherence to 512 byte responses is not followed by other tools in the same ecosystem, Go ought to "be liberal in what it accepts", within reason and of course, unless doing so would violate memory safety or other safety criteria of the software.

End-users are not in a position to solve their upstream DNS server's issues, nor are software maintainers. We don't have control over our end user's DNS servers.

This error isn't unique to the situation I described, it's just most acute right now for those users in the specific scenario I documented. 112 issues have been reported on GitHub with the text "cannot unmarshal DNS", and a survey of those shows that they have occurred across all platforms and among extraordinarily widely used pieces of software across Mac, Windows, *nix. Those issues show that various other VPN providers, ISPs, routers, have all behaved similarly. And going back to the earlier points, users don't have control over those things and we shouldn't expect all Go software users to be software engineers or to be able to modify their DNS configuration.

Lastly, I strongly believe that software that works is superior to software that does not, and end-users of the software will not care what link in the chain is causing it not to work.

There is an opportunity to mitigate an issue end-users are facing in one place, I think bringing Golang into alignment with the rest of the ecosystem will positively impact users.

@mdempsky
Copy link
Contributor

Thanks for the report.

Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification.

Why not? It's been a while since I've read DNS RFCs, but my impression is still today that DNS servers are not allowed to send >512-byte responses unless the client explicitly indicates support for such using EDNS.

As such, I feel like emphasizing "pre-1999" is unfair. I think Microsoft should update their DNS server to adhere to the DNS specification. I'd prefer we don't add hacks to accommodate non-spec behavior.

However, #6464 remains open if someone wants to update Go's DNS client to use EDNS, and to support+advertise a larger buffer size. I think that's the standards-conforming way to address this issue, if folks aren't willing to wait on the issue being fixed in WSL2.

@AaronFriel
Copy link
Author

Hey @mdempsky I would like this re-opened please. Any way we could get on a call to chat about this?

@leitzler
Copy link
Contributor

Just for reference, there is an (currently open) issue over at WSL that should cover this issue microsoft/WSL#7642. I'd suggest adding your findings there as well.

@AaronFriel
Copy link
Author

Understood, though I'd like to chat with someone on the Go language team about the scope & impact of this issue. It's affecting customers of major Go language-built software & has for about seven years. It's particularly acute because, I suspect, none of the players wants to take responsibility for fixing this.

End users do not care why their software is broken, but we have an opportunity here to address, at least partially, thousands of issues raised by users over the past 7 years. And if the Pareto principle is applicable here, I suspect those users knowledgeable enough and motivated enough to comment on GitHub are just a fraction of those impacted.

@mdempsky
Copy link
Contributor

Hey @mdempsky I would like this re-opened please. Any way we could get on a call to chat about this?

Why? What do you hope these requests would accomplish?

As stated, the Go DNS client is spec-compliant to the best of my knowledge, and a feature request issue (#6464) already exists that I believe would make it more accommodating to non-compliant DNS software like WSL2. It just needs someone to implement it. I'm happy to review CLs.

@leitzler
Copy link
Contributor

With all due respect, Go is a open source project and I think that your best bet to get a desired change through isn't via a private call with a maintainer.

@AaronFriel
Copy link
Author

AaronFriel commented Feb 10, 2022

Other languages & libraries use larger buffers and accept larger responses in order to "be liberal in what they accept" to tolerate non-compliant implementations, and a concerted effort by a consortium of DNS implementations and stakeholders pushed for a larger acceptable buffer size in 2020, more than two decades after that specification was accepted.

And end users do not care why their software does not work. I think a phone call might be a better channel to have an empathetic conversation over the issues I've read & the litany of closed/unsolved issues reported against packages on GitHub, StackOverflow, and elsewhere.

Otherwise, I can keep replying, but I don't see any responses to my points on the merits so far. I would like to raise the bar from this text-based conversation to one that's more empathetic toward end-users.

I think we should try, here, to solve customer, end-user problems.

@mdempsky
Copy link
Contributor

mdempsky commented Feb 10, 2022

I think we should try, here, to solve customer, end-user problems.

We've identified two ways to do that already: have WSL2 fix their DNS server (microsoft/WSL#7642), or implement #6464.

@AaronFriel
Copy link
Author

AaronFriel commented Feb 10, 2022

Would anything break by using a larger default buffer for responses? I think that's what glibc does, and as observed previously I think Go is an outlier here among languages & libraries in not tolerating a larger response.

@ianlancetaylor
Copy link
Member

There is something I don't understand here. Apparently some DNS server is out of spec by sending packets greater than 512 bytes without setting the truncated bit. But it can't be the regular Microsoft server, or Go programs on all systems would be reporting problems, not just programs on WSL. Does WSL run a local name server? What is the nameserver entry in resolv.conf? What happens if you change it to 8.8.8.8 or 1.1.1.1?

CC @jstarks for WSL issue.

@AaronFriel
Copy link
Author

AaronFriel commented Feb 11, 2022

@ianlancetaylor First, you're right, the WSL2 DNS server is out of spec. No question there.

Second, let's take a step back - this isn't a WSL2 specific issue. Fixing the acute issue users are facing in WSL2 is WSL2 specific, but I'd encourage you to read the many, many comments on GitHub issues. https://github.com/search?o=asc&q=%22cannot+unmarshal+DNS%22&s=created&type=Issues

Starting with these issues which predate WSL2.

I'm using a red circle to indicate that a user's problem was never solved, a yellow circle to indicate that a workaround was implemented to mitigate customer issues, but didn't root cause them, and a green circle when a project that is actually a DNS server solved the issue. I'm also using GitHub Markdown's list notation to provide partially unfurled data about the link destination via just pasting in URLs.

Consul

Confd

Docker

Kubernetes

Weave

rakyll/drive / odeke-em/drive

Mesos, again

Resolvable, a Docker DNS resolver

Goproxy

Moby / then Docker

  • 🔴 Various users report DNS not working, the workaround posted near the bottom can hardly be called such. 33 comments, more than a dozen users reporting issues. This is in 2016, so users were various Linux distributions. Embedded dns server breaks DIND moby/moby#20037
  • This is still an open issue.

freegeoip

  • 🔴 User gets an error when trying to perform a Get in a go application using Docker. cannot unmarshal DNS Message fiorix/freegeoip#160
  • "Could have been. I've literally just changed service providers from yesterday so I'm using different DNS servers. The error has gone away. Odd."

heroku

clair

Docker for Mac

gorush application server

Docker for Mac

@AaronFriel
Copy link
Author

I think that software that works is better than software that doesn't work, and if a partial mitigation before EDNS0 support lands in Go would have prevented these issues, shouldn't it have been done? How many frustrated users is too many?

That's just the first two pages of results from the GitHub issues. I'll continue tomorrow.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/385035 mentions this issue: net: send EDNS(0) packet length in DNS query

@ianlancetaylor
Copy link
Member

@AaronFriel Can you or someone else with WSL see if https://go.dev/cl/385035 fixes the problem? That CL uses EDNS(0) to advertise a permitted packet size of 1232 bytes.

Although I have to say that if there are DNS servers out there that incorrectly send responses larger than 512 bytes in the absence of an EDNS(0) packet length, then I suspect that there are DNS servers out there that will simply ignore the EDNS(0) packet length and send whatever packet size they feel like. So I don't know how much this will actually help.

@AaronFriel
Copy link
Author

@ianlancetaylor I can, with great enthusiasm, report that your CL causes the test case to pass in the issue.

🎉🎉🎉🎉🎉

It took me a bit to figure out how to check out the CL - I used the base64 encoded blob, not sure if that's the easiest way to do it - but I did build Go locally. And the result of running the test command is starkly different.

Go 1.17.6:

$ ~/.local/go/bin.1.17.6/go run resolve-test.go 
panic: lookup management.azure.com on 172.20.32.1:53: cannot unmarshal DNS message

goroutine 1 [running]:
main.main()
        /home/friel/c/tmp/resolve-test.go:11 +0xe8
exit status 2

With patch applied:

$ ~/c/gh/go/bin/go run resolve-test.go 
13.86.219.80

I rebuilt the Pulumi toolchain that a user reported this error on and which I was able to reproduce, and I can confirm that issue is mitigated as well.

I anticipate this would resolve issues for our friends and colleagues in the infra-as-code ecosystem, as well as anyone else using Go tooling to manage or authenticate with Azure, and likely many of issues folks experienced with non-conforming DNS resolvers out of their control due to being part of a proxy, VPNs, their ISP's routers or otherwise.

If this could be included in the next dot release of Go, I would be eternally grateful. 🙇

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/386016 mentions this issue: net: send EDNS(0) packet length in DNS query

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/386014 mentions this issue: Revert "net: send EDNS(0) packet length in DNS query"

gopherbot pushed a commit that referenced this issue Feb 15, 2022
This reverts https://go.dev/cl/385035. For 1.18 we will use a simple
change to increase the accepted DNS packet size, to handle what appear
to be broken resolvers that don't honor the 512 byte limit. For 1.19
we will restore CL 385035 to make a proper EDNS request, so that it
has more testing time before it goes out in a release.

For #6464
For #21160
For #44135
For #51127
For #51153

Change-Id: Ie4a0eb85ca0a6a73bee5cd4cfc6b7d2a15ef259f
Reviewed-on: https://go-review.googlesource.com/c/go/+/386014
Trust: Ian Lance Taylor <[email protected]>
Reviewed-by: Matthew Dempsky <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/386034 mentions this issue: [release-branch.go1.16] net: increase maximum accepted DNS packet to 1232 bytes

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/386035 mentions this issue: [release-branch.go1.17] net: increase maximum accepted DNS packet to 1232 bytes

gopherbot pushed a commit that referenced this issue Feb 15, 2022
The existing value of 512 bytes as is specified by RFC 1035.
However, the WSL resolver reportedly sends larger packets without
setting the truncation bit, which breaks using the Go resolver.
For 1.18 and backports, just increase the accepted packet size.
This is what GNU glibc does (they use 65536 bytes).

For 1.19 we plan to use EDNS to set the accepted packet size.
That will give us more time to test whether that causes any problems.

No test because I'm not sure how to write one and it wouldn't really
be useful anyhow.

Fixes #6464
Fixes #21160
Fixes #44135
Fixes #51127
For #51153

Change-Id: I0243f274a06e010ebb714e138a65386086aecf17
Reviewed-on: https://go-review.googlesource.com/c/go/+/386015
Trust: Ian Lance Taylor <[email protected]>
Run-TryBot: Ian Lance Taylor <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
Reviewed-by: Matthew Dempsky <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
gopherbot pushed a commit that referenced this issue Feb 17, 2022
…1232 bytes

The existing value of 512 bytes as is specified by RFC 1035.
However, the WSL resolver reportedly sends larger packets without
setting the truncation bit, which breaks using the Go resolver.
For 1.18 and backports, just increase the accepted packet size.
This is what GNU glibc does (they use 65536 bytes).

For 1.19 we plan to use EDNS to set the accepted packet size.
That will give us more time to test whether that causes any problems.

No test because I'm not sure how to write one and it wouldn't really
be useful anyhow.

For #6464
For #21160
For #44135
For #51127
For #51153
Fixes #51162

Change-Id: I0243f274a06e010ebb714e138a65386086aecf17
Reviewed-on: https://go-review.googlesource.com/c/go/+/386015
Trust: Ian Lance Taylor <[email protected]>
Run-TryBot: Ian Lance Taylor <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
Reviewed-by: Matthew Dempsky <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
(cherry picked from commit 6e82ff8)
Reviewed-on: https://go-review.googlesource.com/c/go/+/386035
Reviewed-by: Dmitri Shuralyov <[email protected]>
gopherbot pushed a commit that referenced this issue Feb 17, 2022
…1232 bytes

The existing value of 512 bytes as is specified by RFC 1035.
However, the WSL resolver reportedly sends larger packets without
setting the truncation bit, which breaks using the Go resolver.
For 1.18 and backports, just increase the accepted packet size.
This is what GNU glibc does (they use 65536 bytes).

For 1.19 we plan to use EDNS to set the accepted packet size.
That will give us more time to test whether that causes any problems.

No test because I'm not sure how to write one and it wouldn't really
be useful anyhow.

For #6464
For #21160
For #44135
For #51127
For #51153
Fixes #51161

Change-Id: I0243f274a06e010ebb714e138a65386086aecf17
Reviewed-on: https://go-review.googlesource.com/c/go/+/386015
Trust: Ian Lance Taylor <[email protected]>
Run-TryBot: Ian Lance Taylor <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
Reviewed-by: Matthew Dempsky <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
(cherry picked from commit 6e82ff8)
Reviewed-on: https://go-review.googlesource.com/c/go/+/386034
Reviewed-by: Dmitri Shuralyov <[email protected]>
gopherbot pushed a commit that referenced this issue Mar 3, 2022
Advertise to DNS resolvers that we are willing and able to accept up
to 1232 bytes in a DNS packet. The value 1232 was chosen based on
https://dnsflagday.net/2020/.

For #6464
For #21160
For #44135
For #51127
Fixes #51153

Change-Id: If9182d5210bfe047cf0a4d46163effc6812ab677
Reviewed-on: https://go-review.googlesource.com/c/go/+/386016
Trust: Ian Lance Taylor <[email protected]>
Run-TryBot: Ian Lance Taylor <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@AaronFriel AaronFriel changed the title net: 512 byte DNS response size limit causes error when making requests to Azure on WSL2 net: 512 byte DNS response size limit causes error when making DNS requests Mar 15, 2022
@AaronFriel AaronFriel changed the title net: 512 byte DNS response size limit causes error when making DNS requests net: 512 byte DNS response size limit causes "cannot unmarshal DNS" error Mar 15, 2022
bflad added a commit to hashicorp/terraform-provider-dns that referenced this issue Mar 21, 2022
Reference: golang/go#51127
Reference: #157
Reference: #188

Updates the testing and release processes to the latest 1.16.x version, which resolves a longstanding Go resolver issue with responses greater than 512 bytes. Verifies by enabling a previously skipped acceptance test.

Also adds CHANGELOG entries for upstream module updates which are bundled with this provider release, which may fix specific EDNS handling issues.
bflad added a commit to hashicorp/terraform-provider-dns that referenced this issue Mar 21, 2022
Reference: golang/go#51127
Reference: #157
Reference: #188

Updates the testing and release processes to the latest 1.16.x version, which resolves a longstanding Go resolver issue with responses greater than 512 bytes. Verifies by enabling a previously skipped acceptance test.

Also adds CHANGELOG entries for upstream module updates which are bundled with this provider release, which may fix specific EDNS handling issues.
@golang golang locked and limited conversation to collaborators Mar 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants