Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy fails with "use of closed network connection" error when using a slow proxy #2105

Open
vinzent opened this issue Sep 1, 2023 · 9 comments

Comments

@vinzent
Copy link

vinzent commented Sep 1, 2023

We have an internal Quay registry where we mirror external images. We need to use a proxy (ZScaler). Sometimes some blobs take quite a while to finish downloading. Up to 3 minutes. I suspect they run some scans on the proxy before the download can be finished. Seems skopeo opens a connection to the target registry, wehre the TCP connection closes due to idleness before the download from the source layer finishes (just wild guessing).

Example of a failed copy:

sh-5.2$ rpm -q skopeo
skopeo-1.13.2-1.fc38.x86_64
sh-5.2$ no_proxy=.internal.domain https_proxy=zscaler:443 skopeo copy docker://docker.io/kibana:6.7.1 docker://quay.internal.domain/kibana/kibana:6.7.1
Getting image source signatures
Copying blob 5334d2bd41b7 done
Copying blob f0b755f02a6c done
Copying blob 41d16359541e done
Copying blob 2a0588ea5574 done
Copying blob 9de755da2607 done
Copying blob 8ba884070f61 skipped: already exists
Copying blob fc3de62ddb21 done
Copying blob 64b08a88c3fb done
FATA[0173] copying system image from manifest list: writing blob: Patch https://quay.internal.domain/v2/kibana/kibana/blobs/uploads/4211057a-e8d9-446a-ac54-cd6d157857ba: use of closed network connection

Workaround

Copy to local storage first, then to the internal registry.

https_proxy=zscaler:443 skopeo copy docker://docker.io/kibana:6.7.1 containers-storage:quay.internal.domain/kibana/kibana:6.7.1
skopeo copy containers-storage:quay.internal.domain/kibana/kibana:6.7.1 docker://quay.internal.domain/kibana/kibana:6.7.1

Versions

I used the following to versions:

  • skopeo-1.13.2-1.fc38.x86_64 (Fedora 38)
  • skopeo-1.11.2-0.2.module+el8.8+18251+ad5b274c (RHEL 8.8)

Other things

I see the same error when using the Red Hat Quay Repo Mirror feature (Support-Case opened). Looking at the logs, it seems that skopeo is used in the background.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 4, 2023

Thanks for your report.

Seems skopeo opens a connection to the target registry, wehre the TCP connection closes due to idleness before the download from the source layer finishes (just wild guessing).

The implementation always reads at least a few bytes from the source before initiating the upload to the destination; so it’s not that the source somehow spends a lot of time starting the download. The download is either delayed somewhere in between, or possibly at the very end at EOF. In both cases I think there’s little the destination can do about a slow source — and the design of the copy implementation (with a possible compression/decompresssion transformation) makes it very hard, in general, to somehow restart the upload on failure.

It would progably be valuable to gather some more data about the timing / cause. Is it a delay when starting the copy? When finishing? Is it just slow? Is the cause the source or the destination?


In this area, Skopeo is just a dumb wrapper around the c/image implementation, so I’m transferring the report there.

I also rather suspect that this is an error handling bug in the standard Go library. There are ~two network connections involved: The destination one is purely internal to the standard-library HTTP stack, and not managed by the c/image code; the source one more of a responsibility of the c/image code, but errors from that one would include an happened during read annotation. I realize blaming the standard library is usually the wrong guess, investigating more…


WRT the workaround, I would recommend using dir: instead of containers-storage:, unless you need the image in local storage anyway. dir: is much less resource-intensive, and it can preserve the original representation of the image (notably including the original digests).

@mtrmac mtrmac transferred this issue from containers/skopeo Sep 4, 2023
@mtrmac
Copy link
Collaborator

mtrmac commented Sep 4, 2023

  • skopeo-1.13.2-1.fc38.x86_64 (Fedora 38)

Built with golang-1.20.7-1.fc38.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 4, 2023

I also rather suspect that this is an error handling bug in the standard Go library.

To clarify, I suspect a mostly-cosmetic issue: there is probably some underlying cause (“connection closed by the remote server”, “timed out”) which isn’t reported but should be — but the underlying cause wouldn’t go away just because it would be correctly reported.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 4, 2023

Randomly guessing a reproducer, yes, there is a standard library error handling problem:

func TestUseOfClosedNetworkConnection(t *testing.T) {
	s := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		var buf1 [5]byte
		n, err := r.Body.Read(buf1[:])
		require.NoError(t, err, "reading body")
		logrus.Errorf("Got %d bytes: %s", n, string(buf1[:n]))

		rc := http.NewResponseController(w)
		c, buf, err := rc.Hijack()
		require.NoError(t, err, "Hijack")
		bufLen := buf.Reader.Buffered()
		buf2 := make([]byte, bufLen)
		n, err = buf.Read(buf2)
		require.NoError(t, err)
		logrus.Errorf("Buffered %d bytes: %s", n, buf2[:n])
		logrus.Errorf("Closing")
		err = c.Close()
		require.NoError(t, err)
	}))
	defer s.Close()

	sourceReader, sourceWriter := io.Pipe()
	go func() {
		_, err := sourceWriter.Write([]byte("line 1\n"))
		require.NoError(t, err, "Writing line 1")
		time.Sleep(1 * time.Second)
		_, err = sourceWriter.Write([]byte("line 2\n"))
		require.NoError(t, err, "Writing line 2")
		err = sourceWriter.Close()
		require.NoError(t, err, "Closing writer")
	}()
	defer sourceReader.CloseWithError(errors.New("Test exited"))

	client := &http.Client{
		Transport: tlsclientconfig.NewTransport(),
	}
	req, err := http.NewRequestWithContext(context.Background(), http.MethodPatch, s.URL, sourceReader)
	require.NoError(t, err)
	res, err := client.Do(req)
	require.NoError(t, err)
	defer res.Body.Close()
	body, err := io.ReadAll(res.Body)
	require.NoError(t, err)
	logrus.Errorf("Response status %d, body \n%s\n", res.StatusCode, string(body))
}
% go test -v -v -timeout 600s -run ^TestUseOfClosedNetworkConnection$ ./docker                             
=== RUN   TestUseOfClosedNetworkConnection
time="2023-09-04T22:46:57+02:00" level=error msg="Got 5 bytes: line "
time="2023-09-04T22:46:57+02:00" level=error msg="Buffered 4 bytes: 1\n\r\n"
time="2023-09-04T22:46:57+02:00" level=error msg=Closing
    docker_image_dest_test.go:84: 
        	Error Trace:	…/docker/docker_image_dest_test.go:84
        	Error:      	Received unexpected error:
        	            	Patch "http://127.0.0.1:64956": write tcp 127.0.0.1:64957->127.0.0.1:64956: use of closed network connection
        	Test:       	TestUseOfClosedNetworkConnection
--- FAIL: TestUseOfClosedNetworkConnection (1.00s)
FAIL
FAIL	github.com/containers/image/v5/docker	1.457s
FAIL

though the error text is not exactly the same.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 4, 2023

The basic cause is that the Go HTTP/1.1 implementation has a separate “read” and “write” loops.

The read loop waits for responses; and that one terminates with “net/http: Transport failed to read from server: EOF” (compare golang/go#53472 ). And it’s the read end that is responsible for actually closing the socket. The read end does not know or care whether the write end is still sending the request; a “full duplex” operation where an (error) response can be sent before the full request is sent is, apparently, an intentionally desired feature.

Meanwhile, the write loop is busy sending request data. And that code is unaware of the read end having closed the socket, hence the error text.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 4, 2023

I have filed golang/go#62447 against the upstream Go standard library’s error reporting behavior.

Of course, that’s not directly helpful for making the copy work.

I’m not sure that is reasonably possible; understanding the timing/cause, as mentioned above, would be a prerequisite.

@vinzent
Copy link
Author

vinzent commented Sep 7, 2023

Related Quay issue: https://issues.redhat.com/browse/RFE-4643

@vinzent
Copy link
Author

vinzent commented Sep 7, 2023

Hi @mtrmac

It would progably be valuable to gather some more data about the timing / cause. Is it a delay when starting the copy? When finishing? Is it just slow? Is the cause the source or the destination?

The download is delayed by ZScaler at the end. Until this happens, its fast. This is how the skopeo copy kibana:6.7.1 image download output looks like for about 2m40s:

Getting image source signatures
Copying blob f0b755f02a6c [=====================================>] 179.3MiB / 179.3MiB
Copying blob 41d16359541e done
Copying blob 2a0588ea5574 done
Copying blob 9de755da2607 done
Copying blob 5334d2bd41b7 done
Copying blob 8ba884070f61 skipped: already exists
Copying blob fc3de62ddb21 done
Copying blob 64b08a88c3fb done

WRT the workaround, I would recommend using dir: instead of containers-storage:,

Thanks for the hint!

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 7, 2023

Hum. Does the source actually deliver all data, and block before reporting EOF / success, or is there at least one byte outstanding that we have to block on? If all data is available, we could, in many but not all cases, finish processing the data without waiting for a successful EOF indication (probably defeating the point of the scan — we would consume the data even if the scan later detected an error).

I suppose longer-term we should add an option for downloading to a file locally, and pushing from that file — this is not the only case where significantly different transport speeds between sources and destinations cause trouble, although it’s probably the most notable one so far. That would not be too much code — I’m mostly concerned that we need to understand the root cause enough to rule out some other bug in our code that should be fixed instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants