Restart a cdn instance cause image pull takes long time. #949

likunbyl · 2021-12-19T09:01:19Z

Bug report:

I deployed drygonfly v2.0.2-alpha.2 with Helm chart 0.5.26. The only change I made is that dfdaemon not to discover the schedulers automaticly through manager, instead I specified the domainname of the schedulers.

Then I did some testing. In a test, I restarted a cdn pod, then sometimes it took 1m to pull a image, normally it took only 7s.

From the logs, I noticed that dfdaemon tried to access the previous ip of the restarted cdn pod, not the new one:

{"level":"debug","ts":"2021-12-19 02:28:57.603","caller":"peer/piece_downloader.go:129","msg":"built request url: http://10.218.44.208:8001/download/d68/d68a4b905bb7f6c4aa470407b52f7c1a6287f2e88cf1cbe317acacf2b993c70d?peerId=10.218.44.246-9-d81ac508-845c-49b0-9e63-6085025afe50_CDN"}

In this log, 10.218.44.208 is the ip of cdn pod before it's restart, while the ip 10.218.44.246, which comes from the peeid, is the new ip. At the end, it timeout, and pulled the image from the backsource eventually:

{"level":"error","ts":"2021-12-19 02:29:50.015","caller":"peer/peertask_base.go:790","msg":"get piece task from peer 10.218.44.246-9-d81ac508-845c-49b0-9e63-6085025afe50_CDN error: get client conn by conn dns:///10.218.44.208:8003: cannot found clientConn associated with node dns:///10.218.44.208:8003 and create client conn failed: context deadline exceeded, code: 4001","peer":"10.218.41.169-26171-c7852c5f-c3cc-4553-ae5d-0d4e56ed00d8","task":"d68a4b905bb7f6c4aa470407b52f7c1a6287f2e88cf1cbe317acacf2b993c70d","component":"streamPeerTask","stacktrace":"d7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).preparePieceTasksByPeer\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:790\nd7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).preparePieceTasks\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:728\nd7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).pullPiecesFromPeers\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:437\nd7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).pullSinglePiece\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:379"}
{"level":"info","ts":"2021-12-19 02:29:50.015","caller":"peer/peertask_base.go:585","msg":"start download from source due to base.Code_SchedNeedBackSource","peer":"10.218.41.169-26171-c7852c5f-c3cc-4553-ae5d-0d4e56ed00d8","task":"d68a4b905bb7f6c4aa470407b52f7c1a6287f2e88cf1cbe317acacf2b993c70d","component":"streamPeerTask"}

This happened after the cdn pod get restarted a few minutes, the new ip has been inserted into the database table for a while:

|  3 | 2021-12-13 15:31:36 | 2021-12-19 01:37:45 |      0 | dragonfly-cdn-2.cdn.dragonfly.svc.cluster.local |      |          | 10.218.44.246 | 8003 |          8001 | active |              1 |

If detailed log needed, I will provide.

Expected behavior:

A cdn pod restart can't effect the pull speed, dfdaemon can get the right cdn ip

How to reproduce it:

Deploy dragonfly v2.0.2-alpha.2 with Helm chart 0.5.26.
dfdaemon not to discover the schedulers automaticly through manager, instead specifying the domainname of the schedulers.
3, restart a cdn pod.

Environment:

Dragonfly version: v2.0.2-alpha.2
OS: CentOS Linux 7 (Core)
Kernel (e.g. uname -a): 3.10.0-1160.31.1.el7.x86_64
Kubernetes version： v1.19.10

The text was updated successfully, but these errors were encountered:

likunbyl · 2021-12-23T02:28:01Z

could someone look into this issue?

likunbyl · 2021-12-26T09:03:57Z

I tried the v2.0.2-alpha.6, the same result， occationally takes a long time to pull an image.

likunbyl · 2021-12-27T13:11:28Z

Also tried helm chart version 0.5.16, the same problem, I doubt it can be deployed to a production envrionment.

likunbyl · 2021-12-29T09:39:11Z

Anyone can help? I deployed dragonfly v2.0.2-alpha.6 with the helm chart 0.5.26, default settings, the same results.

I restarted 2 cdn instances, then it took a long time to pull a image.

Four pods ran on four nodes, the time to pull a 143MB image is:
1m41.524286867
1m31.059427422
1m40.727282204
1m21.067155916

Restarted all schedulers seems solved the problem.

likunbyl · 2021-12-30T02:43:10Z

@jim3ma Could you pls take a look at this issue? I plan to deploy to production environment recently.

jim3ma · 2021-12-30T05:33:36Z

Can you upload the log of cdn and dfdaemon?

likunbyl · 2021-12-30T06:18:49Z

Here is the logs:

core-dfdaemon-dvqgp.log
core-cdn-2.log
core-cdn-0.log
core-cdn-1.log

jim3ma · 2021-12-30T10:22:41Z

Bug confirmed: scheduler did not refresh cdn status when cdn ip changed, and returned the old ip to peers.
@gaius-qi will fix it.

jim3ma · 2022-01-05T04:30:29Z

@likunbyl The latest code change fixed this issue, you can have a try.

likunbyl · 2022-01-05T06:36:46Z

Thank you, I need the docker image to try out, could it be provide?

likunbyl · 2022-01-05T06:38:14Z

By the way, I have another issue about preheat enhencement, do you have a plan ?

jim3ma · 2022-01-05T12:49:35Z

Thank you, I need the docker image to try out, could it be provide?

Just run make docker-build in Dragonfly2 root directory to build all images.

likunbyl · 2022-01-11T07:12:37Z

I have tried the newest code, when I restarted all three cdn pods, again it took more than 1m to pull an image.

Here is the logs:
core-cdn-2.log
core-cdn-1.log
core-cdn-0.log
core-dfdaemon-8bbps.log

@jim3ma

jim3ma · 2022-01-12T14:37:53Z

I will investigate it. Please wait some time.

likunbyl · 2022-02-15T06:09:45Z

done, thanks.

likunbyl added the kind/bug label Dec 19, 2021

jim3ma self-assigned this Dec 30, 2021

jim3ma mentioned this issue Dec 30, 2021

feat: if cdn is deleted, clear cdn related information #967

Merged

11 tasks

jim3ma mentioned this issue Jan 22, 2022

feat: If cdn only updates IP, set cdn peers state to PeerStateLeave #1029

Merged

11 tasks

likunbyl closed this as completed Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart a cdn instance cause image pull takes long time. #949

Restart a cdn instance cause image pull takes long time. #949

likunbyl commented Dec 19, 2021

likunbyl commented Dec 23, 2021

likunbyl commented Dec 26, 2021

likunbyl commented Dec 27, 2021

likunbyl commented Dec 29, 2021

likunbyl commented Dec 30, 2021

jim3ma commented Dec 30, 2021

likunbyl commented Dec 30, 2021

jim3ma commented Dec 30, 2021 •

edited

Loading

jim3ma commented Jan 5, 2022

likunbyl commented Jan 5, 2022

likunbyl commented Jan 5, 2022

jim3ma commented Jan 5, 2022

likunbyl commented Jan 11, 2022

jim3ma commented Jan 12, 2022

likunbyl commented Feb 15, 2022

Restart a cdn instance cause image pull takes long time. #949

Restart a cdn instance cause image pull takes long time. #949

Comments

likunbyl commented Dec 19, 2021

Bug report:

Expected behavior:

How to reproduce it:

Environment:

likunbyl commented Dec 23, 2021

likunbyl commented Dec 26, 2021

likunbyl commented Dec 27, 2021

likunbyl commented Dec 29, 2021

likunbyl commented Dec 30, 2021

jim3ma commented Dec 30, 2021

likunbyl commented Dec 30, 2021

jim3ma commented Dec 30, 2021 • edited Loading

jim3ma commented Jan 5, 2022

likunbyl commented Jan 5, 2022

likunbyl commented Jan 5, 2022

jim3ma commented Jan 5, 2022

likunbyl commented Jan 11, 2022

jim3ma commented Jan 12, 2022

likunbyl commented Feb 15, 2022

jim3ma commented Dec 30, 2021 •

edited

Loading