Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart a cdn instance cause image pull takes long time. #949

Closed
likunbyl opened this issue Dec 19, 2021 · 15 comments
Closed

Restart a cdn instance cause image pull takes long time. #949

likunbyl opened this issue Dec 19, 2021 · 15 comments
Assignees

Comments

@likunbyl
Copy link

Bug report:

I deployed drygonfly v2.0.2-alpha.2 with Helm chart 0.5.26. The only change I made is that dfdaemon not to discover the schedulers automaticly through manager, instead I specified the domainname of the schedulers.

Then I did some testing. In a test, I restarted a cdn pod, then sometimes it took 1m to pull a image, normally it took only 7s.

From the logs, I noticed that dfdaemon tried to access the previous ip of the restarted cdn pod, not the new one:

{"level":"debug","ts":"2021-12-19 02:28:57.603","caller":"peer/piece_downloader.go:129","msg":"built request url: http://10.218.44.208:8001/download/d68/d68a4b905bb7f6c4aa470407b52f7c1a6287f2e88cf1cbe317acacf2b993c70d?peerId=10.218.44.246-9-d81ac508-845c-49b0-9e63-6085025afe50_CDN"}

In this log, 10.218.44.208 is the ip of cdn pod before it's restart, while the ip 10.218.44.246, which comes from the peeid, is the new ip. At the end, it timeout, and pulled the image from the backsource eventually:

{"level":"error","ts":"2021-12-19 02:29:50.015","caller":"peer/peertask_base.go:790","msg":"get piece task from peer 10.218.44.246-9-d81ac508-845c-49b0-9e63-6085025afe50_CDN error: get client conn by conn dns:///10.218.44.208:8003: cannot found clientConn associated with node dns:///10.218.44.208:8003 and create client conn failed: context deadline exceeded, code: 4001","peer":"10.218.41.169-26171-c7852c5f-c3cc-4553-ae5d-0d4e56ed00d8","task":"d68a4b905bb7f6c4aa470407b52f7c1a6287f2e88cf1cbe317acacf2b993c70d","component":"streamPeerTask","stacktrace":"d7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).preparePieceTasksByPeer\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:790\nd7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).preparePieceTasks\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:728\nd7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).pullPiecesFromPeers\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:437\nd7y.io/dragonfly/v2/client/daemon/peer.(*peerTask).pullSinglePiece\n\t/go/src/d7y.io/dragonfly/v2/client/daemon/peer/peertask_base.go:379"}
{"level":"info","ts":"2021-12-19 02:29:50.015","caller":"peer/peertask_base.go:585","msg":"start download from source due to base.Code_SchedNeedBackSource","peer":"10.218.41.169-26171-c7852c5f-c3cc-4553-ae5d-0d4e56ed00d8","task":"d68a4b905bb7f6c4aa470407b52f7c1a6287f2e88cf1cbe317acacf2b993c70d","component":"streamPeerTask"}

This happened after the cdn pod get restarted a few minutes, the new ip has been inserted into the database table for a while:

|  3 | 2021-12-13 15:31:36 | 2021-12-19 01:37:45 |      0 | dragonfly-cdn-2.cdn.dragonfly.svc.cluster.local |      |          | 10.218.44.246 | 8003 |          8001 | active |              1 |

If detailed log needed, I will provide.

Expected behavior:

A cdn pod restart can't effect the pull speed, dfdaemon can get the right cdn ip

How to reproduce it:

  1. Deploy dragonfly v2.0.2-alpha.2 with Helm chart 0.5.26.
  2. dfdaemon not to discover the schedulers automaticly through manager, instead specifying the domainname of the schedulers.
    3, restart a cdn pod.

Environment:

  • Dragonfly version: v2.0.2-alpha.2
  • OS: CentOS Linux 7 (Core)
  • Kernel (e.g. uname -a): 3.10.0-1160.31.1.el7.x86_64
  • Kubernetes version: v1.19.10
@likunbyl
Copy link
Author

could someone look into this issue?

@likunbyl
Copy link
Author

I tried the v2.0.2-alpha.6, the same result, occationally takes a long time to pull an image.

@likunbyl
Copy link
Author

Also tried helm chart version 0.5.16, the same problem, I doubt it can be deployed to a production envrionment.

@likunbyl
Copy link
Author

Anyone can help? I deployed dragonfly v2.0.2-alpha.6 with the helm chart 0.5.26, default settings, the same results.

I restarted 2 cdn instances, then it took a long time to pull a image.

Four pods ran on four nodes, the time to pull a 143MB image is:
1m41.524286867
1m31.059427422
1m40.727282204
1m21.067155916

Restarted all schedulers seems solved the problem.

@likunbyl
Copy link
Author

@jim3ma Could you pls take a look at this issue? I plan to deploy to production environment recently.

@jim3ma
Copy link
Member

jim3ma commented Dec 30, 2021

Can you upload the log of cdn and dfdaemon?

@likunbyl
Copy link
Author

@jim3ma jim3ma self-assigned this Dec 30, 2021
@jim3ma
Copy link
Member

jim3ma commented Dec 30, 2021

Bug confirmed: scheduler did not refresh cdn status when cdn ip changed, and returned the old ip to peers.
@gaius-qi will fix it.

@jim3ma
Copy link
Member

jim3ma commented Jan 5, 2022

@likunbyl The latest code change fixed this issue, you can have a try.

@likunbyl
Copy link
Author

likunbyl commented Jan 5, 2022

Thank you, I need the docker image to try out, could it be provide?

@likunbyl
Copy link
Author

likunbyl commented Jan 5, 2022

By the way, I have another issue about preheat enhencement, do you have a plan ?

@jim3ma
Copy link
Member

jim3ma commented Jan 5, 2022

Thank you, I need the docker image to try out, could it be provide?

Just run make docker-build in Dragonfly2 root directory to build all images.

@likunbyl
Copy link
Author

I have tried the newest code, when I restarted all three cdn pods, again it took more than 1m to pull an image.

Here is the logs:
core-cdn-2.log
core-cdn-1.log
core-cdn-0.log
core-dfdaemon-8bbps.log

@jim3ma

@jim3ma
Copy link
Member

jim3ma commented Jan 12, 2022

I will investigate it. Please wait some time.

@likunbyl
Copy link
Author

done, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants