-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: run DragonflyBSD VMs on GCE? #23060
Comments
More background on the dfly users list: |
In that thread, @rickard-von-essen says:
|
/cc @dmitshur |
Update: I just ran Dragonfly (5.2.2) at home on QEMU/KVM with virtio-scsi and virtio net and it works fine. So it should work fine on GCE, of course (which we already heard). At this point I'm thinking we should just do this builder "by hand" for now, with a readme file with notes. I'll prepare the image by hand, then shut it down and copy its disk to a GCE image. (uploading it as a sparse tarball) We can automate it with expect or whatnot later. Perfect is the enemy of good, etc. |
I shut down my KVM/QEMU instance, copied its disk to a new GCE image, and created a GCE VM. It kernel panics on boot (over serial) with:
So, uh, not as easy as I'd hoped. |
Perhaps if we already have to do the whole double virtualization thing for Solaris (#15581 (comment)) anyway, we could just reuse that mechanism to run Dragonfly in qemu/kvm under GCE. |
I've tried working on this earlier this year (back in 2018-02), and had it scripted to make the image automatically, but I had the same issue that it'd work on my machines with vanilla QEMU just fine, including with the disk being accessible on DFly through DragonFly's I've also tried running DragonFly BSD side by side with FreeBSD with Nested virtualisation sounds interesting. Does it require Linux on GCE, or would FreeBSD also work? |
@cnst do you have instructions on how you tried DragonFly on GCE? |
Change https://golang.org/cl/162959 mentions this issue: |
…ting This adds a linux-amd64 COS builder that should be just like our existing linux-amd64 COS builder except that it's using a forked image that has the VMX license bit enabled for nested virtualization. (GCE appears to be using the license mechanism as some sort of opt-in mechanism for features that aren't yet GA; might go away?) Once this is in, it won't do any new builds as regular+trybot builders are disabled. But it means I can then use gomote + debugnewvm to work on preparing the other four image types. Updates golang/go#15581 (solaris) Updates golang/go#23060 (dragonfly) Updates golang/go#30262 (riscv) Updates golang/go#30267 (fuchsia) Updates golang/go#23824 (android) Change-Id: Ic55f17eea17908dba7f58618d8cd162a2ed9b015 Reviewed-on: https://go-review.googlesource.com/c/162959 Reviewed-by: Dmitri Shuralyov <[email protected]>
I've tried myself and it seems DragonFly is unable to find the disk. |
Change https://golang.org/cl/163057 mentions this issue: |
The COS image I'd forked from earlier didn't have CONFIG_KVM or CONFIG_KVM_INTEL enabled in its kernel, so even though I'd enabled the VMX license bit for the VM, the kernel was unable to use it. Now I've instead rebuilt the ChromiumOS "lakitu" board with a modified kernel config: https://cloud.google.com/container-optimized-os/docs/how-to/building-from-open-source More docs later. Still tinkering. Nothing uses this yet. Updates golang/go#15581 (solaris) Updates golang/go#23060 (dragonfly) Updates golang/go#30262 (riscv) Updates golang/go#30267 (fuchsia) Updates golang/go#23824 (android) Change-Id: Id2839066e67d9ddda939d96c5f4287af3267a769 Reviewed-on: https://go-review.googlesource.com/c/163057 Reviewed-by: Dmitri Shuralyov <[email protected]>
Change https://golang.org/cl/163301 mentions this issue: |
…d OS + vmx This adds scripts to create a new builder host image that acts like Container-Optimized OS (has docker, runs konlet on startup) but with a Debian 9 kernel + userspace that permits KVM for nested virtualization. Updates golang/go#15581 (solaris) Updates golang/go#23060 (dragonfly) Updates golang/go#30262 (riscv) Updates golang/go#30267 (fuchsia) Updates golang/go#23824 (android) Change-Id: Ib1d3a250556703856083c222be2a70c4e8d91884 Reviewed-on: https://go-review.googlesource.com/c/163301 Reviewed-by: Dmitri Shuralyov <[email protected]>
Change https://golang.org/cl/202478 mentions this issue: |
…ilder From golang/go#34958 (comment) : > Go's DragonFly support policy is that we support the latest stable > release primarily, but also try to keep DragonFly master passing, in > prep for it to become the latest stable release. > > But that does mean we need one more builder at the moment. Updates golang/go#34958 Updates golang/go#23060 Change-Id: I84be7c64eac593dee2252c397f9529deea13605a Reviewed-on: https://go-review.googlesource.com/c/build/+/202478 Reviewed-by: Tobias Klauser <[email protected]> Reviewed-by: Bryan C. Mills <[email protected]>
@tuxillo, looks like no progress on that bug, eh? |
Thanks for the reminder, I kind of forgot about this one. It's being a tough one anyways. I'll check with the team again next week to see if we could do something. |
@bradfitz I have some time to work on it again, but my credits expired, and trying to signup for a new account required some sort of an extra verification. Is there a way to get the credits again to work on this? Also, is there any way to reproduce this bug outside of Google environment? As per my 2018 comments, our driver works just fine in regular KVM using NetBSD's instructions for activating the codepath. |
GCP has a Free Tier these days:
There's no way to reproduce it locally. GCP uses KVM but doesn't use QEMU and its implementation of virtio-scsi etc isn't open source. |
@bradfitz How long does it take recompile the kernel on this free instance? A few hours? It was already taking too long even on non-micro GCP instances compared to 15-year old hardware. I think it'd be great if there was a way to reproduce this problem locally, because our virtio-scsi drivers work just fine with anything but the proprietary GCP implementation. Would it be helpful to provide automation for any other cloud provider? |
@cnst, I didn't imagine you'd be using the f1-micro installation for compilations. I thought you'd use your normal development environment to build and then use the f1-micro to test boot them on GCE until it worked. |
@cnst what I did in my tests was to download the latest IMG, mount null it, build kernel with modifications and install it in the mountpoint. Then I used gcloud/gsutil to upload the img and create the disk and the instance. You can retrieve the console output with gcloud iirc. |
Did you try adding a static entry after purging the discovered ones? Also, if there is any form of IPv6, how does that act? Have you tried pinging ff02::1? ;) Does putting the iface in promisc mod help? |
Thanks for the suggestions. Static ARP didn't help before, I hadn't tried IPv6, and tcpdump reported at startup that it cannot put the interface in promiscuous mode at all. Oddly, in the hour or so I have left the VM sitting here, it has fixed itself for UDP. This is unfortunate in the sense that I don't know what changed, which won't help the next time I create a VM, but it's working at the moment. I can't see anything different (except obviously the lack of ARP messages and the presence of UDP traffic):
Now that I notice it, the line in There is no obvious explanation for what changed. The only traffic shown by the background tcpdump between an hour ago and when things were working just now is ARP requests from the router for the VM's IP address, and the VM replying, one round trip per minute like clockwork. I started two more VMs. One was working at boot (first time!). The other came up in the "TCP is fine, UDP is broken" state. |
The 'published' might indicate ProxyARP..... that would be quite interesting, but might be the case in your environment, that you are in one VLAN, and the gateway is actually in another VLAN. I guess that your local box at least is not playing proxy_arp... but the remote one might... |
@rsc |
Thanks @paulzhol, I will see what effect that has. I've found that ifdown/ifup/dhclient vtnet0 seems to "correct" the problem, so another option I am trying is just doing that as needed (up to 10 times) before trying to download the buildlet. |
Disabling TXCSUM did not help, but thanks for the suggestion. I have left it disabled. I just did 10 runs of all.bash. |
Another point to consider: |
Change https://go.dev/cl/419083 mentions this issue: |
Change https://go.dev/cl/419081 mentions this issue: |
Change https://go.dev/cl/419084 mentions this issue: |
Glad to see progress on this issue! I see there are problems with UDP and vtnet but it is not clear to me how it is reproduced. Is there anything we, from DragonFlyBSD, should do, investigate? I've also seen that you created a GCE image for 6.2.2, are you guys going to follow RELEASE only? And what do we do with the reverse builder? |
@tuxillo, would you be willing to review https://go-review.googlesource.com/c/build/+/419081/ to see if it looks like it makes sense? To answer your questions: When the image boots on GCP - a completely standard build, a VM configured with just the Dragonfly install CD should be enough to reproduce - it just can't do any UDP traffic at all. UDP traffic triggers ARP requests for the gateway instead. So 'host -W 3 google.com' times out for example, but 'host -T -W 3 google.com' works fine. This is the state after bringing up vtnet0 at boot on something like half the times it boots. I don't understand what could possibly cause that failure mode, honestly. It could be Dragonfly or it could be something about the virtio network device on Google Cloud's side. I used a standard release for reproducibility. Over at https://farmer.golang.org/builders we have a list of the builders for other systems and we typically have a few different release versions as needed for supportability. The idea is that we'd add a new builder for new releases and retire the old ones. Does that seem like a reasonable plan to you? We haven't changed over from the reverse builder yet, but once we do I will post here. At that point you can retire the reverse builder, with our gratitude for keeping it running for so long. Thanks! |
@rsc the patch looks good to me and it's far better than what I could provide, which was nothing :-) It also helps me understand the image creation process from your side.
I can see you're using "DHCP mtu 1460" when setting up the vtnet netwok interface, but I don't know why. We have two DHCP clients, one is Is there a way I can pick up the already generated image and boot it myself in GCP so I can try? Or should I generate a new one myself? Also, I'd need the network configuration I need to use in GCP to get a setup as close as possible to the one you had.
Our release model is very typical, with a point release which is the stable version, i.e. RELEASE-6.2, which is then tagged for minors (.2, .3, whatever) and this is done twice a year. Then we have our "master" branch which is what you'd call "tip" I think, but the difference is that most of the DFly developers use this one, so normally it is pretty stable. Ideally, if you don't mind, under amd64 (we only support one arch atm) we'd have something what the freebds builder has. For example, "6_2" and "BE" (bleeding-edge) or tip, whatever you want to call it.
Sure thing, thanks!
|
Thanks for reporting, created: https://bugs.dragonflybsd.org/issues/3320 |
I tried that because FreeBSD was setting the smaller MTU as well. Not setting it didn't help.
Thanks for this tip. I will give dhcpd a try. |
The only problem with bleeding-edge is that it means we have to keep rebuilding the image at regular intervals, which we could do, but it's a bit of a pain. It also means that results change when the builder changes, whereas we try to keep the builder constant and have only our Go tree changing. For comparison, as I understand it we do not have any FreeBSD builder tracking the dev branch, just numbered releases. I will work on getting you precise directions for GCP. |
This bug is going to auto-close in a little while but we still won't have moved off the reverse builder yet. I'll post here when we have. |
Now that Dragonfly runs on GCE, we can do that and retire the one very slow reverse builder we are using today. For golang/go#23060. Change-Id: I2bd8c8be6735212ba6a8023327864b79dea08cf3 Reviewed-on: https://go-review.googlesource.com/c/build/+/419081 Auto-Submit: Russ Cox <[email protected]> Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Heschi Kreinick <[email protected]>
A good compromise perhaps is to rebuild bleeding-edge only when we bump the __DragonFly_version macro (https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/sys/sys/param.h#L244), which we only do when there is significant changes, you can see the version history in that header file.
Okay thanks. |
Sure, let me know. |
Change https://go.dev/cl/420756 mentions this issue: |
Two reasons: first, the builder is pinned to 6.2.2. Second, the reverse builder is still dialing in and confusing the coordinator. Make a clean break with the past. For golang/go#23060. Change-Id: Ia19cb6ef3fefef323b41c14298ef8dbc90a6e27b Reviewed-on: https://go-review.googlesource.com/c/build/+/420756 Reviewed-by: Dmitri Shuralyov <[email protected]> Run-TryBot: Russ Cox <[email protected]> Auto-Submit: Russ Cox <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
I started the GCE builder but it has yet to complete a build. After make.bash it is supposed to upload the tree back to the coordinator, and in 5 minutes it is only able to transfer about 130 MB which turns out to be not the whole thing. Perhaps this is the MTU thing, or perhaps it is something else. I am going to try to reproduce slow network uploads in a simpler context. We may turn the reverse builder back on in the interim. I will keep this issue posted. |
We have our first 'ok' on build.golang.org for dragonfly-amd64-622. We still need to figure out the upload slowness (worked around for now by disabling that upload) and perhaps also the boot-time network issue (which may be related), but it's working, and much more scalable. @tuxillo, please feel free to shut down the reverse builder, and thanks again for keeping it running for so long! |
Leaving this issue open for the networking issues. |
This should be fixed with https://gitweb.dragonflybsd.org/dragonfly.git/commit/20bf50996e30140ca0d813694090469045bba0c4 for what it's worth. This has also been merged in DragonFly_RELEASE_6_4 branch. |
Looks like Dragonfly now supports virtio:
https://leaf.dragonflybsd.org/cgi/web-man?command=virtio§ion=4
So it should run on GCE?
If somebody could prepare make.bash scripts to script the install to prepare bootable images, we could run it on GCE.
See the netbsd, openbsd, and freebsd directories as examples: https://github.com/golang/build/tree/master/env
(The script must run on Linux and use qemu to do the image creation.)
/cc @tdfbsd
The text was updated successfully, but these errors were encountered: