Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple firecracker test shards to run concurrently on a single machine #7439

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

bduffany
Copy link
Member

@bduffany bduffany commented Sep 12, 2024

Follow-up from #7429

  • Add optional flock-based IP range locking to prevent concurrent test shards from trying to use the same IP range concurrently.
    • This is useful for firecracker tests, as well as running multiple firecracker executors locally.
  • Don't try to clean up network namespaces in tests, because they might be in use.
    • I don't expect that too many of these will accumulate, but if it does happen, we can find a way to clean up "orphaned" namespaces, maybe by associating them with lockfiles.
  • Don't try to clean up the shared executor root directory in tests either, to prevent converted ext4 image files from disappearing mid-test.
    • Added a TODO to provision these images as data deps of the test instead of converting during the test.
  • Set shard_count on firecracker_test. (This won't actually buy us a ton of concurrency until we disable exclusive task scheduling on the bare pool.)

Related issues: https://github.com/buildbuddy-io/buildbuddy-internal/issues/3758

Comment on lines -38 to -48

// Set up a symlink in PATH so that 'iptables' points to 'iptables-legacy'.
// Our Firecracker setup does not yet have nftables enabled and can't use
// the newer iptables.
iptablesLegacyPath, err := exec.LookPath("iptables-legacy")
require.NoError(t, err)
overrideBinDir := testfs.MakeTempDir(t)
err = os.Symlink(iptablesLegacyPath, filepath.Join(overrideBinDir, "iptables"))
require.NoError(t, err)
err = os.Setenv("PATH", overrideBinDir+":"+os.Getenv("PATH"))
require.NoError(t, err)
Copy link
Member Author

@bduffany bduffany Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of this hack because it broke networking on my local machine (because the firewall rules were only being added to the legacy tables rather than nftables). It seems like this is no longer needed for our networking tests to pass, maybe because we updated our firecracker kernel config.

@bduffany bduffany force-pushed the fc-test-shard branch 3 times, most recently from bc5e8a1 to 3f00ceb Compare September 12, 2024 14:29
return nil, fmt.Errorf("make lock dir %q: %s", *networkLockDir, err)
}
path := filepath.Join(*networkLockDir, fmt.Sprintf("ip_range.%d.lock", netIdx))
f, err := os.Create(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check - if path already exists (because another process already holds the lock), will os.Create return the same file descriptor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'll return a different file descriptor (file descriptors are local to a process) but the returned descriptor will point to the same underlying file (inode), so flock should work here. (I also checked this with some debug logging)

Copy link
Contributor

@maggie-lou maggie-lou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but maybe check with Iain because I remember he had some thoughts/alternate ideas for other ways to implement this

@bduffany
Copy link
Member Author

bduffany commented Sep 12, 2024

LGTM - but maybe check with Iain because I remember he had some thoughts/alternate ideas for other ways to implement this

Sure - for reference, I dug up the thread: https://buildbuddy-corp.slack.com/archives/C04R9V7H28N/p1697734759336109

A couple of thoughts after reading through it:

  • We talked about tying namespaces to IP addresses by using a naming convention like /var/run/netns/bb-executor-192-168-XXX-XXX, but in the new world of network pooling, IP addresses and net namespaces are no longer 1-1 (because we unassign/reassign IPs when adding to/taking from the pool, but keep the net namespace around). So, I think the problem is now mostly scoped to just locking IP ranges, and net namespaces are not really as relevant.
  • Towards the end of the thread, we had mostly settled on the flock approach, but we talked about possibly using a library for flocking or rolling our own. I think the current implementation of just making a flock() system call is simple enough that pulling in a dependency is not warranted, and it also doesn't seem worth spending time rolling our own. For cross-process coordination that requires more complexity than just acquiring a lock (e.g. storing data), I think sqlite is probably a better solution - e.g. podman uses sqlite to maintain container state without needing a daemon process.

@bduffany bduffany merged commit c31f453 into master Sep 23, 2024
15 checks passed
@bduffany bduffany deleted the fc-test-shard branch September 23, 2024 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants