Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for adding unique ID into volume mount names #62

Closed
Chuxel opened this issue Jul 15, 2022 · 22 comments
Closed

Support for adding unique ID into volume mount names #62

Chuxel opened this issue Jul 15, 2022 · 22 comments
Assignees
Labels
active proposal Still under discussion, collecting feedback
Milestone

Comments

@Chuxel
Copy link
Member

Chuxel commented Jul 15, 2022

This proposal is an enhancement idea to the current dev container features proposal.


This spec proposal is a port of the proposed solution to microsoft/vscode-remote-release#5679 and relates to #2 (comment).

For both Dev Container Features (#61) and normal devcontainer.json scenarios, it is sometimes necessary to have a unique identifier generated that can be referenced in different places.

The most critical example of this is named volumes. When you have multiple dev containers running in the same environment (e.g. locally), the docker-in-docker feature needs to be able to create unique volume names for each to store docker data. Any feature that references a mount point will face the same challenge.

The proposed solution is to introduce a variable like ${devcontainerId} that is a unique identifier for the resulting container. The format of the identifier does not matter as much as the fact it can be broadly used. So generally it should be alpha numeric. Features can then reference this identifier where appropriate (like in mounts).

@alexdima @chrmarti @jkeech @edgonmsft - This has been the most consistent ask from early adopters of dev container features to date. So I think we'll want to cover it in the spec.

@Chuxel Chuxel added the proposal Still under discussion, collecting feedback label Jul 15, 2022
@bamurtaugh bamurtaugh added this to the July 2022 milestone Jul 21, 2022
@bamurtaugh bamurtaugh modified the milestones: July 2022, August 2022 Jul 28, 2022
@stuartleeks
Copy link

Is there any consensus on how the devcontainerId value would be generated?

My initial thought when thinking about this a while back was to use the name from devcontainer.json. The downside is that changing the name would break the volume links, which isn't ideal. Especially as I often start working with a dev container for a while before noticing that I haven't put something meaningful in the name value so end up changing it later to something to help identify the project.

Another idea is to add an id property to devcontainer.json and have the CLI/extension populate it with a new randomly generated value if it doesn't exist. This could then be the value used for devcontainerId and users could still rename the dev container without breaking the volume links in features.

@stuartleeks
Copy link

An alternative to generating the id and modifying a user's devcontainer.json would be to use id if it exists and name (or a derivative of it) as a fallback. In this case, it might help steer into the pit of success if the extension auto-generated a random id value when a user adds a devcontainer.json via the "Remote-Containers: Add Development Container Configuration Files..." command?

@chrmarti
Copy link
Contributor

I would use a similar approach to how we name images which uses vsc-${workspaceFolderBasename}-${workspaceFolderHash}.

To also support the repository in volume case in Remote-Containers (and possibly other cases) we need to base the hash on the id labels we use to identify the dev container (the id label includes the workspace folder path in the local case). It would be nice to have a human-readable part, but that might not be easy to derive in all cases.

So my proposal is to use a sha-256 hash of the id labels (there can be multiple such labels and they all need to match to identify the dev container).

@stuartleeks
Copy link

When I look at a dev container running locally, I have the following labels:

  "com.visualstudio.code.devcontainers.id": "python-3",
  "com.visualstudio.code.devcontainers.release": "v0.231.6",
  "com.visualstudio.code.devcontainers.source": "https://github.com/microsoft/vscode-dev-containers/",
  "com.visualstudio.code.devcontainers.timestamp": "Wed, 13 Apr 2022 00:28:49 GMT",
  "com.visualstudio.code.devcontainers.variant": "3.10-bullseye",
  "desktop.docker.io/wsl-distro": "Ubuntu",
  "devcontainer.local_folder": "\\\\wsl.localhost\\Ubuntu\\home\\stuart\\source\\my-project",
  "version": "0.203.5"

Which of these do you suggest are included in the hash?

My goal is to have an ID that is scoped to an individual dev container and that doesn't change across dev container renames (i.e. changing name in devcontainer.json) or VS Code release versions etc

@chrmarti
Copy link
Contributor

In this case only "devcontainer.local_folder": "\\\\wsl.localhost\\Ubuntu\\home\\stuart\\source\\my-project" would go into the hash.

A container created from a repository cloned to a volume has different id labels, e.g.:

"vsch.local.repository": "https://github.com/chrmarti/vscode-regex.git/tree/main",
"vsch.local.repository.folder": "vscode-regex",
"vsch.local.repository.volume": "vscode-regex-main-29a8aa263ac65c28ae2f5568b6fd157f"

In the code these labels are passed around as idLabels IIRC.

@Chuxel
Copy link
Member Author

Chuxel commented Aug 11, 2022

@chrmarti I assume this would also work if a volume was used for the source code in the Remote - Containers case, correct?

@chrmarti
Copy link
Contributor

@Chuxel Yes, the variations I'm aware of are:

Local folder:

"devcontainer.local_folder": "\\\\wsl.localhost\\Ubuntu\\home\\stuart\\source\\my-project"

"Clone Repository in Container Volume":

"vsch.local.repository": "https://github.com/chrmarti/vscode-regex.git/tree/main",
"vsch.local.repository.folder": "vscode-regex",
"vsch.local.repository.volume": "vscode-regex-main-29a8aa263ac65c28ae2f5568b6fd157f"

When inspecting a volume:

"vsch.local.volume": "myvolume"

Codespaces (here the unique id might not be relevant at the moment):

"Type": "codespaces"

(Using a fixed id label is fine when that uniquely identifies the dev container among other containers on the same Docker host.)

The Dev Container CLI only has the local folder case built-in, in all other cases we pass id labels to the CLI as command line options (--id-label).

@chrmarti
Copy link
Contributor

So we could use a hash of the id labels and then base64 encode that like:

const crypto = require('crypto');

// Array of label=value.
const idLabels = ['"devcontainer.local_folder=C:\\Users\\chrmarti\\repos\\hello'];

const hash = crypto.createHash('sha256')
	.update(JSON.stringify(idLabels.sort())) // sort to avoid order dependency
	.digest('base64url'); // omits padding =

// For a volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed,
// base64 includes "[a-zA-Z0-9+/]".
const uniqueId = 'a' + hash.replace(/\+/g, '_')
	.replace(/\//g, '-');

console.log(uniqueId); // Example prints: adg5DShAK-VwSz7TkEPJqOyzEmKEAJc3EKAsKsiCfsRE

Notes:

  • Is sha256 too long?
    • Do we need a cryptographic hash?
  • JSON.stringify needs to be specified in abstract terms as not including any additional whitespace (to avoid JavaScript specifics in the spec).
  • The result works for volumes, other names might have other limitations on allowed characters.
    • Should compare with, e.g., container and image name limitations.
    • Includes upper- and lowercase characters.

@tschaffter
Copy link

@chrmarti

What do you think about removing the dash from the set of available chars so that the unique ID could be easily selected whenever needed by double clicking on it?

@chrmarti
Copy link
Contributor

@tschaffter We could use dot instead of dash (to still match base64).

@tschaffter
Copy link

tschaffter commented Aug 15, 2022

Thanks for the prompt response! Dot would lead to the same issue, but it's maybe better than using the dash since VS Code uses the dash as separator in the name of Docker images. For example: vsc-my-project-e7f0e6fdd3e51765526df2844ad10816-features-uid. I realize that the double-click feature I mentioned depends on where the string is double clicked. But if VS Code use dash as separator in resource names, maybe it's best to exclude it from the generation of the hash.

@alexdima
Copy link
Member

alexdima commented Aug 17, 2022

Some time ago I also needed to encode a sha256 in a short string (a subdomain name which are under 63 chars). I went with encoding the number in base32 which requires at most 52 chars:

function sha256AsBase32(bytes: ArrayBuffer): string {
	const array = Array.from(new Uint8Array(bytes));
	const hexArray = array.map(b => b.toString(16).padStart(2, '0')).join('');
	// sha256 has 256 bits, so we need at most ceil(lg(2^256-1)/lg(32)) = 52 chars to represent it in base 32
	return BigInt(`0x${hexArray}`).toString(32).padStart(52, '0');
}

Here's a code pointer (uses crypto browser API, but the principle is the same for nodejs): https://github.com/microsoft/vscode/blob/6fa43c8e9dd85f83ec9f3bbd48f0eeb24db3c76b/src/vs/workbench/browser/webview.ts#L5-L26

@chrmarti
Copy link
Contributor

chrmarti commented Sep 1, 2022

@alexdima I like that idea. It uses [0-9a-v] to represent numbers 0-31 and avoids punctuation characters and case-sensitivity.

@chrmarti
Copy link
Contributor

chrmarti commented Sep 5, 2022

Updated code using base32:

const crypto = require('crypto');

function uniqueIdForLabels(idLabels) {
	const stringInput = JSON.stringify(idLabels, Object.keys(idLabels).sort()); // sort properties
	const bufferInput = Buffer.from(stringInput, 'utf-8');

	const hash = crypto.createHash('sha256')
		.update(bufferInput)
		.digest();

	const uniqueId = BigInt(`0x${hash.toString('hex')}`)
		.toString(32)
		.padStart(52, '0');
	return uniqueId;
}

const examples = [
	{
		'devcontainer.local_folder': '\\\\wsl.localhost\\Ubuntu\\home\\stuart\\source\\my-project'
	},
	{
		'vsch.local.repository.volume': 'vscode-regex-main-29a8aa263ac65c28ae2f5568b6fd157f',
		'vsch.local.repository': 'https://github.com/chrmarti/vscode-regex.git/tree/main',
		'vsch.local.repository.folder': 'vscode-regex'
	},
	{
		'vsch.local.volume': 'myvolume'
	},
	{
		'Type': 'codespaces'
	}
];

const uniqueIds = examples.map(uniqueIdForLabels);
console.log(uniqueIds);
/* Prints:
[
  '1rr0ifggvslov34du7rckm9v3v1vq76fgstqpg49k5ettifijujv',
  '0ogk7ui0niiilqvq4n3cstdcmg5fa4fc2ldsetncf60o3jvu33j3',
  '09h2k0dvao1m9l93kqpcs7kenvhoeik9jlgu3q6k7l4k3nold7ce',
  '0hcbhh2c7vldoj773drm1bjldnb89u0rt96sl9nju22d9ou8d14n'
]
*/

@chrmarti
Copy link
Contributor

chrmarti commented Sep 7, 2022

Goal

Allow features to refer to an identifier that is unique to the dev container they are installed into.

E.g., the docker-in-docker feature needs a way to mount a volume per dev container.

Proposal

The identifier will be refered to as ${devcontainerId} in the feature metadata and that will be replaced with the dev container's id. It should only be used in parts of the metadata that is not used for building the image because that would otherwise prevent prebuilding the image at a time when the dev container's id is not known yet.

The identifier is derived from a set of container labels that uniquely identify the dev container on a Docker host. It is up to the implementation to choose these labels. (E.g., a single label with the workspace folder as its value, or a set of labels identifying a Git repository and a volume to clone the Git repository into.)

E.g., the docker-in-docker feature could use the dev container id with:

{
    "id": "docker-in-docker",
    "version": "1.0.4",
    // ...
    "mounts": [
        {
            "source": "dind-var-lib-docker-${devcontainerId}",
            "target": "/var/lib/docker",
            "type": "volume"
        }
    ]
}

Computing the Identfier

  • Input the labels as a JSON object with the object's keys being the label names and the object's values being the labels' values.
    • To ensure all implemenations get to the same result, the object keys must be sorted and any optional whitespace outside of the keys and values must be removed.
  • Compute a SHA-256 hash from the UTF-8 encoded input string.
  • Use a base-32 encoded representation left-padded with '0' to 52 characters as the result.

JavaScript implemenation taking an object with the labels as argument and returning a string as the result:

const crypto = require('crypto');

function uniqueIdForLabels(idLabels) {
	const stringInput = JSON.stringify(idLabels, Object.keys(idLabels).sort()); // sort properties
	const bufferInput = Buffer.from(stringInput, 'utf-8');

	const hash = crypto.createHash('sha256')
		.update(bufferInput)
		.digest();

	const uniqueId = BigInt(`0x${hash.toString('hex')}`)
		.toString(32)
		.padStart(52, '0');
	return uniqueId;
}

@jkeech
Copy link
Contributor

jkeech commented Sep 7, 2022

Overall looks good. This will be a useful feature.

It should only be used in parts of the metadata that is not used for building the image because that would otherwise prevent prebuilding the image at a time when the dev container's id is not known yet.

Can we formalize this to make it a requirement? For instance, we could only support that variable expansion in devcontainer properties that apply to container runtime and not image build time (mounts, runArgs, etc).

The identifier is derived from a set of container labels that uniquely identify the dev container on a Docker host. It is up to the implementation to choose these labels.

To ensure all implemenations get to the same result, the object keys must be sorted and any optional whitespace outside of the keys and values must be removed.

These two things seem to be in opposition to each other. Unless we define the labels that must be used, implementations will use different labels as the input and get different results, even if they follow the same hashing logic. I would suggest that

  • If we care about implementation interoperability, then we must define the set of labels that are used for the ID so that implementations all agree and use the same inputs
  • If we don't care about interoperability, then we also don't need to define the hashing logic and can instead just document that the labels should be hashed to produce a unique identifier for that instance of the devcontainer. We can provide an example hashing algorithm, but it's ultimately an implementation detail of the implementor at that point.

@edgonmsft
Copy link
Contributor

+1 on the idea of defining the set of labels explicitly. Besides that this would be a great addition.

@Chuxel
Copy link
Member Author

Chuxel commented Sep 8, 2022

It should only be used in parts of the metadata that is not used for building the image because that would otherwise prevent prebuilding the image at a time when the dev container's id is not known yet.

Can we formalize this to make it a requirement? For instance, we could only support that variable expansion in devcontainer properties that apply to container runtime and not image build time (mounts, runArgs, etc).

@jkeech @edgonmsft @chrmarti Why put a limit on it? Mounts is the primary use case and we do need this to work from image labels as well so that would apply to the image. Also, if this affects the contents of the image, then the ID would be isolated to the image, which would work great. The main place there's a challenge is runtime params.

@chrmarti
Copy link
Contributor

chrmarti commented Sep 8, 2022

It should only be used in parts of the metadata that is not used for building the image because that would otherwise prevent prebuilding the image at a time when the dev container's id is not known yet.

Can we formalize this to make it a requirement? For instance, we could only support that variable expansion in devcontainer properties that apply to container runtime and not image build time (mounts, runArgs, etc).

Sounds good.

The identifier is derived from a set of container labels that uniquely identify the dev container on a Docker host. It is up to the implementation to choose these labels.

To ensure all implemenations get to the same result, the object keys must be sorted and any optional whitespace outside of the keys and values must be removed.

These two things seem to be in opposition to each other. Unless we define the labels that must be used, implementations will use different labels as the input and get different results, even if they follow the same hashing logic. I would suggest that

  • If we care about implementation interoperability, then we must define the set of labels that are used for the ID so that implementations all agree and use the same inputs
  • If we don't care about interoperability, then we also don't need to define the hashing logic and can instead just document that the labels should be hashed to produce a unique identifier for that instance of the devcontainer. We can provide an example hashing algorithm, but it's ultimately an implementation detail of the implementor at that point.

The labels to identify a dev container seem to depend on how the dev container is created (in its context, e.g., a Codespace implies a single dev container, so the label can be constant, see #62 (comment)). Other implementations might come up with additional ways to create dev containers, so we probably don't want to limit the allowed labels.

We could specify that implementations can use the devcontainer.local_folder label for interoperability when a dev container is based on a local folder and that they can choose other labels as desired.

@jkeech
Copy link
Contributor

jkeech commented Sep 8, 2022

@jkeech @edgonmsft @chrmarti Why put a limit on it? Mounts is the primary use case and we do need this to work from image labels as well so that would apply to the image.

@Chuxel, what @chrmarti and I are suggesting is that the devcontainer ID is tied to a single instance of the devcontainer. You can have multiple instances of the same image running on the same machine, so if the ID is evaluated in an image, it's no longer unique. Therefore, the ID can only be evaluated at the time when a devcontainer is spun up, and it only applies to properties at that point which are tied to the container rather than the image. That's why we should restrict the property in the spec to only be allowed in runtime properties like mounts.

Even though the ID is evaluated at runtime does not mean that the ID variable cannot be present in image metadata. It would just remain as ${devcontainerId} in the image label (for the allowed properties like mounts) and then be replaced/evaluated by the CLI at runtime when spinning up the containers.

@chrmarti
Copy link
Contributor

Posted the proposal with the discussed clarifications in PR #96 for review.

@Chuxel
Copy link
Member Author

Chuxel commented Oct 24, 2022

//cc @bamurtaugh FYI - Another for the list of items to add to the reference over the next week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
active proposal Still under discussion, collecting feedback
Projects
None yet
Development

No branches or pull requests

8 participants