Skip to content

Latest commit

 

History

History
172 lines (135 loc) · 6.66 KB

03-cli.md

File metadata and controls

172 lines (135 loc) · 6.66 KB

AXLearn CLI

Table of Contents

Section Description
Overview Overview of the CLI.
Usage Usage Instructions.

Overview

The AXLearn CLI provides utilities for launching and managing resources across different cloud environments.

Taking inspiration from Python virtual environments, the CLI provides functionality to "activate" different cloud environments, allowing users to manage resources in these environments seamlessly.

As a motivating example, suppose that users have access to two different TPU clusters tpu-cluster1 and tpu-cluster2, possibly across different availability zones.

A common usage pattern may look like:

# Authenticate to the cloud provider.
axlearn gcp auth

# "Activate" `tpu-cluster1` configuration.
axlearn gcp config activate --label=tpu-cluster1

# Launch jobs to `tpu-cluster1`.
axlearn gcp launch ...

# "Activate" `tpu-cluster2` configuration.
axlearn gcp config activate --label=tpu-cluster2

# Launch jobs to `tpu-cluster2` with the same launch command.
axlearn gcp launch ...

Users therefore do not need to worry about the underlying configuration details of each cluster.

While the above example uses the gcp commands, the CLI is agnostic to the underlying cloud provider and can support arbitrary configurations.


Preparing the CLI

To setup the CLI, we'll need to first create a config file under at least one of the following paths:

  • .axlearn/axlearn.default.config in the current working directory, or
  • .axlearn/.axlearn.config in the current working directory, or
  • ~/.axlearn.config in your home directory.

To create the config, you can copy from the template config from the root of the repo:

cp .axlearn/axlearn.default.config .axlearn/.axlearn.config

Tip: In an organization setting, you can directly modify .axlearn/axlearn.default.config, which can be stored in the root of your git repository. This way, developers will automatically pick up the latest CLI configurations on each git pull.

Tip: You can always run axlearn gcp config cleanup to delete all AXLearn config files from your system.

Here's a sample config file for launching v4-tpus in us-central2-b, under the project my-gcp-project. You may recognize it as a toml file:

[gcp."my-gcp-project:us-central2-b"]
# Basic project configs.
project = "my-gcp-project"
zone = "us-central2-b"
network = "projects/my-gcp-project/global/networks/default"
subnetwork = "projects/my-gcp-project/regions/us-central2/subnetworks/default"
# Used when launching VMs and TPUs.
service_account_email = "[email protected]"
# Used for permanent artifacts like checkpoints. Should be writable by users who intend to launch jobs.
permanent_bucket = "public-permanent-us-central2"
# Used for private artifacts, like quota files. Should be readable by users who intend to launch jobs.
private_bucket = "private-permanent-us-central2"
# Used for temporary artifacts, like logs. Should be writable by users who intend to launch jobs.
ttl_bucket = "ttl-30d-us-central2"
# (Optional) Used by the AXLearn CLI.
labels = "v4-tpu"
# (Optional) Used for pushing docker images.
docker_repo = "us-docker.pkg.dev/my-gcp-project/axlearn"
# (Optional) Configure whether to use on-demand or reserved TPUs.
reserved_tpu = true
# (Optional) Configure a default Dockerfile to use when launching jobs with docker.
default_dockerfile = "Dockerfile"
# (Optional) Enable VertexAI Tensorboard support during training.
vertexai_tensorboard = "1231231231231231231"
vertexai_region = "us-central1"

To confirm that the CLI can locate your config file, run:

# Lists all environments that the CLI is aware of.
$ axlearn gcp config list
No GCP project has been activated; please run `axlearn gcp config activate`.
Found default config at /path/to/axlearn/.axlearn/axlearn.default.config
Found user config at /path/to/axlearn/.axlearn/.axlearn.config
[ ] my-gcp-project:us-central2-b [v4-tpu]

As the output indicates, we have not yet activated a project. To do so, run:

# Activate a specific environment.
$ axlearn gcp config activate
...
Setting my-gcp-project:us-central2-b to active.
Configs written to /path/to/axlearn/.axlearn/.axlearn.config

You can also directly target a config by specifying --label:

# Activate the environment with label "v4-tpu".
axlearn gcp config activate --label=v4-tpu

In this case we only have one config, so --label is redundant.


Usage

The CLI is structured as a tree and is intended to be self-documenting. The tree can be traversed simply by invoking the CLI with --help.

For instance, running the root command with --help prints available sub-commands:

$ axlearn --help
usage: axlearn [-h] [--helpfull] {gcp} ...

AXLearn: An Extensible Deep Learning Library.

positional arguments:
  {gcp}

options:
  -h, --help  show this help message and exit
  --helpfull  show full help message and exit

We can traverse the tree by running a subcommand with --help.

$ axlearn gcp --help
usage: axlearn gcp [-h] [--helpfull] [--project PROJECT] [--zone ZONE]
                   {config,sshvm,sshtpu,bundle,launch,tpu,vm,bastion,dataflow,auth} ...

positional arguments:
  {config,sshvm,sshtpu,bundle,launch,tpu,vm,bastion,dataflow,auth}
    config              Configure GCP settings
    sshvm               SSH into a VM
    sshtpu              SSH into a TPU-VM
    bundle              Bundle the local directory
    launch              Launch arbitrary commands on remote compute
    tpu                 Create a TPU-VM and execute the given command on it
    vm                  Create a VM and execute the given command on it
    bastion             Launch jobs through Bastion orchestrator
    dataflow            Run Dataflow jobs locally or on GCP
    auth                Authenticate to GCP

options:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --project PROJECT
  --zone ZONE

The leaves of the command tree may be implementation dependent. Typically, they correspond to an abseil-py module. For example, axlearn gcp launch simply maps to:

gcp_cmd.add_cmd_from_module(
"launch",
module="axlearn.cloud.gcp.jobs.launch",
help="Launch arbitrary commands on remote compute",
)

In some cases, they can map to shell commands. For example, axlearn gcp auth simply maps to:

auth_command = "gcloud auth login && gcloud auth application-default login"
if docker_repo:
# Note: we currently assume that docker_repo is a GCP one.
auth_command += f" && gcloud auth configure-docker {registry_from_repo(docker_repo)}"
gcp_cmd.add_cmd_from_bash(
"auth",
command=auth_command,
help="Authenticate to GCP",
# Match no flags -- `gcloud auth ...` doesn't support `--project`, `--zone`, etc.
filter_argv="a^",
)

In general, when in doubt, run --help.