Skip to content

Commit

Permalink
allow runtime selection of worker classes through training config (fixes
Browse files Browse the repository at this point in the history
 #300) (#532)

* Rename worker_env.py to the more general worker_selection.py to accommodate upcoming runtime worker selection work

* allow runtime selection of worker classes through training config

This patch adds a new `worker-classes` section of the training config which allows for selecting the "class" of worker to use (currently GCP spot or GCP standard) by kind, with support for a default. This allows us quite flexible configuration, eg: using spot instances for translation and standard ones for training. We will also be able to add one or more classes for Snakepit machines when we bring those online. The default value is `{"default": "gcp-spot"}` - which means that we'll use all spot machines by default. (We can change the default in `config.prod.yml` if desired.)

Most of this patch is quite a boring addition of the new `worker_selection` transform to the pipeline kinds. The most notable part otherwise is a fairly big rework of most things to do with workers in `config.yml`:
* The taskgraph-required `workers.aliases` is now a very simple, straightforward list of all available worker types.
* A new `local-worker-aliases` has been introduced. This maps the generic names like `b-largegpu` to concrete worker types `by-worker-class`. This removes the need for `-standard` variants for the generic names.
* This necessitated a new transform function that looks up the concrete worker type in each kind before we hand off to the `task` transforms. (Previously, that transform simply looked up things like `b-largegpu` in the `workers.aliases` block. If we had the ability to feed it `worker-class` information we could have kept all our mappings there - but I couldn't come up with a reasonable way to do this upstream.)
  • Loading branch information
bhearsum authored May 9, 2024
1 parent 4b63598 commit 067ce65
Show file tree
Hide file tree
Showing 49 changed files with 229 additions and 105 deletions.
99 changes: 63 additions & 36 deletions taskcluster/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,78 +60,105 @@ valid-stages:

workers:
aliases:
# Use for quick tasks that don't require GPUs, eg: linting, tests
b-cpu:
b-linux-large-gcp-d2g:
provisioner: '{trust-domain}-{level}'
implementation: docker-worker
os: linux
worker-type: 'b-linux-large-gcp-d2g'
# Use for tasks that don't require GPUs, but need lots of disk space
# eg: dataset cleaning & merging
b-cpu-largedisk:
worker-type: '{alias}'
b-linux-large-gcp-d2g-300gb:
provisioner: '{trust-domain}-{level}'
implementation: docker-worker
os: linux
worker-type: 'b-linux-large-gcp-d2g-300gb'
# Use for tasks that don't require GPUs, but need immense amounts of disk space
# eg: alignments
b-cpu-xlargedisk:
worker-type: '{alias}'
b-linux-large-gcp-d2g-1tb:
provisioner: '{trust-domain}-{level}'
implementation: docker-worker
os: linux
worker-type: 'b-linux-large-gcp-d2g-1tb'
# Use for tasks that don't require GPUs, but need immense amounts of disk space
# and higher reliability
b-cpu-xlargedisk-standard:
worker-type: '{alias}'
b-linux-large-gcp-d2g-1tb-standard:
provisioner: '{trust-domain}-{level}'
implementation: docker-worker
os: linux
worker-type: 'b-linux-large-gcp-d2g-1tb-standard'
# Use for quick tasks that need a GPU, eg: evaluate
b-gpu:
worker-type: '{alias}'
b-linux-v100-gpu:
provisioner: '{trust-domain}-{level}'
implementation: generic-worker
os: linux
worker-type: 'b-linux-v100-gpu'
# Use for tasks that need lots of GPU power, but not lots of disk space
# eg: translation & scoring
b-largegpu:
worker-type: '{alias}'
b-linux-v100-gpu-4:
provisioner: '{trust-domain}-{level}'
implementation: generic-worker
os: linux
worker-type: 'b-linux-v100-gpu-4'
# Use for tasks that needs lots of GPU power and increased disk space
# eg: bicleaner
b-largegpu-largedisk:
worker-type: '{alias}'
b-linux-v100-gpu-4-300gb:
provisioner: '{trust-domain}-{level}'
implementation: generic-worker
os: linux
worker-type: 'b-linux-v100-gpu-4-300gb'
# Use for tasks that need lots of GPU power and immensive amounts of disk space
# eg: training
b-largegpu-xlargedisk:
worker-type: '{alias}'
b-linux-v100-gpu-4-300gb-standard:
provisioner: '{trust-domain}-{level}'
implementation: generic-worker
os: linux
worker-type: 'b-linux-v100-gpu-4-1tb'
# Use for tasks that needs lots of GPU power, increased disk space, and higher reliability
b-largegpu-largedisk-standard:
worker-type: '{alias}'
b-linux-v100-gpu-4-1tb:
provisioner: '{trust-domain}-{level}'
implementation: generic-worker
os: linux
worker-type: 'b-linux-v100-gpu-4-300gb-standard'
# Use for tasks that needs lots of GPU power, increased disk space, and higher reliability
b-largegpu-xlargedisk-standard:
worker-type: '{alias}'
b-linux-v100-gpu-4-1tb-standard:
provisioner: '{trust-domain}-{level}'
implementation: generic-worker
os: linux
worker-type: 'b-linux-v100-gpu-4-1tb-standard'
worker-type: '{alias}'
images:
provisioner: '{trust-domain}-{level}'
implementation: docker-worker
os: linux
worker-type: '{alias}-gcp'

# Ideally these would be in `workers.aliases` above, but those alias' are
# resolved by Taskgraph, which is unaware of the `worker-class` lookups
# we need to do below.
local-worker-aliases:
# Use for quick tasks that don't require GPUs, eg: linting, tests
b-cpu:
by-worker-class:
gcp-standard: 'b-linux-large-gcp-d2g'
default: 'b-linux-large-gcp-d2g'
b-cpu-largedisk:
by-worker-class:
gcp-standard: 'b-linux-large-gcp-d2g-300gb'
default: 'b-linux-large-gcp-d2g-300gb'
# Use for tasks that don't require GPUs, but need immense amounts of disk space
# eg: alignments
b-cpu-xlargedisk:
by-worker-class:
gcp-standard: 'b-linux-large-gcp-d2g-1tb-standard'
default: 'b-linux-large-gcp-d2g-1tb'
# Use for quick tasks that need a GPU, eg: evaluate
b-gpu:
by-worker-class:
gcp-standard: 'b-linux-v100-gpu'
default: 'b-linux-v100-gpu'
# Use for tasks that need lots of GPU power, but not lots of disk space
# eg: translation & scoring
b-largegpu:
by-worker-class:
gcp-standard: 'b-linux-v100-gpu-4'
default: 'b-linux-v100-gpu-4'
# Use for tasks that needs lots of GPU power and increased disk space
# eg: bicleaner
b-largegpu-largedisk:
by-worker-class:
gcp-standard: 'b-linux-v100-gpu-4-300gb-standard'
default: 'b-linux-v100-gpu-4-300gb'
# Use for tasks that need lots of GPU power and immensive amounts of disk space
# eg: training
b-largegpu-xlargedisk:
by-worker-class:
gcp-standard: 'b-linux-v100-gpu-4-1tb-standard'
default: 'b-linux-v100-gpu-4-1tb'

# Keys are worker type, and align with the `worker-type` entries in the
# `worker.aliases` above.
worker-configuration:
Expand Down
2 changes: 2 additions & 0 deletions taskcluster/configs/config.ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,5 @@ datasets:
target-stage: all
taskcluster:
split-chunks: 2
worker-classes:
default: gcp-spot
13 changes: 13 additions & 0 deletions taskcluster/configs/config.prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -218,3 +218,16 @@ taskcluster:
# then split into an even number of chunks.
# Adjust depending on the amount of data to translate
split-chunks: 20
# Worker classes by `kind`, and a default for `kinds` not specified.
# Available options are in `taskcluster/translations_taskgraph/actions/train.py`.
# By default we like to use `gcp-spot`, which are the cheapest option. To use
# standard (non-spot) instances for all training tasks you would configure
# as follows:
# worker-classes:
# finetune-student: gcp-spot
# train-backwards: gcp-spot
# train-teacher: gcp-spot
# train-student: gcp-spot
# default: gcp-spot
worker-classes:
default: gcp-spot
1 change: 1 addition & 0 deletions taskcluster/kinds/alignments-backtranslated/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/alignments-original/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/alignments-student/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/analyze-corpus/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- translations_taskgraph.transforms.from_datasets:per_dataset
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/analyze-mono/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- translations_taskgraph.transforms.from_datasets:mono
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/bicleaner-model/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/bicleaner/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- translations_taskgraph.transforms.from_datasets:per_dataset
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/cefilter/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/clean-corpus/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- translations_taskgraph.transforms.from_datasets:per_dataset
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/clean-mono/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- translations_taskgraph.transforms.from_datasets:mono
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/collect-corpus/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.from_deps
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/collect-mono-src/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.from_deps
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/collect-mono-trg/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.from_deps
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/dataset/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.from_datasets:per_dataset
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
2 changes: 1 addition & 1 deletion taskcluster/kinds/evaluate-quantized/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.from_datasets:per_dataset
- translations_taskgraph.transforms.worker_env
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
2 changes: 1 addition & 1 deletion taskcluster/kinds/evaluate-teacher-ensemble/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.from_datasets:per_dataset
- translations_taskgraph.transforms.worker_env
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.from_deps
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
2 changes: 1 addition & 1 deletion taskcluster/kinds/evaluate/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.from_datasets:per_dataset
- translations_taskgraph.transforms.worker_env
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- translations_taskgraph.transforms.cast_to
- taskgraph.transforms.chunking
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/export/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/extract-best/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- translations_taskgraph.transforms.cast_to
- taskgraph.transforms.chunking
Expand Down
2 changes: 1 addition & 1 deletion taskcluster/kinds/finetune-student/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.marian_args:transforms
- translations_taskgraph.transforms.worker_env
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/merge-corpus/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- translations_taskgraph.transforms.find_upstreams:by_locales
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/merge-devset/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- translations_taskgraph.transforms.find_upstreams:by_locales
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/merge-mono/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- translations_taskgraph.transforms.find_upstreams:mono
- taskgraph.transforms.run:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/merge-translated/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/quantize/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
2 changes: 1 addition & 1 deletion taskcluster/kinds/score/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_env
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/shortlist/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/split-corpus/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
loader: taskgraph.loader.transform:loader

transforms:
- translations_taskgraph.transforms.worker_selection
- taskgraph.transforms.task_context
- taskgraph.transforms.run:transforms
- translations_taskgraph.transforms.cached_tasks:transforms
Expand Down
Loading

0 comments on commit 067ce65

Please sign in to comment.