Skip to content

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Dec 14, 2020

Supports scaling up/down training to compatible GPU counts. Adds new 'elasticity' key to our config json. Users indicate their max acceptable batch size and acceptable micro batch sizes. DeepSpeed will find a batch size that is usable with the largest list of compatible GPU counts. The intended consumers of this API and JSON addition are both the user training code and also the infrastructure scheduler.

    "elasticity": {
        "enabled": true,
        "max_train_batch_size": 2000,
        "micro_batch_sizes": [2,4,6],
        "min_gpus": 1,
        "max_gpus" : 10000,
        "min_time": 20,
        "version": 0.1
    }

@g-karthik
Copy link

@jeffra I haven't looked at this closely, but am I right to assume this requires the user to also use the training_data argument of deepspeed.initialize()? Also, how does the infrastructure scheduler tie to this config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants