Trainingjob operator is designed for EDL (elastic deep learning) and supports multiple frameworks. It supports automatic fault tolerance and flexible pod completion and termination strategies, and has been tested in paddlepadle, tensorflow, and python frameworks.
Trainingjob-operator can be deployed by compiling and executive
git clone
cd trainingjob-operator/cmd
go build -o trainingjob-operator
./trainingjob-operator --master ${master_ip}:${port} --v 4 --thread-num 1000 --logtostderr --leader-elect=true --enable-creating-failed=true
kubectl apply -f
kubectl get aitj
kubectl describe aitj paddle-mnist
kubectl delete -f