RFC: TensorFlow on DirectML #243

wchao1115 · 2020-05-12T06:24:37Z

This RFC will be open for comment until Monday, May 25th, 2020.

Status	Proposed
RFC #	243
Author(s)	Chai Chaoweeraprasit ([email protected]), Justin Stoecker ([email protected]), Adrian Tsai ([email protected]), Patrice Vignola ([email protected])
Sponsor	Penporn Koanantakool ([email protected])
Updated	2020-06-08

Objective

Implement a new TensorFlow device type and a new set of kernels based on DirectML, a hardware-accelerated machine learning library on the DirectX 12 Compute platform. This change broadens the reach of TensorFlow beyond its existing GPU footprint and enables high-performance training and inferencing on Windows devices with any DirectX12-capable GPU.

RFC: TensorFlow on DirectML

googlebot · 2020-05-12T06:24:43Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

googlebot · 2020-05-12T22:33:47Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

googlebot · 2020-05-13T06:29:13Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

…'s name and email alias.

pull from community

Update RFC

penpornk · 2020-05-14T16:37:11Z

Thank you for your interest!

For the current TensorFlow stack, we recommend adding the new device and kernels as external modules, connecting to TensorFlow through C APIs. Please see the Modular TensorFlow RFC for more context. Kernels and ops can be registered through the APIs in this ongoing RFC. We will share more updates on device addition APIs next week. In addition, there are a few other uncertainties with this approach:

Modular TF cannot guarantee compatibility with the new TensorFlow runtime (TFRT) (but we will try to maximize compatibility if possible). TFRT will be modular, but it is fundamentally different from the current runtime and many things are still evolving.
The C APIs will initially be subject to changes, we cannot yet guarantee API stability

What is your integration timeline? TensorFlow is transitioning to a new backend based on TFRT and MLIR, which will not be compatible with the current stack. If your goal is to get your integration productized sometime in 2021, we recommend waiting to integrate with TFRT and MLIR to save throw-away efforts.

wchao1115 · 2020-05-14T23:33:05Z

Thank you for your comment @penpornk. Our initial goal is to be compatible with the existing runtime behavior as that's what our audience is expecting. We aim to integrate a new backend to achieve broader hardware and platform reach while maintaining compatibility as a top priority.

We definitely want to learn more about TFRT. Based on the info in the blog, it does look like the refactoring will actually make it easier for us to integrate our device runtime and kernels. So, definitely interested to be in the loop for future development. We have not finalized our initial release timeline, but we have some good progress in our fork to date.

Given the size of the change, I would also like to know your recommendation on how we should think about the code integration strategy. The code change is large but mostly additive.

mihaimaruseac · 2020-05-15T00:32:07Z

@wchao1115 regarding CLA, you will have to sign with all associated email accounts

mihaimaruseac · 2020-05-15T16:07:42Z

Just checked the internal data. Your email at ntdev.microsoft.com has not signed the CLA

penpornk · 2020-05-15T20:33:31Z

I manually checked and don't see the CLA for [email protected]. Did you sign with this address or a different alias?

I noticed that both your personal emails are associated with the same github account. Maybe you can add your corp email as the third one (in Github settings) too. (Or do we still require a separate CLA for a corp email address, @mihaimaruseac ?)

yongtang · 2020-05-15T20:37:45Z

I think an easy way is to modify the commit and reset the author to associate the email that signed CLA instead.

You can use git commit -s --amend --reset-author to reset the last commit's author info.

To modify all 9 commit, you can use:

git rebase -i HEAD~9 -x "git commit --amend --author 'Author Name <[email protected]>' --no-edit"

penpornk · 2020-05-15T20:41:59Z

@yongtang That's a great idea. Thank you!

googlebot · 2020-05-16T06:23:24Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

penpornk · 2021-05-10T21:35:40Z

@wchao1115 Nice timing! I was planning to update you on this once TF 2.5 is released too. It's the first release that supports PluggableDevices. :)

We're nearing the completion of our work in the 1.15 fork

That's great news! Looking forward to it!

Do you recommend that we look at the Intel's fork or just focus on what already in master, given that a handful of PRs have already been merged? Also, what is the timeline for the pluggable graph change?

Please focus on the master branch. Most implementations are already in there, including the graph optimization C API.

PluggableDevice support in TF 2.5:

StreamExecutor C API: Complete
PluggableDevice: TensorFlow can run full end-to-end model inferences on PluggableDevices (and also training, just not updating resources on the device). TensorFlow can't override manually-placed node on 1st-party GPUs to run on PluggableDevices yet. (But nodes that aren't manually placed by users could use PluggableDevice just fine.)
- Main implementation PR.
Graph optimization C API: Complete

Additional PluggableDevice support in master:

Minor change to graph optimization C API

Upcoming PRs

Let PluggableDevices override user-placed nodes (e.g., "GPU"). This will be an alternative to the original approach here.
Updating grappler plug-in configs warning messages

Upcoming RFCs

RFC: Kernel C API Extension for Variable Ops: To properly support training.
An RFC to fix buffer inplacement issue.
Another RFC in a few weeks (can't say what it's about yet).

We also have a tutorial for PluggableDevice developers under construction here.

Disclaimer: PluggableDevice hasn't really been tested with Windows yet. We would love your help testing it. :)

Regarding the blog post, I'm sorry to say that there is still no update. (It's beyond my control.)

wchao1115 · 2021-05-10T23:38:20Z

@penpornk Thanks for the update. We'll check this out. The tutorial would be awesome. If we have it, I'll help review it with our work. Since I'm having you here, is there a plan to support an ARM64 build either on Windows or the Mac OS? I think we'll eventually need that on Windows.

penpornk · 2021-05-21T02:47:55Z

is there a plan to support an ARM64 build either on Windows or the Mac OS? I think we'll eventually need that on Windows.

@wchao1115 There is no set plan that we can share yet. But please stay tuned. :)

In case this is useful: We have an aarch64 build config (--config=mkl_aarch64) recently added by @cfRod and @alenik01. This build uses oneDNN with Arm Compute Library support.

penpornk · 2021-06-08T17:02:47Z

Hi @wchao1115,
A friendly reminder that we plan to discuss opportunities for improving TF Windows build infrastructure at the SIG Build meeting today at 2pm PT. If you are interested, please see how to join the call here. :)

wchao1115 · 2021-06-08T18:57:34Z

Thanks @penpornk for the reminder. I indeed plan to attend this call at 2 today. Someone else at MS has routed this meeting to me due to our current work on getting TensorFlow to run on WSL.

wchao1115 · 2021-06-08T23:28:46Z

@penpornk You mentioned toward the end of the SIG call today that there is this conversation on the Windows build topic that I'm part of. That's true, but it seems I do not have the permission to post on that group. Is there any step I need to take to request for the permission?

penpornk · 2021-06-08T23:38:39Z

@wchao1115 I think you'll need to join the group first. If you go to the group page, there should be a button "Join group" after the group name.

wchao1115 · 2021-09-18T20:39:37Z

Just wanted to drop a quick update for the community that the official release of TensorFlow-DirectML 1.15.5 is now available at PyPI. More update can be found here.

We're moving right along to enable DirectML support for TF2 code base with the focus on the TensorFlow pluggable device for DirectML that runs on the framework's latest version from the official branch both on Windows and WSL.

ematejska · 2022-01-31T18:46:29Z

@penpornk What's the status of this RFC?

bhack · 2022-02-04T13:02:15Z

What is your integration timeline? TensorFlow is transitioning to a new backend based on TFRT and MLIR, which will not be compatible with the current stack. If your goal is to get your integration productized sometime in 2021, we recommend waiting to integrate with TFRT and MLIR to save throw-away efforts.

Can we have a fresh overview on this claim? What Is the status of TF/TFRT integration in 2022?

penpornk · 2022-02-04T13:36:00Z

@ematejska Sorry for the late reply! We are taking a slightly different approach than initially proposed in this RFC, i.e., making a TF-DirectML PluggableDevice plug-in instead of directly adding a new device type DmlDevice in the TF source code. So far there is no need for a design review yet and this PR has mostly been used for communication / giving updates.

Can we have a fresh overview on this claim? What Is the status of TF/TFRT integration in 2022?

@bhack I believe @wchao1115 and his team are working on a TF-DirectML PluggableDevice plug-in. There is no ~~device API~~ device plug-in C API for TFRT yet.

Update history:
Edited "device API" to "device plug-in C API".

bhack · 2022-02-04T14:10:05Z

Thanks for the update. As in TFRT I see that we have already Nvidia CUDA and AMD HIP driver wrappers at:

https://github.com/tensorflow/runtime/blob/master/backends/gpu/lib/wrapper/driver_wrapper.cc

See more at:
https://github.com/tensorflow/runtime/blob/master/backends/gpu/README.md

penpornk · 2022-02-04T16:11:05Z

Thanks for the update. As in TFRT I see that we have already Nvidia CUDA and AMD HIP driver wrappers at:

https://github.com/tensorflow/runtime/blob/master/backends/gpu/lib/wrapper/driver_wrapper.cc

See more at: https://github.com/tensorflow/runtime/blob/master/backends/gpu/README.md

@bhack Oops. Sorry. I meant TRFT doesn't have a device plug-in C API yet. (Fixed my earlier post as well.)

bhack · 2022-02-04T16:22:56Z

@penpornk Thanks so I suppose that as with the original point we don't know anything about a future TFRT device plug-in C API and the PluggableDevice design compatibilty and It will be probably required to integrate twice. Right?

wchao1115 · 2022-02-04T18:43:51Z

@bhack As what @penpornk has mentioned here. I just wanted to confirm that we have no plan in place as of now to work on TFRT. Our focus at the moment is on getting a DirectML-based plug-in working with TF2 official builds, both on Linux and on Windows.

@penpornk Given that we've taken your feedback on this RFC and pivot towards the pluggable device on TF2 now, shall we move ahead and approve this RFC for the record? This is a long road that we've started a while back, and would love to close on it.

bhack · 2022-02-04T19:01:41Z

@wchao1115 Thanks it is is clear I see also some (related) pending MS PRs in TF.
My point was just to understand, when TFRT is ready, if it will require a brand new integration effort like this one.

penpornk · 2022-02-07T06:54:46Z

@penpornk Thanks so I suppose that as with the original point we don't know anything about a future TFRT device plug-in C API and the PluggableDevice design compatibilty and It will be probably required to integrate twice. Right?

@bhack Yes. The future MLIR/TFRT stack will likely require a separate integration. If possible, we would like to make it so that it doesn't require too big of a migration effort, but there is no guarantee.

@penpornk Given that we've taken your feedback on this RFC and pivot towards the pluggable device on TF2 now, shall we move ahead and approve this RFC for the record? This is a long road that we've started a while back, and would love to close on it.

@wchao1115 Sorry about that! I originally kept the PR open in case we'd need a redesign, but didn't think of closing it once you decided to go with PluggableDevice. Let me double check with the team if we need to update the RFC content to reflect the new development.

ematejska · 2022-02-08T18:43:39Z

@penpornk How about updating the design document with the current state and we can mark as accepted per your recommendation?

penpornk · 2022-02-08T20:08:07Z

@ematejska Sounds great! Thank you, Ewa!
@wchao1115 Could you please help update the RFC file to explicitly mention the plug-in approach?

boyedarat · 2022-02-27T11:02:21Z

@wchao1115 can we have an update on PluggableDevice? Your last comment was that it was something your team had been working on for TF 2.4. Since the API for PluggableDevice is official since TF 2.5, is there a preview build of some kind available? Been waiting for DirectML for TF2 for a very long time.
Thanks.

wchao1115 · 2022-02-27T17:20:35Z

We are working on it. Should be out some time soon.

wchao1115 · 2022-03-04T08:26:00Z

@penpornk Please take a look at the design change update in the latest commit.

penpornk

Thank you very much for the update! And looking forward to the plug-in release! :D

wchao1115 added 2 commits May 11, 2020 23:03

Proposal to implement a new DirectML-based backend for TensorFlow.

852bb1f

Merge pull request #1 from wchao1115/tensorflow_directml_rfc

31edfce

RFC: TensorFlow on DirectML

wchao1115 requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners May 12, 2020 06:24

googlebot added the cla: no label May 12, 2020

googlebot added cla: yes and removed cla: no labels May 12, 2020

Update the RFC link.

bd34ae7

googlebot added cla: no and removed cla: yes labels May 13, 2020

wchao1115 and others added 6 commits May 12, 2020 23:40

Update to trigger new commit email.

e354918

Update

38fb290

Attempt to trigger a new CLA bot run with a commit with proper commit…

451bac1

…'s name and email alias.

Merge pull request #2 from tensorflow/master

e5a72c2

pull from community

Update RFC

a425273

Merge pull request #3 from wchao1115/tfdml_update

5d36ec3

Update RFC

googlebot removed the cla: no label May 16, 2020

wchao1115 added 2 commits March 4, 2022 00:19

Design update for TF2.

a6725df

Update last timestamp and sponsor

fc39873

penpornk approved these changes Mar 4, 2022

View reviewed changes

Update 20200511-tensorflow-on-directml.md

8203ae8

ematejska approved these changes Mar 4, 2022

View reviewed changes

ematejska merged commit eff9c82 into tensorflow:master Mar 4, 2022

ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Mar 4, 2022

RFC: TensorFlow on DirectML #243

RFC: TensorFlow on DirectML #243

Uh oh!

Conversation

wchao1115 commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Uh oh!

googlebot commented May 12, 2020

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

googlebot commented May 12, 2020

Uh oh!

googlebot commented May 13, 2020

Uh oh!

penpornk commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchao1115 commented May 14, 2020

Uh oh!

mihaimaruseac commented May 15, 2020

Uh oh!

mihaimaruseac commented May 15, 2020

Uh oh!

penpornk commented May 15, 2020

Uh oh!

yongtang commented May 15, 2020

Uh oh!

penpornk commented May 15, 2020

Uh oh!

googlebot commented May 16, 2020

Uh oh!

penpornk commented May 10, 2021

Uh oh!

wchao1115 commented May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penpornk commented May 21, 2021

Uh oh!

penpornk commented Jun 8, 2021

Uh oh!

wchao1115 commented Jun 8, 2021

Uh oh!

wchao1115 commented Jun 8, 2021

Uh oh!

penpornk commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchao1115 commented Sep 18, 2021

Uh oh!

ematejska commented Jan 31, 2022

Uh oh!

bhack commented Feb 4, 2022

Uh oh!

penpornk commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhack commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penpornk commented Feb 4, 2022

Uh oh!

bhack commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchao1115 commented Feb 4, 2022

Uh oh!

bhack commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penpornk commented Feb 7, 2022

Uh oh!

ematejska commented Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penpornk commented Feb 8, 2022

Uh oh!

boyedarat commented Feb 27, 2022

wchao1115 commented May 12, 2020 •

edited

Loading

penpornk commented May 14, 2020 •

edited

Loading

wchao1115 commented May 10, 2021 •

edited

Loading

penpornk commented Jun 8, 2021 •

edited

Loading

penpornk commented Feb 4, 2022 •

edited

Loading

bhack commented Feb 4, 2022 •

edited

Loading

bhack commented Feb 4, 2022 •

edited

Loading

bhack commented Feb 4, 2022 •

edited

Loading

ematejska commented Feb 8, 2022 •

edited

Loading