Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Modular TensorFlow #77

Merged
merged 9 commits into from
Nov 26, 2019
Merged

RFC: Modular TensorFlow #77

merged 9 commits into from
Nov 26, 2019

Conversation

gunan
Copy link
Contributor

@gunan gunan commented Mar 12, 2019

Comment period is open until 2019-04-04

Modular TensorFlow

Status Accepted
Author(s) Gunhan Gulsoy ([email protected])
Sponsor Martin Wicke ([email protected])
Updated 2019-03-06

Objective: Propose an architecture where smaller, more focused TensorFlow modules can be created, maintained and released separately.

gunan and others added 4 commits March 5, 2019 21:56
Here are my proposed changes, to streamline the introduction and correct a few typos.
Update 20190305-modular-tensorflow.md

The ML ecosystem is constantly expanding. New hardware for accelerating ML applications is being worked on by many teams inside and outside Google. As the most popular machine learning framework, TF is expected to add support for many of these hardware as quickly as possible.

Currently, this means that all such hardware developers need to check in their code into the [main tensorflow repository](https://github.com/tensorflow/tensorflow). This means that all the changes are required to go through TF team's review. This can make merging support for new hardware very very difficult.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or maintain a fork of TF, right?

Copy link
Contributor

@byronyi byronyi Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel way without GPL enforcing ends you up with Caffe, as the saying goes by, “there are a thousand versions of Caffe in a thousand contributors’ eyes”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


1. These cannot export any symbols into global namespace to avoid symbol collisions.
1. They need to communicate to TF through any provided C APIs.
1. They can link anything they like, anyway they like in addition to TF.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would TF core APIs be called from within a plugin? e.g. If an MKL plugin signals that it can handle a usage, but as part of that usage, the MKL plugin needs to call the eigen implementation in TF core, how would that handoff be managed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting use case. I can think of two potential solutions:

1 - MKL plugin can use the global kernel registry to fetch the eigen kernel and execute it.
2 - MKL plugin can rebuild all eigen kernels and also distribute all.

We can brainstorm more on this issue, I think it is certainly worth considering.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A global kernel registry sounds interesting. @agramesh1, @nhasabni do you see performance implications here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


* Core TF functionality will be implemented in C++
* Core TF functionality can be extended using shared objects.
* On top of the core C++ libraries, we will have the language bindings (Using the C API)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in this developer thread, it is still not very clear to me how protobufs will be exposed by the C API.

For example, the language bindings will need to retrieve the list of available operations, which is currently represented by a list of OpDef described in opdef.proto. How will that information cross the C API?

  1. by returning the protobufs as a series of bytes that should be deserialized by the bindings, like in TF_GetAllRegisteredKernels?
  2. by exposing a snapshot of the protobufs as a text file that should be parsed by bindings, like in ops.pbtxt?
  3. by wrapping protobuf messages in a C structure exposed in the API?

In other words, can a .proto file of TensorFlow be part of the contract with its modules or should they be used only internally by the core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our solution to this will be 3.
We would like to remove all protobufs from our APIs.
We will push for using protobufs for only serialization formats, for RPCs or data we would store on disk.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gunan It is not clear how to add new operators to a module plugin. Currently, we could create new operators through OpDef and optionally set visibility hidden. We have been doing this for MKL-DNN quantization work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to use a mechanism similar to custom ops, but simplified. With a C ABI, we will be able to make custom ops much easier to build and distribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question, how will the TensorFlow libraries be published to be used by language bindings?

Right now, pretty much all bindings have their own distribution (PiP package, Maven artifacts...) that bundles both the client and the TF core libraries. But with module TF, will the core libraries built, tested and published by Google so the clients just need to pick them for distribution?


**Modules:** These are the components of core TF library that will "accept" plugins to expand their capabilities. Examples of modules are networking, filesystem, (graph) optimizers.

**Plugins:** Plugins are extensions to different modules. For example, filesystem module can have a GCP plugin, an S3 plugin, an HDFS plugin.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defining an SDK. When can we get drop? That will generate lots of feedback/questions...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjamesr is working on this. We are trying to build the API based on OpKernel and OpKernelContext classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A further clarification.
The SDK will start as what it is today.
We will try to make all the kernel code work as is. Therefore, you can think of the current TF C++ headers as v0 of what is to come for TF C++ APIs.



1. These cannot export any symbols into global namespace to avoid symbol collisions.
1. They need to communicate to TF through any provided C APIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread-pool management? How do we coordinate between TF Core and plugins so as to not over/undersubscribe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @sjamesr and @m3bm3b to comment on this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our expectation for the moment is that threadpool management will be much as it is today.
That is, plugins will typically use a threadpool provided to them
(such as via OpKernelContext); the key difference would be that the interface
would be provided as part of a stable ABI, so that it can be used across the boundaries
of shared libraries. Thus, we don't expect that interactions will be substantially
better or worse than today.

We cannot prevent the author of a plugin from creating a threadpool local to that plugin,
but, as you are perhaps suggesting, to do so risks making interactions with other
threadpools within the address space be less predictable. That might well make such a
plugin less desirable by clients.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow on to @claynerobison 's comment. Currently we use different default threadpool settings and also set some MKL specific settings when MKL is enabled (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/threadpool_device.cc) Would there be a way to do this from the plugins?

#### Plugins

Plugins need to include implementation of the interfaces declared by one module. If the module interface requires Init and Compute methods, it will need to implement these two functions, plus a TF_InitPlugin function which will be called at load time. This function will also need to register the plugin as prescribed by the module.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK/API architecture has the potential to influence performance of plugins for better and worse. If an API design favors one plugin and hurts another, what is the plan to resolve the conflict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question. From our side, we will try to be as neutral as possible. Initial APIs will likely be based on the current C++ APIs TF has. But the API designs will be individually reviewed through different RFCs. We hope that we will be able to evaluate all concerns during these reviews. If we make mistakes favoring one or the other framework, we hope to alleviate these through addition of new symbols, and complete redesigns with the next major version releases.

* Each package will set version boundaries for each of their dependencies.
* Each package is responsible for ensuring that all of their public APIs are working without any changes until the next major release
* Packages do not need to modify the minimum version requirements unless they start using newly introduced public API symbols.
* TF metapackage releases may choose to hold back individual packages in favor of faster releases. But dependency requirements have to be respected when doing so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Validation will TF Core team do on plugins to ensure interoperability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a select number of plugins, we will run continuous integration tests. For Filesystems, these will be as simple as reading and writing files. For ops and kernels, we will run convergence and performance testing on select models.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what select number of plugins includes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, "select number of plugins" will be the "features currently officially supported by TF", such as CUDA, cloud file systems, GRPC based networking.
This list may grow or shrink over time.

* tf-gpu: With GPU only, for different operating systems.
* tf-estimator: Only for different python versions

When testing a package that has dependencies, such as tf-estimator, or tf-gpu, tensorflow-base will be installed with its latest stable release, to ensure to avoid any flakes by this package.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will CI be done? When someone submits a PR to TF Core, can plugins trigger CI to ensure that the PR doesn't cause regressions in the plugin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each plugin shall manage their depending TF version by itself.

Copy link
Contributor Author

@gunan gunan Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: replied to the wrong comment. Moved the previous comment here to the docs question below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, our goal is for each repository to run their own unit tests only for PRs. Currently, TF's situation is becoming unmanageable, and it is a big developer concern on TF side.
Plugins will rely on TF core to keep its API syntactically and semantically compatible. Then they will run their own tests.
If the integration tests after PRs are merged uncover backwards incompatible changes in TF, or the other dependencies in the ecosystem these will be rolled back.

* tf-gpu: With GPU only, for different operating systems.
* tf-estimator: Only for different python versions

When testing a package that has dependencies, such as tf-estimator, or tf-gpu, tensorflow-base will be installed with its latest stable release, to ensure to avoid any flakes by this package.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And vice versa: when someone submits PR to Plugin, how does TF Core prevent regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If such regressions can be detected with unit tests, those would be the detectors at presubmit time. If not, we will run these after merging changes. Most of our regression tests run after merging PRs right now.


* tensorflow-base: Operating systems, compiler versions and python versions only with CPU
* tf-gpu: With GPU only, for different operating systems.
* tf-estimator: Only for different python versions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who will own publication of Tensorflow + MKL binaries?

  • anaconda packages
  • pypi packages
  • container images

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just describing from the current state (as official builds do not have MKL enabled yet), TF team releases main TF package, and Intel releases tf-mkl plugin. Again, this is reflecting only the current state, and not taking into account any potential changes to mkl support in TF. We can evaluate those changes orthogonally. To clarify the 3 points you raised in the above described scenario:

  • Anaconda packages: TF team does not own any anaconda packages. We are separately evaluating anaconda, so it is an orthogonal issue.
  • pypi packages: tensorflow pypi package is owned by TF team, and tensorflow-mkl is a plugin owned by Intel.
  • container images: same as pypi packages, but Intel has the option to extend the TF container images just with pip install tensorflow-mkl command to build the MKL images on top of TF images.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```


When this plugin is loaded by TF at runtime, `TF_InitPlugin` method will be called. This method will register the plugin as prescribed by the module, and exit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will conflicts and priority be decided between plugins (e.g. CPU and GPU)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will try to avoid conflicts.
For ops, we have this rough idea: Each op plugin is required to use a separate "namespace" assigned to them for creating new ops. But we will have to improve on this idea.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: if two plugins register the ability to handle the same op (or whatever granularity of operation), how will priority be decided between them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation of the kernel registry, you can add extra labels to each kernel being registered, and use those to select which kernels you would like to pick:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/kernel_def.proto#L35
Our initial idea is to leverage this, but @sjamesr may have other ideas.


#### Scenario 3: C and B depend on different minimum versions of A

As both C and B have to define a range of versions they require from A, the max version should satisfy both constraints.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do plugins catch these dependency chain conflicts with sufficient time to resolve them before a release cycle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of conflicts did you have in mind?
As described here, when we set the max required version to the next major release, plugins can release independently, and as each plugin is independently released pip should be able to take care of satisfying the constraints for 'A'. with the max required A version the same for all plugins, the requirement intervals of versions of 'A' are guaranteed to have an intersection.


This is a pip issue. To help diagnose the problem, B and C should print that A is missing, and user needs to install that to use B or C.


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who owns documentation for plugins? Is there plan for a shared site with plugin contribution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each plugin resides in its own repository, so does the doc site (separated from TF core docs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @lamberta
It is actually a mix. There will have to be some plugins, like estimator we will declare as a part of core TF APIs, and document as such. For some other plugins, what @byronyi said will apply.


1. These cannot export any symbols into global namespace to avoid symbol collisions.
1. They need to communicate to TF through any provided C APIs.
1. They can link anything they like, anyway they like in addition to TF.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MKL, right now we depend on many cc_library(s) from core in bazel build files. Considering a separate MKL plugin, would we be able to do the same? Or will the core build files somehow expose the libraries to depend on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are right now working to create a C API for all libraries accessible through OpKernel and OpKernelContext classes. We are hoping that will cover all the required libraries for kernels. We can discuss if MKL kernels need more symbols.

#### Long build times

Many volunteers and developers outside Google use their laptops for development. On such systems, TF development cycles require building all of tensorflow that takes around 2 hours. While there is an argument to be made that bazel caching should ensure that they build all of TF only once, without forge, bazel caching is not working as [well as expected](https://docs.bazel.build/versions/master/remote-caching.html#known-issues).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new design, is using bazel still a must for building the separate plugins (like MKL) or other build tools (like cmake) can be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugins that are not in the main tensorflow repository are free to use any build system they would like.

1. These cannot export any symbols into global namespace to avoid symbol collisions.
1. They need to communicate to TF through any provided C APIs.
1. They can link anything they like, anyway they like in addition to TF.
1. They can be built and distributed separately

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can different compilers other that GCC be used to build plugins (ex: Intel ICC compiler)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But when using different compilers, it becomes very important to only use C symbols at the ABI level.


#### Flexibility for collaborators

Currently, any partner or contributor that would like to work with us are subject to all the rules within of the main repository. Some of these can be relaxed through modularization, where work can happen in a separate repository.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What and where do we plan on hosting the split up modular repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be up to the owners of each split repository.
They can be under github tensorflow org, they can be closed source, they can be on github and different org, or hosted on different websites.


More can be added to this list where we have to rebuild TensorFlow to support different network architectures, CUDA versions, SIMD instruction sets, etc.

Having a monolithic repository means we need to rebuild all of our code for all of these different combinations. However, it makes no sense to rebuild all of our C++ code if the only difference is the Python version. Or rebuild all of our CPU kernels for different CUDA versions. Modularizing our code means we only need to rebuild and test the modules that are directly impacted by the dimensions we are changing in the support matrix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you touch up a little more on this PR process envisioned with this change.
Currently Intel push changes to google and google owns the process of verifying for

  1. coding standards
  2. API compatibility
  3. backwards compatibility
  4. Unit and regression tests.
  5. others

Now for future PR's for Intel MKL repo, will Intel police its own PRs into this separated module Repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Intel would like to own the MKL repository, yes, intel will have full control over all of these. You will have to provide backwards compatibility on your public APIs.
Each repository owner will have full control over their repository.
For regression testing, we will work together to make sure the continuous integration of Base TF packages and all plugins will continue to work together.

* Architecture (x86, arm, ppc64, …)
* Accelerator (CPU, GPU, TPU)
* Compiler (GCC, Clang, MSVC)
* Python version (2, 3.4, 3.5, 3.6, 3.7)
Copy link

@nammbash nammbash Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the modularized approach,(Lets say for the Intel TF-MKL Module:)

  1. who will be running these tests for these support matrix or we care only about the Linux environment alone?
  2. Who Kicks off the CI process to test MKL. Intel or Google.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MKL module, assuming Intel taking full ownership of the MKL module (which we can discuss), will be intel's decision for both. Intel only wants to support Linux, sure. Intel wants to support all operating systems, sure.
TF and MKL presubmits will be independent, and their CIs will be owned by the repository owners. The integration tests will be a collaboration.

* Compiler (GCC, Clang, MSVC)
* Python version (2, 3.4, 3.5, 3.6, 3.7)

More can be added to this list where we have to rebuild TensorFlow to support different network architectures, CUDA versions, SIMD instruction sets, etc.
Copy link

@nammbash nammbash Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle TF-MKL PR by outsiders? What should Intel do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be entirely intel's call.
Intel may choose to completely close the source of the MKL plugin, or manage the plugin code just like another project, just like this one: https://github.com/intel/mkl-dnn

* Major releases still need to be coordinated.


## Packaging Issue Scenarios

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we maintain release dependence?
Intel syncs dot releases with the major release of Google?
If Google release TF 3.0 with MKL 2.17 dependency, the next MKL is 2.18 (working with TF 2.0 and TF 3.0) or TF-mkl release 2.18 or 3.1 or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean. Main TF package can either depend on MKL, or be depended on MKL. We will not accept a circular dependency.
If TF depends on MKL, MKL will depend on tensorflow-base. If MKL chooses to depend on TF, that will work like an add on.

What we would like to do is, just like described here, enable everyone to do independent releases. Therefore, avoiding circular dependencies is a must.

@bhack
Copy link
Contributor

bhack commented Aug 2, 2019

What is the status of this?

@mihaimaruseac
Copy link
Contributor

It is work in progress, the first module will be the filesystems (#101)

@bhack
Copy link
Contributor

bhack commented Oct 20, 2019

Is this megiable or not?

@gunan
Copy link
Contributor Author

gunan commented Oct 21, 2019

It will be merged, but I have to edit some of the sections.
I am sorry for taking so long to merge this proposal.
While this is not merged, as you can see from other RFCs the work is definitely under way.

@bhack bhack mentioned this pull request Nov 19, 2019
@theadactyl
Copy link
Contributor

@gunan any way we could resolve this RFC?

@bhack
Copy link
Contributor

bhack commented Nov 23, 2019

Can we try to extend the deadline if needed? It Is still set on 2019-04-04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.