Skip to content

[WIP] tensorflow: re-enable build from source#63616

Closed
timokau wants to merge 11 commits intoNixOS:masterfrom
timokau:tensorflow-source-build
Closed

[WIP] tensorflow: re-enable build from source#63616
timokau wants to merge 11 commits intoNixOS:masterfrom
timokau:tensorflow-source-build

Conversation

@timokau
Copy link
Member

@timokau timokau commented Jun 21, 2019

Motivation for this change

This resurrects the tensorflow source build. Its WIP because it builds an unreleased tensorflow version (current release didn't support the current bazel) and needs some cleanup, but it works. Builds upon #63580.

My motivation for this was that I need to run tensorflow on a VM without avx instructions while the official wheel requires avx (tensorflow/tensorflow#17411). There are still cpus sold without AVX support, so we shouldn't require it by default.

Not sure yet if I'll actually push it over the finish line, since it "works for me" for now. But anyway, here it is.

CC @volth, @abbradar who have touched the source build before.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Assured whether relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@timokau timokau requested a review from FRidh as a code owner June 21, 2019 12:55
@timokau timokau changed the title tensorflow: re-enable build from source [WIP] tensorflow: re-enable build from source Jun 21, 2019
@ofborg ofborg bot added 6.topic: python Python is a high-level, general-purpose programming language. 8.has: clean-up This PR removes packages or removes other cruft labels Jun 21, 2019
@ofborg ofborg bot requested a review from abbradar June 21, 2019 13:08
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. labels Jun 21, 2019
@timokau timokau force-pushed the tensorflow-source-build branch from 7a588e1 to e7c7310 Compare June 22, 2019 12:26
@timokau
Copy link
Member Author

timokau commented Jun 22, 2019

Of course a mix of perfectionism and sunk-cost fallacy drove me to continue to work on it.

I've included a tf 1.14 update and an update of the dependencies. Turns out the source build for 1.14 works after all, we just have to tell tensorflow not to check the maximum bazel version.

The value of taking some libs from the system is a bit questionable, but it's probably the right thing to do anyways.

@timokau timokau force-pushed the tensorflow-source-build branch from e7c7310 to 94357aa Compare June 22, 2019 12:30
@ofborg ofborg bot added 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. labels Jun 22, 2019
@abbradar
Copy link
Member

Wow! That's really cool, I didn't expect anyone else having enough patience to dive into this :D

I'll hopefully be able to test during this week when replacement part for my PC arrives - Tensorflow is painfully slow to build otherwise.

This was referenced Jun 24, 2019
@timokau
Copy link
Member Author

timokau commented Jun 24, 2019

Thanks! Apparently at least two people are #63208

This isn't quite done yet, had to debug a nasty issue with the cuda build (#38471 (comment)). Then some cleanup.

@timokau timokau force-pushed the tensorflow-source-build branch from 639cb01 to a70d14f Compare June 24, 2019 21:48
@timokau timokau requested a review from Profpatsch as a code owner June 24, 2019 21:48
@ofborg ofborg bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Jun 24, 2019
@Profpatsch
Copy link
Member

Hm, can you explain why you need gcc and binutils in the bazel derivation for this to work? What’s the errors you are getting otherwise?

@timokau
Copy link
Member Author

timokau commented Jun 25, 2019

That is to fix the issue I explained here, where bazel tries to take the default locations for certain tools from PATH during build time.

Unfortunately that still doesn't fix the underlying problem, which is the build (only with cuda for some reason) failing because bazel tries to execute /usr/bin/ar. More details here: bazelbuild/bazel#8715

@timokau timokau force-pushed the tensorflow-source-build branch from a70d14f to 8d237ee Compare June 25, 2019 21:49
@timokau
Copy link
Member Author

timokau commented Jun 25, 2019

Cuda build finally works! Cleanup still needed..

@timokau timokau force-pushed the tensorflow-source-build branch from 8d237ee to 21db4a7 Compare June 26, 2019 21:50
@timokau
Copy link
Member Author

timokau commented Jun 26, 2019

Some cleanup done.

@Profpatsch the bazel change ended up not being necessary here, since tensorflow sets its own toolchain. It may still be a good idea to fix up bazels default toolchain though. Unfortunately I couldn't figure out where the actual result ends up, so I couldn't verify my patches.

@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Jun 26, 2019
@timokau
Copy link
Member Author

timokau commented Jun 27, 2019

Forked out the update to #63873, because that is ready for merge now.

@abbradar
Copy link
Member

I work on merging both @timokau and @yorickvP's changes into one derivation which produces both a wheel and a library.

@timokau
Copy link
Member Author

timokau commented Jul 12, 2019

Thanks!

This PR isn't quite done however. The python2 build still sometimes fails, I'll probably need to add another fix to python2 (in addition to #64067). The reason I haven't worked on this is that I currently need the hardware for more important things (i.e. actually using tensorflow as opposed to packaging it). I'll likely get back to it some time, but if you want to take over I certainly won't complain either :)

abbradar added 3 commits July 14, 2019 11:26
Timestamp verification skip is no longer needed (not sure why). Generally we
better off always using the environment hack for all packages because that
ensures all NIX_* flags are correctly applied.

One possible improvement in future is to filter only NIX_* variables to
passthru in Bazel.
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively.
Now the derivation builds both libtensorflow and the Python package and puts them into
different outputs.

Quite a bit of improvements were done on the top, including:

* Use official tag revision as source, not a branch;
* Use all system libraries possible (before only one was actually used);
* Move various environment variables to the derivation itself from hooks;
* Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries;
* And more that I forgot!
@timokau timokau force-pushed the tensorflow-source-build branch from 21db4a7 to 9a35096 Compare July 14, 2019 16:38
@ofborg ofborg bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Jul 14, 2019
Without this Bazel always picks Python 3 which breaks Python 2 packages.
Strangely enough just dropping this patch works, with all `bazel.tests`
passing.
+ /* first construct the filename with a .tmp suffix */
+ cpathname_tmp = PyMem_MALLOC(strlen(cpathname) + 5);
+ if (cpathname_tmp == NULL) {
+ PyErr_Clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we need to clear the error here because this function is okay to fail. I updated my patch.

I didn't use goto because the original author preferred less control flow and I wanted to keep the original patch's spirit.

timokau and others added 4 commits July 18, 2019 11:00
This is python bug https://bugs.python.org/issue13146. Fixed since
python 3.4. It makes pyc creation atomic, preventing a race condition.
The patch has been rebased on our deterministic build patch.

It wasn't backported to python 2.7 because there was a complaint about
changed semantics. Since files are now created in a temporary directory
and then moved, symlinks will be overridden. See
https://bugs.python.org/issue17222.

That is an edge-case however. Ubuntu and debian have backported the fix
in 2013 already, making it mainstream enough for us to adopt.

(cherry picked from commit 9db3a58)
Turns out fixing this only in importlib is not sufficient and we
need to backport CPython part of the fix too.

This patch is based on https://hg.python.org/cpython/rev/c16063765d3a
but because the code around is different there are some changes (C-strings
instead of Python objects etc.)

With this patch Tensorflow builds successfully on many-core machine.

(cherry picked from commit da295a1)
They sometimes take separate flags.
Now need to be passed in the configure phase.
@timokau timokau force-pushed the tensorflow-source-build branch from 9a35096 to 0c3db9a Compare July 18, 2019 11:50
abbradar added a commit to abbradar/nixpkgs that referenced this pull request Jul 31, 2019
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively.
Now the derivation builds both libtensorflow and the Python package and puts them into
different outputs.

Quite a bit of improvements were done on the top, including:

* Use official tag revision as source, not a branch;
* Use all system libraries possible (before only one was actually used);
* Move various environment variables to the derivation itself from hooks;
* Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries;
* And more that I forgot!
@timokau
Copy link
Member Author

timokau commented Aug 8, 2019

Superseded by #64716, thanks @abbradar!

@timokau timokau closed this Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.status: merge conflict This PR has merge conflicts with the target branch 6.topic: python Python is a high-level, general-purpose programming language. 8.has: clean-up This PR removes packages or removes other cruft 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants