Skip to content

Build Tensorflow and libtensorflow from source#64716

Merged
abbradar merged 9 commits intoNixOS:masterfrom
abbradar:tensorflow-revive
Jul 31, 2019
Merged

Build Tensorflow and libtensorflow from source#64716
abbradar merged 9 commits intoNixOS:masterfrom
abbradar:tensorflow-revive

Conversation

@abbradar
Copy link
Member

@abbradar abbradar commented Jul 13, 2019

Motivation for this change

Restore Tensorflow source build while leaving binary builds . Based on @yorickvP and @timokau work.with some parts taken from both.

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

cc @uri-canva for changes regaring Bazel.

Tested both CUDA and non-CUDA builds of the Python library, both source and binary builds. Didn't test libtensorflow, I assume that if Python library works and ldd output is okay libtensorflow is okay too.

@abbradar
Copy link
Member Author

Forgot to mention actual previous PRs: #63208 and #63616.

@ofborg ofborg bot added the 6.topic: python Python is a high-level, general-purpose programming language. label Jul 13, 2019
@abbradar
Copy link
Member Author

Darwin still uses binary builds for now. I don't have a Darwin machine so I can't test if source build works and fix it if it doesn't - unless someone wants to take on that role!

@abbradar abbradar requested a review from timokau July 13, 2019 22:48
@abbradar
Copy link
Member Author

(For some reason I cannot request a review from @yorickvP on Github)

@abbradar abbradar force-pushed the tensorflow-revive branch from df1b94b to 568648c Compare July 14, 2019 07:25
@ofborg ofborg bot added 8.has: clean-up This PR removes packages or removes other cruft 11.by: package-maintainer This PR was created by a maintainer of all the package it changes. 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. labels Jul 14, 2019
@abbradar abbradar force-pushed the tensorflow-revive branch from 568648c to 45298f1 Compare July 14, 2019 07:59
@abbradar
Copy link
Member Author

Updated binary libtensorflow doesn't build for Darwin. Can someone with Darwin knowledge help with it?

@danieldk
Copy link
Contributor

danieldk commented Jul 14, 2019

Updated binary libtensorflow doesn't build for Darwin. Can someone with Darwin knowledge help with it?

It seems that Google now build on newer macOS versions, I had the same problem with Tensorflow 1.13.1 when building on my own MacBook with Mojave. A newer cctools is needed, see #49371. I think this is the relevant PR: #61172

@yorickvP
Copy link
Contributor

This is great! I'm hoping to get a fully static build at some point, but having the same protobuf on tensorflow as the rest of the code should fix our linker errors! :)

You can't request a review from me because I don't have commit rights, I suppose.

Copy link
Contributor

@yorickvP yorickvP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoever built tensorflow upstream must have patched out that saved_model_portable_proto line. it's been in there for several version snow.

@yorickvP yorickvP mentioned this pull request Jul 14, 2019
10 tasks
@timokau
Copy link
Member

timokau commented Jul 14, 2019

As I mentioned in #63616, there are still issues with the python2 build. Seems to be a general problem with parallelism on python2, only manifests itself here because its such a big project and basically hopeless to build without a lot of parallelism (used 12 cores).

I already added one patch to python2 (#64067), but I still saw the issue (sometimes, its transient) after that. I suspect the second part of the fix in the python ticket is also needed. That was never backported by debian, so I had to do it myself (being not very experienced with C). I've pushed those changes to the branch of the other PR. Its not very polished, which is why I didn't push it before.

I don't have time to work on this or give extensive review right now. But I can answer questions if you have any.

@abbradar
Copy link
Member Author

@timokau I saw that one but IIUC this is a general Python problem, not Tensorflow-specific one, so I didn't really look into it. It shouldn't happen during the "big" Tensorflow build because I only build "source" package now and build the wheel separately with our buildPythonProgram. I didn't observe the problem on 8-core (16-thread) machine during ~10 builds too.

@abbradar abbradar force-pushed the tensorflow-revive branch from 45298f1 to 7b5a242 Compare July 14, 2019 17:28
@timokau
Copy link
Member

timokau commented Jul 14, 2019

Neat, so you only need bazel for the C part and don't need to unnecessarily rebuild everything for libtensorflow, py2 tensorflow and py3 tensorflow?

@abbradar
Copy link
Member Author

Sadly no, you still need to rebuild everything because their C libraries link to different Pythons and I presume use different APIs. Still, in the end you get something that resembles a usual source Python package with setup.py etc. I use this as src for buildPythonPackage so any Python bugs should be isolated there.

@timokau
Copy link
Member

timokau commented Jul 14, 2019

Oh, so you mean you just skip the final bdist_wheel as it was done previously? I actually changed that on purpose, so that we can unify the binary and source builds. That way we wouldn't have all the duplication between default.nix and bin.nix.

I don't think the bdist_wheel step alone was responsible for the build failures, but who knows.

@abbradar
Copy link
Member Author

The primary reason I switched back to the previous behavior was to ensure fixupPhase runs on unpacked Tensorflow libraries. This ensures RPATH is set up correctly.

@abbradar abbradar closed this Jul 14, 2019
@abbradar abbradar reopened this Jul 14, 2019
@yorickvP
Copy link
Contributor

Will this work with bazel 0.28.0? #64633

@timokau
Copy link
Member

timokau commented Jul 15, 2019

My PR had some issues with bazel 0.27, not sure if @abbradar worked around those. But the bazel team has promised no further backwards incompatibilities for the next 3 releases (then they'll break backwards compatibility once more and release 1.0).

@abbradar
Copy link
Member Author

But the bazel team has promised no further backwards incompatibilities for the next 3 releases (then they'll break backwards compatibility once more and release 1.0).

Nice news! I didn't test 0.28 though yet, wanted to do it today but sadly I won't have time for one more rebuild.

some issues with bazel 0.27, not sure if @abbradar worked around those

Huh, didn't notice anything. Maybe you remember something specific? I apply patch for 0.27 from upstream and also add a flag to enable backwards compatibility with some behavior.

@abbradar
Copy link
Member Author

Now that other parts are here I'll wait till staging goes into master and merge this (so that we don't break Python 2 + Tensorflow).

@timokau
Copy link
Member

timokau commented Jul 17, 2019

Why not just retarget this PR to staging and merge now? Then we'll get it automatically on the next staging-next merge.

@abbradar
Copy link
Member Author

abbradar commented Jul 17, 2019 via email

@timokau
Copy link
Member

timokau commented Jul 17, 2019

Fair enough.

@timokau
Copy link
Member

timokau commented Jul 18, 2019

I tried to build tensorflow with SSE4.1 support, but unfortunately the build failed. Apparently that has to be specified at configure time now. This is my attempt to get it working:
https://github.com/timokau/nixpkgs/commits/3323284
The patch to buildBazelPackages is necessary since tensorflow won't accept the --config=opt flag for fetching.

Feel free to cherry-pick and/or rebase.

@abbradar abbradar force-pushed the tensorflow-revive branch 2 times, most recently from 81651a2 to 388a1ed Compare July 25, 2019 13:24
@ofborg ofborg bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Jul 25, 2019
@abbradar abbradar force-pushed the tensorflow-revive branch from 388a1ed to 5f49e3c Compare July 31, 2019 08:34
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Jul 31, 2019
timokau and others added 9 commits July 31, 2019 13:28
This also changes it to a from-source build.
This merges work done by yorickvP and timokau in NixOS#63208 and NixOS#63616 respectively.
Now the derivation builds both libtensorflow and the Python package and puts them into
different outputs.

Quite a bit of improvements were done on the top, including:

* Use official tag revision as source, not a branch;
* Use all system libraries possible (before only one was actually used);
* Move various environment variables to the derivation itself from hooks;
* Use source Python build instead of wheel build to ensure fixup hooks do their important jobs on libraries;
* And more that I forgot!
They sometimes take separate flags.
Now need to be passed in the configure phase.

abbradar: Don't change CUDA build hash.
@abbradar abbradar force-pushed the tensorflow-revive branch from 5f49e3c to cd0e461 Compare July 31, 2019 10:34
@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Jul 31, 2019
@ofborg ofborg bot requested a review from basvandijk July 31, 2019 10:45
@abbradar abbradar merged commit 6ee9799 into NixOS:master Jul 31, 2019
@timokau
Copy link
Member

timokau commented Aug 24, 2019

@abbradar in case you haven't noticed yet, the tensorflow build (py2 and py3, but there are different issues for py2 and py3 I think) is currently failing. Unfortunately I don't have access to the necessary compute resources to debug this right now.

@abbradar
Copy link
Member Author

Fixed in 7109546. Thanks for the heads up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: python Python is a high-level, general-purpose programming language. 8.has: clean-up This PR removes packages or removes other cruft 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. 11.by: package-maintainer This PR was created by a maintainer of all the package it changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants