Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.1.0rc1 regression: Extraneous cache misses with --remote_download_toplevel #13262

Closed
jlaxson opened this issue Mar 24, 2021 · 3 comments
Closed
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@jlaxson
Copy link
Contributor

jlaxson commented Mar 24, 2021

Description of the problem / feature request:

When building with --remote_download_toplevel, some actions that are definitely in the cache are missed.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

% bazel clean
% bazel build <target> --remote_download_toplevel
INFO: Elapsed time: 306.725s, Critical Path: 44.48s
INFO: 1004 processes: 167 remote cache hit, 9 internal, 828 darwin-sandbox.
INFO: Build completed successfully, 1004 total actions
% bazel clean
% bazel build <target> --remote_download_toplevel
INFO: Elapsed time: 100.161s, Critical Path: 75.20s
INFO: 1066 processes: 902 remote cache hit, 64 internal, 100 darwin-sandbox.
INFO: Build completed successfully, 1066 total actions

The second build should have 100% hit rate, but does not. At the end of this is a build log diff [1] showing a rebuild where all the actions are identical.

Anecdotally, the hit rate increases on each iteration of clean+build. On the second build, maybe 10% of actions are missed, and it seems like they are fairly deep in the action graph - usually library linking and the like. On the third, it will be far fewer. It seems like actions that have inputs which are present only in the cache (vs locally present) end up generating different hashes

What operating system are you running Bazel on?

OSX 10.14 + Xcode 11.3.1

What's the output of bazel info release?

The issue does not occur with release 4.0.0 but does occur with release 4.1.0rc1 (#13099) as well as with master.

Have you found anything relevant by searching the web?

No

Any other information, logs, or outputs that you want to share?

[1] https://gist.github.com/jlaxson/aec2813d19a8494b237e15093e0359a3
This is a side by side diff (left is first build when cache is totally empty, right is clean+rebuild) with two actions excerpted. First one was a cache hit and shows no actual_outputs in the event, second is a lib made from that object, and was a cache miss despite all inputs being the same

--announce_rc output:

INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=234
INFO: Reading rc options for 'build' from /Users/me/repo/.bazelrc:
  Inherited 'common' options: --experimental_allow_tags_propagation --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /Users/me/repo/.bazelrc:
  'build' options: --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --platforms=//:macos -c opt --copt=-g --features=-static_link_cpp_runtimes --dynamic_mode=fully --build_tag_filters=-halide_generator --experimental_cc_shared_library --experimental_guard_against_concurrent_changes --define open_source_build=true --per_file_copt=tensorflow/.*@-g0
INFO: Reading rc options for 'build' from /Users/me/.bazelrc:
  'build' options: --sandbox_writable_path=/Users/me/.ccache --repository_cache=/Users/me/Library/Caches/bazel_repository --distdir=/Users/me/Library/Caches/bazel_dist
INFO: Found applicable config definition common:gcs_cache in file /Users/me/repo/.bazelrc: --google_default_credentials
INFO: Found applicable config definition common:gcs_cache in file /Users/me/.bazelrc: --google_credentials=/creds_file.json
INFO: Found applicable config definition build:gcs_cache in file /User/me/.bazelrc: --remote_cache=https://storage.googleapis.com/cache-bucket
@oquenchil oquenchil added team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged labels Mar 24, 2021
@jlaxson
Copy link
Contributor Author

jlaxson commented Mar 24, 2021

A git bisect points to this commit: ceaac96 (release-4.1.0rc1) and b6e3ba8 (master) from #12820 cc @coeuvre

@coeuvre coeuvre self-assigned this Mar 25, 2021
@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed untriaged labels Mar 25, 2021
@coeuvre
Copy link
Member

coeuvre commented Mar 25, 2021

Thanks for reporting! I can reproduce this bug locally. Just need more time to investigate the root cause.

coeuvre added a commit to coeuvre/bazel that referenced this issue Mar 29, 2021
… is different. bazelbuild#13262

bazelbuild#12820 changed to set executable bit of input files based on its real value. However, this causes cache misses in --remote_download_toplevel mode since executable bit is changed after action execution by ActionMetadataHandler#getMetadata. This change effectively rolls back bazelbuild#12820.
@coeuvre
Copy link
Member

coeuvre commented Mar 29, 2021

The root cause is that the executable bit of an output file is changed after action execution by ActionMetadataHandler#getMetadata.

Repro:

$ cat BUILD

genrule(
  name = "dep",
  srcs = [],
  outs = ["dep"],
  cmd = "echo 'dep' > $@",
)

genrule(
  name = "test",
  srcs = [":dep"],
  outs = ["out"],
  cmd = "cat $(SRCS) > $@",
)

$ bazel build --remote_cache=xxx --remote_download_toplevel :test
$ bazel clean
$ bazel build --remote_cache=xxx --remote_download_toplevel :test

Explain:

  1. To build :test, we need the output of :dep.
  2. The first build
    1. Build :dep and upload its output dep to the remote cache with the executable bit set to false.
    2. Change file mode of dep with chmod 555.
    3. Build :test with dep which executable bit is true by reading from file system. Compute the digest and upload the result.
  3. The second build
    1. Build :dep but find cached result. Since --remote_download_toplevel is enabled, only download metadata of dep which executable bit is false.
    2. Compute the digest for :test, since this time executable bit of dep is false which is different from last time, we won't hit the cache.

We can't use the real executable bit until the behavior of ActionMetadataHandler is changed. #13276 rolls back the change.

@coeuvre coeuvre mentioned this issue Mar 31, 2021
9 tasks
coeuvre added a commit to coeuvre/bazel that referenced this issue Mar 31, 2021
… is different. bazelbuild#13262

bazelbuild#12820 changed to set executable bit of input files based on its real value. However, this causes cache misses in --remote_download_toplevel mode since executable bit is changed after action execution by ActionMetadataHandler#getMetadata. This change effectively rolls back bazelbuild#12820.
coeuvre added a commit to coeuvre/bazel that referenced this issue Mar 31, 2021
… is different. bazelbuild#13262

bazelbuild#12820 changed to set executable bit of input files based on its real value. However, this causes cache misses in --remote_download_toplevel mode since executable bit is changed after action execution by ActionMetadataHandler#getMetadata. This change effectively rolls back bazelbuild#12820.
philwo pushed a commit that referenced this issue Apr 19, 2021
… is changed

Fixes #13262.

#12820 changed to set executable bit of input files based on its real value. However, this causes cache misses in `--remote_download_toplevel` mode since executable bit is changed after action execution by `ActionMetadataHandler#getMetadata`.

This change effectively rolls back #12820.

Closes #13276.

PiperOrigin-RevId: 367009617
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants