Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid action conflicts due to different configuration hashes #14236

Closed
torgil opened this issue Nov 5, 2021 · 40 comments
Closed

Avoid action conflicts due to different configuration hashes #14236

torgil opened this issue Nov 5, 2021 · 40 comments
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug

Comments

@torgil
Copy link
Contributor

torgil commented Nov 5, 2021

Description of the problem / feature request:

Bazel errors out if the same action (with identical action hash) is created with different configuration hash. Configuration hash should be irrelevant when considering if two actions are identical.

The graph in the minimal example below shows the issue where B and C sets different values of a build setting that controls the output of E:
A -> B -> D -> E
A -> C -> D -> E

Case 1: Using bazel with current ST-folders on above graph yields duplicated actions for target D:

$ bazel build //:A1
  bazel-out/k8-fastbuild-ST-49bba2382c95/bin/libE1_bs1_B.a
  bazel-out/k8-fastbuild-ST-49bba2382c95/bin/libD1.a
  bazel-out/k8-fastbuild-ST-1f07a369810f/bin/libE1_bs1_C.a
  bazel-out/k8-fastbuild-ST-1f07a369810f/bin/libD1.a

Case 2: To avoid duplicated actions in D, we can control the "output directory name" through transitions and thus remove the ST-hash. E avoids action conflicts by consider the build setting value in it's output path. This produces the the error in the topic.

$ bazel build //:A2
ERROR: file '_objs/D2/D2.pic.o' is generated by these conflicting actions:
Label: //:D2
RuleClass: library_input_outdir rule
Configuration: 169a94e6fa80a9d7d69db38a12e45a0f126e39b4e60d76c09ff240fd4812eddb, 9ae0f89835b3c1d0ad9e25dcc04eeb78120ba952c193e68b4a176c42a2734fd0
Mnemonic: CppCompile
Action key: 83337fb9f28e5353fb2871bf039e7dcdb7c41898cabc4b205d23cc7961d232b2
Progress message: Compiling D2.c
PrimaryInput: File:[[<execution_root>]bazel-out/any/bin]D2.c
PrimaryOutput: File:[[<execution_root>]bazel-out/any/bin]_objs/D2/D2.pic.o
Owner information: ConfiguredTargetKey{label=//:D2, config=BuildConfigurationKey[169a94e6fa80a9d7d69db38a12e45a0f126e39b4e60d76c09ff240fd4812eddb]}, ConfiguredTargetKey{label=//:D2, config=BuildConfigurationKey[9ae0f89835b3c1d0ad9e25dcc04eeb78120ba952c193e68b4a176c42a2734fd0]}
MandatoryInputs: are equal
Outputs: are equal
ERROR: com.google.devtools.build.lib.skyframe.ArtifactConflictFinder$ConflictException: com.google.devtools.build.lib.actions.MutableActionGraph$ActionConflictException: for _objs/D2/D2.pic.o, previous action: action 'Compiling D2.c', attempted action: action 'Compiling D2.c'
INFO: Elapsed time: 0.048s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 0 targets configured)

Note that his works for bazel 4.x but this functionality was dropped in af0e20f. Later versions need a revert of that commit to work.

Case3: Today we need, as a workaround, have the same value of the conflicting build setting in all paths to D. In this example we reset the value to default value at the input of D. This has the unwanted consequence that E can no longer be a dependency of D.

$ bazel build //:A3
  bazel-out/any/bin/libE3_bs1_default.a
  bazel-out/any/bin/libD3.a

Both versions of E are gone and both B and C links to an unintended default version.
To workaround 3, we have to rearrange the dependency graph in a non-optimal way, possibly with more targets than desired.

Feature requests: what underlying problem are you trying to solve with this feature?

To allow functionality as explained in the example without duplicated actions or action conflicts.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Start with an empty directory and run the following script inside it. For usage, see description above.

config_hash_action_conflict_example_setup.sh.txt

touch WORKSPACE

cat <<EOF > BUILD.bazel
load(":rules.bzl", "build_setting", "library_no_transition", "library_input_outdir", "library_input_outdir_bs1")
load(":setup_graph.bzl", "setup_graph")

build_setting(
    name = "bs1",
    build_setting_default = "bs1_default",
)

[
    setup_graph(case=case, rule=rule, rule_args=rule_args)

    for case, rule, rule_args in [
        ("1", library_no_transition, {}),
        ("2", library_input_outdir, {
            "input_settings": {
                "//command_line_option:output directory name": "any",
            }},
        ),
        ("3", library_input_outdir_bs1, {
            "input_settings": {
                "//command_line_option:output directory name": "any",
                "//:bs1": "bs1_default",
            }},
        ),
    ]
]
EOF

cat <<EOF > setup_graph.bzl
load(":rules.bzl", "no_transition", "input_bs1", "library_input_bs1")

def setup_graph(*, case, rule, rule_args):
    no_transition(
        name = "A" + case,
        deps = [":B" + case, ":C" + case],
    )

    input_bs1(
        name = "B" + case,
        deps = [":D" + case],
        input_settings = {"//:bs1": "bs1_B"},
    )

    input_bs1(
        name = "C" + case,
        deps = [":D" + case],
        input_settings = {"//:bs1": "bs1_C"},
    )

    rule(
        name = "D" + case,
        deps = [":E" + case],
        **rule_args
    )

    library_input_bs1(
        name = "E" + case,
        path_resolution = ["//:bs1"],
    )
EOF

cat <<EOF > rules.bzl
load("@bazel_tools//tools/cpp:toolchain_utils.bzl", "find_cpp_toolchain")

BuildSettingInfo = provider(fields = ["value"])

def _build_setting_impl(ctx):
    return BuildSettingInfo(value = ctx.build_setting_value)

def _dummy_rule_impl(ctx, files = []):
    transitive = [dep[DefaultInfo].files for dep in ctx.attr.deps]
    return [DefaultInfo(files = depset(direct = files, transitive = transitive))]

def _library_rule_impl(ctx):
    path_resolution = "".join(["_%s" % dep[BuildSettingInfo].value for dep in ctx.attr.path_resolution])
    name = ctx.label.name + path_resolution
    src = ctx.actions.declare_file(name + ".c")
    ctx.actions.write(src, "int %s() {return 1;}\n" % name)
    cc_toolchain = find_cpp_toolchain(ctx)
    feature_configuration = cc_common.configure_features(ctx = ctx, cc_toolchain = cc_toolchain)
    cc_args = {"actions": ctx.actions, "feature_configuration": feature_configuration, "cc_toolchain": cc_toolchain}
    compilation_context, compilation_outputs = cc_common.compile(
        name = name, srcs = [src], **cc_args)
    linking_context, linking_outputs = cc_common.create_linking_context_from_compilation_outputs(
        name = name, compilation_outputs = compilation_outputs, **cc_args)
    library_to_link = linking_outputs.library_to_link
    files = []
    files += [library_to_link.static_library] if library_to_link.static_library else []
    files += [library_to_link.pic_static_library] if library_to_link.pic_static_library else []
    return _dummy_rule_impl(ctx, files = files) + [CcInfo(
        compilation_context = compilation_context, linking_context = linking_context)]

def _input_transition_impl(settings, attr):
    settings = {setting: settings[setting] for setting in attr._input_settings}
    settings.update(attr.input_settings)
    return settings

def _create_custom_rule(implementation, input = [], cpp = False):
    attrs = {"deps": attr.label_list(), "path_resolution": attr.label_list()}
    args = {"attrs": attrs, "implementation": implementation}
    if cpp:
        args["fragments"] = ["cpp"]
        args["toolchains"] = ["@bazel_tools//tools/cpp:toolchain_type"]
        attrs["_cc_toolchain"] = attr.label(default = "@bazel_tools//tools/cpp:current_cc_toolchain")

    if input:
        attrs.update({
            "input_settings": attr.string_dict(),
            "_input_settings": attr.string_list(default = input),
            "_whitelist_function_transition": attr.label(
                default = "@bazel_tools//tools/whitelists/function_transition_whitelist"),
        })
        args["cfg"] = transition(inputs = input, outputs = input, implementation = _input_transition_impl)
    return rule(**args)

build_setting = rule(
    implementation = _build_setting_impl,
    build_setting = config.string(flag = False),
)

outdir = "//command_line_option:output directory name"
no_transition = _create_custom_rule(_dummy_rule_impl)
input_bs1 = _create_custom_rule(_dummy_rule_impl, ["//:bs1"])
library_no_transition = _create_custom_rule(_library_rule_impl, cpp = True)
library_input_bs1 = _create_custom_rule(_library_rule_impl, ["//:bs1"], cpp = True)
library_input_outdir = _create_custom_rule(_library_rule_impl, [outdir], cpp = True)
library_input_outdir_bs1 = _create_custom_rule(_library_rule_impl, [outdir, "//:bs1"], cpp = True)
EOF

What operating system are you running Bazel on?

Ubuntu 20.04

What's the output of bazel info release?

development version

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Used current HEAD from master branch on github

$ git checkout e8a066e
$ git revert af0e20f
$ # fix conflicts
$ git revert --continue

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

$ git remote get-url bazelbuild; git rev-parse bazelbuild/master ; git rev-parse HEAD ; git rev-parse HEAD~1
https://github.com/bazelbuild/bazel.git
e8a066e
4d2f7a74ead44aecf80f1b0b271f2b9fa2816d01
e8a066e

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

Similar to #14023, solution to this issue should make #13587 obsolete.

@torgil torgil changed the title Avoid action conflicts if configuration hash is the only difference Avoid action conflicts due to different configuration hashes Nov 5, 2021
@gregestren
Copy link
Contributor

I'm a bit lost in the conversation history here and at #13587.

I didn't fully follow Case 3. For example, can you elaborate "This has the unwanted consequence that E can no longer be a dependency of D"?

And what is the exact goal? All of @sdtwigg 's ongoing work should eliminate action conflicts more fundamentally. Is the primary concern of this issue action conflicts or duplicated action performance issues or both?

@torgil
Copy link
Contributor Author

torgil commented Nov 8, 2021

I'm a bit lost in the conversation history here and at #13587.

#14023 (and @stdwigg's solution) takes care of the "directory name fragment" problem while this issue takes care of some other issues discussed in #13587 . Currently this a real issue to us (see below) while we have #13587 as sufficient workaround for #14023.

Note also that there were also other namespace aspects discussed in #13587, like the IDE/debug-issue, not covered in either of these issues. This means solving both of these issues is required but not necessarily sufficient to deprecate the "output directory name" transition.
Edit: Different actions within a single rule may also have different namespace requirements.

I didn't fully follow Case 3. For example, can you elaborate "This has the unwanted consequence that E can no longer be a dependency of D"?

Sure. Since D (or edge to D) needs to reset build-setting values to avoid above conflict it cannot depend on any target dependent on these build-settings (E above). The build-setting dependency can arise both through use of native functions like cc_common.compile and/or the use of custom build-settings. This means that the graph has to be rewritten such that E is not a dependency of D (for instance moved to B and C in the example above).

And what is the exact goal? All of @sdtwigg 's ongoing work should eliminate action conflicts more fundamentally. Is the primary concern of this issue action conflicts or duplicated action performance issues or both?

The goal with this is issue is to remove the need for current and upcoming workarounds (technical debt). Workarounds is chosen case by case depending on situation but is typically:

  • duplicated actions due to different output paths, similar to case1 above
  • need for edge transitions resetting build-settings to various dependencies
  • non-optimal graphs like case3 discussed above. D might for instance be a subsystem with many actions and high duplication cost

@gregestren
Copy link
Contributor

I didn't fully follow Case 3. For example, can you elaborate "This has the unwanted consequence that E can no longer be a dependency of D"?

Sure. Since D (or edge to D) needs to reset build-setting values to avoid above conflict it cannot depend on any target dependent on these build-settings (E above). The build-setting dependency can arise both through use of native functions like cc_common.compile and/or the use of custom build-settings. This means that the graph has to be rewritten such that E is not a dependency of D (for instance moved to B and C in the example above).

That makes more sense, thank you.

So in the above example, the sole purpose of the build setting is for B/C to set it and for E to consume it? And you desire both of

  • For both B->D and C->D any artifact d.out produced by any action in D has the path bazel-out/<same prefix>/foo/d.out. This is safe because that action doesn't depend on the build setting
  • For B->D->E, any artifact e.out produced by E has path bazel-out/<prefix1>/foo/e.out. For C->D->E, it's bazel-out/<prefix2>/foo/e.out. This is because each version of e.out has different content since it depends on the build setting.

The goal with this is issue is to remove the need for current and upcoming workarounds (technical debt). Workarounds is chosen case by case depending on situation but is typically:

  • duplicated actions due to different output paths, similar to case1 above
  • need for edge transitions resetting build-settings to various dependencies
  • non-optimal graphs like case3 discussed above. D might for instance be a subsystem with many actions and high duplication cost

I support this goal. But there's an intrinsic tension between efficiency (non-duplication, efficient graphs) and correctness (no action conflicts). We can't confidently know if D's actions depend on the build setting without having fine-grained insight into whether it's actions consume anything coming from E. The current Starlark APIs don't provide enough isolation to make that easy: D's rule implementation runs a bunch of Starlark that presumably includes E somehow (otherwise why would E be a dep?) and also creates a bunch of actions. If any of these are intertwined we get implicit dependencies that require D's outputs to be different.

If you really know for a fact this isn't a concern for your rules, we can consider some kind of special tagging. But that won't apply generally, and that puts more responsibility on you as rule designer to ensure these tags are correct and don't fall out of date.

There's also the stripped output path approach, which I've long been interested in: don't worry about these issues, but before sending the work to the executor strip all config paths completely. The net effect is actions won't be executed redundantly. But the build graph itself will still retain redundancies, so it's only a partial solution. Whether that's enough by itself of course depends on specific use cases.

@eric-skydio
Copy link

The lens that I have found most clear for thinking about this is thinking about how to handle a single rule with multiple actions, some of which depend on the build configuration while others don't. For example, a (custom) C++ rule that wanted to compile only once but link in several different configurations for different targets. In this view, an ideal API would ctx.declare_output() to specify configuration keys that should or should not be included in the output directory root for this specific file, lowering the concept of "output root" to the action level instead of the rule level. The same issue applies when a rule's analysis-time behavior depends on configuration but execution-time behavior does not.

@gregestren correctly points out that a naive implementation of this breaks correctness if the rule author misses some subtle way that the configuration value can affect the action parameters. At minimum, the rule user can always pass srcs = select(...) where select is operating on build configuration that's not supposed to affect this action. As previously established in great depth, automatically de-conflicting when and only when this changes the output is very hard.

For our use case (@torgil can you confirm for yours) I think we can avoid this problem by instead just requiring that all instances of an action that generate a given file in a given output root must be identical, and failing the build if that is not the case. That way, if someone tries to get tricky with select statements, it just fails.

This proposed rule (all actions generating a given file in a single build must be identical) is looser than the current rule in Bazel master (only one action may generate a file) but stricter than that in 4.x (hahaha, go for it, hopefully the outputs are equivalent). It still leaves the onus on rule authors to determine what files are "supposed" to depend on what configuration values, but that's acceptable for us.

@torgil
Copy link
Contributor Author

torgil commented Nov 9, 2021

So in the above example, the sole purpose of the build setting is for B/C to set it and for E to consume it? And you desire both of

  • For both B->D and C->D any artifact d.out produced by any action in D has the path bazel-out/<same prefix>/foo/d.out. This is safe because that action doesn't depend on the build setting
  • For B->D->E, any artifact e.out produced by E has path bazel-out/<prefix1>/foo/e.out. For C->D->E, it's bazel-out/<prefix2>/foo/e.out. This is because each version of e.out has different content since it depends on the build setting.

Yes, but it can also be the case that E sends build-setting based information up to B/C to consume. It can (depending on use-case) also make sense for E to avoid name-collisions in the local file path, eg bazel-out/<prefix>/foo/e.out-suffix1 (see example above).

I support this goal. But there's an intrinsic tension between efficiency (non-duplication, efficient graphs) and correctness (no action conflicts). We can't confidently know if D's actions depend on the build setting without having fine-grained insight into whether it's actions consume anything coming from E. The current Starlark APIs don't provide enough isolation to make that easy: D's rule implementation runs a bunch of Starlark that presumably includes E somehow (otherwise why would E be a dep?) and also creates a bunch of actions. If any of these are intertwined we get implicit dependencies that require D's outputs to be different.

A common situation leading here is that variation tend to appear in lower level nodes (E). D may be library without need for variation itself but it calls functions that is in turn has configuration specific functionality depending on os, platform, hardware or other. B and C may want to control this environment/configuration through a build-setting. The object files and static library D produces are independent of the build-setting but when B or C link binaries they need to link against the correct (configuration dependent) libraries given by E.

@torgil
Copy link
Contributor Author

torgil commented Nov 9, 2021

@gregestren correctly points out that a naive implementation of this breaks correctness if the rule author misses some subtle way that the configuration value can affect the action parameters.

With "output directory name" transition (no hashes), this fails with action conflicts today.

For our use case (@torgil can you confirm for yours) I think we can avoid this problem by instead just requiring that all instances of an action that generate a given file in a given output root must be identical, and failing the build if that is not the case. That way, if someone tries to get tricky with select statements, it just fails.

Yes. This will produce an action conflict the rule author needs to take care of. Can you elaborate on how a user could break correctness with tricky select statements?

This proposed rule (all actions generating a given file in a single build must be identical) is looser than the current rule in Bazel master (only one action may generate a file) but stricter than that in 4.x (hahaha, go for it, hopefully the outputs are equivalent).

In case3 above, D have two identical actions generating the same file. It worked on master as of Friday. The "output directory name" patch didn't add anything enforce this.

@gregestren
Copy link
Contributor

gregestren commented Nov 10, 2021

This proposed rule (all actions generating a given file in a single build must be identical) is looser than the current rule in Bazel master (only one action may generate a file) but stricter than that in 4.x (hahaha, go for it, hopefully the outputs are equivalent). It still leaves the onus on rule authors to determine what files are "supposed" to depend on what configuration values, but that's acceptable for us.

Which change from 4.x to master do you mean? Removal of the output directory name flag?

This is similar to action sharing, although my reading is you're saying something slightly different.

@eric-skydio
Copy link

In case3 above, D have two identical actions generating the same file. It worked on master as of Friday. The "output directory name" patch didn't add anything enforce this.

I may be wrong about how this works, and apologies for taking so man words to explain this, but what I think you're referring to with case 3 is where all configuration settings on the rule are identical, in which case the analysis-time rule graph only has a single node, and I've been referring to that as a "single rule" and "single action". I'm instead trying to refer to a situation where multiple differently-configured rules produce actions with identical parameters, because this particular action in the rule doesn't depend on the configuration values that differ. This was possible to accomplish in 4.x using output directory name, but I don't know of a way to accomplish it on master today.
My proposal is to 1. add an API to make this possible again on a per-output-file or per-action basis and 2. enforce that when it happens, the two actions involved must have identical parameters.

This is similar to action sharing, although my reading is you're saying something slightly different.

I hadn't heard of this before, but it seems like it might be exactly what I'm looking for here. Multiple differently-configured versions of a rule should be able to declare that a specific output file does not depend on the configuration, in which case the output root should be configuration-blind and any actions producing that file should be "shared" between the differently-configured rule instances and only executed once (with an error if the action key is different).
This is different than Case 3 because the same rules that "share" this action might also be doing other things like declaring other actions that do depend on the configuration or passing data through providers from other rules that do depend on the configuration.

@eric-skydio
Copy link

For our use case (@torgil can you confirm for yours) I think we can avoid this problem by instead just requiring that all instances of an action that generate a given file in a given output root must be identical, and failing the build if that is not the case. That way, if someone tries to get tricky with select statements, it just fails.

Yes. This will produce an action conflict the rule author needs to take care of. Can you elaborate on how a user could break correctness with tricky select statements?

In Bazel 4.x, we were using output directory name to create code-generation rules that always put their outputs into bazel-out/unconfigured no matter what configuration they were built in. This would result in two instances of the same rule producing the same output filepath, which worked fine as long as those outputs were in fact identical. I was using "tricky select statements" to refer to one way that (unexpectedly to the rule author) this could fail to be the case.

# This rule is written assuming that its output files will always be
# bit-for-bit identical regardless of the active configuration
generate_code(
	srcs = select({"@platforms//os:linux": ["a.in"], ...})
)

I could have sworn that I saw this happen once without Bazel raising an error, and instead just picking one configuration of the rule to "win", although it's possible that even in 4.x there were safety guards against action conflicts like this and I'm misremembering. If those guards are in place, I think they should still be sufficient going forward to make sure that no two different-key actions in the same build produce the same file, which is the only way correctness issues arise.

I'm not aware of any way to produce this sort of conflict in 5.x prereleases today.

@torgil
Copy link
Contributor Author

torgil commented Nov 11, 2021

I'm instead trying to refer to a situation where multiple differently-configured rules produce actions with identical parameters, because this particular action in the rule doesn't depend on the configuration values that differ. This was possible to accomplish in 4.x using output directory name, but I don't know of a way to accomplish it on master today.

Alright. This is exactly what I'm trying to solve with output directory name in combination with the solution to this issue which if I recall right also was an issue in 4.x with maybe the difference that the configuration hash also was included in the action hash.

My proposal is to 1. add an API to make this possible again on a per-output-file or per-action basis and 2. enforce that when it happens, the two actions involved must have identical parameters.

This should also be hit by above config hash issue. It would simplify deduplication of actions in cases where a rule creates both platform dependent and platform independent actions. The alternates I've considered are less appealing:

  • generic output directory name + path resolution in output artifact names -> affects dependencies, single folder with everything
  • Splitting rules with different output directory name-> complicates both graph and code

Is there another way to deal with this today?

I don't see this replace output directory name transition as the naming do not propagate to dependencies or third party rules if you for instance are using different build-settings for different parts of the system and the different parts depend on the same third party codebase (like grpc or llvm) or creates artifacts using third-party rules (like rust or python).

@torgil
Copy link
Contributor Author

torgil commented Nov 17, 2021

I looked little further in the case 2 action conflict and noticed it failed because the compile action(s) for D2.c weren't shareable. The following two patches seems to fix case 2:

  • Make cpp compile actions shareable
  • Revert "Remove outdated "output directory name" option."

With these patchs, aquery says there is one compile action for D2.c and one link action for libD2.a. Building A2 seems to have a glitch that it shows multiple versions of the same file in the command line output:

$ bazel build //:A2
INFO: Analyzed target //:A2 (35 packages loaded, 303 targets configured).
INFO: Found 1 target...
Target //:A2 up-to-date:
  bazel-out/any/bin/libE2_bs1_B.a
  bazel-out/any/bin/libD2.a
  bazel-out/any/bin/libE2_bs1_C.a
  bazel-out/any/bin/libD2.a
INFO: Elapsed time: 0.513s, Critical Path: 0.09s
INFO: 13 processes: 7 internal, 6 linux-sandbox.
INFO: Build completed successfully, 13 total actions

Actions seems shared (building A1 has 13 processes, 8 linux-sandbox).

Patches are here (based on master today):
https://github.com/torgil/bazel/commits/cpp-actions-shareable
git fetch https://github.com/torgil/bazel.git refs/heads/cpp-actions-shareable

@gregestren Do we know why Cpp actions are not shareable?

Except for the glitch, are there any other problems with the above solution for case 2?

@gregestren gregestren added team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug untriaged labels Nov 18, 2021
@gregestren
Copy link
Contributor

C++ actions aren't shareable because of include scanning: finding dependencies by reading #include <foo.h> in source files. That creates a mutable dependency graph: some dependencies are added (or even removed) after action execution gets a better sense of these dependencies.

It's all annoyingly complicated but the main takeaway is it's harder to know at analysis time if actions are really the same when the above means an action may actually change later in the build.

We've talked about ways of loosening this restriction but it's a challenging issue.

@eric-skydio
Copy link

@torgil

I don't see this replace output directory name transition as the naming do not propagate to dependencies or third party rules if you for instance are using different build-settings for different parts of the system and the different parts depend on the same third party codebase (like grpc or llvm) or creates artifacts using third-party rules (like rust or python).

For cases where the new directory name should propogate all the way through the dependency tree (third party code), what are your use cases that aren't solved by #14023 combined with using a transition so you can always depend on the k8-fastbuild version of the third party dependency? We use that approach, and it also avoids running the analysis phase multiple times for those rules. In particular, for third party code we often don't want to unify some pieces of configuration (cpu/os) but do want to unify our starlark configuration.

I need to do more thinking about how to handle third-party rules like rust, maybe rules should have a default attribute to specify config keys to ignore in all actions/files they generate?

@torgil
Copy link
Contributor Author

torgil commented Nov 19, 2021

For cases where the new directory name should propogate all the way through the dependency tree (third party code), what are your use cases that aren't solved by #14023 combined with using a transition so you can always depend on the k8-fastbuild version of the third party dependency?

Yes. This is similar to what we do today but it's non-ideal:

  • Added complexity with extra transitions and attributes separating different kinds of dependencies
  • Rule/transitions for target D above need knowledge about unrelated build-settings
    This could be hard when integrating different build-graphs with a common dependency together, for instance deploying different subsystems to an image or different programs to a filesystem.
  • If target D above is a third-party library that needs custom code to function (for instance custom memory-allocations or modules/plugins in it's design), we're back at the issue in case 3.

We use that approach, and it also avoids running the analysis phase multiple times for those rules. In particular, for third party code we often don't want to unify some pieces of configuration (cpu/os) but do want to unify our starlark configuration.

Good point. In this case Bazel should have the knowledge to drop those build-settings in the third party target configuration without the need to explicitly reset them in a transition since they're not included in the third party code dependency tree.

@torgil
Copy link
Contributor Author

torgil commented Nov 22, 2021

C++ actions aren't shareable because of include scanning: finding dependencies by reading #include <foo.h> in source files. That creates a mutable dependency graph: some dependencies are added (or even removed) after action execution gets a better sense of these dependencies.

Could two actions that has identical action hash after the analyze phase end up as different actions in the execution phase?

It's all annoyingly complicated but the main takeaway is it's harder to know at analysis time if actions are really the same when the above means an action may actually change later in the build.

Is it a problem that two actions that are different in the analyze phase ends up the same after include scanning?

We've talked about ways of loosening this restriction but it's a challenging issue.

I've updated my branch above with a new core option --experimental_shareable_cpp_compile_actions. Would something like this (along with output directory name transition) be a feasible solution?

@katre katre added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Nov 29, 2021
@katre
Copy link
Member

katre commented Nov 29, 2021

@sdtwigg is working on the overall action conflict issue: let's wait for that to land and then re-address this.

@torgil
Copy link
Contributor Author

torgil commented Dec 7, 2021

Rebased output directory name revert can be found here: https://github.com/torgil/bazel/commits/output-directory-name-option
Both directory name, cpp shareable and a few other (coverage, dep-files) patches here: https://github.com/torgil/bazel/commits/master

Edit: The patch from #13587 is still needed (included in master above) as the fix for #12731 somehow triggered changes in "affected by starlark transition" for different paths in some graphs. These changes ended up in the action-hash through ST-hashes appended to "internal/_middlemen" path-names which resulted in action conflicts.

@torgil
Copy link
Contributor Author

torgil commented Nov 24, 2022

@sdtwigg is working on the overall action conflict issue: let's wait for that to land and then re-address this.

I've now created #16844 on this issue since

Is it possible to solve this within the 6.0 cycle?

Edit: question/suggestion about initial configuration removed, need to rethink that a couple of iterations

@torgil
Copy link
Contributor Author

torgil commented Nov 28, 2022

Made a pull request on the --experimental_shareable_cpp_compile_actions flag (see above): #16863

@torgil
Copy link
Contributor Author

torgil commented Nov 30, 2022

My proposal is to 1. add an API to make this possible again on a per-output-file or per-action basis

@eric-skydio @gregestren What about an option to omit the "output directory name" completely and let rules resolve path conflicts at package level (eg bazel-out/external/... or bazel-out/allconfig/external/... if too many scripts break) ?

@gregestren
Copy link
Contributor

@torgil this is complicated and long-standing because the issue is fundamentally complicated. And it's not a single issue: it's a related collection of issues with varying levels of support and safety challenges. We don't want to make choices that compromise correctness. It's extremely easy to do that with some of these ideas. We're obligated to find a balance that doesn't make it easy for users to run builds with bad results (by "bad", I don't mean action conflicts, I mean builds that say they're successful but produce wrong output).

bazel-out/foo-bar-ST-<hash>/ is annoying. Many people here are discovering all the intricacies of that. But there's a reason, for all its imperfections, that foo-bar-ST-<hash> exists. Changing output directory name or making C++ actions shareable crudely removes basic safety checks for build correctness. I respect your own technical expertise, and that that's an effective solution for your builds for lack of better options. But these aren't safe patterns to promote to less experienced builders.

Maybe we could make shared C++ actions work. But we should really understand why they're not shareable today and how such a flag affects that. We haven't discussed in these threads, for example, how Skyframe (Bazel's core execution engine), might get confused and build the wrong targets because of that. That's the kind of problem we need some answer for to promote a path like that.

I realize this is directly pertinent to you. All theory aside, you want builds that work and are fast. And that's compromised right now. I'm totally on board with trying to help with that. And I'm open to more hacky solutions if they're practically helfpul for lack of better solutions now. But we really have to be cautious, clear, and methodical about exactly what problems we're solving with any solution and what side effects it has.

Examples of related but distinct problems:

  1. too fine path names: the same target in the same configuration produces different output paths because of different transitions
  2. too course path names: the same non-C++ target in different configurations produces action conflicts
  3. too course C++ path names: the same C++ target in different configurations produces action conflicts
  4. unnecessary (untrimmed) configuration: A -> D -> E and B -> D > E. A and B set different transitions that only matter to D. Yet E still builds twice
  5. non-functional actions: A -> D -> E and B -> D > E. A and B set different transitions that only matter to E. D's actions don't care about E's output but Ds actions still build twice
  6. dropped analysis cache: $ bazel build //:foo --define a=1; bazel build /:foo --define=2: second build drops and reconstructs the entire build graph from scratch
  7. large build graphs: All this inefficiency creates larger build graphs which use more memory and take more time
  8. bad remote caching: All this inefficiency slows down remote caching and executors because varying path names cause cache misses on equivalent actions
  9. slow builds: most pertinent. May be caused by the above issues, may be resolvable by better build design, may have completely unrelated causes

We're looking at all of these and making at least some progress. To be fair, I don't think we've communicated that well: it's mostly been ad hoc conversations with whoever happens to be part of those conversations. And there's been lots of behind-the-scenes work that hasn't yet manifested into user-accessible Bazel features.

Here are some examples of progress:

  • too fine path names: --experimental_output_directory_naming_scheme=diff_against_baseline should mostly solve Avoid different transition output directories when build settings have the same values #14023. We're negotiating now about whether we can flip that for Bazel 6.0. Which is just a matter of release policy. The flip itself is technically trivial. This solves 1. for the vast majority of cases.
  • bad remote caching: I've done a lot of behind the scenes work to heavily mitigate this. It's not accessible now because it needs a plugin for the remote executor API. @fmeum is stepping up on that and he and I are going to make this accessible to Bazel. There are caveats which require additional work. But @tjgq and others are on those.
  • unnecessary (untrimmed) configuration: I recently prototyped a way for Bazel to automatically remove flags for targets that don't need them. So changed C++ flags wouldn't affect targets with no C++ in their transitive deps. This comes at its own cost that has to be balanced against expected benefit. We also have --trim_test_configuration from Bazel 5.0.

It's also possible we can think of creative new ways to define transitions that might avoid some of the propagation issues you and others experience.

As stated by @eric-skydio and others upthread, non-C++ actions should be perfectly shareable as long as they're the same actions (I think there's one other class of non-shareable actions, but most are shareable). And for problems like non-functional actions, a better Starlark API for writing actions and containing their inputs could address that. That's likely the only practical approach toward that problem.

We had some great chats at this last BazelCon. As one followup I'd like to promote a clearer common view of the problems and how we imagine addressing them. I think we need a better interface than GitHub issues because they quickly get unfocused and out-of-date. And it's hard to reason precisely about all the challenges without really precise common understandings of the problems.

@gregestren
Copy link
Contributor

Oh, I forgot: 7f51c8b could help with rules that really have no configuration needs. That's limited to solving these problems with more transitions, and would need a Starlark hook to be generally usable.

@eric-skydio
Copy link

@gregestren Great job with the diff_against_baseline work, and sorry for checking out of this conversation. Let me know if you have specific questions or want to brainstorm.

In testing on our repo with the 6.0 release candidates, it initially seems to work really well but exposed one other tiny bug that needs to get fixed for us to see all the benefits (#16911). Without that fixed, I can't find a way to unify the configuration of a generated code file that's depended on by both a executable tool and an output target.

@gregestren
Copy link
Contributor

Thanks for quickly tackling #16911 @fmeum.

@torgil
Copy link
Contributor Author

torgil commented Dec 7, 2022

@gregestren Thank you for your detailed response. I totally respect the complexity regarding these issues (and I have seen most of them), especially for the general case and I appreciate your patience towards a sound solution. I've wrapped my head into these issues numerous times and the problem domain seems to get more complex for each round (probably ~80% of the times I need to fire up jdb on bazel is due to these issues).

Of your enumerated issues I prefer to get action conflicts over silent code duplication (to later detect you have built a random tool multiple times with flags intended for target code testing). It forces you to learn about your build and deal with it accordingly. I understand that this cannot be forced on the "casual user" as a default behavior.

The "allconfig" suggestion above was meant to address the problem with different naming needs for different actions within a single target, not as a configuration transition (which you still need for other reasons, like third-party dependencies).

5 unnecessary (untrimmed) configuration: A -> D -> E and B -> D > E. A and B set different transitions that only matter to D. Yet E still builds twice

If the transitions are affecting a starlark rule build setting both A, B and D (but not E) will have a path to it in the dependency graph. Is it possible to use that information?

too fine path names: --experimental_output_directory_naming_scheme=diff_against_baseline

This is certainly a big step forward and it also removes all "affected by starlark transition" issues. Good job!
I've yet to understand all the changes of the 7.0 branch as I'm going slow, bisecting behavior changes in the build one at a time to understand them in detail.

Oh, I forgot: 7f51c8b could help with rules that really have no configuration needs.

Would this run into the same issues as the "host" config where transitions on dependencies silently gets dropped?

@gregestren
Copy link
Contributor

@gregestren Thank you for your detailed response. I totally respect the complexity regarding these issues (and I have seen most of them), especially for the general case and I appreciate your patience towards a sound solution. I've wrapped my head into these issues numerous times and the problem domain seems to get more complex for each round (probably ~80% of the times I need to fire up jdb on bazel is due to these issues).

Thanks for your patience!

Could something like ctexplain help?

The idea behind that is a tool you could run over any build that reports precise information on where transitions happen and where they're wasteful. Such a tool could help us quickly target a build's exact waste profile which could help us more precisely evaluate and propose fixes. Part of what makes these issues hard, IMO, is the impact of these issues really varies by build. So it's hard to know which approaches are best without concrete data.

I wrote a functional v1 of ctexplain a while back. But it only partially landed in @bazelbuild/bazel (more PRs need merging). I'd love to integrate it more deeply but haven't had the personal bandwidth to push this forward.

5 unnecessary (untrimmed) configuration: A -> D -> E and B -> D > E. A and B set different transitions that only matter to D. Yet E still builds twice

If the transitions are affecting a starlark rule build setting both A, B and D (but not E) will have a path to it in the dependency graph. Is it possible to use that information?

If I'm reading you correctly, that's extremely valuable information. Yes, we can work with that. The challenge is how big that subpath may be. If it's really short (one node to its direct dep), I can imagine some pretty effective solutions. If it's arbitrarily long, we run into the tension between performance (extra computation needed to analyze a big subgraph) vs. maintainability (users can manually tag these edges to avoid that computation, but that can be a big user burden across dependencies, edges, different projects, etc.).

My auto-trimming prototype I mentioned above is an attempt to automatically solve that when the computation cost is acceptable.

Oh, I forgot: 7f51c8b could help with rules that really have no configuration needs.

Would this run into the same issues as the "host" config where transitions on dependencies silently gets dropped?

Yes. The intention of this config is that once you're in it, that's it. It's terminal. No more transitions.

@torgil
Copy link
Contributor Author

torgil commented Dec 19, 2022

Could something like ctexplain help?

Yes. Interesting. I ran the analyze function below on the examples in the description using a Bazel from Dec 8 master branch.

$ action_keys () { bazel aquery --skyframe_state --output=jsonproto  | jq '.actions[].actionKey'; }
$ action_count () { echo "Action count: $(action_keys | wc -l)"; }
$ unique_action_count () { echo "Unique action count: $(action_keys | sort -u | wc -l)"; }
$ analyze () { bazel clean 2>/dev/null; ctexplain -b "$1"; action_count; unique_action_count; }

ctexplain is a wrapper function for ctexplain.py to find it deps. It also needed a few source changes for my python version, eg the usual "make random python script work" stuff.

Output for case 1 (no transition):

$ analyze //:A1
Collecting configured targets for //:A1... done in 0.44 s.

Configurations: 6
Targets: 30
Configured targets: 67 (+123.3% vs. targets)
Targets with multiple configs: 20

Action count: 28
Unique action count: 27

The shared action above is a file write on "Dl.c".

Output for case 2 (output directory name transition):

$ analyze //:A2
Collecting configured targets for //:A2... done in 0.49 s.

Configurations: 8
Targets: 30
Configured targets: 77 (+156.7% vs. targets)
Targets with multiple configs: 20

Action count: 28
Unique action count: 22

If I'm reading you correctly, that's extremely valuable information. Yes, we can work with that.

cquery --show_config_fragments show that //:bs_1 is a configuration for //:E1 and //:E2 while CppConfiguration does not trickle down to it's dependencies (seen if changing rule for E1/E2 to "no_transition").

users can manually tag these edges to avoid that computation

Do you mean adding an edge transition to set the "baseline" value here?

more PRs need merging

Are these PRs available somewhere?

@torgil
Copy link
Contributor Author

torgil commented Jan 2, 2023

@gregestren I pushed some small updates I needed to run ctexplain here: #17109

@gregestren
Copy link
Contributor

@eric-skydio
Copy link

Yes. The intention of this config is that once you're in it, that's it. It's terminal. No more transitions.

Just chiming in quickly to point out that this limitation prevents us from using this to solve our issues with duplicating code generation work. In particular, we have code generation actions that don't depend on build configuration and we'd like to avoid running many copies of. Our current workaround is to manually starlark-transition every config value that gets set anywhere in our build tree to a sane default value for the codegen targets, creating the nocpu-fastbuild configuration that is unified. Obviously it would be nicer to not need to track all that manually.

Unfortunately, since the codegen targets depend on tools that need to be built in exec configuration to work, we do need transitions to keep working at least somewhat underneath that layer.

@gregestren
Copy link
Contributor

@eric-skydio naively I'd say the codegen tools are configuration-dependent, since some of their inputs are configuration-dependent. So what you're saying is that the differences don't matter? i.e. if toolA has two different content hashes, you have a contract guarantee that those differences don't mean toolA will produce different results?

@eric-skydio
Copy link

Sorry, just noticed this. None of the tools nor their outputs depend on the target configuration, but they do depend on their own exec configuration being set correctly. This isn't the exact use case, but as an example, imagine that we have a C++ executable emit_readme that can only build with opt, and a genrule which invokes that executable to create a README, where the contents of the readme are entirely configuration-independent.
To make this work, we put an incoming edge transition on emit_readme that sets the build mode to opt. In a normal build, this would make it impossible for emit_readme to compile in a non-opt configuration. However, if the genrule itself were set to a terminal configuration, that would be infectious to the emit_readme executable, and it would end up getting compiled for a configuration it was never designed to work in.

In general, there's lots of ways that a tool can require transitions to work correctly between it and its dependencies, at minimum any cfg="exec" dependency won't work correctly if all transitions are silent no-ops.

@eric-skydio
Copy link

Or to put it differently

graph LR
	multi[":multi_target"] -- transition --> x86
	multi[":multi_target"] -- transition --> arm
    x86[":target (x86)"] -- srcs --> gen[":generated_code (arch-agnostic)"]
    arm[":target (arm)"] -- srcs --> gen
    gen -- tool --> tool[":tool (x86)"]
	tool -- dep --> lib[":lib (x86)"]
Loading

It's important that :tool and :lib are built in an x86 configuration so that we can actually execute them. In general, they might depend on other scripts that need to be built in their own configurations as well.

Our current solution of explicitly transitioning to "nocpu" for :generated_code handles this fine, and I'm not seeing a compelling reason to switch to something else.

The thing we can't do currently (and this might require a separate issue) is handle cases where some actions in a rule depend on (a piece of) the configuration but others don't. This is particularly common for C++, where you might want to build an object file once then link it against several different versions of its dependencies depending on the build configuration, and there's potentially a lot of speed improvement to be had from deduplicating that compile action. Solving this is a hard problem, unfortunately.

@fmeum
Copy link
Collaborator

fmeum commented Jun 13, 2024

@gregestren Should we consider this completed?

@gregestren
Copy link
Contributor

Sure. This has become a Frankenstein bug: covers lots of interesting themes but with lost focus.

Let's continue open-ended discussion on paths & efficiency on a https://github.com/bazelbuild/bazel/discussions. And keep specific actionables in focused new issues.

Apologies if this comment misses someone's ongoing actionable concern: please remind us.

As an aside, I'd still like to flip --experimental_platform_in_output_dir. That helps with the opposite problem of legitimately shareable actions not sharing because of forked paths. I almost got it flipped in Google a few weeks ago. But we found unexplained extra remote executor load that we need to diagnose before properly landing.

@torgil
Copy link
Contributor Author

torgil commented Oct 30, 2024

This "Frankenstein bug" have come back to haunt me a another lap. Could you point me to relevant discussions?

@gregestren What was the problem with allowing devs to set "output directory name" and let them control their own namespace while protected by action conflicts?

I'm not too excited about the new path-mapping track as it adds complexity, is limited in scope and doesn't address the design issue.

@gregestren
Copy link
Contributor

@gregestren What was the problem with allowing devs to set "output directory name" and let them control their own namespace while protected by action conflicts?

Unless I'm misunderstanding this will increase complexity and risk of action conflicts because it shifts the burden to devs to do it right. How do we ensure two transitions that set distinct configs don't set the same "output directory name" that gets consumed by some common library which then crashes?

@torgil
Copy link
Contributor Author

torgil commented Oct 31, 2024

@gregestren What was the problem with allowing devs to set "output directory name" and let them control their own namespace while protected by action conflicts?

Unless I'm misunderstanding this will increase complexity and risk of action conflicts because it shifts the burden to devs to do it right. How do we ensure two transitions that set distinct configs don't set the same "output directory name" that gets consumed by some common library which then crashes?

Causal users will not use the "output directory name" feature, once you use it is because you have analyzed your build and got into the details, possibly after transitions and system level multi-configuration builds. I see/use action conflicts as a feature, they force you to investigate the "why" and go from there as it has either detected a configuration issue or a namespace issue.

Your example with the "common library" is a perfect (and common) example. If it's not detected by build system tests the "crash" here is an action conflict and that will lead to a resolution according to above, a new build system test and maybe the need for a transition on the edge to the common library.

The opposite side of the coin is to not detect these issues, like configuration leaking down to dependencies it shouldn't or unnecessarily duplicating builds multiple times. This can even go undetected unless you hit a cache or disc space issue. It's non-trivial for a developer in a big project to even know about these consequences when adding a build-setting, transition or command line option.

@gregestren
Copy link
Contributor

I see/use action conflicts as a feature,

I sympathize with this idea: that action conflicts can be a meaningful signal and not an obtuse failure.

I also sympathize that there's a tradeoff: auto-avoiding conflicts has its own efficiency consequences that are hard to trace. Not just efficiency but even correctness if a build just fails because of an OOM.

My biggest concern about this approach is connecting these failures with the people who are best informed and qualified to address them. This can get really hard when building projects with external dependencies that cross team and org boundaries. We've seen this sort of thing a lot at Google: even qualified language maintainers struggled to understand where conflicts happen or what to do about them when random project XYZ fails. Reducing the possibility of action conflicts seems to have helped with that repo.

Of course not all repos are the same. I'm genuinely curious how your perspective differs. And we of course have to be careful setting precedents that would apply to other orgs & repos with their own distributions of expertise.

@torgil
Copy link
Contributor Author

torgil commented Nov 14, 2024

even qualified language maintainers struggled to understand where conflicts happen or what to do about them when random project XYZ fails

I sympathize with this. I've thought long and hard on these issues and put considerable time on tooling and integration support. My experience is that these issues have similar complexity in root cause analysis and resolution as the issues on the "other side of the coin" as described above. Going with- or without- transitions for various things was for me once even an unstable state during a build-system migration. There was no one solution fits all either.

Of course not all repos are the same. I'm genuinely curious how your perspective differs. And we of course have to be careful setting precedents that would apply to other orgs & repos with their own distributions of expertise.

For illustration, let's scale up issue in the problem description with subsystems instead of targets to "bazel build //car" level which I recently saw in a presentation at a fair. The build system details were not presented but lets take it all the way with one build tree for everything. A developer in this scenario is tuning a new ignition control algorithm and wants to change a few lines of code and do:
$ bazel run @car//:test_in_car --@engine//:ignition_control=@ignition//ignition_algorithms:new

With this perspective it's easy to see the consequences if the developer adds a "-c dbg" or "--features=sanitize" on the command line. It will trickle down to everything, including dashboard software and if lucky the build fails on a flash size check somewhere.

If we consider all transitions needed for different hardware- and configuration- setups throughout this system we can also imagine what the ST-hashes in the output directory name is doing for code duplication on a common dependency with only a few configurations of it's own. I've seen tools and common libs being generated and built in multiple ST- and exec- output folders on a much smaller system. What if the same developer above makes a test with a transition that sweeps a build setting value affecting a scheduler parameter N times?

As it might be hard to resolve an action conflict for a linking step in athe ignition control software above it can be equally hard to detect the consequences of an inefficient namespace and even harder to resolve it as it may require patches in the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Configurability platforms, toolchains, cquery, select(), config transitions type: bug
Projects
None yet
Development

No branches or pull requests

5 participants