Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runing Demo's train_locally.py. Failed to create saved model evaluator #148

Open
xuc-X opened this issue Sep 24, 2022 · 12 comments
Open

Runing Demo's train_locally.py. Failed to create saved model evaluator #148

xuc-X opened this issue Sep 24, 2022 · 12 comments

Comments

@xuc-X
Copy link

xuc-X commented Sep 24, 2022

Hello, I'm running this command

rm -rf $OUTPUT_DIR &&
PYTHONPATH=$PYTHONPATH:. python3
compiler_opt/rl/train_locally.py
--root_dir=$OUTPUT_DIR
--data_path=$CORPUS
--gin_bindings=clang_path="'$LLVM_INSTALLDIR/bin/clang'"
--gin_bindings=llvm_size_path="'$LLVM_INSTALLDIR/bin/llvm-size'"
--num_modules=100
--gin_files=compiler_opt/rl/inlining/gin_configs/ppo_nn_agent.gin
--gin_bindings=train_eval.warmstart_policy_dir="$WARMSTART_OUTPUT_DIR/saved_policy"

script tell me --num_modules can't not use, I change the command --num_workers=100. But I get the following errors :

2022-09-24 07:12:57.902522: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2078] Estimated count of arithmetic ops: 0.011 M ops, equivalently 0.005 M MACs
I0924 07:12:58.107042 139987454576448 local_data_collector.py:78] Waiting for pending work from last iteration took 0.000004
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpeb9wk1gz/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpsc32ijpx/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmp0xz05pcf/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.

Do you have any idea?

@boomanaiden154
Copy link
Collaborator

We refactored the --num_modules flag to be set in the gin config for each specific problem due it being a pretty critical value for reproducibility in the regalloc case. It looks like I forgot to update the documentation in regards to that. You can just omit the flag. I'd recommend not setting the --num_workers flag unless you have a compelling case to do so. It sets a completely different parameter than what --num_modules used to modify. In regards to the specific error that you're seeing, it seems like the script isn't able to pick up the BC model. Did you perform the behavioral cloning step? And if so, what files are present in the directory mentioned by the gin binding flag setting that variable?

@xc303919323
Copy link

xc303919323 commented Sep 25, 2022

BC model is the LLVM bytecode model ?
I use the train_bc.py successfully. The problem is in the train_locally.py.
The command information show the model load successfully.
This is the full log info:

performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0925 04:03:23.532205 140184883676992 ppo_agent.py:342] Only tf.keras.optimizers.Optimiers are well supported, got a non-TF2 optimizer: <tensorflow.python.training.adam.AdamOptimizer object at 0x7f7e9dd49460>
I0925 04:03:24.762801 140184883676992 common.py:1009] No checkpoint available at /code/model
I0925 04:03:26.191171 140184883676992 train_locally.py:101] Loading module specs from corpus at /code/corpus.
I0925 04:03:30.300293 140184883676992 train_locally.py:107] Done loading module specs from corpus.
I0925 04:03:30.300908 140184883676992 train_locally.py:133] Loaded Reward Stat Map from disk, containing 0 modules
I0925 04:03:30.514247 140184883676992 train_locally.py:152] Last iteration took: 0.004603
W0925 04:03:32.547599 140184883676992 save.py:271] Found untraced functions such as ActorDistributionNetwork_layer_call_fn, ActorDistributionNetwork_layer_call_and_return_conditional_losses, ConstantValueNetwork_layer_call_fn, ConstantValueNetwork_layer_call_and_return_conditional_losses, EncodingNetwork_layer_call_fn while saving (showing 5 of 92). These functions will not be directly callable after loading.
/root/.local/lib/python3.8/site-packages/tensorflow/python/saved_model/nested_structure_coder.py:521: UserWarning: Encoding a StructuredValue with type tfp.distributions.Deterministic_ACTTypeSpec; loading this StructuredValue will require that this type be imported and registered.
warnings.warn("Encoding a StructuredValue with type %s; loading this "
INFO:tensorflow:Assets written to: /code/model/policy/0/saved_policy/assets
I0925 04:03:33.073540 140184883676992 builder_impl.py:779] Assets written to: /code/model/policy/0/saved_policy/assets
2022-09-25 04:03:34.994831: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:362] Ignored output_format.
2022-09-25 04:03:34.994904: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:365] Ignored drop_control_dependency.
2022-09-25 04:03:34.995828: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /code/model/policy/0/saved_policy
2022-09-25 04:03:35.000722: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
2022-09-25 04:03:35.000781: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /code/model/policy/0/saved_policy
2022-09-25 04:03:35.017182: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:365] MLIR V1 optimization pass is not enabled
2022-09-25 04:03:35.023192: I tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
2022-09-25 04:03:35.092413: I tensorflow/cc/saved_model/loader.cc:213] Running initialization op on SavedModel bundle at path: /code/model/policy/0/saved_policy
2022-09-25 04:03:35.147566: I tensorflow/cc/saved_model/loader.cc:305] SavedModel load for tags { serve }; Status: success: OK. Took 151744 microseconds.
2022-09-25 04:03:35.242257: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2022-09-25 04:03:35.444218: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2078] Estimated count of arithmetic ops: 0.011 M ops, equivalently 0.005 M MACs
W0925 04:03:37.566624 140184883676992 save.py:271] Found untraced functions such as ActorDistributionNetwork_layer_call_fn, ActorDistributionNetwork_layer_call_and_return_conditional_losses, ConstantValueNetwork_layer_call_fn, ConstantValueNetwork_layer_call_and_return_conditional_losses, EncodingNetwork_layer_call_fn while saving (showing 5 of 92). These functions will not be directly callable after loading.
/root/.local/lib/python3.8/site-packages/tensorflow/python/saved_model/nested_structure_coder.py:521: UserWarning: Encoding a StructuredValue with type tfp.distributions.Categorical_ACTTypeSpec; loading this StructuredValue will require that this type be imported and registered.
warnings.warn("Encoding a StructuredValue with type %s; loading this "
INFO:tensorflow:Assets written to: /code/model/policy/0/saved_collect_policy/assets
I0925 04:03:38.054838 140184883676992 builder_impl.py:779] Assets written to: /code/model/policy/0/saved_collect_policy/assets
2022-09-25 04:03:40.066622: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:362] Ignored output_format.
2022-09-25 04:03:40.066686: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:365] Ignored drop_control_dependency.
2022-09-25 04:03:40.066882: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /code/model/policy/0/saved_collect_policy
2022-09-25 04:03:40.071930: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
2022-09-25 04:03:40.071989: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /code/model/policy/0/saved_collect_policy
2022-09-25 04:03:40.093924: I tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
2022-09-25 04:03:40.173268: I tensorflow/cc/saved_model/loader.cc:213] Running initialization op on SavedModel bundle at path: /code/model/policy/0/saved_collect_policy
2022-09-25 04:03:40.228462: I tensorflow/cc/saved_model/loader.cc:305] SavedModel load for tags { serve }; Status: success: OK. Took 161578 microseconds.
2022-09-25 04:03:40.557391: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2078] Estimated count of arithmetic ops: 0.011 M ops, equivalently 0.005 M MACs
I0925 04:03:40.805665 140184883676992 local_data_collector.py:78] Waiting for pending work from last iteration took 0.000003
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpgjfr19pm/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpw8v7dxdu/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmp_qg7173y/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmps93tsj2r/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpndhn0nu2/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpne14xdzf/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmp5doda41v/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
3 errors generated.
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpupt7jlc5/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpl5mm4t7i/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.

So I try print the command line in inline_runner.py by following code:

_try:
command_line = []
if self._launcher_path:
command_line.append(self._launcher_path)
command_line.extend([self._clang_path] + list(module_spec.exec_cmd) + [
'-mllvm', '-enable-ml-inliner=development', '-mllvm',
'-training-log=' + log_path, '-o', output_native_path
])
if tf_policy_path:
command_line.extend(
['-mllvm', '-ml-inliner-model-under-training=' + tf_policy_path])
print("command_line1\n",command_line)
compilation_runner.start_cancellable_process(command_line,
self._compilation_timeout,
self._cancellation_manager)
command_line = [self.llvm_size_path, output_native_path]
print("command_line2\n",command_line)

I use the output command like this:

_'/code/llvm-install/bin/clang' '-cc1' '-triple' 'x86_64-unknown-fuchsia' '-emit-obj' '-massembler-fatal-warnings' '--mrelax-relocations' '-disable-free' '-clear-ast-before-backend' '-disable-llvm-verifier' '-discard-value-names' '-main-file-name' 'block-device-manager.cc' '-mrelocation-model' 'pic' '-pic-level' '2' '-pic-is-pie' '-mframe-pointer=all' '-ffp-contract=off' '-fno-rounding-math' '-mconstructor-aliases' '-funwind-tables=2' '-target-cpu' 'x86-64-v2' '-mllvm' '-x86-branches-within-32B-boundaries' '-tune-cpu' 'generic' '-mllvm' '-treat-scalable-fixed-error-as-warning' '-debug-info-kind=constructor' '-dwarf-version=5' '-debugger-tuning=gdb' '-mllvm' '-crash-diagnostics-dir=clang-crashreports' '-ffunction-sections' '-fdata-sections' '-fcoverage-compilation-dir=.' '-resource-dir' '../../../llvm-install/lib/clang/15.0.1' '-dependency-file' 'obj/src/storage/fshost/block-watcher.block-device-manager.cc.o.d' '-MT' 'obj/src/storage/fshost/block-watcher.block-device-manager.cc.o' '-sys-header-deps' '-D' '_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS' '-D' '_LIBCPP_REMOVE_TRANSITIVE_INCLUDES' '-D' '_LIBCPP_ENABLE_THREAD_SAFETY_ANNOTATIONS=1' '-D' 'ZX_ASSERT_LEVEL=2' '-D' 'ALL_SOURCE' '-D' 'FIDL_TRACE_LEVEL=0' '-I' '../..' '-I' 'gen' '-I' 'obj' '-I' '../../sdk' '-I' 'gen/sdk' '-I' 'fidling/gen/sdk/fidl/fuchsia.inspect/fuchsia.inspect/hlcpp' '-I' '../../sdk/lib/fidl_base/include' '-I' 'gen/include' '-I' '../../src/zircon/lib/zircon/include' '-I' 'fidling/gen/sdk/fidl/fuchsia.mem/fuchsia.mem/hlcpp' '-I' '../../sdk/lib/fit/include' '-I' '../../sdk/lib/stdcompat/include' '-I' '../../sdk/lib/fit-promise/include' '-I' '../../sdk/lib/fidl/include' '-I' '../../zircon/system/ulib/zx/include' '-I' '../../zircon/system/ulib/async/include' '-I' '../../zircon/system/ulib/async-default/include' '-I' '../../zircon/system/ulib/inspect/include' '-I' 'fidling/gen/sdk/fidl/fuchsia.io/fuchsia.io/hlcpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.unknown/fuchsia.unknown/hlcpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.sys/fuchsia.sys/hlcpp' '-I' '../../sdk/lib/fdio/include' '-I' 'fidling/gen/sdk/fidl/fuchsia.boot/fuchsia.boot/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.io/fuchsia.io/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.unknown/fuchsia.unknown/cpp' '-I' '../../sdk/lib/fidl/cpp/wire/include' '-I' '../../zircon/system/ulib/zxc/include' '-I' '../../zircon/system/ulib/sync/include' '-I' '../../zircon/system/ulib/fbl/include' '-I' '../../zircon/system/ulib/fzl/include' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block.volume/fuchsia.hardware.block.volume/c' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block/fuchsia.hardware.block/c' '-I' 'fidling/gen/sdk/fidl/fuchsia.io/fuchsia.io/c' '-I' 'fidling/gen/sdk/fidl/fuchsia.unknown/fuchsia.unknown/c' '-I' 'fidling/gen/zircon/vdso/zx/zx/c' '-I' 'fidling/gen/sdk/fidl/fuchsia.storage.metrics/fuchsia.storage.metrics/c' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block.partition/fuchsia.hardware.block.partition/c' '-I' 'fidling/gen/sdk/fidl/fuchsia.device/fuchsia.device/c' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block.volume/fuchsia.hardware.block.volume/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block/fuchsia.hardware.block/cpp' '-I' '../../src/lib/fidl/cpp/include' '-I' 'x64-shared/gen/sdk' '-I' 'fidling/gen/sdk/fidl/fuchsia.storage.metrics/fuchsia.storage.metrics/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block.partition/fuchsia.hardware.block.partition/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.device/fuchsia.device/cpp' '-I' 'fidling/gen/src/storage/fidl/fuchsia.fs.startup/fuchsia.fs.startup/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.fs/fuchsia.fs/cpp' '-I' '../../zircon/system/ulib/fidl-async/include' '-I' '../../zircon/system/ulib/trace/include' '-I' '../../zircon/system/ulib/trace-engine/include' '-I' 'fidling/gen/sdk/fidl/fuchsia.feedback/fuchsia.feedback/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.math/fuchsia.math/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.mem/fuchsia.mem/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.fshost/fuchsia.fshost/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.process.lifecycle/fuchsia.process.lifecycle/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.ldsvc/fuchsia.ldsvc/cpp' '-I' 'fidling/gen/src/storage/fxfs/fuchsia.fxfs/cpp' '-I' '../../zircon/system/ulib/async-loop/include' '-I' '../../zircon/system/ulib/fdio-caller/include' '-I' '../../zircon/system/ulib/service/include' '-I' 'fidling/gen/sdk/fidl/fuchsia.fs/fuchsia.fs/hlcpp' '-I' '../../zircon/system/public' '-I' '../../zircon/system/ulib/storage/buffer/include' '-I' '../../zircon/system/ulib/storage/operation/include' '-I' '../../src/lib/storage/block_client/cpp/include' '-I' '../../zircon/system/ulib/range/include' '-I' '../../zircon/system/ulib/storage-metrics/include' '-I' '../../src/storage/lib/disk_inspector/include' '-I' '../../src/storage/lib/watchdog/include' '-I' '../../zircon/system/ulib/syslog/include' '-I' '../../zircon/system/ulib/bitmap/include' '-I' '../../zircon/system/ulib/id_allocator/include' '-I' '../../zircon/third_party/ulib/safemath/include' '-I' 'fidling/gen/src/storage/blobfs/fuchsia.blobfs.internal/hlcpp' '-I' 'fidling/gen/src/storage/blobfs/fuchsia.blobfs.internal/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.blobfs/fuchsia.blobfs/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.device.manager/fuchsia.device.manager/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.driver.framework/fuchsia.driver.framework/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.component/fuchsia.component/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.component.decl/fuchsia.component.decl/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.data/fuchsia.data/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.url/fuchsia.url/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.process/fuchsia.process/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.component.runner/fuchsia.component.runner/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.diagnostics.types/fuchsia.diagnostics.types/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.driver.host/fuchsia.driver.host/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.power.statecontrol/fuchsia.hardware.power.statecontrol/cpp' '-I' 'fidling/gen/src/sys/pkg/fidl/fuchsia.update.verify/fuchsia.update.verify/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block.encrypted/fuchsia.hardware.block.encrypted/cpp' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.block.verified/fuchsia.hardware.block.verified/cpp' '-I' '../../src/lib/storage/ramdevice_client/cpp/include' '-I' 'fidling/gen/sdk/fidl/fuchsia.hardware.nand/fuchsia.hardware.nand/c' '-I' '../../src/storage/gpt/include' '-I' '../../zircon/system/ulib/zircon-internal/include' '-I' '../../zircon/system/ulib/explicit-memory/include' '-D' 'FIDL_ALLOW_DEPRECATED_C_BINDINGS' '-D' 'FIDL_ALLOW_DEPRECATED_C_BINDINGS' '-isysroot' 'gen/zircon/public/sysroot/cpp' '-internal-isystem' '../../../llvm-install/bin/../include/x86_64-unknown-fuchsia/c++/v1' '-internal-isystem' '../../../llvm-install/bin/../include/c++/v1' '-internal-isystem' '../../../llvm-install/lib/clang/15.0.1/include' '-internal-externc-isystem' 'gen/zircon/public/sysroot/cpp/include' '-Os' '-ffuchsia-api-level=4294967295' '-std=c++17' '-fdeprecated-macro' '-fdebug-compilation-dir=.' '-ferror-limit' '19' '-fvisibility' 'hidden' '-fvisibility-inlines-hidden' '-fsanitize=safe-stack' '-stack-protector' '2' '-ftrivial-auto-var-init=pattern' '-fno-rtti' '-fgnuc-version=4.2.1' '-fcolor-diagnostics' '-vectorize-loops' '-vectorize-slp' '-fembed-bitcode=all' '-debug-info-kind=constructor' '-faddrsig' '-D' '__GCC_HAVE_DWARF2_CFI_ASM=1' '' '-x' 'ir' '/code/corpus/obj/src/storage/fshost/block-watcher.block-device-manager.cc.o.bc' '-mllvm' '-enable-ml-inliner=development' '-mllvm' '-training-log=/tmp/tmp6dd7o0lh/log' '-o' '/tmp/test.aa'

I get the error:
fatal error: error in backend: IO failure on output stream: Bad file descriptor

But I delete the '-mllvm' '-enable-ml-inliner=development' '-mllvm' '-training-log=/tmp/tmp6dd7o0lh/log' '-o' '/tmp/test.aa' and the command run successful.

I use the LLVM15, this is commit ID.

commit b73d2c8c720a8c8e6e73b11be4e27afa6cb75bdf (HEAD -> release/15.x, tag: llvmorg-15.0.1, origin/release/15.x)
Author: Florian Hahn [email protected]
Date: Mon Sep 19 18:14:34 2022 +0100

[LV] Keep track of cost-based ScalarAfterVec in VPWidenPointerInd.

Epilogue vectorization uses isScalarAfterVectorization to check if
widened versions for inductions need to be generated and bails out in
those cases.

At the moment, there are scenarios where isScalarAfterVectorization
returns true but VPWidenPointerInduction::onlyScalarsGenerated would
return false, causing widening.

This can lead to widened phis with incorrect start values being created
in the epilogue vector body.

This patch addresses the issue by storing the cost-model decision in
VPWidenPointerInductionRecipe and restoring the behavior before 151c144.
This effectively reverts 151c144, but the long-term fix is to properly
support widened inductions during epilogue vectorization

Fixes #57712.

@mtrofin
Copy link
Collaborator

mtrofin commented Sep 26, 2022

The reason you get Bad file descriptor when trying to debug is that /tmp/tmp6dd7o0lh/log doesn't exist (more specifically, the first part of the path, i.e. /tmp/tmp6dd7o0lh - it's a tempfile - created (from Python) directory. Try pointing -training-log to output somewhere else, like /tmp/this_is_the.log, i.e. under an existing dir.

Now for the first part. That seems to be about how the model passed to clang during training is invalid. I'm assuming you're at or near HEAD of this (ml-compiler-opt) repo. Under your $OUTPUT_DIR, do you see a bunch of saved model directories? You should see a policy dir, under which you should see a bunch of numbered dirs. Pick one of the latter, under it you should see a saved_policy and saved_collect_policy. What do you see under it?

@xuc-X
Copy link
Author

xuc-X commented Sep 26, 2022

Yes, you are right, For Bad file descriptor problem, I use the /tmp/test.log and the command run successfully.
This is my $OUTPUT_DIR
image
This is my policy dir.
image
How I debug the python or C++ code to test train model ?

@mtrofin
Copy link
Collaborator

mtrofin commented Sep 26, 2022

What happens if you use the same command line that works, and add -mllvm -ml-inliner-model-under-training=/code/model/policy/0/saved_collect_policy

@xuc-X
Copy link
Author

xuc-X commented Sep 26, 2022

It show Status: success !

2022-09-26 16:50:22.840280: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /code/model/policy/0/saved_collect_policy
2022-09-26 16:50:22.847266: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2022-09-26 16:50:22.860368: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-09-26 16:50:22.870020: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3700070000 Hz
2022-09-26 16:50:22.872757: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555d0f8a0370 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-09-26 16:50:22.872793: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2022-09-26 16:50:22.906265: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2022-09-26 16:50:22.957064: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /code/model/policy/0/saved_collect_policy
2022-09-26 16:50:22.996176: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 155897 microseconds.

@mtrofin
Copy link
Collaborator

mtrofin commented Sep 26, 2022

OK, and it compiles I assume. Hmm. Ah, I see what happens. You're building the compiler with the tensorflow C APIs, not tflite, right? We haven't updated the documentation yet - but here's how to build with tflite:

  1. make a directory somewhere, e.g. /tmp/tflitebuild && cd /tmp/tflitebuild
  2. run buildbot/build_tflite.sh
    ... (this takes a bit - it git clones a bunch of repos and builds them)
  3. notice a /tmp/tflitebuild/tflite.cmake was created
  4. for your cmake (best to wipe out the build dir and re-issue cmake): instead of passing -DTENSORFLOW_C_LIB_PATH, pass -C /tmp/tflitebuild/tflite.cmake.

That's it!

@mtrofin
Copy link
Collaborator

mtrofin commented Sep 26, 2022

Updated now the demo - @boomanaiden154 had a PR open (#131 ) for a while and we forgot to merge. Sorry.

@xuc-X
Copy link
Author

xuc-X commented Sep 26, 2022

Thanks, I try it, I recompile my LLVM project and fushcia project. but the problem still happen. Maybe something I can check or debug in the code? I have no idea.

Command is:

rm -rf $OUTPUT_DIR &&
PYTHONPATH=$PYTHONPATH:. python3
compiler_opt/rl/train_locally.py
--root_dir=$OUTPUT_DIR
--data_path=$CORPUS
--gin_bindings=clang_path="'$LLVM_INSTALLDIR/bin/clang'"
--gin_bindings=llvm_size_path="'$LLVM_INSTALLDIR/bin/llvm-size'"
--gin_files=compiler_opt/rl/inlining/gin_configs/ppo_nn_agent.gin
--gin_bindings=train_eval.warmstart_policy_dir="$WARMSTART_OUTPUT_DIR/saved_policy"

Log is:

Parameters for train_eval:

==============================================================================

train_eval.agent_name = %compiler_opt.rl.constant.AgentName.PPO
train_eval.batch_size = 256
train_eval.deploy_policy_name = 'saved_collect_policy'
train_eval.moving_average_decay_rate = 0.8
train_eval.num_iterations = 300
train_eval.num_modules = 100
train_eval.num_policy_iterations = 3000
train_eval.train_sequence_length = 16
train_eval.use_random_network_distillation = False
train_eval.warmstart_policy_dir = '/code/warmstart/saved_policy'

2022-09-26 17:53:44.895495: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-26 17:53:45.034631: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.034828: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.035015: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.035185: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.035344: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.035499: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.625412: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.625633: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.625834: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.626000: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.626171: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.626332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10082 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3080 Ti, pci bus id: 0000:24:00.0, compute capability: 8.6
2022-09-26 17:53:45.626555: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-26 17:53:45.626697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 10188 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:2d:00.0, compute capability: 8.6
2022-09-26 17:53:46.251521: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:629] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
W0926 17:53:46.293635 140348014933824 ppo_agent.py:342] Only tf.keras.optimizers.Optimiers are well supported, got a non-TF2 optimizer: <tensorflow.python.training.adam.AdamOptimizer object at 0x7fa49a380970>
I0926 17:53:46.903522 140348014933824 common.py:1009] No checkpoint available at /code/model
I0926 17:53:47.646316 140348014933824 train_locally.py:101] Loading module specs from corpus at /code/corpus.
I0926 17:53:51.522883 140348014933824 train_locally.py:107] Done loading module specs from corpus.
I0926 17:53:52.110074 140348014933824 local_data_collector.py:73] prefetching took 0
I0926 17:53:52.122872 140348014933824 train_locally.py:152] Last iteration took: 0.012367
W0926 17:53:53.189572 140348014933824 save.py:271] Found untraced functions such as ActorDistributionNetwork_layer_call_fn, ActorDistributionNetwork_layer_call_and_return_conditional_losses, ConstantValueNetwork_layer_call_fn, ConstantValueNetwork_layer_call_and_return_conditional_losses, EncodingNetwork_layer_call_fn while saving (showing 5 of 92). These functions will not be directly callable after loading.
/root/.local/lib/python3.8/site-packages/tensorflow/python/saved_model/nested_structure_coder.py:521: UserWarning: Encoding a StructuredValue with type tfp.distributions.Deterministic_ACTTypeSpec; loading this StructuredValue will require that this type be imported and registered.
warnings.warn("Encoding a StructuredValue with type %s; loading this "
INFO:tensorflow:Assets written to: /code/model/policy/0/saved_policy/assets
I0926 17:53:53.458599 140348014933824 builder_impl.py:779] Assets written to: /code/model/policy/0/saved_policy/assets
2022-09-26 17:53:54.306021: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:362] Ignored output_format.
2022-09-26 17:53:54.306056: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:365] Ignored drop_control_dependency.
2022-09-26 17:53:54.306634: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /code/model/policy/0/saved_policy
2022-09-26 17:53:54.308837: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
2022-09-26 17:53:54.308854: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /code/model/policy/0/saved_policy
2022-09-26 17:53:54.314542: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:365] MLIR V1 optimization pass is not enabled
2022-09-26 17:53:54.315878: I tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
2022-09-26 17:53:54.345653: I tensorflow/cc/saved_model/loader.cc:213] Running initialization op on SavedModel bundle at path: /code/model/policy/0/saved_policy
2022-09-26 17:53:54.365923: I tensorflow/cc/saved_model/loader.cc:305] SavedModel load for tags { serve }; Status: success: OK. Took 59290 microseconds.
2022-09-26 17:53:54.404422: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2022-09-26 17:53:54.513431: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2078] Estimated count of arithmetic ops: 0.011 M ops, equivalently 0.005 M MACs
W0926 17:53:55.616633 140348014933824 save.py:271] Found untraced functions such as ActorDistributionNetwork_layer_call_fn, ActorDistributionNetwork_layer_call_and_return_conditional_losses, ConstantValueNetwork_layer_call_fn, ConstantValueNetwork_layer_call_and_return_conditional_losses, EncodingNetwork_layer_call_fn while saving (showing 5 of 92). These functions will not be directly callable after loading.
/root/.local/lib/python3.8/site-packages/tensorflow/python/saved_model/nested_structure_coder.py:521: UserWarning: Encoding a StructuredValue with type tfp.distributions.Categorical_ACTTypeSpec; loading this StructuredValue will require that this type be imported and registered.
warnings.warn("Encoding a StructuredValue with type %s; loading this "
INFO:tensorflow:Assets written to: /code/model/policy/0/saved_collect_policy/assets
I0926 17:53:55.860256 140348014933824 builder_impl.py:779] Assets written to: /code/model/policy/0/saved_collect_policy/assets
2022-09-26 17:53:56.730252: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:362] Ignored output_format.
2022-09-26 17:53:56.730288: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:365] Ignored drop_control_dependency.
2022-09-26 17:53:56.730407: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /code/model/policy/0/saved_collect_policy
2022-09-26 17:53:56.732630: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
2022-09-26 17:53:56.732646: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /code/model/policy/0/saved_collect_policy
2022-09-26 17:53:56.737559: I tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
2022-09-26 17:53:56.766411: I tensorflow/cc/saved_model/loader.cc:213] Running initialization op on SavedModel bundle at path: /code/model/policy/0/saved_collect_policy
2022-09-26 17:53:56.786771: I tensorflow/cc/saved_model/loader.cc:305] SavedModel load for tags { serve }; Status: success: OK. Took 56365 microseconds.
2022-09-26 17:53:56.948630: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2078] Estimated count of arithmetic ops: 0.011 M ops, equivalently 0.005 M MACs
I0926 17:53:57.091879 140348014933824 local_data_collector.py:134] resolving prefetched sample took: 0 seconds
I0926 17:53:57.092738 140348014933824 local_data_collector.py:73] prefetching took 0
I0926 17:53:57.092979 140348014933824 local_data_collector.py:91] Waiting for pending work from last iteration took 0.000001
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpbxfbj34s/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpx5fxbs2c/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpm2gq9m6x/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpa_m6lua9/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpiec123gk/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpjfas9ul9/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpx83gfm17/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
3 errors generated.
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpd6fx2uw3/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmpqer841ol/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
Could not find SavedModel .pb or .pbtxt at supplied export directory path: /tmp/tmp56v46hye/policyCould not find TF_Output named: StatefulPartitionedCallerror: Failed to create saved model evaluator
error: Could not load or create model evaluator.
error: Could not setup Inlining Advisor for the requested mode and/or options
3 errors generated.

@xuc-X
Copy link
Author

xuc-X commented Sep 26, 2022

I am not delete my llvm-project build directory, just directly use the cmake command to recompile and use ninja to build. Maybe it will have problem. I delete it and try again.

@mtrofin
Copy link
Collaborator

mtrofin commented Sep 26, 2022

You may need to delete the build directory and then re-create it and re-issue the correct (new) cmake command. After that, and after rebuilding clang, try out the one clang invocation we tried in isolation (the one that included the path to the training model)

@xc303919323
Copy link

OK,Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants