bug: perf regression over 10% #22081

jbdalido · 2025-01-30T10:31:06Z

Hi, We found out that 961e5c2 caused a significant drop of performances in our LLM server (over 10%) on Cuda and Rocm.

Either re-adding pipeline.AddPass<TransposeFolding>(CanFoldTransposeOperandIntoDot); at its previous place without removing the new one, or going back to the previous state is fixing the issue.

If you need anything from us for testing, please do not hesitate to ping. Thanks!

google-cla · 2025-01-30T10:31:11Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

pifon2a · 2025-01-31T09:06:38Z

@metaflow

metaflow · 2025-01-31T09:55:53Z

@jbdalido thank you for the report! The reason for 961e5c2 was that DotDecompose, AlgebraicSimplifier, and TransposeFolding were fighting with each other in a loop. That is: HLO was never reaching a fixed point and switched between two states on even / odd cycles. The fact that it got slower in some cases is an indication how brittle the state was before.

With that:

Could you please provide repro HLO of your issue?
Please run the compilation (with transpose folding added back) and --xla_unsupported_crash_on_hlo_pass_fix_max_iteration flag. It should show if pass converges for you.
Alternative fix I had for this issue was to remove DotDecompose like so:

--- compiler/xla/service/gpu/gpu_compiler.cc
+++ compiler/xla/service/gpu/gpu_compiler.cc
@@ -790,7 +790,6 @@ absl::Status RunOptimizationPasses(
     pipeline.AddPass<BitcastDtypesExpander>();
     // AlgebraicSimplifier may add contracting dimensions to a dot.
     pipeline.AddPass<DotDimensionSorter>();
-    pipeline.AddPass<DotDecomposer>();
     // Only merge "smallish" dots.  This threshold defaults to 32MB today, with
     // a flag to override.
     // Do not merge dots when they are assigned different stream ids.
@@ -818,6 +817,7 @@ absl::Status RunOptimizationPasses(
     pipeline.AddPass<HloConstantFolding>();
     pipeline.AddPass<ConditionalSimplifier>();
     pipeline.AddPass<RealImagExpander>();
+    pipeline.AddPass<TransposeFolding>(CanFoldTransposeOperandIntoDot);
     pipeline.AddPass<HloCSE>(/*is_layout_sensitive=*/false);
     pipeline.AddPass<HloDCE>();
   }();
@@ -831,7 +831,6 @@ absl::Status RunOptimizationPasses(
     pipeline.AddPass<ConvertMover>();
     pipeline.AddPass<GpuAlgebraicSimplifier>(layout_insensitive_algsimp_opts,
                                              gpu_version);
-    pipeline.AddPass<TransposeFolding>(CanFoldTransposeOperandIntoDot);
   }();
 
   pipeline.AddPass<HloComputationDeduplicator>(

(DotDecomposer it's present in the pass before). Could you please try that change (with the flag above too) and see if the situation improves for you?

That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c2 1. There is a report that previous approach has resulted in perf degradation #22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification is strange. PiperOrigin-RevId: 721736945

That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c25fbd4082a1ac4f2e0865ad28163d12f7d 1. There is a report that previous approach has resulted in perf degradation openxla/xla#22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification is strange. PiperOrigin-RevId: 721736945

That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c2: 1. There is a report that previous change has resulted in perf degradation #22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification asks for such infinite loops. PiperOrigin-RevId: 721736945

That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c25fbd4082a1ac4f2e0865ad28163d12f7d: 1. There is a report that previous change has resulted in perf degradation openxla/xla#22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification asks for such infinite loops. PiperOrigin-RevId: 721736945

metaflow · 2025-02-05T10:43:51Z

@jbdalido could you please share HLO that got slower for you? Have you tried alternative tensorflow/tensorflow#86443 ?

That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c2: 1. There is a report that previous change has resulted in perf degradation #22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification asks for such infinite loops. --- For archeologists: passes DotDimensionSorter and DotDecomposer were added along with GpuAlgebraicSimplifier as it previously could have added multiple contracting dimensions to dot. But cudnn does not support dots with 2+ dimensions, forcing us to use a less efficient loop emitter. - That what "// AlgebraicSimplifier may add contracting dimensions to a dot." comment was about. After a while simplifier started to use supports_non_canonical_dots to guard against this case. So it should be safe to remove dot decomposer and friends. PiperOrigin-RevId: 721736945

That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c25fbd4082a1ac4f2e0865ad28163d12f7d: 1. There is a report that previous change has resulted in perf degradation openxla/xla#22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification asks for such infinite loops. --- For archeologists: passes DotDimensionSorter and DotDecomposer were added along with GpuAlgebraicSimplifier as it previously could have added multiple contracting dimensions to dot. But cudnn does not support dots with 2+ dimensions, forcing us to use a less efficient loop emitter. - That what "// AlgebraicSimplifier may add contracting dimensions to a dot." comment was about. After a while simplifier started to use supports_non_canonical_dots to guard against this case. So it should be safe to remove dot decomposer and friends. Reverts changelist 723246423 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#22258 from openxla:schedule_vlog 025352635a155e447559d83c471369559aad5981 PiperOrigin-RevId: 721736945

That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c2: 1. There is a report that previous change has resulted in perf degradation #22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification asks for such infinite loops. --- For archeologists: passes DotDimensionSorter and DotDecomposer were added along with GpuAlgebraicSimplifier as it previously could have added multiple contracting dimensions to dot. But cudnn does not support dots with 2+ dimensions, forcing us to use a less efficient loop emitter. - That what "// AlgebraicSimplifier may add contracting dimensions to a dot." comment was about. After a while simplifier started to use supports_non_canonical_dots to guard against this case. So it should be safe to remove dot decomposer and friends. PiperOrigin-RevId: 723470593

723477680 by A. Unique TensorFlower<[email protected]>: [XLA] Tag timeout tests as `not_run:arm` Similarly to cl/722883015 tagging also: * //third_party/tensorflow/compiler/xla/python/transfer:socket_bulk_transport_test * //third_party/tensorflow/compiler/xla/python/transfer:socket-server_test * //third_party/tensorflow/compiler/xla/python/transfer:event_loop_test -- 723476384 by A. Unique TensorFlower<[email protected]>: Parse XLA_FLAGS environment variable every time, conditionally on xla_flags_reset flag. -- 723471749 by A. Unique TensorFlower<[email protected]>: [XLA:GPU] Rename `IsSyncCollective` and move to a GPU specific file. The implementation is specific to the GPU backend. -- 723470593 by A. Unique TensorFlower<[email protected]>: [XLA:GPU] move DotDecompose out of simplification pipeline That seems to be a better approach then moving TransposeFold to simplification-2 in 961e5c25fbd4082a1ac4f2e0865ad28163d12f7d: 1. There is a report that previous change has resulted in perf degradation openxla/xla#22081 2. I have found another case when DotDecompose is competing with algsimp. Added a test for that. Overall, having an pass that expands operation together with passes that are trying to do the simplification asks for such infinite loops. --- For archeologists: passes DotDimensionSorter and DotDecomposer were added along with GpuAlgebraicSimplifier as it previously could have added multiple contracting dimensions to dot. But cudnn does not support dots with 2+ dimensions, forcing us to use a less efficient loop emitter. - That what "// AlgebraicSimplifier may add contracting dimensions to a dot." comment was about. After a while simplifier started to use supports_non_canonical_dots to guard against this case. So it should be safe to remove dot decomposer and friends. -- 723469960 by A. Unique TensorFlower<[email protected]>: PR #22334: [ROCm] Fix flaky gpu compiler test when building with rocm Imported from GitHub PR openxla/xla#22334 This change fixes the flaky gpu compiler test used to run on rocm CI pipeline gate. Triton pipeline was wrongly using the TritonGPUAccelerateMatmul pass which supports cuda only. In rocm there is a different pass which is now used in the rocm pipeline. https://github.com/triton-lang/triton/blob/main/third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp Copybara import of the project: -- c5f600f03aa87d155bb624bedb0584e635af190e by Alexandros Theodoridis <[email protected]>: Fix flaky gpu compiler test when building with rocm Merging this change closes #22334 -- 723453199 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723445422 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723443292 by A. Unique TensorFlower<[email protected]>: [pjrt] Removed deprecated `PjRtBuffer::CopyToDevice` -- 723434255 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723430683 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723426786 by A. Unique TensorFlower<[email protected]>: PR #22258: [GPU][NFC] Avoid always printing complete PGLE profiles. Imported from GitHub PR openxla/xla#22258 Copybara import of the project: -- 025352635a155e447559d83c471369559aad5981 by Ilia Sergachev <[email protected]>: [GPU][NFC] Avoid always printing complete PGLE profiles. Merging this change closes #22258 -- 723426773 by A. Unique TensorFlower<[email protected]>: PR #21375: [ds-fusion] Get While loop analysis with copy fusion Imported from GitHub PR openxla/xla#21375 In later stages of optimization, there are instances of copy fusion on the parameter of the while body. With this, we need to allow inlining of fusions while getting the induction variable index, otherwise we cannot deduce the tuple index. Copybara import of the project: -- 3147ec926aa1c6fdfa2f4376668434c9a2fbeb87 by Shraiysh Vaishay <[email protected]>: [ds-fusion] Get While loop analysis with copy fusion In later stages of optimization, there are instances of copy fusion on the parameter of the while body. With this, we need to allow inlining of fusions while getting the induction variable index, otherwise we cannot deduce the tuple index. -- a435fbd2eadc17269d7bccbe141dcf7a21cc20e8 by Shraiysh Vaishay <[email protected]>: Relay control dependencies while converting fusion to call (extractor) Merging this change closes #21375 -- 723425710 by A. Unique TensorFlower<[email protected]>: [XLA] Add const reference versions of `ForEachInstructionWithPred` and `ForEachInstructionWithOpcode`. These are more permissive and semantically equivalent. -- 723425622 by A. Unique TensorFlower<[email protected]>: Remove dead code (NFC) We compute the total number of tiles in a variable `num_tiles` but then never use it. So remove it. -- 723419822 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723402058 by A. Unique TensorFlower<[email protected]>: compat: Update forward compatibility horizon to 2025-02-05 -- 723401869 by A. Unique TensorFlower<[email protected]>: Update GraphDef version to 2129. -- 723396271 by A. Unique TensorFlower<[email protected]>: [XLA] Support different operand and result types in AlgebraicSimplifierVisitor::HandlePad. I checked that none of the other cases in HandlePad require any adjustments. -- 723389764 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723370461 by A. Unique TensorFlower<[email protected]>: Use matchers_oss in vendor code -- 723367856 by A. Unique TensorFlower<[email protected]>: Update users of TSL headers and targets to new location in XLA Updating: - `env.h` - `env_time.h` - `errors.h` - `file_statistics.h` - `file_system.h` - `file_system_helper.h` - `logging.h` - `macros.h` - `status.h` - `status_matchers.h` - `status_to_from_proto.h` - `statusor.h` - `test.h` - `test_benchmark.h` - `threadpool.h` - `threadpool_async_executor.h` - `threadpool_interface.h` - `threadpool_options.h` - `types.h` and associated targets. -- 723349025 by A. Unique TensorFlower<[email protected]>: Fix inference request analysis aggregated on batch size, by aggregating only the requests included in a single batch, as large request split into multiple batches will introduce confusing results (eg. the device time will be the sum of the 2 batch processing). -- 723344172 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723340771 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723337100 by A. Unique TensorFlower<[email protected]>: Automated Code Change -- 723321370 by A. Unique TensorFlower<[email protected]>: Stop modifying the TraceEventsContainer in DoStoreAsLevelDbTable. This behavior is not intuitive (modifying a const value that was passed in) and unnecessary. -- 723307829 by A. Unique TensorFlower<[email protected]>: Automated rollback of changelist 723246423. 723278167 by A. Unique TensorFlower<[email protected]>: Update users of TSL headers and targets to new location in XLA Updating: - `env.h` - `env_time.h` - `errors.h` - `file_statistics.h` - `file_system.h` - `file_system_helper.h` - `logging.h` - `macros.h` - `status.h` - `status_matchers.h` - `status_to_from_proto.h` - `statusor.h` - `test.h` - `test_benchmark.h` - `threadpool.h` - `threadpool_async_executor.h` - `threadpool_interface.h` - `threadpool_options.h` - `types.h` and associated targets. -- 723265881 by A. Unique TensorFlower<[email protected]>: Add the list of Qualcomm SoCs supporting NPU. -- 723248792 by A. Unique TensorFlower<[email protected]>: Add Q/DQ annotation lowering support. LowerQuantAnnotationsPass now supports quant.quantize and quant.dequantize composite lowering. These patterns make adjustments to the function signatures if necessary. -- 723246423 by A. Unique TensorFlower<[email protected]>: PR #85476: Support Qnn Wrappers for LiteRt Imported from GitHub PR #85476 # WHAT - Basic wrapper for QNN types, handle dynamic resources along with wrapper instances. - Make these wrappers independent to LiteRT/tflite - Only depend on QNN and STL ### `ScalarParamWrapper` - Wrap `Qnn_Param_t` with `QNN_PARAMTYPE_SCALAR` for `paramType` - Choose correct `QNN_DATATYPE` based on the data type ### `TensorParamWrapper` - Wrap `Qnn_Param_t` with `QNN_PARAMTYPE_TENSOR` for `paramType` ### `UndefinedQuantizeParamsWrapper` - Wrap `Qnn_QuantizeParams_t` - Default for quantization parameter ### `ScaleOffsetQuantizeParamsWrapper` - Wrap `Qnn_QuantizeParams_t` for per-tensor quantization ### `AxisScaleOffsetQuantizeParamsWrapper` - Wrap `Qnn_QuantizeParams_t` for per-axis quantization ### `TensorWrapper` - Wrap `Qnn_TensorType_t` - Handle dynamic resource, e.g. name, dimensions, weight data. ### `OpWrapper` - Wrap `Qnn_OpConfig_t` - Handle dynamic resource, e.g. name, input output tensors, params Copybara import of the project: -- 4833a20 by weilhuan-quic <[email protected]>: LiteRt Qualcomm wrappers -- 725f571 by weilhuan-quic <[email protected]>: TensorWrapper GetDataTypeSize() return bytes instead of bits -- dd3f251 by weilhuan-quic <[email protected]>: comment qnn_lib_headers -- 06e0616 by weilhuan-quic <[email protected]>: Change license Merging this change closes #85476 -- PiperOrigin-RevId: 723477680

metaflow · 2025-02-05T16:05:30Z

@jbdalido I have just landed #86443, please tell what effects do you see.

jbdalido · 2025-02-05T17:12:42Z

hey @metaflow sorry my kid was sick. Running quick benchmark now

metaflow · 2025-02-10T08:27:38Z

Closing this PR for now. Please tell if you still see the performance impact from the changes to passes.

fix: perf regression over 10%

97a8936

golechwierowicz requested review from akuegel, metaflow and pifon2a January 31, 2025 08:34

metaflow self-assigned this Jan 31, 2025

copybara-service bot mentioned this pull request Feb 4, 2025

[XLA:GPU] move DotDecompose out of simplification pipeline #22219

Closed

copybara-service bot mentioned this pull request Feb 4, 2025

[XLA:GPU] move DotDecompose out of simplification pipeline tensorflow/tensorflow#86443

Closed

metaflow closed this Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: perf regression over 10% #22081

bug: perf regression over 10% #22081

Uh oh!

jbdalido commented Jan 30, 2025

Uh oh!

google-cla bot commented Jan 30, 2025

Uh oh!

pifon2a commented Jan 31, 2025

Uh oh!

metaflow commented Jan 31, 2025

Uh oh!

metaflow commented Feb 5, 2025

Uh oh!

metaflow commented Feb 5, 2025

Uh oh!

jbdalido commented Feb 5, 2025

Uh oh!

metaflow commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bug: perf regression over 10% #22081

bug: perf regression over 10% #22081

Uh oh!

Conversation

jbdalido commented Jan 30, 2025

Uh oh!

google-cla bot commented Jan 30, 2025

Uh oh!

pifon2a commented Jan 31, 2025

Uh oh!

metaflow commented Jan 31, 2025

Uh oh!

metaflow commented Feb 5, 2025

Uh oh!

metaflow commented Feb 5, 2025

Uh oh!

jbdalido commented Feb 5, 2025

Uh oh!

metaflow commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants