`param_variational_noise` causes recursion limit error in TF backend #1616

mmz33 · 2024-09-05T15:57:01Z

I set param_variational_noise for almost all layers in the encoder. The network I am using has 4 encoder layers only and I am getting this python exception:

Exception RecursionError('maximum recursion depth exceeded while calling a Python object') in step 0.

Increasing the stack limit does not fix the issue because it seems there is an infinite loop in the gradient checkpointing logic as you can see in the log here: https://gist.github.com/mmz33/547a099d050983ab71c8fc7d5ca87c62

Here is the last grad checkpoint call before crashing. It says op .../Switch_994 so something is wrong.

........

  File "/nas/models/asr/mzeineldeen/setups/spanish/2023-10-20--att-i6/returnn/returnn/tf/util/gradient_checkpoint.py", line 116, in prepare_gradient_checkpointing.<locals>._set_wrapped_grad_func.<locals>._WrappedOp.__init__
    line: self._inputs = tuple(_map_tensor(x) for x in op.inputs)
    locals:
      self = <local> <returnn.tf.util.gradient_checkpoint.prepare_gradient_checkpointing.<locals>._set_wrapped_grad_func.<locals>._WrappedOp object at 0x7fab2a857b20>
      self._inputs = <local> !AttributeError: '_WrappedOp' object has no attribute '_inputs'
      tuple = <builtin> <class 'tuple'>
      _map_tensor = <local> <function prepare_gradient_checkpointing.<locals>._map_tensor at 0x7fab258c3370>
      x = <not found>
      op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch_994' type=Switch>
      op.inputs = <local> (<tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>, <tf.Tensor 'conv0/W_variational_noise/cond/pred_id:0' shape=() dtype=bool>)

It looks that it is trying to apply grad checkpoint for the Switch op and this loops indefinitely.

This is the returnn network: https://gist.github.com/mmz33/840033656b97b7e6e415c9a2b46fe75a

The text was updated successfully, but these errors were encountered:

albertz · 2024-09-05T20:51:54Z

What TensorFlow version?

albertz · 2024-09-05T21:24:02Z

I'm not sure this is an infinite recursion error. Maybe the graph is just really big.

The question is, where are those <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch> created. You can inspect that by checking/printing op.traceback.

albertz · 2024-09-06T07:02:07Z

Is this about variational noise or about weight dropout? I see you have both. Does it still occur when you only have variational noise?

(And side remark: Does it make sense to have both?)

albertz · 2024-09-06T07:27:11Z

I tested this extended test case and that works:

def test_param_weight_dropout_and_variational_noise():
    from returnn.tensor import Dim, batch_dim
    from returnn.tf.util.basic import print_graph_output, find_ops_with_tensor_input
    from returnn.tf.util.gradient_checkpoint import prepare_gradient_checkpointing

    time_dim = Dim(None, name="time")
    feature_dim = Dim(7, name="feature")
    classes_dim = Dim(13, name="classes")

    config = Config(
        {
            "param_dropout": 0.1,
            "param_variational_noise": 0.075,
            "extern_data": {
                "data": {
                    "dim_tags": [batch_dim, time_dim, feature_dim],
                    "time_dim_axis": 1,
                    "feature_dim": feature_dim,
                    "dtype": "float32",
                },
                "classes": {"dim_tags": [batch_dim, time_dim], "sparse_dim": classes_dim, "dtype": "int32"},
            },
        }
    )
    with make_scope() as session:
        network = TFNetwork(config=config, train_flag=True)
        # Do subnetwork by intention, to test when we have multiple variable scopes.
        network.construct_from_dict(
            {
                "output": {
                    "class": "linear",
                    "out_dim": classes_dim,
                    "activation": "softmax",
                    "from": "data",
                    "loss": "ce",
                    "target": "classes",
                }
            }
        )
        loss = network.get_total_loss()

        prepare_gradient_checkpointing()
        opt = tf_compat.v1.train.GradientDescentOptimizer(learning_rate=0.1)
        opt_op = opt.minimize(loss)
        print("optimizer:")
        print_graph_output(opt_op)

        tf_log_dir = tempfile.mkdtemp()
        print("TF log dir:", tf_log_dir)
        writer = tf_compat.v1.summary.FileWriter(logdir=tf_log_dir, graph=session.graph, session=session)
        params = network.get_params_list()
        print("params:", params)
        assert len(params) == 2  # weights and bias
        for param in params:
            print("param:", param)
            ops = find_ops_with_tensor_input(param, fetches=opt_op)
            print("param graph:")
            print_graph_output(ops)
            # There can be multiple ops due to gradient checkpointing.
            assert (
                1 <= len(ops)
                and all("_variational_noise/" in op.name or "/ResourceApply" in op.name for op in ops)
                and any("_variational_noise/" in op.name for op in ops)
            ), f"ops: {ops}"

        network.initialize_params(session=session)

        run_metadata = tf_compat.v1.RunMetadata()
        run_options = tf_compat.v1.RunOptions(trace_level=tf_compat.v1.RunOptions.FULL_TRACE)
        session.run(
            opt_op, feed_dict=make_feed_dict(network.extern_data), options=run_options, run_metadata=run_metadata
        )
        writer.add_run_metadata(run_metadata, tag="step_0")
        writer.close()
        print("TF log dir:", tf_log_dir)

Edit Sorry, actually, it does not. Weight dropout is not applied here at all. This check here does not apply:

            if (
                param_dropout
                and param.dtype.is_floating
                and isinstance(param, tf.Variable)
                and param.shape.ndims >= param_dropout_min_ndim
            ):

because at that point, param is not a tf.Variable anymore...

So I will extend the check. But that's just a separate additional bug. It still means there might be a problem with variational noise only.

mmz33 · 2024-09-06T09:41:30Z

Is this about variational noise or about weight dropout?

It is about variational noise. I used before only weight dropout and this worked fine.

Does it still occur when you only have variational noise?

Yes it still occurs.

mmz33 · 2024-09-06T09:42:02Z

What TensorFlow version?

I use TF 2.13

Fix #1616

albertz · 2024-09-06T13:02:41Z

I pushed some change where I avoid the recursion and do flat construction instead. So there should never be a maximum recursion depth exceeded error. However, logically, nothing should be different from before. So even before, with a high enough recursion limit, it should have worked. Not sure if this really changes sth for you now (except that you don't get the recursion error, but instead it might just hang and slowly runs OOM?). But can you just try? My hypothesis is still that the graph might just be very big.

mmz33 · 2024-09-06T17:14:21Z

I tried with master branch. It fails due to op copy error:

....

  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 148, in <genexpr>
    line: self._inputs = tuple(_map_tensor(x) for x in op.inputs)
    locals:
      self = <not found>
      self._inputs = <not found>
      tuple = <builtin> <class 'tuple'>
      _map_tensor = <local> <function prepare_gradient_checkpointing.<locals>._map_tensor at 0x7f12907f5ab0>
      x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
      op = <not found>
      op.inputs = <not found>
  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 128, in prepare_gradient_checkpointing.<locals>._map_tensor
    line: x_op_copy = _copy_op(x.op)
    locals:
      x_op_copy = <not found>
      _copy_op = <local> <function prepare_gradient_checkpointing.<locals>._copy_op at 0x7f12907f5990>
      x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
      x.op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 93, in prepare_gradient_checkpointing.<locals>._copy_op
    line: raise _DeepCopyError(op)
    locals:
      _DeepCopyError = <local> <class 'returnn.tf.util.gradient_checkpoint.prepare_gradient_checkpointing.<locals>._DeepCopyError'>
      op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
_DeepCopyError: deep copy err: name: "conv0/W_variational_noise/cond/ReadVariableOp/Switch"

albertz · 2024-09-06T17:50:22Z

Ah I think I know the problem. Can you post the full log/error (it shouldn't be so long now)? Btw, did you use an earlier TF version before?

Fix #1616

albertz · 2024-09-06T21:46:02Z

I pushed again another small change. Can you test?

albertz mentioned this issue Sep 6, 2024

TF fix combined variational noise and weight dropout #1617

Merged

albertz added a commit that referenced this issue Sep 6, 2024

TF prepare_gradient_checkpointing, avoid deep recursion

94b065f

Fix #1616

albertz mentioned this issue Sep 6, 2024

TF prepare_gradient_checkpointing, avoid deep recursion #1619

Merged

albertz closed this as completed in #1619 Sep 6, 2024

albertz added a commit that referenced this issue Sep 6, 2024

TF prepare_gradient_checkpointing, avoid deep recursion (#1619)

7a58317

Fix #1616

albertz reopened this Sep 6, 2024

albertz added a commit that referenced this issue Sep 6, 2024

TF prepare_gradient_checkpointing, fix for newer TF

9072ad2

Fix #1616

albertz mentioned this issue Sep 6, 2024

TF prepare_gradient_checkpointing, fix for newer TF #1620

Merged

albertz closed this as completed in #1620 Sep 6, 2024

albertz closed this as completed in 1b5530d Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`param_variational_noise` causes recursion limit error in TF backend #1616

`param_variational_noise` causes recursion limit error in TF backend #1616

mmz33 commented Sep 5, 2024

albertz commented Sep 5, 2024

albertz commented Sep 5, 2024

albertz commented Sep 6, 2024 •

edited

Loading

albertz commented Sep 6, 2024 •

edited

Loading

mmz33 commented Sep 6, 2024

mmz33 commented Sep 6, 2024

albertz commented Sep 6, 2024

mmz33 commented Sep 6, 2024 •

edited

Loading

albertz commented Sep 6, 2024 via email •

edited

Loading

albertz commented Sep 6, 2024

param_variational_noise causes recursion limit error in TF backend #1616

param_variational_noise causes recursion limit error in TF backend #1616

Comments

mmz33 commented Sep 5, 2024

albertz commented Sep 5, 2024

albertz commented Sep 5, 2024

albertz commented Sep 6, 2024 • edited Loading

albertz commented Sep 6, 2024 • edited Loading

mmz33 commented Sep 6, 2024

mmz33 commented Sep 6, 2024

albertz commented Sep 6, 2024

mmz33 commented Sep 6, 2024 • edited Loading

albertz commented Sep 6, 2024 via email • edited Loading

albertz commented Sep 6, 2024

`param_variational_noise` causes recursion limit error in TF backend #1616

`param_variational_noise` causes recursion limit error in TF backend #1616

albertz commented Sep 6, 2024 •

edited

Loading

albertz commented Sep 6, 2024 •

edited

Loading

mmz33 commented Sep 6, 2024 •

edited

Loading

albertz commented Sep 6, 2024 via email •

edited

Loading