Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

param_variational_noise causes recursion limit error in TF backend #1616

Closed
mmz33 opened this issue Sep 5, 2024 · 10 comments · Fixed by #1619 or #1620
Closed

param_variational_noise causes recursion limit error in TF backend #1616

mmz33 opened this issue Sep 5, 2024 · 10 comments · Fixed by #1619 or #1620

Comments

@mmz33
Copy link
Member

mmz33 commented Sep 5, 2024

I set param_variational_noise for almost all layers in the encoder. The network I am using has 4 encoder layers only and I am getting this python exception:

Exception RecursionError('maximum recursion depth exceeded while calling a Python object') in step 0.

Increasing the stack limit does not fix the issue because it seems there is an infinite loop in the gradient checkpointing logic as you can see in the log here: https://gist.github.com/mmz33/547a099d050983ab71c8fc7d5ca87c62

Here is the last grad checkpoint call before crashing. It says op .../Switch_994 so something is wrong.

........

  File "/nas/models/asr/mzeineldeen/setups/spanish/2023-10-20--att-i6/returnn/returnn/tf/util/gradient_checkpoint.py", line 116, in prepare_gradient_checkpointing.<locals>._set_wrapped_grad_func.<locals>._WrappedOp.__init__
    line: self._inputs = tuple(_map_tensor(x) for x in op.inputs)
    locals:
      self = <local> <returnn.tf.util.gradient_checkpoint.prepare_gradient_checkpointing.<locals>._set_wrapped_grad_func.<locals>._WrappedOp object at 0x7fab2a857b20>
      self._inputs = <local> !AttributeError: '_WrappedOp' object has no attribute '_inputs'
      tuple = <builtin> <class 'tuple'>
      _map_tensor = <local> <function prepare_gradient_checkpointing.<locals>._map_tensor at 0x7fab258c3370>
      x = <not found>
      op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch_994' type=Switch>
      op.inputs = <local> (<tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>, <tf.Tensor 'conv0/W_variational_noise/cond/pred_id:0' shape=() dtype=bool>)

It looks that it is trying to apply grad checkpoint for the Switch op and this loops indefinitely.

This is the returnn network: https://gist.github.com/mmz33/840033656b97b7e6e415c9a2b46fe75a

@albertz
Copy link
Member

albertz commented Sep 5, 2024

What TensorFlow version?

@albertz
Copy link
Member

albertz commented Sep 5, 2024

I'm not sure this is an infinite recursion error. Maybe the graph is just really big.

The question is, where are those <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch> created. You can inspect that by checking/printing op.traceback.

@albertz
Copy link
Member

albertz commented Sep 6, 2024

Is this about variational noise or about weight dropout? I see you have both. Does it still occur when you only have variational noise?

(And side remark: Does it make sense to have both?)

@albertz
Copy link
Member

albertz commented Sep 6, 2024

I tested this extended test case and that works:

def test_param_weight_dropout_and_variational_noise():
    from returnn.tensor import Dim, batch_dim
    from returnn.tf.util.basic import print_graph_output, find_ops_with_tensor_input
    from returnn.tf.util.gradient_checkpoint import prepare_gradient_checkpointing

    time_dim = Dim(None, name="time")
    feature_dim = Dim(7, name="feature")
    classes_dim = Dim(13, name="classes")

    config = Config(
        {
            "param_dropout": 0.1,
            "param_variational_noise": 0.075,
            "extern_data": {
                "data": {
                    "dim_tags": [batch_dim, time_dim, feature_dim],
                    "time_dim_axis": 1,
                    "feature_dim": feature_dim,
                    "dtype": "float32",
                },
                "classes": {"dim_tags": [batch_dim, time_dim], "sparse_dim": classes_dim, "dtype": "int32"},
            },
        }
    )
    with make_scope() as session:
        network = TFNetwork(config=config, train_flag=True)
        # Do subnetwork by intention, to test when we have multiple variable scopes.
        network.construct_from_dict(
            {
                "output": {
                    "class": "linear",
                    "out_dim": classes_dim,
                    "activation": "softmax",
                    "from": "data",
                    "loss": "ce",
                    "target": "classes",
                }
            }
        )
        loss = network.get_total_loss()

        prepare_gradient_checkpointing()
        opt = tf_compat.v1.train.GradientDescentOptimizer(learning_rate=0.1)
        opt_op = opt.minimize(loss)
        print("optimizer:")
        print_graph_output(opt_op)

        tf_log_dir = tempfile.mkdtemp()
        print("TF log dir:", tf_log_dir)
        writer = tf_compat.v1.summary.FileWriter(logdir=tf_log_dir, graph=session.graph, session=session)
        params = network.get_params_list()
        print("params:", params)
        assert len(params) == 2  # weights and bias
        for param in params:
            print("param:", param)
            ops = find_ops_with_tensor_input(param, fetches=opt_op)
            print("param graph:")
            print_graph_output(ops)
            # There can be multiple ops due to gradient checkpointing.
            assert (
                1 <= len(ops)
                and all("_variational_noise/" in op.name or "/ResourceApply" in op.name for op in ops)
                and any("_variational_noise/" in op.name for op in ops)
            ), f"ops: {ops}"

        network.initialize_params(session=session)

        run_metadata = tf_compat.v1.RunMetadata()
        run_options = tf_compat.v1.RunOptions(trace_level=tf_compat.v1.RunOptions.FULL_TRACE)
        session.run(
            opt_op, feed_dict=make_feed_dict(network.extern_data), options=run_options, run_metadata=run_metadata
        )
        writer.add_run_metadata(run_metadata, tag="step_0")
        writer.close()
        print("TF log dir:", tf_log_dir)

Edit Sorry, actually, it does not. Weight dropout is not applied here at all. This check here does not apply:

            if (
                param_dropout
                and param.dtype.is_floating
                and isinstance(param, tf.Variable)
                and param.shape.ndims >= param_dropout_min_ndim
            ):

because at that point, param is not a tf.Variable anymore...

So I will extend the check. But that's just a separate additional bug. It still means there might be a problem with variational noise only.

@mmz33
Copy link
Member Author

mmz33 commented Sep 6, 2024

Is this about variational noise or about weight dropout?

It is about variational noise. I used before only weight dropout and this worked fine.

Does it still occur when you only have variational noise?

Yes it still occurs.

@mmz33
Copy link
Member Author

mmz33 commented Sep 6, 2024

What TensorFlow version?

I use TF 2.13

@albertz
Copy link
Member

albertz commented Sep 6, 2024

I pushed some change where I avoid the recursion and do flat construction instead. So there should never be a maximum recursion depth exceeded error. However, logically, nothing should be different from before. So even before, with a high enough recursion limit, it should have worked. Not sure if this really changes sth for you now (except that you don't get the recursion error, but instead it might just hang and slowly runs OOM?). But can you just try? My hypothesis is still that the graph might just be very big.

@mmz33
Copy link
Member Author

mmz33 commented Sep 6, 2024

I tried with master branch. It fails due to op copy error:

....

  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 148, in <genexpr>
    line: self._inputs = tuple(_map_tensor(x) for x in op.inputs)
    locals:
      self = <not found>
      self._inputs = <not found>
      tuple = <builtin> <class 'tuple'>
      _map_tensor = <local> <function prepare_gradient_checkpointing.<locals>._map_tensor at 0x7f12907f5ab0>
      x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
      op = <not found>
      op.inputs = <not found>
  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 128, in prepare_gradient_checkpointing.<locals>._map_tensor
    line: x_op_copy = _copy_op(x.op)
    locals:
      x_op_copy = <not found>
      _copy_op = <local> <function prepare_gradient_checkpointing.<locals>._copy_op at 0x7f12907f5990>
      x = <local> <tf.Tensor 'conv0/W_variational_noise/cond/ReadVariableOp/Switch:1' shape=() dtype=resource>
      x.op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
  File "/u/zeineldeen/dev/returnn/returnn/tf/util/gradient_checkpoint.py", line 93, in prepare_gradient_checkpointing.<locals>._copy_op
    line: raise _DeepCopyError(op)
    locals:
      _DeepCopyError = <local> <class 'returnn.tf.util.gradient_checkpoint.prepare_gradient_checkpointing.<locals>._DeepCopyError'>
      op = <local> <tf.Operation 'conv0/W_variational_noise/cond/ReadVariableOp/Switch' type=Switch>
_DeepCopyError: deep copy err: name: "conv0/W_variational_noise/cond/ReadVariableOp/Switch"

@albertz
Copy link
Member

albertz commented Sep 6, 2024 via email

@albertz
Copy link
Member

albertz commented Sep 6, 2024

I pushed again another small change. Can you test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants