-
Notifications
You must be signed in to change notification settings - Fork 130
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
param_variational_noise
causes recursion limit error in TF backend
#1616
Comments
What TensorFlow version? |
I'm not sure this is an infinite recursion error. Maybe the graph is just really big. The question is, where are those |
Is this about variational noise or about weight dropout? I see you have both. Does it still occur when you only have variational noise? (And side remark: Does it make sense to have both?) |
I tested this extended test case and that works: def test_param_weight_dropout_and_variational_noise():
from returnn.tensor import Dim, batch_dim
from returnn.tf.util.basic import print_graph_output, find_ops_with_tensor_input
from returnn.tf.util.gradient_checkpoint import prepare_gradient_checkpointing
time_dim = Dim(None, name="time")
feature_dim = Dim(7, name="feature")
classes_dim = Dim(13, name="classes")
config = Config(
{
"param_dropout": 0.1,
"param_variational_noise": 0.075,
"extern_data": {
"data": {
"dim_tags": [batch_dim, time_dim, feature_dim],
"time_dim_axis": 1,
"feature_dim": feature_dim,
"dtype": "float32",
},
"classes": {"dim_tags": [batch_dim, time_dim], "sparse_dim": classes_dim, "dtype": "int32"},
},
}
)
with make_scope() as session:
network = TFNetwork(config=config, train_flag=True)
# Do subnetwork by intention, to test when we have multiple variable scopes.
network.construct_from_dict(
{
"output": {
"class": "linear",
"out_dim": classes_dim,
"activation": "softmax",
"from": "data",
"loss": "ce",
"target": "classes",
}
}
)
loss = network.get_total_loss()
prepare_gradient_checkpointing()
opt = tf_compat.v1.train.GradientDescentOptimizer(learning_rate=0.1)
opt_op = opt.minimize(loss)
print("optimizer:")
print_graph_output(opt_op)
tf_log_dir = tempfile.mkdtemp()
print("TF log dir:", tf_log_dir)
writer = tf_compat.v1.summary.FileWriter(logdir=tf_log_dir, graph=session.graph, session=session)
params = network.get_params_list()
print("params:", params)
assert len(params) == 2 # weights and bias
for param in params:
print("param:", param)
ops = find_ops_with_tensor_input(param, fetches=opt_op)
print("param graph:")
print_graph_output(ops)
# There can be multiple ops due to gradient checkpointing.
assert (
1 <= len(ops)
and all("_variational_noise/" in op.name or "/ResourceApply" in op.name for op in ops)
and any("_variational_noise/" in op.name for op in ops)
), f"ops: {ops}"
network.initialize_params(session=session)
run_metadata = tf_compat.v1.RunMetadata()
run_options = tf_compat.v1.RunOptions(trace_level=tf_compat.v1.RunOptions.FULL_TRACE)
session.run(
opt_op, feed_dict=make_feed_dict(network.extern_data), options=run_options, run_metadata=run_metadata
)
writer.add_run_metadata(run_metadata, tag="step_0")
writer.close()
print("TF log dir:", tf_log_dir) Edit Sorry, actually, it does not. Weight dropout is not applied here at all. This check here does not apply: if (
param_dropout
and param.dtype.is_floating
and isinstance(param, tf.Variable)
and param.shape.ndims >= param_dropout_min_ndim
): because at that point, So I will extend the check. But that's just a separate additional bug. It still means there might be a problem with variational noise only. |
It is about variational noise. I used before only weight dropout and this worked fine.
Yes it still occurs. |
I use TF 2.13 |
I pushed some change where I avoid the recursion and do flat construction instead. So there should never be a maximum recursion depth exceeded error. However, logically, nothing should be different from before. So even before, with a high enough recursion limit, it should have worked. Not sure if this really changes sth for you now (except that you don't get the recursion error, but instead it might just hang and slowly runs OOM?). But can you just try? My hypothesis is still that the graph might just be very big. |
I tried with master branch. It fails due to op copy error:
|
Ah I think I know the problem. Can you post the full log/error (it shouldn't be so long now)? Btw, did you use an earlier TF version before?
|
I pushed again another small change. Can you test? |
I set
param_variational_noise
for almost all layers in the encoder. The network I am using has 4 encoder layers only and I am getting this python exception:Increasing the stack limit does not fix the issue because it seems there is an infinite loop in the gradient checkpointing logic as you can see in the log here: https://gist.github.com/mmz33/547a099d050983ab71c8fc7d5ca87c62
Here is the last grad checkpoint call before crashing. It says op
.../Switch_994
so something is wrong.It looks that it is trying to apply grad checkpoint for the Switch op and this loops indefinitely.
This is the returnn network: https://gist.github.com/mmz33/840033656b97b7e6e415c9a2b46fe75a
The text was updated successfully, but these errors were encountered: