Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Autoregressive action distributions #5304

Merged
merged 41 commits into from
Aug 10, 2019

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jul 29, 2019

What do these changes do?

  • Pass in the parent model reference into the action distribution class, so it can use model variables for auto-regressive sampling
  • Switch IMPALA / PPO to use the prev actions logp instead of logits for importance weighting
  • Add a simple example and documentation
  • Debug RNN nans

Right now the custom action distribution is injected via an override_action_dist hack, this will be removed once #5164 merges.

Related issue number

Closes #4939
Closes #5419

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15737/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15739/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15740/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15741/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15744/
Test FAILed.

@yangysc

This comment has been minimized.

@ericl
Copy link
Contributor Author

ericl commented Jul 30, 2019 via email

@yangysc

This comment has been minimized.

@yangysc

This comment has been minimized.

@ericl
Copy link
Contributor Author

ericl commented Aug 2, 2019

In common cases, it's a trival thing. But if we use BinaryAutoregressiveOutput, we do not have access to the input_dict, only having hidden_state (self.input). Is there any chance that we can access the input_dict in the custom distribution?

This is a bit hacky, but if you are defining a custom model you can in your forward() save the input dict as e.g., model.last_input_dict. Then, you can access model.last_input_dict from your action distribution.

Another issue I want to point out is that the hidden_state in the model has to have the same dimension with the model action output to pass the shape check in the Line 162 of modelv2.py

This should be addressed once #5164 is merged, since that PR allows custom action distributions to specify their own output size using required_model_output_shape. Hence, your custom distribution can specify any size as desired.

Let me know if these work.

@ericl ericl changed the title [rllib] Autoregressive action distributions [rllib] [WIP] Autoregressive action distributions Aug 2, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16119/
Test FAILed.

@yangysc
Copy link

yangysc commented Aug 8, 2019

Hi, @ericl . Sorry again to trouble you, though the autoregressive action distribution is very mature enough, but I think there is still one tiny function issue about how to fetch the action logits of each subaction, instead of returning the hidden state as the behaviour_logits when we perform testing. This would be very helpful when we debug the programs.

Here are what I have tried.

1. Add the logits directly as the attribute of the model

I noticed that the function vf_preds_and_logits_fetches in the ppo_policy.py is in charge of the fetching task, so I tried to add the a1_logits and a2_logits to the model to return.

def vf_preds_and_logits_fetches(policy):
    """Adds value function and logits outputs to experience batches."""
    return {
        SampleBatch.VF_PREDS: policy.value_function,
        BEHAVIOUR_LOGITS: policy.model_out,
        'action_mask': policy.model.last_input_dict,
        # 'ac_1_logits': policy.model.a1_logits,
        # 'ac_2_logits': policy.model.a2_logits,
    }

But it turns that the BinaryAutoregressiveOutput distribution depends on the hidden state and a1_input as the Input, so it fails since there is no input for them.

2. Trying to call the action_model directly

Based on the first failure, I tried to feed the ctx_input and a1_input directly after running the Line 204 of the rollout.py file.

# Line 204
 a_action, p_state, info = agent.compute_action(
                            a_obs,
                            state=agent_states[agent_id],
                            prev_action=prev_actions[agent_id],
                            prev_reward=prev_rewards[agent_id],
                            policy_id=policy_id, full_fetch=True)
input_size = 256
agent.get_policy(policy_id).model.action_model([info['behaviour_logits'][None], np.zeros((1, input_size), dtype=np.float32)])

But an error shows that

{ValueError}Tensor("default_policy/a2_hidden/kernel/Read/ReadVariableOp:0", shape=(384, 128), dtype=float32) must be from the same graph as Tensor("model_1_1/concatenate_1/concat:0", shape=(1, 384), dtype=float32).

Where the model_1 is the ParametricActionsModel model, and the hidden state size is 128, and a1_input size is 256, so 384==128+256

Do you have any thoughts on fetching the action logits? The training is not well, and I hope I can watch the logits for helping debugging a little.

Sorry for disturbing you again.

Best Wishes

Shanchao

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16134/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Aug 8, 2019

@yangysc I would probably add some tf.Prints inside the action distribution object itself, since you want to capture the logit outputs during the sampling process. It might also be possible to assign to self.model inside the action distribution object to capture the right conditioned output.

@ericl ericl assigned richardliaw and unassigned hartikainen Aug 8, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16156/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16195/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16203/
Test PASSed.

@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 10, 2019
ModelCatalog.register_custom_model("autoregressive_model",
AutoregressiveActionsModel)
ModelCatalog.register_custom_action_dist("binary_autoreg_output",
BinaryAutoregressiveOutput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make this explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could auto register it but that would be more effort for sure.

BATCH = tf.shape(self.inputs)[0]
a1_logits, _ = self.model.action_model(
[self.inputs, tf.zeros((BATCH, 1))])
a1_dist = Categorical(a1_logits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this have issues? Like somehow adding nodes to the tf graph over and over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine since this entire thing is only called once in graph mode.

# and state of each episode (i.e., for multiagent). You can do
# whatever is needed here, e.g., MCTS rollouts.
return action_batch
class BinaryAutoregressiveOutput(ActionDistribution):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use literalinclude so that this doesn't go out of sync?

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good though we should probably add a check to make sure the TF graph doesn't change over time...

@ericl ericl merged commit a1d2e17 into ray-project:master Aug 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
5 participants