PPO¶

class PPO(model, clip_param=0.1, value_loss_coef=0.5, entropy_coef=0.01, initial_lr=0.00025, eps=1e-05, max_grad_norm=0.5, use_clipped_value_loss=True, norm_adv=True, continuous_action=False)[source]¶

Bases: parl.core.paddle.algorithm.Algorithm

__init__(model, clip_param=0.1, value_loss_coef=0.5, entropy_coef=0.01, initial_lr=0.00025, eps=1e-05, max_grad_norm=0.5, use_clipped_value_loss=True, norm_adv=True, continuous_action=False)[source]¶

PPO algorithm

Parameters:

model (parl.Model) – forward network of actor and critic.
clip_param (float) – epsilon in clipping loss.
value_loss_coef (float) – value function loss coefficient in the optimization objective.
entropy_coef (float) – policy entropy coefficient in the optimization objective.
initial_lr (float) – learning rate.
eps (float) – Adam optimizer epsilon.
max_grad_norm (float) – max gradient norm for gradient clipping.
use_clipped_value_loss (bool) – whether or not to use a clipped loss for the value function.
norm_adv (bool) – whether or not to use advantages normalization.
continuous_action (bool) – whether or not is continuous action environment.

learn(batch_obs, batch_action, batch_value, batch_return, batch_logprob, batch_adv, lr=None)[source]¶

update model with PPO algorithm

Parameters:	batch_obs (torch.Tensor) – shape([batch_size] + obs_shape) batch_action (torch.Tensor) – shape([batch_size] + action_shape) batch_value (torch.Tensor) – shape([batch_size]) batch_return (torch.Tensor) – shape([batch_size]) batch_logprob (torch.Tensor) – shape([batch_size]) batch_adv (torch.Tensor) – shape([batch_size]) lr (torch.Tensor) –
Returns:	value loss action_loss (float): policy loss entropy_loss (float): entropy loss
Return type:	value_loss (float)

predict(obs)[source]¶

use the model to predict action

Parameters:	obs (torch tensor) – observation, shape([batch_size] + obs_shape)
Returns:	action, shape([batch_size] + action_shape), noted that in the discrete case we take the argmax along the last axis as action
Return type:	action (torch tensor)

sample(obs)[source]¶

Define the sampling process. This function returns the action according to action distribution.

Parameters:	obs (torch tensor) – observation, shape([batch_size] + obs_shape)
Returns:	value, shape([batch_size, 1]) action (torch tensor): action, shape([batch_size] + action_shape) action_log_probs (torch tensor): action log probs, shape([batch_size]) action_entropy (torch tensor): action entropy, shape([batch_size])
Return type:	value (torch tensor)

value(obs)[source]¶

use the model to predict obs values

Parameters:	obs (torch tensor) – observation, shape([batch_size] + obs_shape)
Returns:	value of obs, shape([batch_size])
Return type:	value (torch tensor)