OAC

class OAC(model, gamma=None, tau=None, alpha=None, beta=None, delta=None, actor_lr=None, critic_lr=None)[source]

Bases: parl.core.paddle.algorithm.Algorithm

__init__(model, gamma=None, tau=None, alpha=None, beta=None, delta=None, actor_lr=None, critic_lr=None)[source]

OAC algorithm

Parameters:
  • model (parl.Model) – forward network of actor and critic.
  • gamma (float) – discounted factor for reward computation
  • tau (float) – decay coefficient when updating the weights of self.target_model with self.model
  • alpha (float) – Temperature parameter determines the relative importance of the entropy against the reward
  • beta (float) – determines the relative importance of sigma_Q
  • delta (float) – determines the relative changes of exploration`s mean
  • actor_lr (float) – learning rate of the actor model
  • critic_lr (float) – learning rate of the critic model
learn(obs, action, reward, next_obs, terminal)[source]

Define the loss function and create an optimizer to minize the loss.

predict(obs)[source]

Refine the predicting process, e.g,. use the policy model to predict actions.

sample(obs)[source]

Define the sampling process. This function returns an action with noise to perform exploration.