MADDPG

class MADDPG(model, agent_index=None, act_space=None, gamma=None, tau=None, actor_lr=None, critic_lr=None)[源代码]

基类:Algorithm

Q(obs_n, act_n, use_target_model=False)[源代码]

use the value model to predict Q values

参数:
  • obs_n (list of paddle tensor) – all agents’ observation, len(agent’s num) + shape([B] + shape of obs_n)

  • act_n (list of paddle tensor) – all agents’ action, len(agent’s num) + shape([B] + shape of act_n)

  • use_target_model (bool) – use target_model or not

返回:

Q value of this agent, shape([B])

返回类型:

Q (paddle tensor)

__init__(model, agent_index=None, act_space=None, gamma=None, tau=None, actor_lr=None, critic_lr=None)[源代码]

MADDPG algorithm

参数:
  • model (parl.Model) – forward network of actor and critic. The function get_actor_params() of model should be implemented.

  • agent_index (int) – index of agent, in multiagent env

  • act_space (list) – action_space, gym space

  • gamma (float) – discounted factor for reward computation.

  • tau (float) – decay coefficient when updating the weights of self.target_model with self.model

  • critic_lr (float) – learning rate of the critic model

  • actor_lr (float) – learning rate of the actor model

learn(obs_n, act_n, target_q)[源代码]

update actor and critic model with MADDPG algorithm

predict(obs)[源代码]

use the policy model to predict actions

参数:

obs (paddle tensor) – observation, shape([B] + shape of obs_n[agent_index])

返回:

action, shape([B] + shape of act_n[agent_index]),

noted that in the discrete case we take the argmax along the last axis as action

返回类型:

act (paddle tensor)

sample(obs, use_target_model=False)[源代码]

use the policy model to sample actions

参数:
  • obs (paddle tensor) – observation, shape([B] + shape of obs_n[agent_index])

  • use_target_model (bool) – use target_model or not

返回:

action, shape([B] + shape of act_n[agent_index]),

noted that in the discrete case we take the argmax along the last axis as action

返回类型:

act (paddle tensor)

sync_target(decay=None)[源代码]

update the target network with the training network

参数:

decay (float) – the decaying factor while updating the target network with the training network. 0 represents the assignment. None represents updating the target network slowly that depends on the hyperparameter tau.