MADDPG¶
-
class
MADDPG
(model, agent_index=None, act_space=None, gamma=None, tau=None, actor_lr=None, critic_lr=None)[source]¶ Bases:
parl.core.paddle.algorithm.Algorithm
-
Q
(obs_n, act_n, use_target_model=False)[source]¶ use the value model to predict Q values
Parameters: - obs_n (list of paddle tensor) – all agents’ observation, len(agent’s num) + shape([B] + shape of obs_n)
- act_n (list of paddle tensor) – all agents’ action, len(agent’s num) + shape([B] + shape of act_n)
- use_target_model (bool) – use target_model or not
Returns: Q value of this agent, shape([B])
Return type: Q (paddle tensor)
-
__init__
(model, agent_index=None, act_space=None, gamma=None, tau=None, actor_lr=None, critic_lr=None)[source]¶ MADDPG algorithm
Parameters: - model (parl.Model) – forward network of actor and critic. The function get_actor_params() of model should be implemented.
- agent_index (int) – index of agent, in multiagent env
- act_space (list) – action_space, gym space
- gamma (float) – discounted factor for reward computation.
- tau (float) – decay coefficient when updating the weights of self.target_model with self.model
- critic_lr (float) – learning rate of the critic model
- actor_lr (float) – learning rate of the actor model
-
predict
(obs)[source]¶ use the policy model to predict actions
Parameters: obs (paddle tensor) – observation, shape([B] + shape of obs_n[agent_index]) Returns: - action, shape([B] + shape of act_n[agent_index]),
- noted that in the discrete case we take the argmax along the last axis as action
Return type: act (paddle tensor)
-
sample
(obs, use_target_model=False)[source]¶ use the policy model to sample actions
Parameters: - obs (paddle tensor) – observation, shape([B] + shape of obs_n[agent_index])
- use_target_model (bool) – use target_model or not
Returns: - action, shape([B] + shape of act_n[agent_index]),
noted that in the discrete case we take the argmax along the last axis as action
Return type: act (paddle tensor)
-
sync_target
(decay=None)[source]¶ update the target network with the training network
Parameters: decay (float) – the decaying factor while updating the target network with the training network. 0 represents the assignment. None represents updating the target network slowly that depends on the hyperparameter tau.
-