Getting Started

Goal of this tutorial:

  • Understand PARL’s abstraction at a high level
  • Train an agent to solve the Cartpole problem with Policy Gradient algorithm

This tutorial assumes that you have a basic familiarity of policy gradient.


First, let’s build a Model that predicts an action given the observation. As an objective-oriented programming framework, we build models on the top of parl.Model and implement the forward function.

Here, we construct a neural network with two fully connected layers.

import parl
from parl import layers

class CartpoleModel(parl.Model):
    def __init__(self, act_dim):
        act_dim = act_dim
        hid1_size = act_dim * 10

        self.fc1 = layers.fc(size=hid1_size, act='tanh')
        self.fc2 = layers.fc(size=act_dim, act='softmax')

    def forward(self, obs):
        out = self.fc1(obs)
        out = self.fc2(out)
        return out


Algorithm will update the parameters of the model passed to it. In general, we define the loss function in Algorithm. In this tutorial, we solve the benchmark Cartpole using the Policy Graident algorithm, which has been implemented in our repository. Thus, we can simply use this algorithm by importting it from parl.algorithms.

We have also published various algorithms in PARL, please visit this page for more detail. For those who want to implement a new algorithm, please follow this tutorial.

model = CartpoleModel(act_dim=2)
algorithm = parl.algorithms.PolicyGradient(model, lr=1e-3)

Note that each algorithm should have two functions implemented:

  • learn

    updates the model’s parameters given transition data

  • predict

    predicts an action given current environmental state.


Now we pass the algorithm to an agent, which is used to interact with the environment to generate training data. Users should build their agents on the top of parl.Agent and implement four functions:

  • build_program

    define programs of fluid. In general, two programs are built here, one for prediction and the other for training.

  • learn

    preprocess transition data and feed it into the training program.

  • predict

    feed current environmental state into the prediction program and return an exectuive action.

  • sample

    this function is usually used for exploration, fed with current state.

class CartpoleAgent(parl.Agent):
    def __init__(self, algorithm, obs_dim, act_dim):
        self.obs_dim = obs_dim
        self.act_dim = act_dim
        super(CartpoleAgent, self).__init__(algorithm)

    def build_program(self):
        self.pred_program = fluid.Program()
        self.train_program = fluid.Program()

        with fluid.program_guard(self.pred_program):
            obs =
                name='obs', shape=[self.obs_dim], dtype='float32')
            self.act_prob = self.alg.predict(obs)

        with fluid.program_guard(self.train_program):
            obs =
                name='obs', shape=[self.obs_dim], dtype='float32')
            act ='act', shape=[1], dtype='int64')
            reward ='reward', shape=[], dtype='float32')
            self.cost = self.alg.learn(obs, act, reward)

    def sample(self, obs):
        obs = np.expand_dims(obs, axis=0)
        act_prob =
            feed={'obs': obs.astype('float32')},
        act_prob = np.squeeze(act_prob, axis=0)
        act = np.random.choice(range(self.act_dim), p=act_prob)
        return act

    def predict(self, obs):
        obs = np.expand_dims(obs, axis=0)
        act_prob =
            feed={'obs': obs.astype('float32')},
        act_prob = np.squeeze(act_prob, axis=0)
        act = np.argmax(act_prob)
        return act

    def learn(self, obs, act, reward):
        act = np.expand_dims(act, axis=-1)
        feed = {
            'obs': obs.astype('float32'),
            'act': act.astype('int64'),
            'reward': reward.astype('float32')
        cost =
            self.train_program, feed=feed, fetch_list=[self.cost])[0]
        return cost

Start Training

First, let’s build an agent. As the code shown below, we usually build a model, an algorithm and finally agent.

model = CartpoleModel(act_dim=2)
alg = parl.algorithms.PolicyGradient(model, lr=1e-3)
agent = CartpoleAgent(alg, obs_dim=OBS_DIM, act_dim=2)

Then we use this agent to interact with the environment, and run around 1000 episodes for training, after which this agent can solve the problem.

def run_episode(env, agent, train_or_test='train'):
    obs_list, action_list, reward_list = [], [], []
    obs = env.reset()
    while True:
        if train_or_test == 'train':
            action = agent.sample(obs)
            action = agent.predict(obs)

        obs, reward, done, info = env.step(action)

        if done:
    return obs_list, action_list, reward_list

env = gym.make("CartPole-v0")
for i in range(1000):
      obs_list, action_list, reward_list = run_episode(env, agent)
      if i % 10 == 0:
"Episode {}, Reward Sum {}.".format(i, sum(reward_list)))

      batch_obs = np.array(obs_list)
      batch_action = np.array(action_list)
      batch_reward = calc_discount_norm_reward(reward_list, GAMMA)

      agent.learn(batch_obs, batch_action, batch_reward)
      if (i + 1) % 100 == 0:
          _, _, reward_list = run_episode(env, agent, train_or_test='test')
          total_reward = np.sum(reward_list)
'Test reward: {}'.format(total_reward))


_images/performance.gif _images/quickstart.png

In this tutorial, we have shown how to build an agent step-by-step to solve the Cartpole problem.

The whole training code could be found here. Have a try quickly by running several commands:

# Install dependencies
pip install paddlepaddle

pip install gym
git clone
pip install .

# Train model
cd examples/QuickStart/