Navigating in Gridworld using Policy and Value Iteration

Learn how to use policy evaluation, policy iteration, and value iteration to find the shortest path in gridworld.

Navigating in Gridworld using Policy and Value Iteration

In reinforcement learning, we are interested in identifying a policy that maximizes the obtained reward. Assuming a perfect model of the environment as a Markov decision process (MDPs), we can apply dynamic programming methods to solve reinforcement learning problems.

In this post, I present three dynamic programming algorithms that can be used in the context of MDPs. To make these concepts more understandable, I implemented the algorithms in the context of a gridworld, which is a popular example for demonstrating reinforcement learning.

Before we being with the application, I would like to quickly provide the theoretical background that is required for the following work on the gridworld.

Key Reinforcement Learning Terms for MDPs

The following sections explain the key terms of reinforcement learning, namely:

  • Policy: Which actions the agent should execute in which state
  • State-value function: The expected value of each state with regard to future rewards
  • Action-value function: The expected value of performing a specific action in a specific state with regard to future rewards
  • Transition probability: The probability to transition from one state to another
  • Reward function: The reward that the agent obtains when transitioning between states


A policy, \(\pi(s,a)\), determines the probability of executing action \(a\) in state \(s\). In deterministic environments, a policy directly maps from states to actions.

State-Value Function

Given a policy \(\pi\), the state-value function \(V^{\pi}(s)\) maps each state \(s\) to the expected return that the agent can obtain in this state:

\[V^{\pi}(s) = E_{\pi} \{R_t | s_t = s\} = E_{\pi} \{\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} | s_t = s\}\]

In the formula, \(s_t\) indicates the state at time \(t\). The parameter \(\gamma \in [0,1]\) is called the discount factor. It determines the impact of rewards in the future. If we set \(\gamma = 1\), this indicates that we are sure about the future because we do not have to discount future rewards. For \(\gamma < 1\), we take the uncertainty about the future into account and give greater weight to more recently earned rewards.

Action-Value Function

Given a policy \(\pi\), the action-value function \(Q{^\pi}(s,a)\) determines the expected reward when executing action \(a\) in state \(s\):

\[Q^{\pi}(s,a) = E_{\pi}\{R_t | s_t = s, a_t = a\} = E_{\pi} \{\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} | s_t = s, a_t = a\} \]

Transition Probability

Executing an action \(a\) in state \(s\) may transition the agent to state \(s'\). The probability that this transition takes place is described by \(P_{ss'}^a\).

Reward Function

The reward function, \(R_{ss'}^a\), specifies the reward that is obtained when the agent transitions from state \(s\) to state \(s'\) via action \(a\).

Demonstration of Three Basic MDP Algorithms in Gridworld

In this post, you will learn how to apply three algorithms for MDPs in a gridworld:

  1. Policy Evaluation: Given a policy \(\pi\), what is the value function associated with \(\pi\)?
  2. Policy Iteration: Given a policy \(\pi\), how can we find the optimal policy \(\pi^{\ast}\)?
  3. Value Iteration: How can we find an optimal policy \(\pi^{\ast}\) from scratch?

In gridworld, the goal of the agent is to reach a specified location in the grid. The agent can either go north, go east, go south, or go west. These actions are represented by the set : \(\{N, E, S, W\}\). Note that the agent knows the state (i.e. its location in the grid) at all times.

To make life a bit harder, there are walls in the grid through which the agent cannot pass. Rather, when trying to move through a wall, the agent remains in its position. For example, when the agent chooses action \(N\) and there is a wall to the north of the agent, the agent will remain in its place.

Basic Gridworld Implementation

I’ve implemented the gridworld in an object-oriented manner. The following sections describe how I designed the code for the map and the policy entities.

Representing the Gridworld Map

To implement gridworld, the first thing I started with was a class for representing the map. I defined the following format to represent individual grid cells:

  • # indicate walls
  • X indicates the goal
  • Blanks indicate empty tiles

Relying on these symbols, I constructed the following map, which is used in the following work:

#                X#
#   ###########   #
#   #         #   #
#   # ####### #   #
#   # #     # #   #
#     #  #  #     #
#        #        #
#                 #
###             ###
#                 #

The map-related entities of the code are structured like this:

I implemented MapParser, which generates a Map object. The map object controls the access to the cells of the gridworld. Individual cell subclasses define the behavior of specific cells such as empty cells, walls, and the goal cell. Each cell can be identified using its row and column index.

With this setup, a map can be conveniently loaded:

parser = MapParser()
gridMap = parser.parseMap("../data/map01.grid")

Loading a Policy

For reinforcement learning, we need to be able to deal with a policy, \(\pi(s,a)\). In gridworld, each state \(s\) represents a position of the agent. The actions move the agent one step into one of the four geographic directions. We will use the following symbols to map a policy onto a map:

  • N for the action GO_NORTH
  • E for the action GO_EAST
  • S for the action GO_SOUTH
  • W for the action GO_WEST

Unknown symbols are mapped to a NONE action in order to obtain a complete policy.

Using these definitions, I defined the following policy:


Note that the policy file retains the walls and the goal cell for better readability. The policy was written with two goals in mind:

  1. The agent should be able to reach the goal. With a policy where this property is not fulfilled, policy evaluation will not give a reasonable result because the reward at the goal will never be earned.
  2. The policy should be suboptimal. This means that there are states where the agent doesn’t take the shortest path to the goal. Such a policy allows us to see the effect of algorithms that try to improve upon the initial policy.

To load the policy, I implemented a policy parser, which stores the policy as a policy object. Using these objects, we can load our initial policy like so:

policyParser = PolicyParser()
policy = policyParser.parsePolicy("../data/map01.policy")

The policy object has a function for modeling \(\pi(s,a)\):

def pi(self, cell, action):
    if len(self.policy) == 0:
        # no policy: try all actions (for value iteration)
        return 1

    if self.policyActionForCell(cell) == action:
        # policy allows this action
        return  1
        # policy forbids this action
        return 0

Preparations for Reinforcement Learning

To prepare the implementation of the reinforcement learning algorithms, we still need to provide a transition and a reward function.

Transition Function

To define the transition function \(P_{ss'}^a\), we first need to to differentiate between illegal and legal actions. In gridworld, there are two ways that an action can be illegal:

  1. If the action would take the agent off the grid
  2. If the action would move the agent into a wall

The gives us the first rule for the transition function:

1. When an action is illegal, the agent should remain in its previous state.

Additionally, we must also require:

2. When an action is illegal, transitions to states other than its previous state should be forbidden.

So what about legal actions? Of course, state transition must be valid with regard to the chosen action. Since each action moves the agent only a single position, a proposed state \(s'\) must have the agent in a cell that is adjacent to the one in state \(s\):

3. Only allow transitions through actions that would lead the agent to an adjacent cell.

For this rule, we assume that there is a predicate \(adj(s,s')\) to indicate whether the agent’s transition from \(s\) to \(s'\) involved adjacent cells.

Finally, once we have reached the goal state, \(s^{\ast}\), we don’t want the agent to move away again. To stipulate this, we introduce the final rule:

4. Don't transition from the goal cell.

Based on these four rules, we can define the transition function as follows:

\[P_{ss'}^a = \begin{cases} 1 & \forall \quad \text{illegal action with $s = s'$} & \text{Rule 1} \\ 0 & \forall \quad \text{illegal action with $s \neq s'$} &\text{Rule 2} \\ 1 & \forall \quad \text{legal action with adj(s,s')} & \text{Rule 3} \\ 0 & \forall \quad \text{if $s = s^{\ast}$} & \text{Rule 4} \end{cases} \]

The Python implementation provided by getTransitionProbability is a not as clear-cut as the mathematical formulation, but I will provide it nonetheless:

def transitionProbabilityForIllegalMoves(oldState, newState):
    if newState == oldState:
        # Rule 1: stay at oldState if action is illegal
        return 1
        # Rule 2: dont transition to 'newState' if action is illegal
        return 0

def getTransitionProbability(self, oldState, newState, action, gridWorld):
    proposedCell = gridWorld.proposeMove(action)
    if proposedCell is None:
        # Rule 1 and 2: illegal move 
        return transitionProbabilityForIllegalMoves(oldState, newState)
    if oldState.isGoal():
        # Rule 4: stay at goal
        return 0
    if proposedCell != newState:
        # Rule 3: move not possible
        return 0
        # Rule 3: move possible
        return 1

Note that proposeMove simulates the successful execution of an action and returns the new grid cell of the agent.

Reward Function

In gridworld, we want to find the shortest path to the terminal state. We want to maximize the obtained rewards, so the reward at the goal state, \(s^{\ast}\) should be higher than the reward at the other states. For the gridworld, we will use the following simple function:

\[R_{ss'}^a = \begin{cases} -1 & \forall s' \neq s^{\ast} \\ 0 & \forall s' = s^{\ast} \end{cases} \]

The Python implementation is given by

def R(self, oldState, newState, action):
        # reward for state transition from oldState to newState via action
        if newState and newState.isGoal():
            return 0
            return -1

Policy Evaluation

The goal of the policy evaluation algorithm is to evaluate a policy \(\pi(s,a)\), that is, to determine the value of all states in terms of \(V(s) \forall s\). The algorithm is based on the Bellman equation: \[V_{k+1}(s) = \sum_{a} \pi(s,a) \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^{\pi}_k(s')] \] For iteration \(k+1\), the equation yields the value of state \(s\) via:

  • \(\pi(s,a)\): the probability of choosing action \(a\) in state \(s\)
  • \(P_{ss'}^a\): the probability of transitioning from state \(s\) to state \(s'\) using action \(a\)
  • \(R_{ss'}^a\): the expected reward when transitioning from state \(s\) to state \(s'\) using action \(a\)
  • \(\gamma\): the discount rate
  • \(V^{\pi}_k(s')\): the value of state \(s'\) in step \(k\), given the policy \(\pi\)

For a better understanding of the equations, let’s consider it piece by piece, in the context of gridworld:

  • \(\pi(s,a)\): Since we’re in a deterministic environment, a policy only specifies a single action, \(a\), with \(\pi(s,a) = 1\), while all other actions, \(a'\), have \(\pi(s,a') = 0\). So, the multiplication by \(\pi(s,a)\) just selects the action that is specified by the policy.
  • \(\sum_{s'}\): This sum is over all states, \(s'\), that can be reached from the current state \(s\). In gridworld, we merely need to consider adjacent cells and the current cell itself, i.e. \(s' \in \{x | adj(x,s) \lor x = s\}\).
  • \(P_{ss'}^a\): This is the probability of transitioning from state \(s\) to \(s'\) via action \(a\).
  • \(R_{ss'}^a\): This is the reward for the transition from \(s\) to \(s'\) via \(a\). Note that in gridworld, the reward is merely determined by the next state, \(s'\).
  • \(\gamma\): The discounting factor modulates the impact of the expected reward.
  • \(V_{k}(s')\): The expected reward at the proposed state, \(s'\). The presence of this term is why policy evaluation is dynamic programming: we are using previously computed values to update the current value.

We will use \(\gamma = 1\) because we are in an episodic setting where learning episodes stop when we reach the goal state. Because of this, the value function represents the length of the shortest path to the goal cell. More, precisely, let \(d(s,s^{\ast})\) indicate the shortest path from state \(s\) to the goal. Then, \(V^{\pi}(s) = -d(s, s^{\ast}) + 1\) for \(s \neq s^{\ast}\).

To implement policy evaluation, we will typically perform multiple sweeps of the state space. Each sweep requires the value function from the previous iteration. The difference between the new and old value function is often used as a stopping criterion for the algorithm:

def findConvergedCells(self, V_old, V_new, theta = 0.01):
    diff = abs(V_old-V_new)
    return np.where(diff < theta)[0]

The function determines the indices of grid cells whose value function difference was less than \(\theta\). When the values of all states have converged to a stable value, we can stop. Since this is not always the case (e.g. if the policy specifies states with actions that do not lead to the goal or when the transition probabilities/rewards are configured inappropriately), we also specify a maximal number of iterations.

Once the stopping criterion is reached, evaluatePolicy returns the latest state-value function:

def evaluatePolicy(self, gridWorld, gamma = 1):
    if len(self.policy) != len(gridWorld.getCells()):
        # sanity check whether policy matches dimension of gridWorld
        raise Exception("Policy dimension doesn't fit gridworld dimension.")
    maxIterations = 500
    V_old = None
    V_new = initValues(gridWorld) # sets value of 0 for viable cells
    iter = 0
    convergedCellIndices = np.zeros(0)
    while len(ConvergedCellIndices) != len(V_new):
        V_old = V_new
        iter += 1
        V_new = self.evaluatePolicySweep(gridWorld, V_old, gamma, convergedCellIndices)
        convergedCellIndices = self.findConvergedCells(V_old, V_new)
        if iter > maxIterations:
    print("Policy evaluation converged after iteration: " + str(iter))
    return V_new

Once sweep of policy evaluation is performed by the evaluatePolicySweep function. The function iterates over all cells in the grid and determines the new value of the state:

def evaluatePolicySweep(self, gridWorld, V_old, gamma, ignoreCellIndices):
    V = initValues(gridWorld)
    # evaluate policy for every state (i.e. for every viable actor position)
    for (i,cell) in enumerate(gridWorld.getCells()):
        if np.any(ignoreCellIndices == i):
            V[i] = V_old[i]
            if cell.canBeEntered():
                V_s = self.evaluatePolicyForState(gridWorld, V_old, gamma)
                V[i] = V_s
    return V

Note that the ignoreCellIndices parameter represents the indices of cells for which subsequent sweeps didn’t change the value function. These cells are ignored in further iterations to improve the performance. This is fine for our gridworld example because we are just interested in finding the shortest path. So, the first time that a state’s value function doesn’t change, this is its optimal value.

The value of a state is computed using the evaluatePolicyForState function. At its core, the function implements the equations from Bellman that we talked about earlier. An important idea for this function is that we do not want to scan all states \(s'\) when computing the value function for state \(s\). This is why the state generator generates only those states that can actually occur (i.e. have a transition probability greater than zero).

def evaluatePolicyForState(self, gridWorld, V_old, gamma):
    V = 0
    cell = gridWorld.getActorCell()
    stateGen = StateGenerator()
    transitionRewards = [-np.inf] * len(Actions)
    for (i, actionType) in enumerate(Actions):
        gridWorld.setActor(cell) # set state
        actionProb = self.pi(cell, actionType)
        if actionProb == 0 or actionType == Actions.NONE:
        newStates = stateGen.generateState(gridWorld, actionType, cell)
        transitionReward = 0
        for newActorCell in newStates:
            V_newState = V_old[newActorCell.getIndex()]
            # Bellman equation
            newStateReward = getTransitionProbability(cell, newActorCell, actionType, gridWorld) *\
                                (R(cell, newActorCell, actionType) +\
                                gamma * V_newState)
            transitionReward += newStateReward
        transitionRewards[i] = transitionReward
        V_a = actionProb * transitionReward
        V += V_a
    if len(self.policy) == 0:
        V = max(transitionRewards)
    return V

Results from Policy Evaluation

With the implementation in place, we can find the state-value function of our policy by executing:

V = policy.evaluatePolicy(gridMap)

To draw the value function together with the policy, we can use pyplot from matplotlib after converting the 1-D array that is used to represent the map to a 2D-array:

def drawValueFunction(V, gridWorld, policy):
    fig, ax = plt.subplots()
    im = ax.imshow(np.reshape(V, (-1, gridWorld.getWidth())))
    for cell in gridWorld.getCells():
        p = cell.getCoords()
        i = cell.getIndex()
        if not cell.isGoal():
            text = ax.text(p[1], p[0], str(policy[i]),
                       ha="center", va="center", color="w")
        if cell.isGoal():
            text = ax.text(p[1], p[0], "X",
                       ha="center", va="center", color="w")

Using the function, we can visualize the state-value function of our policy:

For non-goal cells, the plot is annotated with the actions specified by the policy. The goal in the upmost right cell is indicated by the X label. Walls (infinite value) are shown in white.

The value of the others cells is indicated by color. The worst states (with the lowest reward) are shown in purple, bad states in blue, intermediate states in turquois, good states in green, and very good states (with the highest reward) are shown in yellow.

Looking at the values, we can see that the results match the actions that are dictated by the policy. For example, the state directly to the west of the goal has a very low value because this state’s action (GO_WEST) leads to a long detour. The cell directly south of the goal has a very high value because its action (GO_NORTH) leads directly to the goal.

Note that, for our future work, the performance of evaluatePolicy is of critical concern because we will call it many times. For the computed example, the function requires 61 iterations, which translates to roughly half a second on my laptop. Note that the policy evaluation will require fewer iterations for policies that are closer to the optimal policy because values will propagate faster.

Being able to determine the state-value function is nice - now we can quantify the merit of a proposed policy is. However, we haven’t yet dealt with the problem of finding an optimal policy. This is where policy iteration comes into play.

Policy Iteration

Now that we are able to compute the state-value function, we should be able to improve an existing policy. A simple strategy for this is a greedy algorithm that iterates over all the cells in the grid and then chooses the action that maximizes the expected reward according to the value function.

This approach implicitly determines the action-value function, which is defined as

\[Q^{\pi}(s,a) = \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V^{\pi}(s')] \]

The improvePolicy function determines the value function of a policy (if it’s not available yet) and then calls findGreedyPolicy to identify the optimal action for every state:

def improvePolicy(policy, gridWorld, gamma = 1):
    policy = copy.deepcopy(policy) # dont modify old policy
    if len(policy.values) == 0:
        # policy needs to be evaluated first
    greedyPolicy = findGreedyPolicy(policy.getValues(), gridWorld, \
                    policy.gameLogic, gamma)
    return policy

def findGreedyPolicy(values, gridWorld, gameLogic, gamma = 1):
    # create a greedy policy based on the values param
    stateGen = StateGenerator()
    greedyPolicy = [Action(Actions.NONE)] * len(values)
    for (i, cell) in enumerate(gridWorld.getCells()):
        if not cell.canBeEntered():
        maxPair = (Actions.NONE, -np.inf)
        for actionType in Actions:
            if actionType == Actions.NONE:
            proposedCell = gridWorld.proposeMove(actionType)
            if proposedCell is None:
                # action is nonsensical in this state
            Q = 0.0 # action-value function
            proposedStates = stateGen.generateState(gridWorld, actionType, cell)
            for proposedState in proposedStates:
                actorPos = proposedState.getIndex()
                transitionProb = gameLogic.getTransitionProbability(cell, proposedState, actionType, gridWorld)
                reward = gameLogic.R(cell, proposedState, actionType)
                expectedValue = transitionProb * (reward + gamma * values[actorPos])
                Q += expectedValue
            if Q > maxPair[1]:
                maxPair = (actionType, Q)
        gridWorld.unsetActor(cell) # reset state
        greedyPolicy[i] = Action(maxPair[0])
    return greedyPolicy

What findGreedyPolicy does is to consider each cell and to select that action maximizing its expected reward, thereby constructing an improved version of the input policy. For example, after executing improvePolicy once and re-evaluating the policy, we obtain the following result:

In comparison to the original value function, all cells next to the goal are giving us a high reward now because the actions have been optimized. However, we can see that these improvements are merely local. So, how can we obtain an optimal policy?

The idea of the policy iteration algorithm is that we can find the optimal policy by iteratively evaluating the state-value function of the new policy and to improve this policy using the greedy algorithm until we’ve reached the optimum:

def policyIteration(policy, gridWorld):
    lastPolicy = copy.deepcopy(policy)
    lastPolicy.resetValues() # reset values to force re-evaluation of policy
    improvedPolicy = None
    while True:
        improvedPolicy = improvePolicy(lastPolicy, gridWorld)
        if improvedPolicy == lastPolicy:
        improvedPolicy.resetValues() # to force re-evaluation of values on next run
        lastPolicy = improvedPolicy

Results from Policy Iteration

Running the algorithm on the gridworld leads to the optimal solution within 20 iterations - about 4,5 seconds on my notebook. The termination after 20 iterations doesn’t come as a surprise: the width of the gridworld map is 19. So we need 19 iterations to optimize the values of the horizontal corridor. Then, we need one additional iteration to determine that the algorithm can terminate because the policy hasn’t changed.

A great tool for understanding policy iteration is by visualizing each iteration:

The following figure shows the optimal value function that has been constructed using policy iteration:

Visual inspection shows that the value function is correct, as it chooses the shortest path for each cell in the grid.

Value Iteration

With the tools we have explored until now, a new question arises: why do we need to consider an initial policy at all? The idea of the value iteration algorithm is that we can compute the value function without a policy. Instead of letting the policy, \(\pi\), dictate which actions are selected, we will select those actions that maximize the expected reward:

\[V_{k+1}(s) = \max_{a} \sum_{s'} P_{ss'}^a [R_{ss'}^a + \gamma V_k(s')] \]

Because the computations for value iteration are very similar to policy evaluation, I have already implemented the functionality for doing value iteration into the evaluatePolicyForState method that I defined earlier. I marked the relevant lines with >:

def evaluatePolicyForState(self, gridWorld, V_old, gamma):
    V = 0
    cell = gridWorld.getActorCell()
    stateGen = StateGenerator()
    transitionRewards = [-np.inf] * len(Actions)
    for (i, actionType) in enumerate(Actions):
        gridWorld.setActor(cell) # set state
        actionProb = self.pi(cell, actionType)
        if actionProb == 0 or actionType == Actions.NONE:
        newStates = stateGen.generateState(gridWorld, actionType, cell)
        transitionReward = 0
        for newActorCell in newStates:
            V_newState = V_old[newActorCell.getIndex()]
            # Bellman equation
            newStateReward = getTransitionProbability(cell, newActorCell, actionType, gridWorld) *\
                                (R(cell, newActorCell, actionType) +\
                                gamma * V_newState)
            transitionReward += newStateReward
>        transitionRewards[i] = transitionReward
        V_a = actionProb * transitionReward
        V += V_a
>   if len(self.policy) == 0:
>       V = max(transitionRewards)
    return V

This function performs the value iteration algorithm as long as no policy is available. In this case, len(self.policy) will be zero such that pi always returns a value of one and such that V is determined as the maximum over the expected rewards for all actions.

So, to implement value iteration, we don’t have to do a lot of coding. We just have to iteratively call the evaluatePolicySweep function on a Policy object whose value function is unknown until the procedure gives us the optimal result. Then, to determine the corresponding policy, we merely call the findGreedyPolicy function that we’ve defined earlier:

def valueIteration(self, gridWorld, gamma = 1):
    self.resetPolicy() # ensure empty policy before calling evaluatePolicy
    V_old = None
    V_new = np.repeat(0, gridWorld.size())
    convergedCellIndices = np.zeros(0)
    while len(convergedCellIndices) != len(V_new):
        V_old = V_new
        V_new = self.evaluatePolicySweep(gridWorld, V_old, gamma, convergedCellIndices)
        convergedCellIndices = self.findConvergedCells(V_old, V_new)
    greedyPolicy = findGreedyPolicy(V_new, gridWorld, self.gameLogic)

Results from Value Iteration

How does value iteration perform? For our gridworld example, only 25 iterations are necessary and the result is available within less than half a second. Remember that this is roughly the same time that was needed to do a single run of evaluatePolicy for our badly designed initial policy. The reason why value iteration is much faster than policy iteration is that we immediately select the optimal action rather than cycling between the policy evaluation and policy improvement steps.

When performing value iteration, the reward (high: yellow, low: dark) spreads from the terminal state at the goal (top right X) to the other states:


We have seen how reinforcement learning can be applied in the context of MDPs. We worked under the assumption that we have total knowledge of the environment and that the agent is fully aware of the environment. Based on this, we were able to facilitate dynamic programming to solve three problems. First, we used policy evaluation to determine the state-value function for a given policy. Next, we applied the policy iteration algorithm to optimize an existing policy. Third, we applied value iteration to find an optimal policy from scratch.

Since these algorithms require perfect knowledge of the environment, it’s not necessary for the agent to actually interact with the environment. Instead, we were able to precompute the optimal value function, which is also known as planning. Still, it is crucial to understand these concepts in order to understand reinforcement learning in settings with greater uncertainty, which will be the topic of another blog post.