https://www.youtube.com/watch?v=UwjvpYrCUZ0&ab_channel=TimMiller transcript on Markov Decision Processes (MDPs) and value iteration.
iIn this section we're going to look at Markov decision processes. Markov decision processes are models of sequential decision making otherwise known as planning. We're going to look at a definition of Markov decision process. We're going to look at what solutions for Markov decision processes are and we'll investigate a key approach to solving Markov decision processes using dynamic programming known as value iteration let's get started there are five learning outcomes of this path the first is to be able to Define what a Markov decision process is the second is to identify situations in which Markov decision processes are a suitable model of a problem third is to explain how Bellman equations are solutions to mdp problems the fourth is the ability to apply value iteration to problems in particular to manually apply value iteration to small problems and to implement value iteration to solve larger problems and finally the ability to be able to extract policies from value functions so what exactly is a Markov decision process before we look at a formal definition let's just get a bit of intuition via this example in this example we can see a grid that an agent has to navigate in the bottom left hand corner we have the agent itself in the top right corner we have the goal given in green the agent will get a score of one if it reaches the goal next to the goal we have a square in red which an agent has to avoid say falling in a hole and it gets a score of negative one if it falls in that the aim is for the agent to navigate from this bottom left across up and across to the top right or across and up in order to get to the top right and avoid falling in the hole however there's a catch to the problem if an agent tries to go in a particular direction let's say goes up here that will be successful with probability of 0.8 however there's a probability of 0.1 that if it tries to go up it will go right instead of 0.1 and a probability of 0.1 that it will go left at which point it will bounce off the wall and end up back in the same place in any of these cells if an agent tries to go in a certain direction like this it'll be successful with 0.8 and then it will turn right or left with point one such as this 0.1 and I point one as such there's uncertainty when the agent takes an action which the next date will be so unlike in for example classical planning when an agent takes an action in a particular state it can predict what the next state will be in a Markov decision process there's uncertainty about which date that would be so this gives us the ability to model uncertainty in outcomes when we apply actions so there are other examples that might be useful as well for example what if we try to schedule something on a Cloud Server and we're unsure whether we'll be added to the queue or we'll be able to execute it straight away the second example might be a robot working in a disaster scenario where it's going around trying to detect whether there are dangerous substances on the ground as it takes a sample and tries to measure that sample it's unsure which substance it might detect there might be say five or six different dangerous substances and it's not sure which one of those is going to detect a priority but it might have some estimate of probability about which ones are likely to be in the environment so this ability to measure uncertainty is what separates Markov decision processes from for example classical search problems now let's look at a more formal definition of Markov decision process and when we do this we're going to compare it to classical planning or classical search they're quite similar but they have a few differences that are important to point out So In classical planning and in a Markov decision processes we have a set of States s which represents all the possible states in the world that we could possibly be in we have an initial state which we'll call s0 which is one of those States and it's the state we're in at the start in the example we saw of the grid World earlier it's the bottom left hand side we have a set of actions a that are available to an agent and we use this syntax here a of s to represent all the actions that are available in the set s In classical planning we have a transition function which tells us given a state s if we apply action a we will end up in state S Prime however in Markov decision processes remember there's uncertainty about where we will end up therefore we don't have a transition function like this instead we have transition probabilities these probabilities are expressed using the notation of conditional probability you can find the background of conditional probability in the appendix of the notes what this notation says is that if we apply action a in state s then the probability that we'll end up in state S Prime is given by this expression here in any given state if we sum up the probabilities of all the actions they must sum to one this is how we Define the uncertainty of action outcomes we can have several actions where this probability is greater than zero foreign classical planning we have a set of goal States and all of our actions have costs with them in Markov decision processes this is replaced with a reward function a reward function is a function that gives us some reward that can be positive or negative a positive or negative real number that if we transition from State s using action a and result in state S Prime we will get this reward in a gridwell example if we're in cell 2 2 and we transition right into the cell at the top right hand corner we'll get a reward of one alternatively if we transition into the state below that we'll get a reward of negative one so what we can actually say here is these two points the goals and the action costs are somewhat subsumed by the reward function the reward function if we have a goal we want to reach we should give it a positive reward and for Action costs we can give a negative reward every time we take an action that doesn't lead to a goal Markov decision processes have one other element which is a discount Factor gamma the discount factor is effectively allowing us to prefer shorter plans over longer plans we'll go into a bit more detail about that in a minute what you need to know though is a Markov decision process consists of these things a set of States an initial State some actions transition probabilities reward function and discount Factor so let's go into a bit more detail about this idea of a discount Factor recall from the previous slide that we have a discount Factor gamma 0.0 and one this discount Vector models an assumption that we value rewards sooner rather than later so we value rewards now more than we value rewards in the future for example if I promise to give you a hundred dollars would you prefer it today or in a week or in a year you'd probably select have it today if you couldn't get it today you'd like to have it in a week otherwise you then have it in a year the idea here is that as people we tend to Value rewards sooner rather than later however there's also a good technical reason as we'll learn in this subject why we might want to have a discount Factor so let's think about how it works if we use the letter G to represent the rewards for an entire Trace then our Awards would look a little bit like this imagine we received a reward at step one which we'll call reward one we would get that reward then we received another reward at step two after our second action what we would do is we would multiply that reward by gamma our discount Factor if gamma was strictly less than one then that would mean that we'd value that reward slightly less than the reward at step one at the next action we would multiply our reward reward 3 by gamma squared again we value that even less than we value reward two and so on at step four it would be cubed Etc so in this case here if we received a reward five and step one then we receive that reward five if we receive the award for example 10 at step two and our discount Factor was 0.9 we would receive a reward of nine rather than 10 and similarly for the other ones what we can actually do is we can factor out this discount Factor like so we can see a pattern resolving such that at each time step the reward we will receive which we'll call GT for the reward at time step T is going to be equal to the reward that the reward that we receive at that time Plus the future reward we receive its debt time step t plus one and onwards multiplied by the discount Factor so that idea is the idea of a discounted future reward in this case what we have is the discounted future reward GT it's a reward we receive now plus any rewards we receive in the future the rest of the plan multiplied by our discount Factor so what is the solution to an mdp look like a solution to an mdp is known as a policy a policy is similar to the idea of a plan a plan is a sequence of actions that we can draw out that tells us exactly what actions we're going to take however remember with an mdp the outcomes of our actions are uncertain which means that we can't draw our plan as a sequence because as we take that first action we don't know which state we're going to end up in and depending which date we're going to end up in that affects which action will which we'll choose next so what does a policy look like well a policy similar to a plan simply just says that in a particular State this is the action we're going to take so if we look at our grid World example our policy we could represent graphically by saying in this state we're going to take the action up and in this day here we might take the action left and in this state here we might take the action down and this day here we might take the action right the important thing is is that for every state we have to say which action we're going to take for the gridwell problem this is a complete policy for every state it defines which action we can take it's not a very good policy at the moment as you can see it doesn't go towards the goal time we'll look at what an optimal policy looks like in a second the important thing though to note with a policy is if we take the action that corresponds to our policy and instead of going right with this point one probability it ends up transitioning to this day here we can still recover we know to take the next action which is to go up on the other hand if we just defined this as a sequence of actions this would take us left even though we had gone up the the plan would tell us to go left so our policy is more robust to these uncertainties than a plan for this particular problem Markov decision problem the optimal policy looks like this okay from the bottom left we go up and across if we're in this state here we go around as you can see it tries to avoid going into the hole instead of going straight up here it goes around we're going to study several techniques that take a description of a Markov decision process and produce a policy the problem of reinforcement learning is to take a Markov decision process and produce the best policy we can what does the best policy mean we'll Define that formally soon but for now let's just have a look at two different types of policy that are important for mdps in mdps there are two main types of policies deterministic policy and a stochastic policy the policy that you saw in the previous slide is a deterministic policy deterministic simply means that in every state it gives us exactly one action that we should take let's define that a bit more formally we're going to use the character pi to represent our policy and pi is defined as a function from States to actions a complete policy will Define this for every possible state in the mdp so with pi what happens is we Supply our policy with a state and it Returns the action that we should apply in that state it returns exactly one action which makes it deterministic the other type of policy is a stochastic policy a stochastic policy doesn't give us one action it gives us several actions with a probability distribution over those actions formally it's defined as Pi but it takes two parameters a state and an action and instead of giving us an action back it takes pairs of State in actions and returns a real number somewhere between zero and one that rates the probability of selecting action a in state s this is what makes it stochastic we're not going to go into details now of when you would use deterministic versus the stochastic policy but later on we'll use stochastic policies in order to help us search for good Solutions the important thing to remember now is that a deterministic policy takes a state and tells us which action to take and a stochastic policy gives us a probability distribution over actions and we should sample the action based on that probability distribution policies aren't graphical the example we saw in the grid world we drew arrows on an image in order to represent a policy they're represented in different ways here's a really simple way to represent a policy a deterministic policy it's just simply using something like a dictionary data structure all we do is take the state which in this case would be the agent is at coordinates 0 0 that's the bottom left coordinates of grid world and it maps to which action we should take in that state move up if we're at 2 2 which is the state just to the left of the goal it's going to tell us to move right later on in the subject we'll look at more sophisticated versions of policies that are actual functions that we learn using machine learning in these cases we can represent the policy much more succinctly than we can here for the grid World example a dictionary works just fine but when we have a massive number of states for example an infinite number of states we can't represent it using a dictionary any longer but for now just think of a policy as a mapping from a state to an action next we're going to Define what an optimal solution is for an mdp but before we do that let's just revisit an idea that you might be familiar with called the expected return a very similar concept known as expected reward is important for mdps we're going to do it via a short quiz now imagine that you've broken into a warehouse and you can steal a box of iPhones or a box of Samsung phones and you're going to try and sell them you can only carry one box so you have to decide which one you can steal the iPhones or the Samsung so for the iPhones you believe there's about a 20 chance you can sell them each for five hundred dollars or an 80 chance you can sell them each for 250 dollars for now let's just just assume that we're going to sell the entire box at the same price similarly for the Samsung you believe there's about a 50 chance that you'll sell the meats for 500 or a 50 chance that you'll sell them for two hundred dollars which box should you take if you want to maximize your expected return the iPhone or the Samsung okay so how did you go did you choose the iPhone or the Samsung well I'm going to show you how to briefly calculate this idea just to give you the refresher on expected return an expected return is simply taking all the possible options taking the return that we get from those options multiplying them by their probability and summing these together so for example in the iPhone case what we have is that we have a 20 chance that we would sell the iPhone at 500 .2 and we multiply that by the 500. then we add this to the 0.8 chance the 80 chance that we'd sell the phones for 250. and the total that we'd receive for these two together would be three hundred dollars I'll leave you to do the math for yourself and the Samsung we could do a similar calculation 0.5 times 500 plus 0.5 times 200. we do the calculations correctly and that we'd find that the expected return is 350 dollars so in this case here it's better to take the Samsung and try to sell them because our expected return is going to be higher 350 instead of 300. now that's not going to be our actual return it might turn out that we take the Samsung and we sell them for 200 when if we had taken the iPhone we would have sold them for five hundred dollars but by multiplying it by our probabilities we get our expected return this concept of expected return is very similar to the concept of expected reward that we're going to use for mdps therefore it's important that we understand it so let's discuss a little bit about what a good solution to an mdp should look like what we want is a sequence of actions that maximizes the accumulative reward we would receive over that in other words that maximizes the sum of the rewards that we'd receive over that Trace however there are two catches firstly those rewards are discounted remember that rewards in the future are worth less than rewards closer to the present so what we want to do is maximize our discounted accumulated reward but also remember that our actions have uncertainty in their outcomes so we're not sure which sequence of actions we might choose in the future because it depends on the outcomes of actions what we want to do then is we want to maximize our expected accumulative reward over that trace or more precisely our expected discounted accumulated reward let's try and formalize this notion first recall that we have a policy that we're trying to find policy that takes a state s and gives us back the action that we're going to select in state s we're going to assume for now that our policies are deterministic not stochastic the solution to an mdp should try to maximize the rewards that we receive we're going to use the notation V pi s to determine the value that policy Pi can give to us from State s what do we mean by value well let's define that a little bit more precisely as well first let's just assume we have no discount factor and we have no uncertainty in our action outcomes what does the accumulated reward look like well for a policy pie at each step which we will call I we receive a reward for taking an action from State and resulting in a different state we'll say the reward s of I receive the reward from transitioning from State SI using action I to State s i Plus 1. where that action is the action given to us by our policy like so so what this says is that at step I we receive the reward defined by the Markov decision process by selecting the action that our policy tells us to select but we have a sequence of actions and what we want to do is sum up all of the rewards that we receive over that sequence that's called our accumulated reward we'll just use our summation operator and say that we sum over every step I that's in a trajectory that we have in our mdp so so far that gives us our accumulated reward for this sequence as defined by policy pi but remember we have discount factor in our MTP so we have to Discount rewards that are in the future we do this simply by saying that at each step we multiply the reward by gamma to the power I therefore rewards further in the future are worth less than rewards that are closer so so far what we have now here is our accumulated discounted reward for a trajectory given to us by our policy pi so now let's relax the assumption that our actions are deterministic and reconsider the fact that our actions have uncertainty in their outcomes what we have now is multiple such trajectories that our policy can give us remember from our example over here of the grid world when we try to go up we might go right at that stage there we've got the start of two different trajectories that are possible we want to maximize therefore the expected reward over all those possible trajectories for policy pi we model this using the expectation function e and put the brackets all the way around this year what this means then is if we sum up the rewards over every possible trajectory multiply that by the probability that that trajectory would occur and then take the expected reward of those we end up with the expected reward of policy pie from State s we Define this as the value of the policy from State s to solve an mdp what we want to do is we want to try and find the policy p such that it's maximizes V Pi because it maximizes the value from the initial state s0 of our mdp so let's recap what we've got here what we have is an ability to calculate the accumulated reward of a trace given by our policy pi and if we assume that we have uncertainty in our action outcomes we take the expected cumulative reward discounted by our discount Factor here the job of trying to solve an mdp is trying to find the policy pie that maximizes this value here if we want to calculate the value of a policy Pi one way to do that is to look at every possible sequence of actions it's possible from PI and simply sum up the expected return from that however this is infeasible for even the smallest examples if we look at grid World on the right here which is Tiny by realistic standards what we can see here is the agent could simply iterate around this Loop an infinite number of times and therefore we have an infinite number of possible trajectories next up we'll look at a more principled way to define the value of a policy from a state without having to sample all possible trajectories of an mdp okay so what is an optimal solution to an mdp an optimal solution to an mdp is defined by the Bellman equation this is named after Richard E Bellman who is the person who came up with the idea of dynamic programming and dynamic programming is one of the approaches we're going to use to solve mdps as well The Bowman equation defines the value V of every state s if we have the value of every possible State then the idea is we can use that value in order to select the actions that maximize the value of the next state in our plan how do we Define the value of that state well let's think a little bit about what we have to do the value of the state is going to be based on the actions that we can take from that state if there are good actions from that state that lead us to good rewards then that means that we're going to receive a good value for that state if all the actions are bad then that's that's obviously going to be a bad state let's think a little bit about how we might Define the value of an action from a state okay so thinking back to what happens when we select an action what do we receive remember our discounted rewards we receive the reward of that step plus a discounted future reward so I'm going to Define that first okay we're going to receive the reward uh s a s Prime if we transition from State s to State s prime using action a so we select action a and then we receive our discounted future reward which we're going to call for now just G t plus one that's what we get if we select action a and it transitions to State S Prime remember however that actions aren't necessarily going to go to that state because the action outcomes are uncertain so if we want to figure out what the expected return of that action is we have to multiply it by the probability that that actually happens for this we use our transition probabilities defined by our mdp now remember this expression here comes from our definition of mdp and it is our like our transition function but it's a probabilistic transition function so here what we have is if we select action a and it takes us from State s to State S Prime then take the probability that that would happen and multiply it by our discounted future reward okay so so far that makes sense right we select an action and it takes us to a state and the probability of it taking to that state is sometimes less than one so we multiply the reward that we receive by the probability we have to remember that if we select action a there are multiple possible outcomes and we don't know which one is going to happen before we execute the action similar to the idea of expected return we can Define the expected reward of an action by summing up the outcomes of all those actions multiplied by the probability that they will occur now we already have the outcome of the action or the reward we received multiplied by the probability so all we need to do is sum up over the different possible outcomes that can occur here so we're going to use our summation operator Sigma here which simply means add all of these together and we're going to add every possible stay in our system so what this says here is that if we take action a look at every possible State that's possible in our mdp take the probability that we transition to that state if we select action a from S Prime and multiply that by the expected reward that we'd receive if we did it now this may seem a little bit let's say computationally difficult to sum over every state so in the case of the grid world we can't move from coordinate 0 0 in the bottom left up to the goal State straight away with one action however we're just defining what this is here we're not Computing it with development equation yet so in that case the probability that we move from the bottom left coordinate to the top right coordinate would be zero and it would be effectively we wouldn't receive any reward for that at all because we couldn't do it but for now we'll just say that we're going to go over every possible State when you compute it it's possible to look at only the states that you would transition to so let's just think about this now we'll break it down a little bit here we have our discounted reward or discounted future reward here if we selected action a and ended up in state S Prime and this greater expression here is simply the expected reward of choosing action hey we look at all possible outcomes that could happen if we execute action a we take the probabilities that those outcomes will occur and multiply them by the discounted reward then we sum all those together and we get the expected reward of executing action a so if we're in state s and we did this for every action that was possible from State s we would have the expected reward for all of those actions for each of those actions here like I said earlier the value of the state is going to depend on how good the actions are from that state but how would we Define precisely the value of that state well Belmond defines it quite simply and just says it's the best action from that state so we take the action that maximizes our expected reward and the value of that action is the value of the state this is how Bellman defines the value of the state take the action that maximizes our expected reward if we had this for every possible state in the system then we could effectively trace a plan through that tries to follow the high reward States however there's a catch here at this point here I've defined g t plus 1 as being the future reward that we receive from State S Prime but how can we possibly estimate what that is well in this case here it's telling us the discounted future reward over particular plan that we would take but we don't know a priority which plan is going to be selected so we can't use this value here what can we use well we can use the value of the state S Prime they V of S Prime which is defined Itself by the Bellman equation so now what we talk about when we talk about discounted future reward is the reward we receive immediately plus the discounted by gamma reward of the state that we end up in S Prime that we transition to so we can see here that the Bellman equation is recursive the value of State s can depend on other states such as S Prime but also the value of S Prime can depend on the value of s so we have this recursive definition here turns out this is part of what's tricky in solving mdps but we'll look at different techniques throughout this on how to solve it using various methods now let's look at an algorithm for solving mdps first off we're going to look at Value iteration we choose value iteration because it's a very simple mapping from Bellman equations to Value iteration it's perhaps the most straightforward solution we can use to solve mdps it has some weaknesses that we'll discuss later however value iteration is a very simple algorithm that simply applies the Bellman equation iteratively until it converges on a solution let's have a look so here we have the algorithm for Value iteration on the slide I'm going to ignore some of the steps in the algorithm for now and come back to them later the input to the value iteration algorithm is a Markov decision process the output is a value function of the type that we've seen earlier value iteration is a very simple algorithm and it does the following steps first as we can see on this first line here it initializes a value function to some arbitrary set of values for every state let's assume for now that that initial value is zero but the actual value doesn't matter all we simply do now is we repeat the following steps until we reach what we say is Convergence we'll Define what convergence means in a little bit for now we're going to skip this line here and we're going to look at what happens inside this iteration in each iteration we go through every possible State and all we simply do is we apply the Bellman equation to that state we Define V Prime of the state s to be the value of the action that maximizes the expected return at that State s this is a simple application of the Bellman equation in this case here though because of the recursive nature of the Bellman equation what we're actually doing is we're creating a new value function V Prime that uses the previous value function just V once we've done that for every possible State at the end we simply say that V becomes V Prime we then iterate over that loop again this is what's known as dynamic programming we're using the solution of a previous iteration to help solve the new iteration this will iteratively improve our solution as we go what happens is that we continue doing this until we reach what we say is Convergence what does convergence actually mean well what convergence means in theory is that we go around this Loop over and over until the values in V and V Prime are the same at the end of a loop in other words we get no change in the values of states once we reach a situation with V and V Prime are the same any new iterations over this won't change the values in v at that point there we've converged to what's called an optimal solution we have an optimal value function however in practice that could be an infinitely long process the values can get better and better but only by very small amounts we might be able to solve the problem quite well by saying that we'll terminate when the values in the value function change just a little bit we defined that little bit as being this parameter here Theta when the values changes to small amount we think that we're probably close enough to the optimal solution and therefore we can terminate our iterations how do we calculate whether we're at this or not well at the start of every Loop we Define delta as being zero Delta is what we use to represent the change between V and V Prime for each state we calculate the Delta between the previous value and the old value here this would be how much the value of State s has changed between the two iterations and we take the absolute value of that we use this maximize function to find the state with the biggest change in every iteration so because we go through every state here what we want to know is which is the state that has changed the most and how big is that Delta at the end of the iteration here if that Delta is less than our threshold we'll end up with a converged solution with the error of theta so if we think about this as applied to the gridwell example at each iteration we're going to go through each state and we're going to apply the Bellman equation to each reachable state in our problem at each point we calculate the value of those States and we assign that to V Prime once we've done that for every state we update V Prime to be V again and we repeat the process and we repeat this until the largest change from one of these states is less than Theta if we set Theta to being 0 then we simply do it until there's no change at all however pragmatically we tend to find that we can solve the problem quite well when we have Theta being as a small value but what's important to note here is that value iteration is simply applying the Bellman equation repeatedly until we get close enough to a solution now let's apply the value iteration algorithm to the grid World example to illustrate how it works on the top here I've got the Bellman equation which I'm going to apply at each iteration to each state and on the right here I've got the example we're going to apply it to now we're going to look at iteration two after iteration one if we initialize all the values to zero we'll end up with this top right State the green one having a value of one and this red State having a value of -1 and all the others will still be zero you can verify this for yourself if you want so we're going to look at what happens in that second iteration now in each iteration we have to go through every state and find the value of that state based on the previous values that we have I'm going to only look at one state which is this state here next to our goal State why am I going to look at that today because I know that's the best date for this example I'll briefly go over some of the other states but we'll look at that one in detail now what we have to do is for this state here we have to find the action here that maximizes our expected future reward there's four possible actions up down left and right to help illustrate what's going on I'm simply going to divide this state into four sections based on the four different actions that are possible left right up and down and I'll fill those in with our values as we go so in the second iteration we have the value of these states here what we're going to try and do is calculate the value of this state here I'm going to go through the four different actions first if we're in this state what happens if we take the action right now we can all see that's the best action but a priority we wouldn't know this if we're using value iteration on a larger example but let's calculate it anyway so if we have the action right from this state we have to calculate what the future expected discounted reward is from that state so for the right action we have three possible outcomes we have to think about that it's successful if it goes right otherwise it slips left or slips right which would be slipping up and slipping down in this example here so I'm going to try and line up these values with the corresponding column under here now remember if we try to go right we have an 0.8 chance of this action actually being successful and if we go right what we'll receive is our reward of zero initially Plus a discount factor which we're assuming over here is 0.9 multiplied by the value of the state that we transition to now remember I said this is iteration two and iteration one the value of this date is going to be one so that's multiplied by one that's the return we would get if we selected the right and we're successful however we have to remember that we might go up or we might go down if we try and go right so there's other two actions here if we slip and accidentally go up which is a 0.1 chance we can see we get no reward plus the discounted factor of 0.9 and we slip back into this state at the moment this value of that state is zero so we can see this is all zero and similarly if we slip down which is 0.1 chance we get no reward plus our discount Factor times if we slip down we come down here the value of that state is zero as well so what's the expected return and if we sum all these up well if we multiply 0.8 by our 0.9 then it's back to here we get 0.72 and clearly the value of these two is going to be zero we've got zero plus zero multiplied by 0.1 so the value of our state the maximum value is going to be 0.72 so let's fill in our value here as 0.72 that's the value of going right however when we're running a value iteration algorithm we don't know that that's necessarily going to be the best answer so let's look at what happens if we evaluate the other actions as well so first we're going to do the action going down again when we go down it's going to be successful with the probability of 0.8 going down doesn't receive a particular reward at all and then yeah discount Factor 0.9 multiplied by the value of the state if we go down if it's successful which is this date here the value of that is again zero however if we try and go down remember we might slip to be to either left or the right if we slip to the left which is going this way we'll get 0.1 probability no reward now discount of 0.9 but the value of this state here is zero because we haven't learned any values for it but we also might slip left a little bit as well if we slip left a little bit with probability 0.1 we'll get no reward plus our discounted future reward discounted by 0.9 and the reward we would get here would be one so we can calculate now our expected reward of selecting the action down by simply summing up these three terms we can see here that these first two terms are clearly going to be zero in this case here our term is going to be 0.9 multiplied by point one so therefore our answer is 0.09 let's fill that in 0.09 without going into additional detail we know that if we do the up action we could go up with in which case we would result back in this state here we might go left with 0.1 chance in which case we would get zero award or we might go right with 0.1 chance in which case we would get a discounted future reward of one discounted by 0.9 therefore that's going to be the same as going down at least in this iteration and finally if we go left with 0.8 we'll get a return of 0 with 0.1 we'll get 0 with 0.1 we'll get zero again so the value of going left is just simply going to be zero okay so let's fill in those two values now what we have in this top right corner is the expected return of each of the actions if we chose them now as we knew intuitively going right is going to be the best action we can see here it gives us value 0.72 which is clearly the action that maximizes the expected future reward therefore in this case v of s for that state would be 0.72 once we've calculated all them effectively what we can do is replace all these in our value function with the value that we get here now if we went through and applied this to every possible state in our mdp on iteration two what we'd see is that these states around here would all have a value of zero simply because the value function of all the surrounding states is zero therefore there's no reward to get however what we can see now is that we have 0.72 for this action here on the third iteration we're going to see that this date if we go right we now get some future discount reward defined by V what we see is that our value function is slowly going to learn good values for this problem okay so now it's your turn to try this I've got a short quiz here for you to try out on the top I've still got the Bellman equation for your reference but the question is if V of s equals 0 for every possible State and we're still at iteration two what is the value V for the state 2 1. we've already showed you what the value for the state 2 2 is going to be now it's your turn to try and apply your understanding of value iteration to see if you can calculate this new value okay so how did you go did you get the right answer let's go through and have a look and see what answer you've got and compare it to the answer that I get okay so we're looking at this date here and we have to think about the four possible actions I'm going to do what I did before and draw the four possible actions on here and we'll quickly go through and calculate and see if the answer you've got comes out to be the same okay so if we evaluate the action right what we get is 0.8 which is a probability that it's successful multiplied by the reward that we get which is 0 plus the discounted future reward 0.9 times the value that we get in this case it's going to be -1 we can then sum this up with the value 0.1 if we accidentally slip up or 0.1 if we accidentally slip down we know that that's going to be zero again because we can see here that there's no value here and no value here for to be gotten therefore it's 0.1 times that 0.1 times 0 as well so what's the expected return going to be for that action given the uncertainty in the action outcomes well it looks pretty much the same as the one we had before it's going to be negative 0.02 so let's fill that in negative point zero two next let's do the down action the down action again with 0.8 probability that's going to be successful giving us no immediate reward plus 5.9 times the value of that state down here which is zero plus it's going to give us 0.1 probability if we try and go down that it accidentally slips into the hole which gives us no reward Plus a discounted future reward 0.9 times 0.1 Plus 0.1 chance that it slips to the right in which case it's going to bounce into the wall and remain in that state 0.9 times 0 which is simply zero so this first expression and this expression are both zero so we simply have to look at this middle expression here which is going to be Point minus nine future discounted reward times point one 0.09 we can see clearly that if we use the up action we're going to similarly get point 09 and if we use the left action all that's going to happen is that with 0.8 we're going to end up back in this state ourselves with point one we can go down with point one we can go up which both give us zero rewards so left is going to be zero so therefore what is the solution to the Bellman equation here given that we want the action that maximizes our value well the value F of 0 for this particular state is going to be zero still so the answer to our question is zero is that the answer that you got if not let's look at why you might not have got the right answer okay I filled in this right hand side here with our different values you might have answered negative 0.72 that intuitively seems like the right answer because 0.72 was the value of this state and we got a lot of that value by going right into the good State we can imagine that if we have the symmetric problem here where we're going right into the red state that the value would be 0.72 if you thought that however the one thing you missed out on was that we need to take the action that maximizes the future expected return the action that maximizes the future expected return when we're at this day here is left which is zero however that's not the same as the action that maximizes the future return here which is right this is simply because there's a negative reward here and a positive reward there so that might have caught you out if you said the answer was something along the lines of 0.528 if you did that you may have forgotten that we don't update the value function V Prime based on other values in V Prime we update them based on V which is the value function we got in the previous iteration so in that case when you evaluated for example the action up you might have thought that this was the value of v s Prime but it this but in iteration two the value of this state is still zero in v even though it's 0.72 in V Prime don't forget that and finally you might have said that the answer is perhaps negative 0.09 intuitively what you might have said is we know that the best action is to go up therefore I'll choose the value of the up action from this state here and that's true in the final policy the best action is to go up however at this particular point in time at iteration 2 using value iteration we don't yet know that and in fact going up is not the best action we will see later on that as the algorithm converges towards the optimal solution that indeed does become the best action but it's not the case in iteration two okay now that we've looked at an example manually let's look at how value iteration converges towards the optimal solution using the implementation and visualization from the notes as we can see here we're starting at iteration one where the value of the top right state is 1 and the value of the state underneath it the whole is -1 we're going to step through and look at multiple iterations of value iteration so if we step through here after the first iteration we'll see that our value function is indeed 0.72 in the top right which is what we manually calculated and also that at this stage here it's still zero if we step to the next state we see that the two states either side of the state we've just got also received some value this is because that expression V of S Prime the future expected reward now has some value to reach to from these states here which is the one we've updated however we also see that the top right state has converged a little bit further toward the optimal solution going from 0.72 to 0.78 as we step through we see that the value starts to spread from the rewards that we receive here out to the remaining States we step through a little bit further we see that these literally get further and further towards the optimal value if we just press play here now and watch it iterate through we can see that the values are slowly updating but after around about 15 or 20 iterations we don't see any change what's actually happening is there is change going on underneath but the changes are so small that they don't have a corresponding value at two decimal places so let's just pause it here and look back at the start as to what happened so let's just watch this again and note how the rewards spread out from the state at the top right in order into the other states here at this situation we see why the discount Factor gamma becomes useful if we didn't have the discount Factor gamma each of these states would eventually converge to the value one because our discount factor is one we say that we don't value future rewards less than current rewards and therefore it doesn't matter how long it takes us to get to the rewards in the top right corner here as long as we get there at some point this isn't very useful because it doesn't give us any signal to go on at each date we look around and we can't figure out which is the best state to go to so for this reason it's important that gamma is less than one on situations like this we could model it slightly differently by giving a small negative reward every time we select an action that doesn't go into the goal state this would have the same effect in that it would encourage shorter Parts rather than longer parts the last thing to note here is that if Theta our error margin is greater than zero we're going to terminate after a certain number of iterations let's say it might be 10 iterations at that point our value function won't be optimal however that might be good enough to select the optimal policy if we have a look here at our value function after 10 iterations we can see here that if we just follow the sent from the state and picked the highest rewards dates we would indeed pick out the optimal path towards the goal and if we found ourselves in this state accidentally we would follow the optimal path towards the goal here so despite the fact that at this point if we if we stretch forward to 100 iterations we see that our value function is closer to Optimal after 100 iterations the policy that we get from that is going to be exactly the same foreign now let's look at the time complexity of the value iteration algorithm we can see the algorithm here on the slide in front of us for now we're just going to ignore this outer loop this is the actual iteration part of the algorithm that iterates until we converge and we're just going to analyze this inner loop here well first we can see with the inner loop that what we have to do is iterate for every state in s therefore this outer loop here is going to execute s number of times where s is the set of State for each one of these states the main operation is to do the Bellman equation update here the main operation in this Loop is the Bellman equation update in this case here we have to look at the complexity of this calculating this expression well first we can see that we have to find the best action therefore we have to evaluate every single action that this is therefore of complexity a where we evaluate each action then for every action we have to evaluate the expected return if we took that action for every possible outcome in the worst case we have to evaluate every possible state therefore the complexity of the Bellman equation update is the number of actions multiplied by the number of states we have to do this s number of times therefore the complexity of this in a loop is o of s squared times hey so far then what we have is the complexity of this inner loop the iteration now what we have to think about is how many iterations is this Loop going to have well that's 64 million dollar question the number of iterations on this which we're going to call n is simply the number of iterations to convergence there are three factors that influence how many iterations we have the first is largely the number of states as we saw in the convergence the rewards sort of propagate out from the reward States into the other states that have no reward so the first factor that influences this is the number of states foreign the second factor is the F Theta here clearly a very small Theta leave less room for error and therefore is going to require more iterations until we reach it a higher value of theta will terminate earlier and the third one is gamma a discount Factor why does this play A Part well intuitively you can kind of think of it like this if gamma is very high that means we have to kind of search further into the future for rewards that might play A Part on the other hand if gamma is very low for example zero we don't have to look into the future at all to find out what the future rewards are therefore gamma can play some part in how many iterations we require in value iteration so what this really means is that the entire complexity here can be summarized by s the number of states squared multiplied by the number of actions multiplied by the number of iterations okay so if we use value iteration or in fact some of the other methods we'll talk about later how do we use this information in order to decide how to act well we can use a technique known as policy extraction which allows us to extract a policy from a value function so if we think about a policy remember that we have the policy pi that we're going to extract and we want to be able to extract that for each state s we're assuming here that a policy is deterministic what action should we select well intuitively what we have to do is select the action that maximizes our expected reward what's that going to be well it's just more or less another application of the Bellman equation what we have to select is the action which we'll say ARG Max the action a from all the actions that are available from X ARG Max simply says not the maximum value but the action that gives us that value as opposed to Max which is the maximum value and then we have to select the action that maximizes this expression this looks a lot like our Bellman equation the only difference is instead of selecting the value that maximizes it we select the actual action therefore the policy is defined as at State s take the action that maximizes the expected return wait wait wait wait wait we've just done value iteration and then to extract the policy we do another iteration of value iteration what new information does this give us why don't we just choose the action that takes us to the best state great question well the reason is is that we learn a value function so we just know the value of States we can choose the action that leads to the best state but remember we don't know if that action will result in that state what if the best state has say a low probability of occurring so you mean that if we choose the action that leads to the best state the probability of going to that state may be so low that the value of another action is in fact better exactly why don't we construct an example together okay so we have two actions A and B each with two outcomes action a has outcome s with probability 0.1 and with a value of a hundred and also has outcome S Prime with probability 0.9 and value 0. then action B has an outcome t with probability 0.5 and a value of 50. and an outcome T Prime with probability 0.5 and value 90. good right so action a leads to the best state the value of 100. what happens if we apply policy iteration okay so we just calculate the expected reward using policy extraction action a is expected reward is 0.1 times a hundred plus 0 which is ten action B's is 0.5 times 50 plus 0.5 times 90 which is 74. therefore action B is better good so what do we learn from that well we learned that because we have the values of the states in our value function but we don't have the probabilities of those States occurring we need to calculate the expected reward again using policy extraction perfect so in this section we looked at Markov decision processes known as mdps we looked at what a solution to a Markov decision process should be and we studied the Bellman equation which helps us Define what the value of States can be then we looked at Value iteration an algorithm based on dynamic programming that uses the Bellman equation and replies it repeatedly until it comes up with a value function that conversions on the optimal value function or on something that's hopefully very close to the optimal value function we then looked at policy extraction which given a value function allows us to extract a policy that can be used to Define behavior for an agent if the value function is optimal then the policy will be optimal if the value function is not optimal the policy may still be optimal if our value function is good enough later on we'll be looking at other techniques for solving mdps that relax some of the assumptions that value iteration has they'll have different strengths and weaknesses compared to value iteration and it really depends on the type of problem you have and the information you have at hand as to which technique you should choose iIn this section we're going to look at Markov decision processes. Markov decision processes are models of sequential decision making otherwise known as planning. We're going to look at a definition of Markov decision process. We're going to look at what solutions for Markov decision processes are and we'll investigate a key approach to solving Markov decision processes using dynamic programming known as value iteration let's get started there are five learning outcomes of this path the first is to be able to Define what a Markov decision process is the second is to identify situations in which Markov decision processes are a suitable model of a problem third is to explain how Bellman equations are solutions to mdp problems the fourth is the ability to apply value iteration to problems in particular to manually apply value iteration to small problems and to implement value iteration to solve larger problems and finally the ability to be able to extract policies from value functions so what exactly is a Markov decision process before we look at a formal definition let's just get a bit of intuition via this example in this example we can see a grid that an agent has to navigate in the bottom left hand corner we have the agent itself in the top right corner we have the goal given in green the agent will get a score of one if it reaches the goal next to the goal we have a square in red which an agent has to avoid say falling in a hole and it gets a score of negative one if it falls in that the aim is for the agent to navigate from this bottom left across up and across to the top right or across and up in order to get to the top right and avoid falling in the hole however there's a catch to the problem if an agent tries to go in a particular direction let's say goes up here that will be successful with probability of 0.8 however there's a probability of 0.1 that if it tries to go up it will go right instead of 0.1 and a probability of 0.1 that it will go left at which point it will bounce off the wall and end up back in the same place in any of these cells if an agent tries to go in a certain direction like this it'll be successful with 0.8 and then it will turn right or left with point one such as this 0.1 and I point one as such there's uncertainty when the agent takes an action which the next date will be so unlike in for example classical planning when an agent takes an action in a particular state it can predict what the next state will be in a Markov decision process there's uncertainty about which date that would be so this gives us the ability to model uncertainty in outcomes when we apply actions so there are other examples that might be useful as well for example what if we try to schedule something on a Cloud Server and we're unsure whether we'll be added to the queue or we'll be able to execute it straight away the second example might be a robot working in a disaster scenario where it's going around trying to detect whether there are dangerous substances on the ground as it takes a sample and tries to measure that sample it's unsure which substance it might detect there might be say five or six different dangerous substances and it's not sure which one of those is going to detect a priority but it might have some estimate of probability about which ones are likely to be in the environment so this ability to measure uncertainty is what separates Markov decision processes from for example classical search problems now let's look at a more formal definition of Markov decision process and when we do this we're going to compare it to classical planning or classical search they're quite similar but they have a few differences that are important to point out So In classical planning and in a Markov decision processes we have a set of States s which represents all the possible states in the world that we could possibly be in we have an initial state which we'll call s0 which is one of those States and it's the state we're in at the start in the example we saw of the grid World earlier it's the bottom left hand side we have a set of actions a that are available to an agent and we use this syntax here a of s to represent all the actions that are available in the set s In classical planning we have a transition function which tells us given a state s if we apply action a we will end up in state S Prime however in Markov decision processes remember there's uncertainty about where we will end up therefore we don't have a transition function like this instead we have transition probabilities these probabilities are expressed using the notation of conditional probability you can find the background of conditional probability in the appendix of the notes what this notation says is that if we apply action a in state s then the probability that we'll end up in state S Prime is given by this expression here in any given state if we sum up the probabilities of all the actions they must sum to one this is how we Define the uncertainty of action outcomes we can have several actions where this probability is greater than zero foreign classical planning we have a set of goal States and all of our actions have costs with them in Markov decision processes this is replaced with a reward function a reward function is a function that gives us some reward that can be positive or negative a positive or negative real number that if we transition from State s using action a and result in state S Prime we will get this reward in a gridwell example if we're in cell 2 2 and we transition right into the cell at the top right hand corner we'll get a reward of one alternatively if we transition into the state below that we'll get a reward of negative one so what we can actually say here is these two points the goals and the action costs are somewhat subsumed by the reward function the reward function if we have a goal we want to reach we should give it a positive reward and for Action costs we can give a negative reward every time we take an action that doesn't lead to a goal Markov decision processes have one other element which is a discount Factor gamma the discount factor is effectively allowing us to prefer shorter plans over longer plans we'll go into a bit more detail about that in a minute what you need to know though is a Markov decision process consists of these things a set of States an initial State some actions transition probabilities reward function and discount Factor so let's go into a bit more detail about this idea of a discount Factor recall from the previous slide that we have a discount Factor gamma 0.0 and one this discount Vector models an assumption that we value rewards sooner rather than later so we value rewards now more than we value rewards in the future for example if I promise to give you a hundred dollars would you prefer it today or in a week or in a year you'd probably select have it today if you couldn't get it today you'd like to have it in a week otherwise you then have it in a year the idea here is that as people we tend to Value rewards sooner rather than later however there's also a good technical reason as we'll learn in this subject why we might want to have a discount Factor so let's think about how it works if we use the letter G to represent the rewards for an entire Trace then our Awards would look a little bit like this imagine we received a reward at step one which we'll call reward one we would get that reward then we received another reward at step two after our second action what we would do is we would multiply that reward by gamma our discount Factor if gamma was strictly less than one then that would mean that we'd value that reward slightly less than the reward at step one at the next action we would multiply our reward reward 3 by gamma squared again we value that even less than we value reward two and so on at step four it would be cubed Etc so in this case here if we received a reward five and step one then we receive that reward five if we receive the award for example 10 at step two and our discount Factor was 0.9 we would receive a reward of nine rather than 10 and similarly for the other ones what we can actually do is we can factor out this discount Factor like so we can see a pattern resolving such that at each time step the reward we will receive which we'll call GT for the reward at time step T is going to be equal to the reward that the reward that we receive at that time Plus the future reward we receive its debt time step t plus one and onwards multiplied by the discount Factor so that idea is the idea of a discounted future reward in this case what we have is the discounted future reward GT it's a reward we receive now plus any rewards we receive in the future the rest of the plan multiplied by our discount Factor so what is the solution to an mdp look like a solution to an mdp is known as a policy a policy is similar to the idea of a plan a plan is a sequence of actions that we can draw out that tells us exactly what actions we're going to take however remember with an mdp the outcomes of our actions are uncertain which means that we can't draw our plan as a sequence because as we take that first action we don't know which state we're going to end up in and depending which date we're going to end up in that affects which action will which we'll choose next so what does a policy look like well a policy similar to a plan simply just says that in a particular State this is the action we're going to take so if we look at our grid World example our policy we could represent graphically by saying in this state we're going to take the action up and in this day here we might take the action left and in this state here we might take the action down and this day here we might take the action right the important thing is is that for every state we have to say which action we're going to take for the gridwell problem this is a complete policy for every state it defines which action we can take it's not a very good policy at the moment as you can see it doesn't go towards the goal time we'll look at what an optimal policy looks like in a second the important thing though to note with a policy is if we take the action that corresponds to our policy and instead of going right with this point one probability it ends up transitioning to this day here we can still recover we know to take the next action which is to go up on the other hand if we just defined this as a sequence of actions this would take us left even though we had gone up the the plan would tell us to go left so our policy is more robust to these uncertainties than a plan for this particular problem Markov decision problem the optimal policy looks like this okay from the bottom left we go up and across if we're in this state here we go around as you can see it tries to avoid going into the hole instead of going straight up here it goes around we're going to study several techniques that take a description of a Markov decision process and produce a policy the problem of reinforcement learning is to take a Markov decision process and produce the best policy we can what does the best policy mean we'll Define that formally soon but for now let's just have a look at two different types of policy that are important for mdps in mdps there are two main types of policies deterministic policy and a stochastic policy the policy that you saw in the previous slide is a deterministic policy deterministic simply means that in every state it gives us exactly one action that we should take let's define that a bit more formally we're going to use the character pi to represent our policy and pi is defined as a function from States to actions a complete policy will Define this for every possible state in the mdp so with pi what happens is we Supply our policy with a state and it Returns the action that we should apply in that state it returns exactly one action which makes it deterministic the other type of policy is a stochastic policy a stochastic policy doesn't give us one action it gives us several actions with a probability distribution over those actions formally it's defined as Pi but it takes two parameters a state and an action and instead of giving us an action back it takes pairs of State in actions and returns a real number somewhere between zero and one that rates the probability of selecting action a in state s this is what makes it stochastic we're not going to go into details now of when you would use deterministic versus the stochastic policy but later on we'll use stochastic policies in order to help us search for good Solutions the important thing to remember now is that a deterministic policy takes a state and tells us which action to take and a stochastic policy gives us a probability distribution over actions and we should sample the action based on that probability distribution policies aren't graphical the example we saw in the grid world we drew arrows on an image in order to represent a policy they're represented in different ways here's a really simple way to represent a policy a deterministic policy it's just simply using something like a dictionary data structure all we do is take the state which in this case would be the agent is at coordinates 0 0 that's the bottom left coordinates of grid world and it maps to which action we should take in that state move up if we're at 2 2 which is the state just to the left of the goal it's going to tell us to move right later on in the subject we'll look at more sophisticated versions of policies that are actual functions that we learn using machine learning in these cases we can represent the policy much more succinctly than we can here for the grid World example a dictionary works just fine but when we have a massive number of states for example an infinite number of states we can't represent it using a dictionary any longer but for now just think of a policy as a mapping from a state to an action next we're going to Define what an optimal solution is for an mdp but before we do that let's just revisit an idea that you might be familiar with called the expected return a very similar concept known as expected reward is important for mdps we're going to do it via a short quiz now imagine that you've broken into a warehouse and you can steal a box of iPhones or a box of Samsung phones and you're going to try and sell them you can only carry one box so you have to decide which one you can steal the iPhones or the Samsung so for the iPhones you believe there's about a 20 chance you can sell them each for five hundred dollars or an 80 chance you can sell them each for 250 dollars for now let's just just assume that we're going to sell the entire box at the same price similarly for the Samsung you believe there's about a 50 chance that you'll sell the meats for 500 or a 50 chance that you'll sell them for two hundred dollars which box should you take if you want to maximize your expected return the iPhone or the Samsung okay so how did you go did you choose the iPhone or the Samsung well I'm going to show you how to briefly calculate this idea just to give you the refresher on expected return an expected return is simply taking all the possible options taking the return that we get from those options multiplying them by their probability and summing these together so for example in the iPhone case what we have is that we have a 20 chance that we would sell the iPhone at 500 .2 and we multiply that by the 500. then we add this to the 0.8 chance the 80 chance that we'd sell the phones for 250. and the total that we'd receive for these two together would be three hundred dollars I'll leave you to do the math for yourself and the Samsung we could do a similar calculation 0.5 times 500 plus 0.5 times 200. we do the calculations correctly and that we'd find that the expected return is 350 dollars so in this case here it's better to take the Samsung and try to sell them because our expected return is going to be higher 350 instead of 300. now that's not going to be our actual return it might turn out that we take the Samsung and we sell them for 200 when if we had taken the iPhone we would have sold them for five hundred dollars but by multiplying it by our probabilities we get our expected return this concept of expected return is very similar to the concept of expected reward that we're going to use for mdps therefore it's important that we understand it so let's discuss a little bit about what a good solution to an mdp should look like what we want is a sequence of actions that maximizes the accumulative reward we would receive over that in other words that maximizes the sum of the rewards that we'd receive over that Trace however there are two catches firstly those rewards are discounted remember that rewards in the future are worth less than rewards closer to the present so what we want to do is maximize our discounted accumulated reward but also remember that our actions have uncertainty in their outcomes so we're not sure which sequence of actions we might choose in the future because it depends on the outcomes of actions what we want to do then is we want to maximize our expected accumulative reward over that trace or more precisely our expected discounted accumulated reward let's try and formalize this notion first recall that we have a policy that we're trying to find policy that takes a state s and gives us back the action that we're going to select in state s we're going to assume for now that our policies are deterministic not stochastic the solution to an mdp should try to maximize the rewards that we receive we're going to use the notation V pi s to determine the value that policy Pi can give to us from State s what do we mean by value well let's define that a little bit more precisely as well first let's just assume we have no discount factor and we have no uncertainty in our action outcomes what does the accumulated reward look like well for a policy pie at each step which we will call I we receive a reward for taking an action from State and resulting in a different state we'll say the reward s of I receive the reward from transitioning from State SI using action I to State s i Plus 1. where that action is the action given to us by our policy like so so what this says is that at step I we receive the reward defined by the Markov decision process by selecting the action that our policy tells us to select but we have a sequence of actions and what we want to do is sum up all of the rewards that we receive over that sequence that's called our accumulated reward we'll just use our summation operator and say that we sum over every step I that's in a trajectory that we have in our mdp so so far that gives us our accumulated reward for this sequence as defined by policy pi but remember we have discount factor in our MTP so we have to Discount rewards that are in the future we do this simply by saying that at each step we multiply the reward by gamma to the power I therefore rewards further in the future are worth less than rewards that are closer so so far what we have now here is our accumulated discounted reward for a trajectory given to us by our policy pi so now let's relax the assumption that our actions are deterministic and reconsider the fact that our actions have uncertainty in their outcomes what we have now is multiple such trajectories that our policy can give us remember from our example over here of the grid world when we try to go up we might go right at that stage there we've got the start of two different trajectories that are possible we want to maximize therefore the expected reward over all those possible trajectories for policy pi we model this using the expectation function e and put the brackets all the way around this year what this means then is if we sum up the rewards over every possible trajectory multiply that by the probability that that trajectory would occur and then take the expected reward of those we end up with the expected reward of policy pie from State s we Define this as the value of the policy from State s to solve an mdp what we want to do is we want to try and find the policy p such that it's maximizes V Pi because it maximizes the value from the initial state s0 of our mdp so let's recap what we've got here what we have is an ability to calculate the accumulated reward of a trace given by our policy pi and if we assume that we have uncertainty in our action outcomes we take the expected cumulative reward discounted by our discount Factor here the job of trying to solve an mdp is trying to find the policy pie that maximizes this value here if we want to calculate the value of a policy Pi one way to do that is to look at every possible sequence of actions it's possible from PI and simply sum up the expected return from that however this is infeasible for even the smallest examples if we look at grid World on the right here which is Tiny by realistic standards what we can see here is the agent could simply iterate around this Loop an infinite number of times and therefore we have an infinite number of possible trajectories next up we'll look at a more principled way to define the value of a policy from a state without having to sample all possible trajectories of an mdp okay so what is an optimal solution to an mdp an optimal solution to an mdp is defined by the Bellman equation this is named after Richard E Bellman who is the person who came up with the idea of dynamic programming and dynamic programming is one of the approaches we're going to use to solve mdps as well The Bowman equation defines the value V of every state s if we have the value of every possible State then the idea is we can use that value in order to select the actions that maximize the value of the next state in our plan how do we Define the value of that state well let's think a little bit about what we have to do the value of the state is going to be based on the actions that we can take from that state if there are good actions from that state that lead us to good rewards then that means that we're going to receive a good value for that state if all the actions are bad then that's that's obviously going to be a bad state let's think a little bit about how we might Define the value of an action from a state okay so thinking back to what happens when we select an action what do we receive remember our discounted rewards we receive the reward of that step plus a discounted future reward so I'm going to Define that first okay we're going to receive the reward uh s a s Prime if we transition from State s to State s prime using action a so we select action a and then we receive our discounted future reward which we're going to call for now just G t plus one that's what we get if we select action a and it transitions to State S Prime remember however that actions aren't necessarily going to go to that state because the action outcomes are uncertain so if we want to figure out what the expected return of that action is we have to multiply it by the probability that that actually happens for this we use our transition probabilities defined by our mdp now remember this expression here comes from our definition of mdp and it is our like our transition function but it's a probabilistic transition function so here what we have is if we select action a and it takes us from State s to State S Prime then take the probability that that would happen and multiply it by our discounted future reward okay so so far that makes sense right we select an action and it takes us to a state and the probability of it taking to that state is sometimes less than one so we multiply the reward that we receive by the probability we have to remember that if we select action a there are multiple possible outcomes and we don't know which one is going to happen before we execute the action similar to the idea of expected return we can Define the expected reward of an action by summing up the outcomes of all those actions multiplied by the probability that they will occur now we already have the outcome of the action or the reward we received multiplied by the probability so all we need to do is sum up over the different possible outcomes that can occur here so we're going to use our summation operator Sigma here which simply means add all of these together and we're going to add every possible stay in our system so what this says here is that if we take action a look at every possible State that's possible in our mdp take the probability that we transition to that state if we select action a from S Prime and multiply that by the expected reward that we'd receive if we did it now this may seem a little bit let's say computationally difficult to sum over every state so in the case of the grid world we can't move from coordinate 0 0 in the bottom left up to the goal State straight away with one action however we're just defining what this is here we're not Computing it with development equation yet so in that case the probability that we move from the bottom left coordinate to the top right coordinate would be zero and it would be effectively we wouldn't receive any reward for that at all because we couldn't do it but for now we'll just say that we're going to go over every possible State when you compute it it's possible to look at only the states that you would transition to so let's just think about this now we'll break it down a little bit here we have our discounted reward or discounted future reward here if we selected action a and ended up in state S Prime and this greater expression here is simply the expected reward of choosing action hey we look at all possible outcomes that could happen if we execute action a we take the probabilities that those outcomes will occur and multiply them by the discounted reward then we sum all those together and we get the expected reward of executing action a so if we're in state s and we did this for every action that was possible from State s we would have the expected reward for all of those actions for each of those actions here like I said earlier the value of the state is going to depend on how good the actions are from that state but how would we Define precisely the value of that state well Belmond defines it quite simply and just says it's the best action from that state so we take the action that maximizes our expected reward and the value of that action is the value of the state this is how Bellman defines the value of the state take the action that maximizes our expected reward if we had this for every possible state in the system then we could effectively trace a plan through that tries to follow the high reward States however there's a catch here at this point here I've defined g t plus 1 as being the future reward that we receive from State S Prime but how can we possibly estimate what that is well in this case here it's telling us the discounted future reward over particular plan that we would take but we don't know a priority which plan is going to be selected so we can't use this value here what can we use well we can use the value of the state S Prime they V of S Prime which is defined Itself by the Bellman equation so now what we talk about when we talk about discounted future reward is the reward we receive immediately plus the discounted by gamma reward of the state that we end up in S Prime that we transition to so we can see here that the Bellman equation is recursive the value of State s can depend on other states such as S Prime but also the value of S Prime can depend on the value of s so we have this recursive definition here turns out this is part of what's tricky in solving mdps but we'll look at different techniques throughout this on how to solve it using various methods now let's look at an algorithm for solving mdps first off we're going to look at Value iteration we choose value iteration because it's a very simple mapping from Bellman equations to Value iteration it's perhaps the most straightforward solution we can use to solve mdps it has some weaknesses that we'll discuss later however value iteration is a very simple algorithm that simply applies the Bellman equation iteratively until it converges on a solution let's have a look so here we have the algorithm for Value iteration on the slide I'm going to ignore some of the steps in the algorithm for now and come back to them later the input to the value iteration algorithm is a Markov decision process the output is a value function of the type that we've seen earlier value iteration is a very simple algorithm and it does the following steps first as we can see on this first line here it initializes a value function to some arbitrary set of values for every state let's assume for now that that initial value is zero but the actual value doesn't matter all we simply do now is we repeat the following steps until we reach what we say is Convergence we'll Define what convergence means in a little bit for now we're going to skip this line here and we're going to look at what happens inside this iteration in each iteration we go through every possible State and all we simply do is we apply the Bellman equation to that state we Define V Prime of the state s to be the value of the action that maximizes the expected return at that State s this is a simple application of the Bellman equation in this case here though because of the recursive nature of the Bellman equation what we're actually doing is we're creating a new value function V Prime that uses the previous value function just V once we've done that for every possible State at the end we simply say that V becomes V Prime we then iterate over that loop again this is what's known as dynamic programming we're using the solution of a previous iteration to help solve the new iteration this will iteratively improve our solution as we go what happens is that we continue doing this until we reach what we say is Convergence what does convergence actually mean well what convergence means in theory is that we go around this Loop over and over until the values in V and V Prime are the same at the end of a loop in other words we get no change in the values of states once we reach a situation with V and V Prime are the same any new iterations over this won't change the values in v at that point there we've converged to what's called an optimal solution we have an optimal value function however in practice that could be an infinitely long process the values can get better and better but only by very small amounts we might be able to solve the problem quite well by saying that we'll terminate when the values in the value function change just a little bit we defined that little bit as being this parameter here Theta when the values changes to small amount we think that we're probably close enough to the optimal solution and therefore we can terminate our iterations how do we calculate whether we're at this or not well at the start of every Loop we Define delta as being zero Delta is what we use to represent the change between V and V Prime for each state we calculate the Delta between the previous value and the old value here this would be how much the value of State s has changed between the two iterations and we take the absolute value of that we use this maximize function to find the state with the biggest change in every iteration so because we go through every state here what we want to know is which is the state that has changed the most and how big is that Delta at the end of the iteration here if that Delta is less than our threshold we'll end up with a converged solution with the error of theta so if we think about this as applied to the gridwell example at each iteration we're going to go through each state and we're going to apply the Bellman equation to each reachable state in our problem at each point we calculate the value of those States and we assign that to V Prime once we've done that for every state we update V Prime to be V again and we repeat the process and we repeat this until the largest change from one of these states is less than Theta if we set Theta to being 0 then we simply do it until there's no change at all however pragmatically we tend to find that we can solve the problem quite well when we have Theta being as a small value but what's important to note here is that value iteration is simply applying the Bellman equation repeatedly until we get close enough to a solution now let's apply the value iteration algorithm to the grid World example to illustrate how it works on the top here I've got the Bellman equation which I'm going to apply at each iteration to each state and on the right here I've got the example we're going to apply it to now we're going to look at iteration two after iteration one if we initialize all the values to zero we'll end up with this top right State the green one having a value of one and this red State having a value of -1 and all the others will still be zero you can verify this for yourself if you want so we're going to look at what happens in that second iteration now in each iteration we have to go through every state and find the value of that state based on the previous values that we have I'm going to only look at one state which is this state here next to our goal State why am I going to look at that today because I know that's the best date for this example I'll briefly go over some of the other states but we'll look at that one in detail now what we have to do is for this state here we have to find the action here that maximizes our expected future reward there's four possible actions up down left and right to help illustrate what's going on I'm simply going to divide this state into four sections based on the four different actions that are possible left right up and down and I'll fill those in with our values as we go so in the second iteration we have the value of these states here what we're going to try and do is calculate the value of this state here I'm going to go through the four different actions first if we're in this state what happens if we take the action right now we can all see that's the best action but a priority we wouldn't know this if we're using value iteration on a larger example but let's calculate it anyway so if we have the action right from this state we have to calculate what the future expected discounted reward is from that state so for the right action we have three possible outcomes we have to think about that it's successful if it goes right otherwise it slips left or slips right which would be slipping up and slipping down in this example here so I'm going to try and line up these values with the corresponding column under here now remember if we try to go right we have an 0.8 chance of this action actually being successful and if we go right what we'll receive is our reward of zero initially Plus a discount factor which we're assuming over here is 0.9 multiplied by the value of the state that we transition to now remember I said this is iteration two and iteration one the value of this date is going to be one so that's multiplied by one that's the return we would get if we selected the right and we're successful however we have to remember that we might go up or we might go down if we try and go right so there's other two actions here if we slip and accidentally go up which is a 0.1 chance we can see we get no reward plus the discounted factor of 0.9 and we slip back into this state at the moment this value of that state is zero so we can see this is all zero and similarly if we slip down which is 0.1 chance we get no reward plus our discount Factor times if we slip down we come down here the value of that state is zero as well so what's the expected return and if we sum all these up well if we multiply 0.8 by our 0.9 then it's back to here we get 0.72 and clearly the value of these two is going to be zero we've got zero plus zero multiplied by 0.1 so the value of our state the maximum value is going to be 0.72 so let's fill in our value here as 0.72 that's the value of going right however when we're running a value iteration algorithm we don't know that that's necessarily going to be the best answer so let's look at what happens if we evaluate the other actions as well so first we're going to do the action going down again when we go down it's going to be successful with the probability of 0.8 going down doesn't receive a particular reward at all and then yeah discount Factor 0.9 multiplied by the value of the state if we go down if it's successful which is this date here the value of that is again zero however if we try and go down remember we might slip to be to either left or the right if we slip to the left which is going this way we'll get 0.1 probability no reward now discount of 0.9 but the value of this state here is zero because we haven't learned any values for it but we also might slip left a little bit as well if we slip left a little bit with probability 0.1 we'll get no reward plus our discounted future reward discounted by 0.9 and the reward we would get here would be one so we can calculate now our expected reward of selecting the action down by simply summing up these three terms we can see here that these first two terms are clearly going to be zero in this case here our term is going to be 0.9 multiplied by point one so therefore our answer is 0.09 let's fill that in 0.09 without going into additional detail we know that if we do the up action we could go up with in which case we would result back in this state here we might go left with 0.1 chance in which case we would get zero award or we might go right with 0.1 chance in which case we would get a discounted future reward of one discounted by 0.9 therefore that's going to be the same as going down at least in this iteration and finally if we go left with 0.8 we'll get a return of 0 with 0.1 we'll get 0 with 0.1 we'll get zero again so the value of going left is just simply going to be zero okay so let's fill in those two values now what we have in this top right corner is the expected return of each of the actions if we chose them now as we knew intuitively going right is going to be the best action we can see here it gives us value 0.72 which is clearly the action that maximizes the expected future reward therefore in this case v of s for that state would be 0.72 once we've calculated all them effectively what we can do is replace all these in our value function with the value that we get here now if we went through and applied this to every possible state in our mdp on iteration two what we'd see is that these states around here would all have a value of zero simply because the value function of all the surrounding states is zero therefore there's no reward to get however what we can see now is that we have 0.72 for this action here on the third iteration we're going to see that this date if we go right we now get some future discount reward defined by V what we see is that our value function is slowly going to learn good values for this problem okay so now it's your turn to try this I've got a short quiz here for you to try out on the top I've still got the Bellman equation for your reference but the question is if V of s equals 0 for every possible State and we're still at iteration two what is the value V for the state 2 1. we've already showed you what the value for the state 2 2 is going to be now it's your turn to try and apply your understanding of value iteration to see if you can calculate this new value okay so how did you go did you get the right answer let's go through and have a look and see what answer you've got and compare it to the answer that I get okay so we're looking at this date here and we have to think about the four possible actions I'm going to do what I did before and draw the four possible actions on here and we'll quickly go through and calculate and see if the answer you've got comes out to be the same okay so if we evaluate the action right what we get is 0.8 which is a probability that it's successful multiplied by the reward that we get which is 0 plus the discounted future reward 0.9 times the value that we get in this case it's going to be -1 we can then sum this up with the value 0.1 if we accidentally slip up or 0.1 if we accidentally slip down we know that that's going to be zero again because we can see here that there's no value here and no value here for to be gotten therefore it's 0.1 times that 0.1 times 0 as well so what's the expected return going to be for that action given the uncertainty in the action outcomes well it looks pretty much the same as the one we had before it's going to be negative 0.02 so let's fill that in negative point zero two next let's do the down action the down action again with 0.8 probability that's going to be successful giving us no immediate reward plus 5.9 times the value of that state down here which is zero plus it's going to give us 0.1 probability if we try and go down that it accidentally slips into the hole which gives us no reward Plus a discounted future reward 0.9 times 0.1 Plus 0.1 chance that it slips to the right in which case it's going to bounce into the wall and remain in that state 0.9 times 0 which is simply zero so this first expression and this expression are both zero so we simply have to look at this middle expression here which is going to be Point minus nine future discounted reward times point one 0.09 we can see clearly that if we use the up action we're going to similarly get point 09 and if we use the left action all that's going to happen is that with 0.8 we're going to end up back in this state ourselves with point one we can go down with point one we can go up which both give us zero rewards so left is going to be zero so therefore what is the solution to the Bellman equation here given that we want the action that maximizes our value well the value F of 0 for this particular state is going to be zero still so the answer to our question is zero is that the answer that you got if not let's look at why you might not have got the right answer okay I filled in this right hand side here with our different values you might have answered negative 0.72 that intuitively seems like the right answer because 0.72 was the value of this state and we got a lot of that value by going right into the good State we can imagine that if we have the symmetric problem here where we're going right into the red state that the value would be 0.72 if you thought that however the one thing you missed out on was that we need to take the action that maximizes the future expected return the action that maximizes the future expected return when we're at this day here is left which is zero however that's not the same as the action that maximizes the future return here which is right this is simply because there's a negative reward here and a positive reward there so that might have caught you out if you said the answer was something along the lines of 0.528 if you did that you may have forgotten that we don't update the value function V Prime based on other values in V Prime we update them based on V which is the value function we got in the previous iteration so in that case when you evaluated for example the action up you might have thought that this was the value of v s Prime but it this but in iteration two the value of this state is still zero in v even though it's 0.72 in V Prime don't forget that and finally you might have said that the answer is perhaps negative 0.09 intuitively what you might have said is we know that the best action is to go up therefore I'll choose the value of the up action from this state here and that's true in the final policy the best action is to go up however at this particular point in time at iteration 2 using value iteration we don't yet know that and in fact going up is not the best action we will see later on that as the algorithm converges towards the optimal solution that indeed does become the best action but it's not the case in iteration two okay now that we've looked at an example manually let's look at how value iteration converges towards the optimal solution using the implementation and visualization from the notes as we can see here we're starting at iteration one where the value of the top right state is 1 and the value of the state underneath it the whole is -1 we're going to step through and look at multiple iterations of value iteration so if we step through here after the first iteration we'll see that our value function is indeed 0.72 in the top right which is what we manually calculated and also that at this stage here it's still zero if we step to the next state we see that the two states either side of the state we've just got also received some value this is because that expression V of S Prime the future expected reward now has some value to reach to from these states here which is the one we've updated however we also see that the top right state has converged a little bit further toward the optimal solution going from 0.72 to 0.78 as we step through we see that the value starts to spread from the rewards that we receive here out to the remaining States we step through a little bit further we see that these literally get further and further towards the optimal value if we just press play here now and watch it iterate through we can see that the values are slowly updating but after around about 15 or 20 iterations we don't see any change what's actually happening is there is change going on underneath but the changes are so small that they don't have a corresponding value at two decimal places so let's just pause it here and look back at the start as to what happened so let's just watch this again and note how the rewards spread out from the state at the top right in order into the other states here at this situation we see why the discount Factor gamma becomes useful if we didn't have the discount Factor gamma each of these states would eventually converge to the value one because our discount factor is one we say that we don't value future rewards less than current rewards and therefore it doesn't matter how long it takes us to get to the rewards in the top right corner here as long as we get there at some point this isn't very useful because it doesn't give us any signal to go on at each date we look around and we can't figure out which is the best state to go to so for this reason it's important that gamma is less than one on situations like this we could model it slightly differently by giving a small negative reward every time we select an action that doesn't go into the goal state this would have the same effect in that it would encourage shorter Parts rather than longer parts the last thing to note here is that if Theta our error margin is greater than zero we're going to terminate after a certain number of iterations let's say it might be 10 iterations at that point our value function won't be optimal however that might be good enough to select the optimal policy if we have a look here at our value function after 10 iterations we can see here that if we just follow the sent from the state and picked the highest rewards dates we would indeed pick out the optimal path towards the goal and if we found ourselves in this state accidentally we would follow the optimal path towards the goal here so despite the fact that at this point if we if we stretch forward to 100 iterations we see that our value function is closer to Optimal after 100 iterations the policy that we get from that is going to be exactly the same foreign now let's look at the time complexity of the value iteration algorithm we can see the algorithm here on the slide in front of us for now we're just going to ignore this outer loop this is the actual iteration part of the algorithm that iterates until we converge and we're just going to analyze this inner loop here well first we can see with the inner loop that what we have to do is iterate for every state in s therefore this outer loop here is going to execute s number of times where s is the set of State for each one of these states the main operation is to do the Bellman equation update here the main operation in this Loop is the Bellman equation update in this case here we have to look at the complexity of this calculating this expression well first we can see that we have to find the best action therefore we have to evaluate every single action that this is therefore of complexity a where we evaluate each action then for every action we have to evaluate the expected return if we took that action for every possible outcome in the worst case we have to evaluate every possible state therefore the complexity of the Bellman equation update is the number of actions multiplied by the number of states we have to do this s number of times therefore the complexity of this in a loop is o of s squared times hey so far then what we have is the complexity of this inner loop the iteration now what we have to think about is how many iterations is this Loop going to have well that's 64 million dollar question the number of iterations on this which we're going to call n is simply the number of iterations to convergence there are three factors that influence how many iterations we have the first is largely the number of states as we saw in the convergence the rewards sort of propagate out from the reward States into the other states that have no reward so the first factor that influences this is the number of states foreign the second factor is the F Theta here clearly a very small Theta leave less room for error and therefore is going to require more iterations until we reach it a higher value of theta will terminate earlier and the third one is gamma a discount Factor why does this play A Part well intuitively you can kind of think of it like this if gamma is very high that means we have to kind of search further into the future for rewards that might play A Part on the other hand if gamma is very low for example zero we don't have to look into the future at all to find out what the future rewards are therefore gamma can play some part in how many iterations we require in value iteration so what this really means is that the entire complexity here can be summarized by s the number of states squared multiplied by the number of actions multiplied by the number of iterations okay so if we use value iteration or in fact some of the other methods we'll talk about later how do we use this information in order to decide how to act well we can use a technique known as policy extraction which allows us to extract a policy from a value function so if we think about a policy remember that we have the policy pi that we're going to extract and we want to be able to extract that for each state s we're assuming here that a policy is deterministic what action should we select well intuitively what we have to do is select the action that maximizes our expected reward what's that going to be well it's just more or less another application of the Bellman equation what we have to select is the action which we'll say ARG Max the action a from all the actions that are available from X ARG Max simply says not the maximum value but the action that gives us that value as opposed to Max which is the maximum value and then we have to select the action that maximizes this expression this looks a lot like our Bellman equation the only difference is instead of selecting the value that maximizes it we select the actual action therefore the policy is defined as at State s take the action that maximizes the expected return wait wait wait wait wait we've just done value iteration and then to extract the policy we do another iteration of value iteration what new information does this give us why don't we just choose the action that takes us to the best state great question well the reason is is that we learn a value function so we just know the value of States we can choose the action that leads to the best state but remember we don't know if that action will result in that state what if the best state has say a low probability of occurring so you mean that if we choose the action that leads to the best state the probability of going to that state may be so low that the value of another action is in fact better exactly why don't we construct an example together okay so we have two actions A and B each with two outcomes action a has outcome s with probability 0.1 and with a value of a hundred and also has outcome S Prime with probability 0.9 and value 0. then action B has an outcome t with probability 0.5 and a value of 50. and an outcome T Prime with probability 0.5 and value 90. good right so action a leads to the best state the value of 100. what happens if we apply policy iteration okay so we just calculate the expected reward using policy extraction action a is expected reward is 0.1 times a hundred plus 0 which is ten action B's is 0.5 times 50 plus 0.5 times 90 which is 74. therefore action B is better good so what do we learn from that well we learned that because we have the values of the states in our value function but we don't have the probabilities of those States occurring we need to calculate the expected reward again using policy extraction perfect so in this section we looked at Markov decision processes known as mdps we looked at what a solution to a Markov decision process should be and we studied the Bellman equation which helps us Define what the value of States can be then we looked at Value iteration an algorithm based on dynamic programming that uses the Bellman equation and replies it repeatedly until it comes up with a value function that conversions on the optimal value function or on something that's hopefully very close to the optimal value function we then looked at policy extraction which given a value function allows us to extract a policy that can be used to Define behavior for an agent if the value function is optimal then the policy will be optimal if the value function is not optimal the policy may still be optimal if our value function is good enough later on we'll be looking at other techniques for solving mdps that relax some of the assumptions that value iteration has they'll have different strengths and weaknesses compared to value iteration and it really depends on the type of problem you have and the information you have at hand as to which technique you should choose