All right. We’re gonna go ahead and get started. Um, homework two, it should be well underway. If you have any questions feel, feel free to reach out to us. Um, [NOISE] project proposals, if you have questions about that, feel free to come to our office hours or to reach out, um, via Piazza. Somebody have any other questions I can answer right now? All right. So today, we’re gonna start- this is a little bit loud. Um, today we’re gonna start talking about policy gradient methods. Um, policy gradient methods are probably the most well used in reinforcement learning right now. Um, so, I think they’re an incredibly useful thing to be familiar with. Um, [NOISE] whenever we talk about reinforcement learning, we keep coming back to these main properties that we’d like about agents that learn to make decisions about them, you know, to do this sort of optimization, handling delayed consequences, doing exploration, um, [NOISE] and do it all through statistically, and efficiently, in really high dimensional spaces. Um, and what we were sort of talking about last time in terms of imitation learning was sort of a different way to kind of provide additional structure or additional support for our agents, um, so that they could try to learn how to do things faster. Um, and imitation learning was one way to provide structural support by leveraging demonstrations from people. And we’ve seen other ways to sort of, um, encode structure or human prior knowledge, um, when we started talking about function approximation. So, when we think about how we define q, like when we define q as s, a, and w, where this was a set of parameters. We were implicitly making a choice about sort of imposing some structure in terms of how we are going to represent our value function, and that choice might be fairly strong like assuming it was linear. So, this is sort of a quite a strong assumption, or it might be a very weak assumption like using a deep neural net. And so, when we specify sort of these function approximations, and representations, we’re sort of implicitly making, uh, uh, choices about how much structure and how much domain knowledge we want to put in, um, in order for our agents to learn. So, what we’re gonna start to talk about today and we’re gonna talk about this week is policy search which is another place where it can be very natural to put in domain knowledge. I mean, we’ll see that in in some robotics examples today, and it can be also a very efficient way to try to learn. So, as I was saying, before we sort of we’re approximating where we’re doing model-free reinforcement learning, and when we started to try to scale up to really large state spaces. Um, I’ve been having several different people ask me about really large action spaces which is a really important topic. We’re not gonna talk too much about that in this quarter, but we will talk a little bit about when your action space is continuous but low-dimensional. But we have started to talk a lot about when the state space is really high-dimensional and, and, and really large. And so, we talked about approximating things, um, uh, with some sort of parameterization, like whether it will be parameters Theta or we often, or we often use w, but some sort of parameterization of the function. So, we used our value function, um, to define expected discounted sum of rewards from a particular state or state action, and then we could extract a policy from that value function or at least from a state action value function. And instead, what we’re gonna do today is just directly parameterize the policy. So, when we talked about tabular policies, our policy was just a mapping from states to actions. And in the tabular setting, we could just look- do that as a lookup table. For every single state, we could write down what action we would take. And what we’re going to do now is to say, “Well, it’s not gonna be feasible to write down our table of our policy, so instead what we’re going to do is parameterize it, and we’re gonna use a set of weights or Thetas.” Today, we’re mostly gonna use Thetas, but this could equally well think of this as weights. Um, just some way to parameterize our policy. We’ll talk more about particular forms of parameterization. Um, but just like what we saw for state action value functions, um, this is gonna have a big implication because this is effectively defining the space that you can learn over. So, it’s sort of, um, it’s determining the, the class of policies you could possibly learn. Um, [NOISE] and we’re again gonna sort of focus on model-free reinforcement learning, meaning that we’re not gonna assume that we have access to an a priori model of the dynamics or reward of the world. So, we had thrown some of these diagrams up at the start of the quarter, I just want to go back to it. Um, we’ve been talking about sort of value, we- we haven’t talked so much about models, the models are also super important. Um, but we’ve been talking a lot about sort of value function, based approaches which is this, and now we’re gonna talk about policy, um, direct policy search methods. And as you might expect, there’s a lot of work which tries to combine between the two of them, and these are often called actor-critic methods. Um, where you try to explicitly maintain a parameterized policy, and explicitly maintain a parameterized critic or value function. So, this is the policy, and this is a Q. Okay, so, we’re gonna start today and we’re gonna be talking about policy-based methods. So, why would you wanna do this? Um, [NOISE] well, uh, it actually goes back a little bit to also what we were talking about last week with imitation learning. For imitation learning, we talked about the fact that sometimes it’s hard for humans to write down a reward function, and so it might be easier for them just to demonstrate what the policy looks like. Similarly, in some cases, maybe it’s easier to write down sort of a parametrization of, um, the space of policies than it is to write down a parameterization of the space of state action value functions. Um, in addition, they’re often much more effective in high-dimensional or continuous action spaces, and they allow us to learn stochastic policies which we haven’t talked very much about so far, but I’m gonna give you some illustrations about where we definitely want stochastic policies. Um, [NOISE] and they sometimes have better convergent policy- convergence, uh, properties um, that can be a little bit debated, it depends exactly what- whether we’re comparing that to model-free or model-based approaches and how much computation we’re doing. Um so, this can be a little bit of a function of computation to computation can matter. One of the really big disadvantages is that they are typically only gonna converge to a local optimum. So, where you’re going to converge to something that is hopefully a pretty good policy, but we’re not generally guaranteed to converge to the global optima. [NOISE]. Now, there are some techniques that are guaranteed to converge to a local- to the global optima, and I’ll try to highlight some of those today, but generally, almost all of the methods that you see in like deep reinforcement learning that are policy gradient based, um, only converge to a local optima. Um, and then the other challenge is that typically we’re gonna do this by trying to evaluate a policy and then estimate its gradient, and often that can be somewhat sample inefficient. So, there might be quite a lot of data to estimate, um, what that gradient is when we’re taking a gradient-based approach. So, why might we want sort of a stochastic policy? So, in what I mentioned before, um, in the tabular setting, so let me just go back to here. So, in a- now, why do we want this? Do we want this? If you think back to the very first lectures, um, what I said is that if we have a tabular MDP, there exist a Pi which is deterministic and optimal. So, in the tabular MDP setting, we do not need, um, er, deter- we do not need stochastic policies because there always exists a policy that is deterministic that has the same value as the optimal policy does. So, this is not needed in the tabular Markov Decision Process case, but we don’t always- we’re not always acting in the tabular Markov Decision Process case. So, as an example, um, [NOISE] who here is familiar with rock-paper-scissors ? Okay. Most people. Um, er, possibly if you’re not, you might have played it by another name. So, in rock-paper-scissors, uh, it’s a two player game, um, [NOISE] everyone can either pick, uh, paper or scissors or rock. And you have to pick one of those, and scissors beats paper, paper- rock beats scissors, and paper beats rock. Um, and in this case, if you had a deterministic policy, you could lose a lot, you could easily be exploited by the other agent. Um, but a uniform random policy is basically optimal. What do I mean by optimality? In this case, I mean, that you, you could say a plus one if you win, and let say zero or minus one if you lose. We’re not gonna talk too much about multi-agent cases, um, er, in this class, but it’s a super interesting area of research. Um, and in this case, um, you know, the environment is not agnostic. Um, the environment can react to, uh, the policies that we’re doing and could be adversarial, and so we want a policy that, um, is robust to an adversary. So, a second case, um, is Aliased Gridword. So, um, so in this case, so why, you know, why is being stochastic important here? Well, because we’re not really in a stochastic setting, we are in an adversarial setting, and we have another agent that is playing with us and they can be non-stationary and changing their policy in response to ours. Um, so it’s not, uh, the environment doesn’t pick the next- doesn’t pick rock-paper-scissors regardless of our actions, um, in the past it can respond to those. [NOISE] Um, so it’s sort of got this non-stationarity or adversarial nature. Um, another case is where it’s not Markov, so it’s really partially observable, and you have aliasing, which means that we can’t distinguish between multiple states in terms of our sensors. So, we saw this before a little bit when we talked about robots that, you know, could have laser range finders and sort of tell where they were in a hallway. By how far away each of the, um, the, the first point of, um, uh, uh, obstacle was for all of their 180 degrees, and so that will look the same in lots of different hallways. Um, so this is a simple example of that. So, in an Aliased Gridword, um, let’s say that the agent because of their sensors cannot distinguish the gray states and they have features of a particular form. Um, they have a feature for whether or not there’s a wall to the North, um, er, or East, or South, or West. So, it can basically, like, if it’s here it can tell, like, “Oh I have walls to either side of me and not in front or behind me.” Um, but that could be the same over in that in the, in the other [NOISE], um, grey state. So, if we did a value-based reinforcement learning approach using some sort of approximate value function, um, it would take these features which are a combination of what action am I going to take? And whether there are walls around me or not. Um, or we could have a policy-based approach which also, um, takes some of these features but then just directly tries to make decisions about what action to take, and those actions can be stochastic. So, in this case, the agent is trying to figure out how to navigate in this world. It really wants to get to here. This is where there’s a large reward. So, this is good. It wants to avoid the skull and crossbones, and those will be negative reward. So, because of the aliasing, the agent can’t distinguish whether or not it’s here or here. Um, and so it has to do the same thing in both states. And so either it has to go left or it has to go right, Call it West or East [NOISE], um, and either way that’s not optimal because if it’s actually here it should be going that way, not over here and down. Um, and so it can distinguish whether it’s in here or here but it could just end up moving back and forth, er, or making very bad decisions. And so it can get stuck and never be able to know when it’s safe to go down and reach the money. So, it learns a near-determini- deterministic policy because that’s what we’ve normally been learning with these, um, and whether it’s greedy or e-greedy and generally it will do very poorly. But if you have a stochastic policy when you’re in a state where you’re aliased, you could just randomize. You’d say, “I’m not sure whether I’m actually in this state- in this state or this state, um, so I’ll just go, er, either East or West with 50 percent probability.” And then it’ll generally reach the goal state quickly. Because note, it can tell what it should do when it reaches here because that looks different than these two states. So, once it’s in the middle it knows exactly what to do. So, that’s just, again, an example where a stochastic policy has a way better value than a deterministic policy and that’s because the domain here is not Markov, it’s partially observable. [NOISE] Okay. So, that’s sort of one of the reasons why we might want to- some of the reasons why you want- might wanna be directly policy-based, and there’s a lot of other reasons. Um, so, so what does this mean? Well, er, we’re gonna have this parameterized policy and the goal is that we wanna find. Yeah, Like you said, can we conclude that when the world is not Markov, it is partially observed, stochastic policy is always better? Your name is ? I’m sorry yeah. So what said is can we conclude that, um, if the world is partially observable stochastic policies are always better. Um, I think it depends on the modeling you wanna do. I think, in this case, better than being stochastic because it’s still doing something, kind of, not very intelligent in the gray states, it’s just randomizing, would be to have a partially observable Markov decision process policy, um, and then you could track, uh, an estimate over where you are in the world. So, you can keep track of a belief state over what state you’re in, and then you could hopefully uniquely identify that, “Oh, if I was just in this state I have to be in the state now.” And then you can deterministically go to the right or left. [NOISE] So, it depends on, on the modeling one’s willing to do. Good question. Okay. So, when we, um, start to do the- go to parameterize policy search, what we’re gonna wanna do is find the parameters that yield the best value. The policy in the class with the best value, and so similar to what we’ve seen before we can, we can think about sort of episodic settings and infinite sort of continuing settings. So, in an episodic setting, that means that the agent will act for a number of time-steps often, let’s say, H steps. But it could be variable, like, it might be until you reach, you know, a terminal state. And then we can just consider what is the expected value, wha- wher- what is the value? What is the expected discounted sum of rewards we get from the start state or distribution of start states? And then what we wanna do is find the parameterized policy that has the highest value. Um, another option is that if we’re in a continuing environment which means we’re in the online setting, we don’t act for H steps we just act forever. There’s no terminal states and we can either use the, um, average value where we average over the distribution of states. So, this is, um, like what we saw before thinking about the distribution, the stationary distribution over the Markov chain that is induced by a particular policy. Because we talked about before about the fact that if you fix the policy, um, then basically, uh, you get into Markov reward process. You can also just think of the distribution of states you get is a Markov chain. So, um, if we’re acting forever, we’re gonna say sort of on average what is the value of the states that we reach under that stationary distribution? Um, and another way to do it is also to say we just look at sort of the average reward per time step. Now, for simplicity today we’re gonna focus almost exclusively on the episodic setting, but we can think about similar techniques for these other forms of settings. So, as before and this is an optimization problem similar to what we saw in the value function approximation case, uh, for linear value functions and using deep neural networks, um, we’re gonna wanna be doing optimization, er, which means that we need to do some sort of optimization tool to try to search for the best data. So, one option is to do gradient free optimization. We don’t tend to do this very much in policy search methods, um, but there are lots of different methods that are gradient free optimization. Just for us to find whatever parameters maximize this V Pi Theta. Um, and just to connect this- just like what we saw for Q functions, now we have Theta which is specifying a policy. And it maybe has some interesting landscape, and then we wanna be able to find where’s the max. So, we’re really trying to find the max of a function as efficiently as we can. And there are lots of methods for doing that that don’t rely on the function being differentiable. Um, and these actually can be very good in some cases. Um, so this is some nice work done by a colleague of mine and- We have developed a method for automatically identifying the exoskeleton assistance patterns that minimize metabolic energy costs for individual humans during walking [NOISE]. During optimization the user first experiences one control law while respiratory measurements are taken. Steady-state energy cost is estimated by fitting a first-order dynamical model to two minutes of transient data. The control law is then changed and metabolic rate is estimated again. This process is repeated for a prescribed number of control laws forming one generation. [NOISE] A covariance matrix adaptation evolution strategy is then used to create the next generation. The mean of each generation represents the best estimate of the optimal control parameter values. After about an hour of optimization, energy cost was reduced by an averageg of 24 percent, compared to no assistance. So this is work that’s done by my colleague uh, Steve Collins, um, who’s over in mechanical engineering and we’ve been collaborating some on whether you can train people to- do this better, um. So, the idea in this case is that, uh, there’s lots of instances for which you’d like to use exoskeletons. Um, a lot of people have strokes, a lot of people have mobility problems um, and of course there’s a lot of veterans that lose a limb. Um, and in these cases one of the challenges has been is how do you, sort of figure out what are the parameters of these exoskeletons in order to provide um, support for people walking and generally it varies on physiology and for many different people. They’re going to need different types of parameters, um, but you want do this really quickly. So, you want to be able to figure out very fast for each individual. What is the right control parameters in order to help them get the most assistance as they walk. Um, and so Steve’s lab treated this as, uh, sort of a policy, a policy search problem, where what you’re doing is you’re having somebody wear their device, you’re trying some, uh, control laws, um, that are providing a particular form of support in terms of their exoskeleton. You’re measuring their sort of, um, metabolic efficiency, which is, how do you- you know. How hard are they breathing? How hard are they having to work, compared to if they weren’t wearing this or under different control laws. And then you can use this information to figure out what’s the next set of control laws you use and do this all, in a, closed loop fashion as quickly as possible. Now one of the reasons I bring this up is, both because it was incredibly effective, it’s a really nice science paper that, um, illustrates how this could be much more effective than previous techniques. Um, and second because it was using CMA-ES, which is a gradient free approach. So even though most of what we’re gonna discuss today in class, is all with gradient based methods. There’s some really nice examples of not using gradient based methods also to do policy search for lots of other types of applications. So, I think it’s useful to sort of, know in your toolbox that, one doesn’t have to be constrained to gradient based methods, and one of the really nice things about, things like CMA-ES is that, they’re guaranteed to get towards, uh, to a global optima. So in some cases, uh, you might really want to be guaranteed that you’re doing that because it’s high stakes situation. Um, and in general, it sort of is, has been noticed repeatedly recently that sometimes these sort of approaches do work kind of embarrassingly well, um, uh, that they tend to be in some ways sort of a brute forced, a smart brute force way, um, that often can be very effective. So they’re good to consider, in terms of the applications you look at. But, you know, despite this, um, uh, even though, they can be really good and sometimes, um, they’re very, very helpful for parallelization. Um, uh, they’re generally not very sample efficient. And so depending on, the domain that you’re looking at and what sort of structure you have, often it’s useful to go to a gradient-based method, particularly if you might be satisfied with the local solution at the end. Sort of locally optimal. So what we’re going to talk about mostly today- just like what we did for, um, value, like, uh, value-based methods is, gradient descent, um, and gradient based methods, um and other methods that try to exploit the sequential structure of decision making problems. So CMA-ES doesn’t know anything about the fact that this- the world might be an MDP or any form of sort of, sequential stochastic process. And we’re gonna focus on ones that sort of, leverage the structure of the Markov decision process, in the decision process itself. So let’s talk about policy gradient methods. Um, where again just sort of, um, define things in terms of theta, so that we’re explicit about the parameter and we’re gonna focus on episodic MDPs, which means that, we’re gonna run our policy for a certain number of time steps, until we reach a terminal state or for certain you know, maybe h steps. Uh, we’re going to get some reward during that time period, and then we’re going to reset. So, we’re going to be looking for a local maximum, and we’re going to be taking the gradient, with respect to the parameters that, um, define the policy, and then use some small learning rate [NOISE]. So just this is- this should look very similar, very similar to the, similar to, uh, Q and V based search. And the main difference here is, that instead of taking, uh, the derivative with respect to parameters that define our q function, we’re taking them with respect to, the parameters that define our policy. So, the simplest thing to do here, is, um, to do finite differences. Um, so for each of your policy parameters, you just perturb it a little bit, um, and if you do that for every single one of the dimensions, um, that define your policy parameters, then you’re going to get an estimate of the gradient. Here, just doing sort of a finite differences estimate of the gradient. And you can use a certain number of evaluations for doing this, in each of the cases. So you can- let’s say you have this, um, k dimensional, uh, set of parameters that defined your policy, you try changing one of them a little bit, you repeat it, you get a bunch of samples for that, new policy. Um, you do that for all of the different dimensions, and now you have an approximation of the gradient. It’s very simple, it’s quite noisy, um, it’s not particularly efficient, but, it can sometimes be effective. I mean it was one of the earlier demonstrations of how policy gradient methods could be very useful, in an RL context. Um, and the nice thing is that the policy itself doesn’t have to be differentiable because, we’re just doing sort of a finite difference approximation of the gradient [NOISE]. So, one of the first examples that- I see- well, um, I think of when- I think of, sort of how policy, uh, gradient methods or how policy search methods can be really effective, is Peter Stone’s work on doing, uh Robocup and who here’s ever seen- like Robocup? Okay. A few people but not everybody. So let’s see if we can, get up like a short demonstration of like what these robots look like. So, let’s- ah, okay. So you probably can’t see it do you? We won’t do that right now, um, but essentially what you have is, there’s a bunch of different leagues of Robocup. One of the goals has been that, um, I think by 2050, the goal is that, we’re going to have, uh, a robotic soccer, uh, team that is going to be able to defeat- like able to, you know, win the, the World Cup. Um, so that’s been one of the driving goals, of this Ro- the Robocup initiative. Uh, and there’s lot of different leagues within this, and one of them is, these sort of quadruped robots, um, which try to score goals against each other. And one of the key challenges for this is, they look kind of like that. Um, and, you have to figure out the gait for walking, um, and you want them to be able to walk, quickly but you don’t want them to fall over. Um, and so just simply that question of like, how do you optimize the gait, is an important question in order, to win, because you need your robots to move fast on the field. So, Peter Stone has been a leading person in, in Robocup for a long time. Um, and their goal was simply to, learn a fast way for these AIBOs to walk. Um, and to do it by, uh, real experience um, and data’s really important here because it’s expensive um, you have these robots walking back up forth and you want them to very quickly, optimize their gait. Um, and you don’t want to have to keep changing batteries and things like that, so you really wanna do this with very little amounts of data. So, what they thought of doing in this case is sort of to, to do a parameterized policy and try to optimize those proper policies. So this is where significant domain knowledge came in and this is a way to inject domain knowledge. So they, um, specified it by this sort of continuous ellipse, of how gait works for, um, the small robot. And so they parameterized it by these 12 continuous parameters. And this completely defines the space of possible policies you could learn. This might not be optimal. Peter Stone and his group have a huge amount of experience on doing Robocup, um, at the time they were doing this paper and so they really had a lot of knowledge they could inject in here. And in some ways it’s a way to provide sort of this hierarchical structure about, what sort of policies might be good. And then what they did is they did just this method of finite differencing, in order to try to optimize for all of these parameters. [NOISE] So, one of the important things here, um, is that all of their policy evaluations were going to be done on actual real robots, um, and they just wanted to, have people inter- intervene every once in a while, in order to, replace batteries which took- happened about once an hour. Um, and so they did it on three AIBOs, very small amount of hardware. Um, they did about 15 policies per iteration, and, they evaluated each policy three times. So, it’s not very many but it can be a very noisy signal, um, and each iteration took about 7.5 minutes. So- and then they had to pick some learning rate. And so what do we see in this case? Well, we see that, in terms of the number of iterations that they have versus how quickly they’re- of course you have to define your optimization criteria in this case, they’re looking at speed of stable walking. Um, and a lot of people have been trying to figure out how to do these using hand tuning before, um, uh, including, so they’re the UT Austin Villa team. Um, including, them- in the past people have found different ways to sort of hand tune, um, I don’t know if we’ll be using unsupervised learning et cetera. And you can see, as they do multiple iterations of trying to, search for a better policy using this finite difference method, that they get to faster than everything else. And this is not that many iterations, um, so this is something that was happening over, you know, a few hours. So, I think this was a really compelling example of how, policy gradient methods could really do much better than what we- had happened before and they didn’t have to require an enormous amount of data. That’s very different than probably what you’re experiencing in assignment two, so this is, no total number of iterations. Um, uh, I think this was on the order of, let’s see, like, this is on the order of, you know, tens to hundreds of policies, not millions and millions of steps. So these things can be very data efficient. But there was also a lot of information that was given. So, um, if you think about sort of like, uh, I have a little bit on here. So, in their paper they discussed sort of what was actually impacting performance in this case, and there are a lot of things that impact performance. So, um, you know, how do we start? Um, so I may have a sense of why, you know, why does the initial policy parameters used matter for this type of method? Yeah, Well, because we’re not guaranteed to have a global optima, only a local optima, so your starting point is gonna affect which local optima you are able to find. Exactly. So what just said is that, um, because these methods are only guaranteed, particularly this method is only guaranteed, to, find a local optima and all of the sort of policy gradient style methods are. Um, then wherever you start you’re gonna get to the closest local optima and you have no guarantee that that’s the best global optima. Um, so it’s important to either try lots of random restarts here in this case or to have domain knowledge. Um, another important question here is, how much you’re perturbing sort of the size of your finite differences. And then I think, really most critical is this policy parameterization. Like just how are you writing down the space of possible policies that you can learn within because like, if that’s not a good policy space, then you’re just not going to learn anything. Um, yeah. on slide 26 What is an open loop policy can you explain a bit more on that. Yeah. Um, uh, so, question was about the open loop policy part. So, these policies that we’re learning don’t have to be adaptive. And open loop policy is essentially a plan. It’s a sequence of actions to take, um, regardless of any additional input that you might have. So, um, we typically have been thinking about policies as mappings from states to actions, but they can also just be a series of actions. And so, when we talk about an open loop policy, that’s a non-reactive policy because it’s just a sequence of actions that regardless of the state of the robot you just keep going. So, maybe there’s a really large wind in the middle, and the robot’s next action is the same whether there’s a lot of wind or not. It doesn’t have to be reactive. Okay. So, but in general, um, you know, finite differences is a reasonable thing to try. Um, often we’re gonna want to use gradient information and leverage the fact that our policy for function is actually differentiable. So, what we’re gonna do now is, um, compute the policy gradient analytically [NOISE] excuse me. This is most common, um, in most of the techniques that are used right now. Um, we’re gonna assume it’s differentiable wherever it is non-zero, um, and that we can explicitly compute this. So, when we say, what we- when we say know that means that this is computable. And we can compute this explicitly. And so, now we’re gonna be, um, thinking only about gradient-based methods. And so, we’re only, [NOISE] we’re gonna only converge to a local optima. Hopefully, hopefully, we’ll get to a local optima, that’s the best we can hope for in this case. Okay. So, we’re going to talk- people often talk about likelihood ratio policies, um, and they’re gonna proceed as follows. So, let- we’re thinking about the episodic case. So, we’re gonna think about it as having, um, trajectories. So, state action reward, next state, et cetera, all the way out to some terminal state. So, this is where we terminate. And we’re gonna use R of Tau to denote the sum of rewards for a trajectory. Okay. So, the policy value in this case is just gonna be the expected discounted sum of rewards we get by following this policy. And we can represent that as the probability that we observe a particular trajectory times the reward of that trajectory. So, it just says given under this policy what are the, you know, what’s the probability of seeing any trajectory, and then what would be the reward of that trajectory? Because the reward is the deterministic function of the trajectory. Once you know the state action rewards, et cetera, then your reward is, um, just the sum of all of those. And so now, in this particular notation, what our goal will be is to find policy parameters Theta, which, um are the arg max of this. Uh, and the reason we sort of- what have we changed here, um, the change now then has been the fact that we’ve gonna focus on here. So, notice now that the policy parameters only appear in terms of the distribution of trajectories that we might encounter under this policy. And this is, again, a little bit similar to what we talked about for imitation learning before or where in imitation learning, we talked a lot about distributions of states and distributions of states and actions, and trying to find a policy that would match the same state action distribution, as what was demonstrated by an expert. Um, today, we’re not gonna talk as much about sort of state action distributions but we are talking about sort of distributions of trajectories that we could encounter under our particular policy. So, what’s the gradient of this? Um, so, we wanna take the gradient of this function with respect to Theta. So, we’re gonna go for this as follows. We are gonna rewrite what is the probability of a trajectory under Theta. So, sum over Tau. I wanna do probability of Tau [NOISE] times. All right, first actually I’ll whip it in here. And then what we’re gonna do is, make sure I get the notation the same. Okay. So, then what we’re gonna do is, we’re gonna do something simple where we just multiply and divide by the same thing. So, we’re gonna put in probability of Tau given Theta, divided by probability of Tau given Theta, times the derivative of the probability of Tau given Theta. And the probability- if we instead had a log. So, if you’re taking the derivative of log of probability of Tau given Theta, that is exactly equal to the one over the probability of Tau given Theta times the derivative of p of Tau given Theta. So, we can re-express this as follows; Sum over Tau r of Tau, p of Tau given Theta times derivative with respect to log of p of Tau given Theta. Now, so far that doesn’t necessarily seem like that’s gonna be very useful. Um, [LAUGHTER] So, we’ve done that, that’s a reasonable transformation, but we’ll see shortly why that transformation is helpful. And in particular, the reason this transformation is helpful is it’s gonna be very useful when we think about wanting to do all of this without knowing the dynamics or reward models. Um, so, we’re gonna need to be able to, you know, get reward in terms of, uh, a trajectory, but we wanna be able to evaluate, um, the gradient of a policy without knowing the dynamics model, and this trick is gonna help us get there. So, when, when we do this, this is often, this is often referred to as the likelihood ratio. And we can convert it and just say, “Well, we noticed that by doing this, this is actually exactly the same as the log.” Now, why else does this start to look like something that might be useful? Well, what do we have here? We have, if we- this is the sum over all trajectories. Of course, we don’t necessarily have access to all possible trajectories, but we can sample them. So, you could imagine starting to be able to approximate this by running your policy a number of times, sampling a number of trajectories, looking at the reward of those, um, and then taking the derivative with respect to this probability of trajectory given Theta. So, typically we’re gonna do this by just running the policy m times. Um, and then, that p of the- Tau given Theta, we’re gonna just approximate that by the following. So, that part drops out, we’re just gonna weigh all of the trajectories that we got during our sampling uniformly, and then we look at the reward of that trajectory, and the log of p of, um, [NOISE] uh, Tau given Theta. So, what is happening in this case? Okay. So, this is saying that the gradient is this sort of, um, [NOISE] uh, the reward that we get Um, times the log of the probability of that trajectory for the reward with associated word times Theta. So, what’s happening in that case? So, in this case, we have a function which for our case is the reward, which is sort of measuring how good, um, that particular, um, you know, trajectory is or how good that sample is. And so, what this is doing is we’re just moving up and the trajectory of the log probability of that sample based on how good it is. So, we wanna sort of push up our parameters, that, um, are responsible for us getting samples which are good. So, um, we want to have parameters in our policy that are gonna cause us to execute trajectories that give us high reward. So, if we think of just sort of here f of x again is the reward. And we are- this is gonna be our policy or parameterized policy. We want to increase the weight of things in our space that lead to high reward. So, if this is our f of x, which is our reward function and this is the probability of our trajectories, then we wanna reweight our policy to try to increase the probability of trajectories that yield high reward. So, you would end up having larger gradients towards things that have high value, high reward. Okay. So, then the next question is, if I’m gonna do this then I have to be able to approximate the second term, which is this log. You know, the derivative with respect to the probability of a trajectory, um, under some parameters. So, I have to be able to figure out what is the probability of a trajectory under a set of parameters, and we can do that as follows. And so, this is gonna be Delta Theta of log of the pro- of Mu of S_0. So, this is our initial starting state, the probability of our initial starting state, times the product over j equals 0 to t minus 1 of the probability of observing the next state given the action that was taken, times the probability of taking that action under our current policy. So, there’s like another bracket at the end. Um, and so, since this is log, we can just decompose this. So, this is gonna be equal to Delta Theta of log of Mu of S naught plus Delta Theta sum over Delta Theta because it’s a log term, of J equals 0 to t minus 1 of log of [NOISE] the transition model. Remember, we don’t know this in general. This is unknown. You’re just gonna hopefully end up with an expression which means we don’t need to have it. Um, and what i is indexing here is which trajectory we’re on. Add sum over J equals 0 to t minus 1. And this is gonna be our actual policy parameters. All right, can anybody me why this is a useful decomposition? And whether or not it looks like we’re gonna need to, so, let me just parameterize all these things, um, does this look hopeful in terms of us not needing to know what the dynamics model is? [inaudible] How about everybody just take a second, talk to your neighbor and, um, then tell me which of these terms are gonna be zero. So we’re taking the derivative with respect to Theta. And which of these terms depend on Theta [OVERLAPPING]. Remember Theta is what determines your policy parameters. Theta is what determines what action you take in a given state. All right I’m gonna do a quick poll, um, so I’m gonna call these items one, two and three. Does the first term depend on Theta? Raise your hand if yes, raise your hand if no. Great, okay, yeah. So this is independent Theta. So this is gonna be zero. Raise your hand if the second term is independent of Theta. Great, so this goes to zero. So the only thing we have less is this, which is great. So, um, the nice thing and so now it sort of becomes more clear why we did this weird log, um, transformation, because when we did this weird log transformation, it allowed us to take this product of the probability of the action that we took in the state transitions, then instead we can decompose it into sums. And now once we see that we decompose it into sums, we can apply the derivative separately and that means some of these terms just directly disappear which is really cool. So, it means that we don’t actually need to know what the transition model is. Um, we don’t need to have a explicit representative. Yeah question and name first please. And the question is, I was wondering, doesn’t the dynamics of the system depend on the policy though, um, in general? Great question. So question is, does the dynamics of the system depend on the policy? Absolutely, but only through this part. So, it’s like um, the agent gets to pick what action they take, but once they pick that, the dynamics is independent of the agent and so it’s this de-coupling. So, if you have a different policy, you will absolutely get different trajectories but the way in which you get different trajectories, um, is only affected by the policy in terms of the actions that are selected and then the environment will determine sort of the next states that you get. And so we don’t need to know that in terms of, um, estimating the impact of the actions on the environment. It will also come through in terms of the rewards you get. Because the rewards you get are also a function of the state. So you’ll absolutely visit different parts of the state depending on the actions you take. Any other questions? I’m and I just want to make sure I understand how I get the estimate for the probability of how given Theta, I mean, most likely we are just saying if we took m episodes and this one showed up, you know, i times it’s gonna be i over m. It is that what we are doing here is correct? Great question. So, is asking, um, you know, this is, what I put here is just one of those internal terms that would cover one i, yes, So, what we’re doing here is, we’re saying, we’re gonna take this policy. We’re gonna run it m times. We’re probably not gonna get any trajectories that are identical. And what we’re gonna do is compute this log of the probability of trajectory for each of those separately. And then sum them up. You might end up with that, I mean, you know in deterministic cases you might, um, if your domain doesn’t have a lot of stochasticity and neither does your policy, you might end up with multiple trajectories that are identical. In general your trajectories will be completely different. And so will your estimate of their local gradients. So, this is really nice. We’re gonna end up with this situation where we only have to be able to have an analytic form for the derivative of our policy with respect to our parameters. So, we still need and we’ll talk about this a little bit more later. We still need this. We have to evaluate this. This is about how we parametrized our policy. And if we want this to be analytic, we need to have parametrized our policy in a way that we can compute this exactly for any state and action. So, we’ll talk more about some ways, you know, some policy parameterizations which make this computation analytic and nice. In other cases you might have to estimate this thing itself by, you know, brute force or computation or finite differences or something. But if we choose a particular form of parameterized policy, then this part is gonna be analytic. So, another thing is I don’t find this, um, er, I don’t find this additional terminology particularly helpful but it’s used all over the place. So I wanna introduce it which is people often call, um, this part a score function. Just the score function which is not particularly helpful, I think but nevertheless is often used is called this. So that’s the quantity that we were just talking about needing to be able to evaluate. So this really gets into, um, er, well, we’ll write it out again. So, um, when we take the derivative of the value function we approximate that by getting m samples and we sum over i equals one to m. And we look at the reward for that treject- um, trajectory and then we sum over these per step score functions. Can everybody read that in the back? Yeah. Okay great. Yeah so these are sort of our score functions. And these are our score functions, um, that can be evaluated over every single state action pair that we saw and we do not need to know the dynamics model. So, the policy gradient theorem slightly generalizes this. How is it gonna generalize this? Note in this case, what we’re doing here is we’re- this is for the episodic setting. And this is for when we just take our raw, our raw reward functions. So we look at the sum of rewards for that trajectory and then, um, we weigh it by this sort of derivative with respect to our policy parameters. Um, it turns out that we can also slightly generalize this. And let’s say I’m gonna call, so this is a value function, um, let’s say that we had slightly different objective functions. We talked before about how we could have episodic reward or average reward per time step or average value. So, we could either have our objective function be equal to our normal value for episodic or we could have it equal to what I’m gonna call J AVR which is average reward per time step or we could have it as average value. Let’s say we’re always continuing and we want to average over the distribution of states that we encounter. So we can think about that. It’s a good scenario too. It turns out that on all of those cases you can do a similar derivation to what we did here for the episodic case. And what we find is that we have the derivative of our objective function, which now can be kind of any one of these different objective functions, is equal to the expected value under that, the current policy of the derivative with respect to just those policy parameters times Q. And Sutton and Barto in Chapter 13 which we also reference on the schedule, um, have a nice discussion about a number of these different issues, um, and so again, we’re not gonna talk too much about these slightly other different objective functions but just know that this all can be extended to the continuing case. Okay, so what we’ve said here so far is that we have this approximation where what we do is we just take our policy, we run it out phi m times, for each of those m times we get a whole sequence of states and actions and rewards. And then we average. And this is an unbiased estimate of the policy gradient but it’s very noisy. So, this is gonna be unbiased and noisy. If you think about what we saw before for things like Monte Carlo methods, it should look vaguely familiar, same sort of spirit, right? We have, um, we’re just running out our policy. We’re gonna get some sum of rewards just like what we got in Monte Carlo, um, estimates. But, [NOISE] so, it’ll be unbiased estimate of the gradient. So, it’s unbiased estimate of the gradient, estimate of gradient. But noisy. So, what can make this actually practical? Um, there’s a number of different techniques for doing that. Um, but some of the things we’ll start to talk about today are temporal structure and baselines. [NOISE] Okay. So, how do we fix this? I’m gonna start to look at, you know, fixes, [NOISE] uh, temporal structure and baselines. And before we keep going on this, um, based on what I just said in terms of Monte Carlo estimates, um, what are some of you guys’ ideas for how we could maybe reduce the variance of this estimate? Based on stuff we’ve seen so far in class. Like, what are the alternative to cut up Monte Carlo methods? Yeah? We could use bootstraps. Um, can I get your name first. Oh, I’m . ? Yeah. What said is exactly, right. So, said we could use bootstrapping. Yeah. So, we’ve repeatedly seen that, um, we have this trade off between bias and variance, and that bootstrapping, um, like temporal difference methods that we saw in Q-learning that you’re doing in DQN, can be helpful in, um, reducing variance and, and speeding the spread of information. So, yeah. So, we could absolutely do things like bootstrapping, um, to kind of replace R with something else or use, um, a covariate in addition to R. To try to reduce the variance of R. Okay. All right. So, what we’re gonna do now is, we’re first gonna do something that doesn’t go all the way to there but tries to [NOISE] at least leverage the fact that we’re in a temporal, temporal, um, process. [NOISE] Okay. Um, and for any of you who have played around with importance sampling before, this is closely related to, um, per-decision importance sampling. And basically, the, um, thing that we’re going to exploit, is the fact that, um, the rewards, uh, can only, um, of the temporal structure domain. Oh, I’ll write it out first. Okay. So, what we had before, is we said that, um, the derivative with respect to Theta of the expected value of Tau of the return is equal to the expected value under the trajectories you could get of the sum over t equals 0 to t minus 1 of rt such as the sum of rewards you get [NOISE] times the sum over t equals 0 to t minus 1 of your derivative with respect to your policy parameters. [NOISE] Um, that’s what we had before. So, we just sum up all of our rewards, and then, we’d multiply that by the sum over all of the gradients of our policy at every single action state pair we got in that trajectory. [NOISE] Okay. So, let’s think about doing this for a single reward, instead of looking at the whole sum of rewards. So, let’s just look at. We take the derivative with respect to Theta of expected value of rt prime. [NOISE] So, this is just a single time step reward that we might encounter you know, along our trajectory, and that’s gonna be equal to the expected value of rt prime times sum over t equals zero to t prime of this derivative. So, this is gonna look almost exactly the same as before. [NOISE] Except, the only key difference here is that I’m only summing up to t prime. Okay. So, we’re only summing up, um, you could think of this as just like a shortened trajectory. I’m looking at the product of, um, the states, and the actions and the rewards that I, I reached all the way up to when I got to, um, rt prime. Okay. So, I don’t have to sum all the way over the future ones. So, we can take this expression, and now, we can sum over all time steps. So, this says, what’s the expected reward, uh, or the derivative with respect to the reward for time step t prime? Now, I’m gonna just sum that, and that’s gonna be the same as my first expressions. So, what I’m gonna do is I’m gonna say, [NOISE] V of Theta is equal to the derivative with respect to Theta of er, and I’m gonna sum up that internal expression. So, I’m gonna sum over t prime is equal to zero to t minus 1 rt prime, and then insert that second expression. Okay. So, all I did is I put that in there and I summed over t prime is equal to zero, all the way up to t minus 1, and then, what I’m gonna do is I’m going to reorder this and this by making the following observation. So, if we think about, how many terms one of these particular, um, log Pi Theta at st appears in? So, if we look at, um, [NOISE] log of Phi Theta of a_1 s_1. So if you look at how many times that appears, that appears for the early rewards and it appears for all the later rewards too. Okay. This is going to appear for r_1, it’s going to appear for r_2, it’s gonna appear all the way up to rt for t minus 1. Because we’re always summing over everything before that t prime. Okay. So, what we’re gonna do now is we’re gonna take those terms and we’re gonna just reorganize those. So, some of these terms appear a whole bunch of times, some of them, the last one, the [NOISE] log of Pi Theta of at minus 1, st minus 1, it’s only gonna appear once. It’s only re-responsible for helping dictate the very [NOISE] final reward. So, we can use that insight to just slightly, um, reorganize this equation as follows. [NOISE] So, now, we’re gonna say this is equal to the expected value of sum over t equals zero to t minus 1. So, notice before, I put the t prime on the outside and the t was on the inside, and now, what I’m gonna do is put the t on the outside, and I’m gonna say, [NOISE] got Delta Theta, log Pi Theta, at st times sum over t prime is equal to t all the way to t minus 1 of rt prime. Okay. So, all I’ve done is I’ve reorganized that sum. Yes? Is that ? Yeah. Yeah. Um, on the second line from the bottom, is it’s supposed to be the derivative that- is that a value function [NOISE] with respect to Theta? Um, at the very [NOISE] left. Yeah. Okay. Yes. Oh sorry. You mean[OVERLAPPING] it’s supposed to be the derivative of this? Yes. [NOISE] Yeah. Thank you. Okay. So, what we’ve done in this case was we’ve reorganized the sum. We-we’ve just recollected terms in a slightly different way. But it’s gonna be the- in a useful way. So, [NOISE] let’s move this up, and I’ll move this one down. [NOISE]. Okay. So, right now, we’re still working on the temporal structure. [NOISE] And what is this going to allow us to do? Well that second term there, should look somewhat familiar. What that’s saying here is that’s saying, what is the reward we get starting at time step, uh, t all the way to the end? And that’s just the return. So, we had previously defined that, um, when we are talking about, like, Monte Carlo methods, et cetera, [NOISE] that we could always just look at, um, rt prime at I. This is just equal to the return. The return for the rest of the episode starting in time step t on episode i. So, that, that should look very familiar to what we had seen in Monte Carlo methods, where we could always say from this state and action, what was the sum of rewards we get starting at that state and action until the end of the episode? Okay. So, that means we can re-express the derivative [NOISE] with respect to Theta as approximately one over m sum over all of the trajectories and we’re summing over, sum over all the time steps. The derivative with respect to Theta of our actual policy parameter times just the return. And this is gonna be a slightly lower variance estimate than before. Okay. So, instead of us having to sort of separately sum up all of our words, and then, we multiply that by the full sum of all of these derivatives at the logs, we are only kind of needing to take the sum of the logs, um, for some of the reward terms essentially, and so, we can reduce the variance in that case. Because in some ways, what this is doing, this is saying, like, for every single reward, because you could re-express this as a sum of rewards. For every single one of those rewards, you have to, um, sum it by sort of the full trajectory in terms of the derivative of the gradient, uh, the derivative of the policy parameters. And now, we’re saying, you don’t have to, uh, multiply that by all of those. You only have to multiply it by the ones that are relevant for that particular reward. That means that you’re gonna have a slightly lower variance estimator. [NOISE] Okay. So, when we do this we can end up with what’s known as REINFORCE which, um, who here has heard of REINFORCE? Yeah. Number of people, not everybody. REINFORCE is one of the most common reinforcement learning policy gradient algorithms. So, you get the REINFORCE algorithm. [NOISE] So how it works is you, um, then the algorithm is you initialize in it, theta randomly. [NOISE] You just always will have to first decide on how you’re parameterizing your policy, so somewhere you already defied- decided how you’re parameterizing your policy. Now, you’re gonna set the values for that policy randomly. And then for each episode, so you’re going to run an episode with that policy. [NOISE] Episode. [NOISE] And you’re gonna gather a whole bunch of actions and rewards, [NOISE] and this is sampled from your current policy. So, your sample, your current policy according to, um, sample from your current policy, you get a trajectory, and then for every step in that trajectory, you’re gonna update your policy parameters. [NOISE] So, [NOISE] for every time step inside of that episode, we’re gonna update our policy parameters. [NOISE] So, it’s going to be the same as before times some learning rate. [NOISE] I will not use W there, I use alpha, um, times the derivative with respect to Theta, log Pi Theta, st at Gt, where Gt is just in this episode, what is the sum of rewards from st at onwards? [NOISE]. So, that’s the just the normal return like, what we had with, um, Monte Carlo methods. So, just like, what we did when we were estimating like, the linear value function, and we were using rewards from the state and action onwards. We’re going to do the same thing here except for, um, we’re going to be updating the policy parameters, and we do this many, many, many times and then at the end, we return the Theta parameters. Yeah? [NOISE] I have a question. So, for each episode, do you sample from the updated, um, policy? We’re gonna talk with you. Hope you’re ready. Yes. Uh-um. [NOISE] Yeah. So, what just asked is, right. Um, so, in this case you, um, I- after you do all the updates for one episode, so you could do these incremental updates. Um, I- and then, a- at the end of doing all of your incremental updates, then you get another episode with your new updating parameters. Yeah, ? Um, since we’re doing every time updates would this be a biased method? Um, good question. So, since we’re doing every time estimates, um, this should be an unbiased estimate of the, um, I- It should still be an unbiased estimate of the gradient. It’s stochastic, um, but, um, we- there’s not a notion of state and actions in the same way. Um, this will be asymptotically consistent. It’s a good question. So, the notion of, um, a state and action in this case is different because we have just these policy parameters. So, we’re not estimating the value of a state and action here. Um, so, this is certainly asymptotically consistent. I think it’s still just unbiased. Um, if I, if I reconsider that later, I’ll send a Piazza post, um, but I think it’s still just an unbiased estimate of the gradient. It’s a good question. Okay. So, I go back to my slide notes. Um, I think the last thing I just wanted- well, I’ll- I mention the, uh, probably the things will, um, [NOISE] one critical question here is, whether or not, or how to compute this differential with respect to the policy parameters? So, I think it’s useful to talk about, you know, what are the classes of policies that often people consider, [NOISE], um, that have nice differentiable forms. So, um, some of the classes people considers are things like, Softmax, Gaussians, and Neural networks. Those are probably the most common. So, I- what do I mean by that? I mean, that’s how we’re [NOISE] going to actually, just parameterize our policy. So, let’s just look at an example. So, Softmax is where we simply [NOISE] have a linear combination of features, and we take sort of, um, an exponential weight of them. So, what we’re gonna do is we’re gonna have some features of our state and action space, and we’re gonna [NOISE] multiply them by some weights or parameters. These are our parameters. [NOISE] And then, to actually get a probability of taking an action, so if we want to have our policy where we say, what is the probability of action given the state? We’re gonna take the exponential of these weighted features. So, we have e [NOISE] to the phi t Theta, divided by the sum over all actions [NOISE]. So notice, this is a reasonable thing to do when our action space is discrete, action space discrete [NOISE]. So, [NOISE] a lot of Atari games, a lot of different, um, scenarios you do have a discrete action space, um, so you could just take this exponential, divide by the normalized. Um, sum over the exponential, and that immediately yields our parameterized policy class. And so, then, um, if we want to be able to take the derivative of this with respect to the log, that’s quite nice because we have exponentials here, and we have log term. We’re taking the log of this. So, um, we want to be able to compute this term from this sort of parameterized policy class. What we get is the derivative with respect to Theta of log of this type of parameterized policy [NOISE] is just equal to our features [NOISE]. So, this is whatever feature representation we’re using, like in the case of the, um, uh, locomotion robotic case, this would be like all the different. Um, uh, this could be something like, you know, angles, or joints, or things like that. Um, so, this is whatever featurization we’re using minus the expected value for Theta of [NOISE] your parameters, um, with an exponent with, um, your expected value [NOISE] over all the actions you might take under that policy. So, it’s sort of saying the features that you observed versus the sort of average feature, average, average over the action. Okay? So, that’s differentiable. Um, and you can solve it, then it gives you an analytic form. Um, another thing that’s really popular is a Gaussian policy. [NOISE] And why might this be good? Well, this might be good because often, we have continuous action spaces. So, this is good if we have discrete action spaces. Often, we have continuous actions spaces. This is very common in, um, controls and robotics. [NOISE] So, you have a number of different parameters and i- in their continuous scalar values that you wanna be able to set. Um, and what we could say here, let’s say, we use mu of s. It might be a linear combination of state features, times some parameter. Okay. And let’s imagine for simplicity right now that, um, we have a variance but that, that’s static. So, we could also consider the case where it’s not, but we’re gonna assume that we have some variance term that is fixed, [NOISE] so this is not a parameter. This is not something we’re gonna try to learn. We’re just gonna be trying to learn the Theta that’s defining our mu function, and then our policy is gonna be a Gaussian. So, a is going to be drawn from a Gaussian using this mean per features. Okay. So, we compare our current state to the mean, and then we select an action with respect to that, and the score function in this case is the derivative of this Gaussian. [NOISE] So, score’s a derivative of, uh, of this Gaussian function [NOISE], which ends up being a minus mu of s times our parameterization of the state features divided by sigma squared [NOISE]. So, again, it can be done analytically. And the other really common one is deep neural networks. So, those are, um, [NOISE] those are- those are sort of very common forms, um, I- that people use. We’re gonna talk next time about another common way. Before we finish, um, we’re gonna do, um, spend about five minutes, um, to do some early class feedback. It’s really helpful for us to figure out what’s helping you learn, what things do you think could be improved? Um, so, what I’m going to do right now is open, if you guys could go to Piazza. Um, It would be great if you could fill this out. All everything is anonymous. Um, and my goal is to get feedback for you guys on Wednesday about this. So, just- let me see if I could redo that. So, it’s posted. Okay. Let me see if I can- I’ll pin it, so it’s easier to find. [NOISE] So, it should be pinned now at the very top. Yeah. So, a class feedback survey, if you go to Piazza, um, it’s a very short survey. You can give us information back. That would be great. What we’ll do next time is we’ll continue to talk about policy search. We’re gonna talk about baselines, which is another way to reduce variance. Um, and this again is a super active area of research, so there’s, um, a ton of work on deep reinforcement learning and policy gradients, and so we’ll talk about some of that work, and then you’ll have a chance to, uh, play around with this a- and, uh, do get sort of hands-on experience with policy gradients after the mid-term. So we’re releasing the assignment about this at post the midterm, and that will be the third assignment.