MDP的决策迭代和值迭代

欸呦图丢了

MDP(马尔可夫决策过程)

给定当前状态%5Et%5Cboldsymbol%20s,未来%5E%7Bt%2B1%7D%5Cboldsymbol%20s和过去%5E%7Bt-1%7D%5Cboldsymbol%20s是独立的。对于MDP,行动%5E%7Bt%7D%5Cboldsymbol%20a的结果%5E%7Bt%2B1%7D%5Cboldsymbol%20s仅取决于当前状态%5Et%5Cboldsymbol%20s,而和过去没关系,这种特性有时被称为“无记忆性”。
这个过程可以概括为五个部分:

  • %5Cmathcal%20S:状态集,%5Cmathcal%20S%3D%5C%7B%5Cboldsymbol%20s_0%2C%5Cboldsymbol%20s_1%2C%5Ccdots%5C%7D
  • %5Cmathcal%20A:动作集,%5Cmathcal%20A%3D%5C%7B%5Cboldsymbol%20a_0%2C%5Cboldsymbol%20a_1%2C%5Ccdots%5C%7D,当状态为%5Cboldsymbol%20s时执行动作%5Cboldsymbol%20a%5Cboldsymbol%20a%3D%5Cpi%28%5Cboldsymbol%20s%29
  • P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D:状态转移分布,%5Csum_%7B%5Cboldsymbol%20s%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28%5Cboldsymbol%20s%5E%5Cprime%29%3D1%5C%20%2C%5C%20P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28%5Cboldsymbol%20s%5E%5Cprime%29%5Cgeq0%5C%20%2C%5C%20%5Cboldsymbol%20s%5E%5Cprime%5Csim%20P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D,描述在状态%5Cboldsymbol%20s执行动作%5Cboldsymbol%20a后进入状态%5Cboldsymbol%20s%5E%5Cprime的概率。如果这个概率是%5Cfrac%7B0%7D%7B0%7D,一个值可以通过默认概率分布来估计;
  • %5CGamma%5CGamma%5Cin%5B0%2C1%29,随着时间的推移,奖励会越来越少,这个值是人为设定的;
  • R:奖励函数,一系列动作%5Cboldsymbol%20a_j%5C%20%2C%5Cquad%20j%3D0%2C1%2C%5Ccdots的总奖励为%5Csum_%7Bj%7DR%28%5Cboldsymbol%20s_j%29%5CGamma%5Ej(上标为幂),这个总奖励也是一个随机变量;奖励功能根据实际应用需要设置;

MDP试图求解%5Cmax%20%5Cmathbb%20E%5Cbig%5B%5Csum_%7Bj%7DR%28%5Cboldsymbol%20s_j%29%5CGamma%5Ej%5Cbig%5D,即奖励最大化。

贝尔曼方程

当前状态%5Cboldsymbol%20s的值V%28s%29等于当前状态的奖励R%28%5Cboldsymbol%20s%29和之后可能获得的奖励(递归,无限递归)之和,即:
V%28%5Cboldsymbol%20s%29%3DR%28%5Cboldsymbol%20s%29%2B%5CGamma%20%5Csum_%7B%5Cboldsymbol%20s%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%5Cto%20%5Cboldsymbol%20s%5E%5Cprime%7DV%28%5Cboldsymbol%20s%5E%5Cprime%29%5Cboldsymbol%20s状态下奖励最大(最优)的动作是:
%5Cboldsymbol%20a%5E%2A%3D%5Cpi%5E%2A%28%5Cboldsymbol%20s%29%3D%5Cargmax_%5Cboldsymbol%20a%5Csum_%7Bs%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28s%5E%5Cprime%29V%28s%5E%5Cprime%29%3D%5Cargmax_%5Cboldsymbol%20a%5Cmathbb%20E_%7Bs%5E%5Cprime%5Csim%20P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%7D%5Cbig%5BV%5E%2A%28s%5E%5Cprime%29%5Cbig%5D%5Cphi%28s%29(它是一个增加维度的函数)输出%5Cboldsymbol%20s%5Cboldsymbol%20a%5E%2A%2C%5Cboldsymbol%20a%5Cin%20%5Cmathcal%20A%5Cboldsymbol%20a%5E%2A是从%5Cboldsymbol%20a值中选择的输出。 V%28%5Cboldsymbol%20s%29%3D%5Cboldsymbol%7B%5Ctheta%7D%5E%5Cintercal%5Cphi%28%5Cboldsymbol%20s%29

合身

在训练过程中,首先随机执行动作%5Cboldsymbol%20a以获得“经验”,其中包含P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D的值。然后从以下选择一个迭代方法来拟合:

  • 价值迭代:
    V_s%3D0;迭代更新V_s%5E%7Bnew%7D%3DR%28s%29%2B%5Cmax_%5Cboldsymbol%20a%5CGamma%5Csum_%7Bs%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28s%5E%5Cprime%29V%28s%5E%5Cprime%29直到收敛,当它收敛时就会有V_s%5Capprox%20V%5E%2A%28s%29。这里V作为值进行运算,V直接运算。
    function
    valueIteration(%5Cmathcal%20A , %5Cmathcal%20S ){
    varV = zeros( %5Cmathcal%20S .length());
    /*
    初始化为0
    */
    do
    {
    foreach
    (%5Cboldsymbol%20s
    in%5Cmathcal%20S ){
    V_%5Cboldsymbol%20s = %5Cdisplaystyle%20R%28s%29%2B%5Cmax_%7B%5Cboldsymbol%20a%5Cin%20%5Cmathcal%20A%7D%5CGamma%5Csum_%7Bs%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28s%5E%5Cprime%29%5C%20V_%7Bs%5E%5Cprime%7D ;
    }
    }
    while
    (isCovergenced(V ));
    returnV ;
    }
  • 政策迭代:
    随机化%5Cpi;迭代更新%5Cpi%28%5Cboldsymbol%20s%29%3D%5Cdisplaystyle%5Cargmax_%5Cboldsymbol%20a%5Csum_%7Bs%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28s%5E%5Cprime%29V%5E%5Cpi%28s%5E%5Cprime%29直到收敛(这里假设每一步%5Cpi都是%5Cpi%5E%2A,其实这个假设在收敛的时候基本成立)。
    function
    policyIteration(%5Cmathcal%20A , %5Cmathcal%20S ){
    var%5Cpi =
    new
    π(randoms(%5Cmathcal%20S .length()));
    /*
    这个 %5Cpi 是可执行的 */
    do
    {
    foreach
    (%5Cboldsymbol%20s
    in%5Cmathcal%20S ){
    /*
    当决策函数为%5Cpi时声明一个表达式,策略评估 */
    lambdaV%5E%5Cpi%28s%29 = %5Cdisplaystyle%20R%28s%29%2B%5CGamma%5Csum_%7Bs%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28s%5E%5Cprime%29%5C%20V%5E%5Cpi%28s%5E%5Cprime%29 ;
    /*
    策略改进,哪个动作能达到最大价值V,就是最优策略,所以求一个期望*/
    %5Cpi%28%5Cboldsymbol%20s%29 = %5Cdisplaystyle%5Cargmax_%7B%5Cboldsymbol%20a%5Cin%20%5Cmathcal%20A%7D%5Csum_%7Bs%5E%5Cprime%5Cin%20%5Cmathcal%20S%7DP_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%28s%5E%5Cprime%29%5C%20V%5E%5Cpi%28s%5E%5Cprime%29 ;
    }
    }
    while
    (isCovergenced(%5Cpi ));
    return%5Cpi ;
    }

无限递归难以实现的问题可以通过指定最大递归层数来解决。

连续状态

%5Cmathcal%20S 可能是一个无限集。对于随机m状态,先计算近似值y_i,然后使用监督学习使V逼近y%5Cmathbb%20E_%7B%5Cboldsymbol%20s%5E%5Cprime%5Csim%20P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%7D%5BV%5E%2A%28%5Cboldsymbol%20s%29%5D%5Capprox%20V%5E%2A%28%5Cmathbb%20E%5B%5Cboldsymbol%20s%5E%5Cprime%5D%29%3DV%5E%2A%28f%28%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%29%29,其中%5Cboldsymbol%20s%5E%5Cprime%3Df%28%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%29%2B%5Cvarepsilon%5Cvarepsilon是高斯噪声,淹没在期望中。

functionvalueIteration(%5Cboldsymbol%5Ctheta,%5Cmathcal%20A,%5Cmathcal%20S_%E5%85%A8,m){
var%5Cmathcal%20S=newRandomSubSet(%5Cmathcal%20S_%E5%85%A8);/*随机选取m个状态*/
do{
var%5Cmathcal%20y=newList(m);
foreach(iin%5Cmathcal%20S.sample(m)){
var%5Cboldsymbol%20s=%5Cmathcal%20S_i;
varq=newQ();
foreach(%5Cboldsymbol%20ain%5Cmathcal%20A){
var%5Cmathcal%20S%5E%5Cprime= filter(%5Cmathcal%20S,P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D);/*%5Cforall%20%5Cboldsymbol%20s%5E%5Cprime%20%5Cin%20%5Cmathcal%20S%5E%5Cprime%2C%5Cboldsymbol%20%7Bs%7D%5E%5Cprime%5Csim%20P_%7B%5Cboldsymbol%20s%5Ei%2C%5Cboldsymbol%20a%7D*/
vark=%5Cmathcal%20S%5E%5Cprime.length();
/* 所以 q%28%5Cboldsymbol%20a%29R%28%5Cboldsymbol%20s%29%2B%5CGamma%5Cmathbb%20E_%7B%5Cboldsymbol%20s%5E%5Cprime%5Csim%20P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%7D%5Cbig%5BV%28%5Cboldsymbol%20s%5E%5Cprime%29%5Cbig%5D 的估计 */
q%28%5Cboldsymbol%20a%29=%5Cdisplaystyle%20R%28%5Cboldsymbol%20s%29%2B%5Cfrac%7B%5CGamma%7D%7Bk%7D%5Csum_%7B%5Cboldsymbol%20s%5E%5Cprime%20%5Cin%5Cmathcal%20S%7DV%28%5Cboldsymbol%20s%5E%5Cprime%29;
}
/*%5Cmathcal%20y_iR%28%5Cboldsymbol%20s%29%2B%5CGamma%5Cdisplaystyle%5Cmax_%7B%5Cboldsymbol%20a%5Cin%5Cmathcal%20A%7D%5Cmathbb%20E_%7B%5Cboldsymbol%20s%5E%5Cprime%5Csim%20P_%7B%5Cboldsymbol%20s%2C%5Cboldsymbol%20a%7D%7D%5Cbig%5BV%28%5Cboldsymbol%20s%5E%5Cprime%29%5Cbig%5D的近似值,最后是V%28%5Cboldsymbol%20s%29%5Capprox%20%5Cmathcal%20y_i*/
%5Cmathcal%20y_i=%5Cdisplaystyle%5Cmax_%7B%5Cboldsymbol%20a%5Cin%5Cmathcal%20A%7Dq%28%5Cboldsymbol%20a%29
}
/* 在初始迭代算法中(离散状态),我们根据V%28%5Cboldsymbol%20s%29%3A%3D%5Cmathcal%20y_i更新价值函数V。然后使用监督学习(线性回归)来实现V%28%5Cboldsymbol%20s%29%5Capprox%20%5Cmathcal%20y_i。 */
%5Cboldsymbol%5Ctheta=%5Cdisplaystyle%5Cargmin_%7B%5Cboldsymbol%5Ctheta%7D%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bi%3D1%7D%5E%7Bm%7D%5CBig%28%5Cboldsymbol%5Ctheta%5E%5Cintercal%5Cphi%28%5Cboldsymbol%20s_i%29-%5Cmathcal%20y_i%5CBig%29;
}while(isCovergenced(%5Cboldsymbol%20%5Ctheta));
}

文章出处登录后可见!

已经登录?立即刷新

共计人评分,平均

到目前为止还没有投票!成为第一位评论此文章。

(0)
青葱年少的头像青葱年少普通用户
上一篇 2022年3月25日 下午3:32
下一篇 2022年3月25日

相关推荐