🐕saprhd 👮‍♀️policy

add new column in [[🐕saprhd 👮‍♀️policy]] | | 👁️state | 🤜action | 👀🤜P-transition | 💰reward, obj (max long term expected rwd) | ⏳horizon,timestep $\gamma$; H=1/1-$\gamma$ | diagram | 👮🏻policy, learning alg, bellman eq (NOT MDP) | | ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------- | ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | e2a(`s`,e2e(`s`, a2e(a2a(`s`, $\epsilon^a$), $\epsilon^e$))) | 👮🏻2👮🏻(👁️, $\epsilon^a$) 🕘 | 🌏2🌏(👁️,🤜)=👁️' 🕞update 👁️ | 🌏2👮🏻(👁️,🤜, 👁️')=💰 🕒 | 💰( 👁️, 👀🤜( 👁️, 🤜(=👮🏻( 👁️, $\epsilon^a$)) , $\epsilon^e$) ) reward( state, transition( state, action=policy( state, $\epsilon^a$) , $\epsilon^e$) ) s' info = $s,a, \epsilon^e$ | | | 👮🏻2🌏(👁️) =🤜 🕡update 👮🏻 | | 🚢berth | 6 states 001,002, 101(200),102 (=201) 011(110), 012(=111) | | ![[Pasted image 20240429072429.png\|100]] 🟦state transition w.o. action ![[Pasted image 20240429072455.png\|100]] | avg time in system | $\infty$ | ![[Pasted image 20240428171534.png\|100]] | | | 📦Inventory | Inventory level $S_t \in \{0, \ldots, M\}$ | Order quantity $\in \{0, \ldots, M-S_t\}$ | $P_{ss'}=P(s_{t+1}=s' \| s_t,=a_t)lt;br>$s_{t+1} = (s_t + a_t - D_t)^{+}$ Stochastic demand $D_tlt;/mark> | $rev\left(s_t+a_t-s_{t+1}\right)$ $-cost_{buy}\left(a_t\right)-cost_{holding}\left(s_t+a_t\right)$ | (1 month, 12 months, 0.92) | ![[Pasted image 20240428170551.png\|300]] | det. $\pi(s)=\left\{\begin{array}{cc}C-s & \text { if } s<M / 4 \\ 0 & \text { otherwise }\end{array}\right.lt;br>stochastic $\pi\left(s_t\right)=\left\{\begin{array}{cc}U\left(C-s_{t-1}, C-s_{t-1}+10\right) & \text { if } s_t<s_{t-1} / 2 \\ 0 & \text { otherwise }\end{array}\right.$ | | 👏rideshare matching | location, time, and vehicle type of idle drive | Request destination location/time | New idle state of the driver given request (deterministic) | Expected assignment earnings or 0 for idle drivers, total trip duration or 4 sec for idle drivers | | ![[Pasted image 20240429094453.png\|100]] | New idle state of the driver given request (deterministic) | | 🛑parking (optimal stopping) | $\begin{aligned} & \{(0, T),(0, A) \\ & (1, T),(1, A) ,..., (C, A) \\ & (C, T), leave, park \}\end{aligned}lt;br> parking stage, parking lot empty of T or A | $\left\{\begin{array}{l}(\text { park, continue) if } s=(\cdot, A) \\ (\text { continue) if } s=(\cdot, T) \text {, } \\ \text { do nothing } o. w\}\end{array}\right.$ | | r(s,a) = k (s= (k,A)), -c (leave, a= park), 0 (o.w.) | | ![[Pasted image 20240503080731.png]] | ![[Pasted image 20240503080837.png]] | | 🪂Loon | $\{$ balloon altitude, battery charge, solar elevation, distance to station, relative bearing, time from sunrise, navigation enabled, has excess energy, descent cost, Internal pressure ratio, last command, wind column (magnitude, relative bearing, uncertainty)$\}$ | $\{$Ascend (no power cost), Descend (power cost), Stay (no power cost)$\}$ | Not provided | Distance-based reward with penalty for angular velocity: ![[Pasted image 20240428151123.png]] | (3 minutes, 2 days, -; 1000) | | DQN | | control 🚑 AV🚓road | $\{v_{lead}, h_{lead}, v_{follower}, h_{follower}, v_{ev}, l_{ev}, p_{ev}, v_{cav}, p_{cav}\}lt;br>speed of leading and following vehicles of 🚑 , 🚑speed, lane, position of EMS vehicle, speed position of the CAV | implicit; sample from microscopic sim. $s_{t+1} \sim p(s_t, a_t)$ | $\begin{cases} \nu_1 * v_{ev} + \nu_2 * v_{cav}, & \text{if } p_{cav} - p_{ev} \leq 0 + l_{ev} \neq l_{cav} \\ \nu_3 * v_{ev} + \nu_4 * v_{cav}, & \text{if } p_{cav} - p_{ev} > 0 \\\\ v_{cav}, & l_{ev} = l_{cav} \vee \text{no 🚑} \end{cases}$ | | | ![[Pasted image 20240428153517.png\|300]] | Acceleration command $\alpha \in (\alpha_{min}, \alpha_{max})$ for longitudinal control of the CAV | | 🪫EV routing p3.1 | high, low, dead | $\{search, stay, charge\}$ | p, q, | $r_{stay}$: expected number of packages while staying $r_{search}$: expected number of packages while searching -10: AV is dead and needs rescue $r_{search}gt; $r_{stay}$ | | ![[Pasted image 20240428084418.png\|300]] | | | ✈️Plane Loading p3.2 | (remaining capacity, accumulated value) | $\{\text{Load i} \| i \in [1, 5]\}$ | deterministic | $r((w, v) \xrightarrow[item \; i]{} (w-w_i, v+v_i)) = V_ilt;br> | (1, 5, 1) | ![[Pasted image 20240428085545.png\|200]] | | | 🏷️ price setting p3.3 | $\left\{x_i, i=1, \ldots, n\right\} \cup\{o, r\}lt;br>$x_i$ : setting the price $s_i$. $o$ : riders draw away by the competing service. $r:$ invest alternative transportation projects | $A\left(x_i\right)=\{A ; J\} \quad A(o)=\{I, R\}$ | | | | ![[Pasted image 20240429072342.png\|300]] | $\begin{gathered}V^*\left(x_i\right)=\max \left\{s_i, \gamma\left(\beta V^*(o)+(1-\beta) \sum_{j=1}^n p_j V^*\left(x_j\right)\right)\right\} \\ V^*(o)=\max \left\{0,-c+\gamma\left((1-\alpha) V^*(o)+\alpha \sum_{j=1}^n p_j V^*\left(x_j\right)\right)\right\}\end{gathered}$ | | 🅿️easy parking (p3.4) | 1,2,3,4,5,6,7 | | | r(1,2,3,5,6,7)=-1, r(4)=0 sparse reward - shapping | $V^∗ (s) = − \| s − 4 \|$ | ![[Pasted image 20240429094618.png\|300]] | (s,a,r) sample = (3, −1, −1), (2, 1, −1), (3, 1, −1), (4, 1, 0) $Q(s, a) \leftarrow Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime} \in\{-1,1\}} Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)lt;br>$\begin{aligned} Q(3,-1) & \leftarrow 0+\frac{1}{2}\left(-1+\max _{a^{\prime}} Q\left(2, a^{\prime}\right)\right)=\frac{1}{2}(-1+0)=-\frac{1}{2} \\ Q(2,1) & \leftarrow 0+\frac{1}{2}\left(-1+\max _{a^{\prime}} Q\left(3, a^{\prime}\right)\right)=\frac{1}{2}(-1+0)=-\frac{1}{2} \\ Q(3,1) & \leftarrow 0+\frac{1}{2}\left(-1+\max _{a^{\prime}} Q\left(4, a^{\prime}\right)\right)=\frac{1}{2}(-1+0)=-\frac{1}{2}\end{aligned}lt;br>$\hat{Q}(s, a) \xrightarrow{a . s .} \hat{Q}^*(s, a)lt;br>$V_n(s)=\max _{a \in A}\left[r(s, a)+\gamma E_{s^{\prime} \sim P(\cdot \mid s, a)}\left[V_{n-1}\left(s^{\prime}\right)\right]\right]$ | | | 👁️state | 🤜action | 👀🤜P-transition | 💰reward, obj (max long term expected rwd) | ⏳horizon,timestep $\gamma$; H=1/1-$\gamma$ | discount factor | 👮🏻policy, learning alg, bellman eq (NOT MDP) | | 🚓(🚗🚦) q21 🚓(🚗) q23 | { v0, v1, v2,..,vn−1, x0, x1,..,xn−1, tl} traffic light state $\{v_0, v_1, v_2, \ldots v_{n-1}, x_0, x_1, \ldots x_{n-1}\}$ | $a=(1-\alpha) a_0+\alpha a_{+}$ | $\begin{gathered}P\left(v_i^{\prime} \mid s, a\right)=N\left(v_i+a_i \Delta t, \sigma_1\right) \\ P\left(x_i^{\prime} \mid s, a\right)=N\left(x_i+v_i \Delta t+0.5 a_i \Delta t^2, \sigma_2\right)\end{gathered}$ | $-w_f \frac{1}{n} \sum_{i=0}^n f_i+w_s \frac{1}{n} \sum_{i=0}^n v_i$ | | | | | 🚁drone q23 | A, B, C | ab, ba, bc, ca, cb | deterministic except C2A | r(ab) =-8, r(ba)=2, r(ca)=4, r(cb)=8, r(bc)=-2, | | ![[Pasted image 20240428103737.png\|300]] | ![[Pasted image 20240501111555.png]] | | signal q23 | (green, stop), (green, go), (red, stop), (red, go) | | | 1 when (green, stop) o.w. 0 | | | | | ☕️ free coffee (Q, V) | {s0, s1, s2} | only one action available | P(s1\|s0) = 1, P(s2\|s1) = p, P(s1\|s1) = 1-p, P(s2\|s2) = 1 (absorbing state) | r(s1) = 1, r(s0) = r(s2) = 0 | infinite | ![[Pasted image 20240429093200.png\|300]] | V with optimal Bell.eq$V^{*}(s)=max_{a \in A} [r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot\|s,a)} [V^{*}(s')]]lt;br>$V^{*}(s_2)=r(s_2)+ \gamma \times 1 \times V^{*}(s_2)$ $\Rightarrow V^{*}(s_2)=0lt;br>$V^{*}(s_1)=r(s_1)+\gamma [p \times V^{*}(s_1)+(1-p) \times V^{*}(s_2)]$\Rightarrow V^{*}(s_1)=\frac{1}{1-\gamma p}lt;br>$V^{*}(s_0)=\frac{\gamma}{1-\gamma p}lt;br> Q with Bellman consistency equation (Q=V as 1action) $Q^{*}(s,a)=r(s,a)+\gamma \mathbb{E}_{s' \sim P(\cdot\|s,a)} [V^{*}(s')]lt;br>$Q^{*}(s_0,a)=r(s_0)+\gamma \times 1 \times V^{*}(s_1)\Rightarrow Q^{*}(s_0,a)=0+\gamma \times \frac{1}{1-\gamma p} =\frac{\gamma}{1-\gamma p}$ | | 🏷️Asset pricing | Inventory level $\{S_0,S_0-1,...,0 \}$ | set price (continuous) | $\begin{gathered}s_{t+1}=s_t-w_t\left(a_t\right) \\ w_t\left(a_t\right)= \begin{cases}1 & \text { with probability } \alpha e^{-a_t} \\ 0 & \text { with probability } 1-\alpha e^{-a_t}\end{cases} \end{gathered}$ | $\begin{cases}a_t & \text { with probability } \alpha e^{-a_t} \\ 0 & \text { with probability } 1-\alpha e^{-a_t}\end{cases}lt;br> $\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t a_t w_t\right]$ | | | among policy val.iter, q-iter, q-iter is usually used due to state, action, E_{s0, a1, s1,..} | | 🔁 value iter | {1, 2, ..., 16} | {Up, Down, Left, Right} | 1. Stochastic: P(s'\|s,a) = 0.8 for desired direction, 0.1 for each perpendicular direction. Bumping into wall leaves agent in same state. 2,3. Deterministic: State transitions based on chosen action, staying in same state if bumping into wall. | 1. $r_s=0$ for non-terminal states, $r_g=+5$ for goal state (12), $r_r=-5$ for red square of death (5) 2. $r_s=-1$ for non-terminal states, $r_g=+5$ for goal state (12), $r_r=-5$ for red square of death (5) 3. All rewards have +2 added, so $r_s=+1$, $r_g=+7$, $r_r=-3$ | 1. $\gamma=0.9lt;br>2,3. $\gamma=1.0$ , H: Infinite | ![[Pasted image 20240503063028.png]] | $v_n(s) = \max_{a \in A} [r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot\|s,a)}[V_{n-1}(s')]]lt;br>- Bellman operator is used to compute values when policy is known - Optimal policy encourages reaching goal state in minimum steps when $r_s=-1$ | | | | | | | | | | [[📝🌊nailin_sail_eval]]