add new column in [[🐕saprhd 👮‍♀️policy]] | | <mark class = "green"> 👁️state</mark> | <mark class = "orange">🤜action</mark> | <mark class = "purple"> 👀🤜P-transition</mark> | <mark class = "yellow">💰reward</mark>, obj (max long term expected rwd) | ⏳horizon,timestep $\gamma$; H=1/1-$\gamma$ | diagram | <mark class = "red">👮🏻policy</mark>, learning alg, bellman eq (NOT MDP) | | ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------- | ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | e2a(`s`,<mark class = "purple">e2e</mark>(`s`, a2e(<mark class = "red">a2a</mark>(`s`, $\epsilon^a$), $\epsilon^e$))) | 👮🏻2👮🏻(👁️, $\epsilon^a$)<br>🕘<br> | 🌏2🌏(👁️,🤜)=👁️'<br>🕞update 👁️ | 🌏2👮🏻(👁️,🤜, 👁️')=💰<br>🕒 | <mark class = "yellow">💰</mark>(<mark class = "green"> 👁️</mark>,<br> <mark class = "purple"><mark class = "purple"> 👀🤜</mark></mark>(<mark class = "green"> 👁️</mark>, <br> <mark class = "orange">🤜</mark>(=<mark class = "red">👮🏻</mark>(<mark class = "green"> 👁️</mark>, $\epsilon^a$))<br> , $\epsilon^e$)<br>)<br><mark class = "yellow">reward</mark>(<mark class = "green"> state</mark>,<br> <mark class = "purple"> transition</mark></mark>(<mark class = "green"> state</mark>, <br> <mark class = "orange">action</mark>=<mark class = "red">policy</mark>(<mark class = "green"> state</mark>, $\epsilon^a$)<br> , $\epsilon^e$)<br>)<br>s' info = $s,a, \epsilon^e$ | | | 👮🏻2🌏(👁️) =🤜<br>🕡update 👮🏻 | | 🚢berth | 6 states <br><br>001,002, <br>101(200),102 (=201)<br>011(110), 012(=111)<br><br> | | ![[Pasted image 20240429072429.png\|100]]<br>🟦state transition w.o. action<br>![[Pasted image 20240429072455.png\|100]] | avg time in system | $\infty$ | ![[Pasted image 20240428171534.png\|100]] | | | 📦Inventory | Inventory level $S_t \in \{0, \ldots, M\}$ | Order quantity $\in \{0, \ldots, M-S_t\}$ | $P_{ss'}=P(s_{t+1}=s' \| s_t,=a_t)lt;br>$s_{t+1} = (s_t + a_t - D_t)^{+}$ <br><mark class = "purple"> Stochastic demand $D_tlt;/mark> | $rev\left(s_t+a_t-s_{t+1}\right)$ $-cost_{buy}\left(a_t\right)-cost_{holding}\left(s_t+a_t\right)$ | (1 month, 12 months, 0.92) | ![[Pasted image 20240428170551.png\|300]] | det.<br>$\pi(s)=\left\{\begin{array}{cc}C-s & \text { if } s<M / 4 \\ 0 & \text { otherwise }\end{array}\right.lt;br>stochastic $\pi\left(s_t\right)=\left\{\begin{array}{cc}U\left(C-s_{t-1}, C-s_{t-1}+10\right) & \text { if } s_t<s_{t-1} / 2 \\ 0 & \text { otherwise }\end{array}\right.$ | | 👏rideshare matching | location, time, and vehicle type of idle drive | Request destination location/time | New idle state of the driver given request (deterministic) | Expected assignment earnings or 0 for idle drivers, <br>total trip duration or 4 sec for idle drivers | | ![[Pasted image 20240429094453.png\|100]] | New idle state of the driver given request (deterministic) | | 🛑parking (optimal stopping) | $\begin{aligned} & \{(0, T),(0, A) \\ & (1, T),(1, A) ,..., (C, A) \\ & (C, T), leave, park \}\end{aligned}lt;br><br>parking stage, parking lot empty of T or A | $\left\{\begin{array}{l}(\text { park, continue) if } s=(\cdot, A) \\ (\text { continue) if } s=(\cdot, T) \text {, } \\ \text { do nothing } o. w\}\end{array}\right.$ | | r(s,a) = k (s= (k,A)), -c (leave, a= park), 0 (o.w.) | | ![[Pasted image 20240503080731.png]] | ![[Pasted image 20240503080837.png]] | | 🪂Loon | $\{$ balloon altitude, battery charge, solar elevation, <br>distance to station, relative bearing, <br>time from sunrise, navigation enabled, <br>has excess energy, descent cost, <br>Internal pressure ratio, last command, <br>wind column (magnitude, <br>relative bearing, uncertainty)$\}$ | $\{$Ascend (no power cost), <br>Descend (power cost),<br> Stay (no power cost)$\}$ | Not provided | Distance-based reward with penalty for angular velocity:<br>![[Pasted image 20240428151123.png]]<br> | (3 minutes, 2 days, -; 1000) | | <br>DQN | | control 🚑 AV🚓road | $\{v_{lead}, h_{lead}, v_{follower}, h_{follower}, v_{ev}, l_{ev}, p_{ev}, v_{cav}, p_{cav}\}lt;br>speed of leading and following vehicles of 🚑 , <br>🚑speed, lane, position of EMS vehicle, <br>speed position of the CAV | implicit; sample from microscopic sim. $s_{t+1} \sim p(s_t, a_t)$ | $\begin{cases} \nu_1 * v_{ev} + \nu_2 * v_{cav}, & \text{if } p_{cav} - p_{ev} \leq 0 + l_{ev} \neq l_{cav} \\ \nu_3 * v_{ev} + \nu_4 * v_{cav}, & \text{if } p_{cav} - p_{ev} > 0 \\\\ v_{cav}, & l_{ev} = l_{cav} \vee \text{no 🚑} \end{cases}$ | | | ![[Pasted image 20240428153517.png\|300]] | Acceleration command $\alpha \in (\alpha_{min}, \alpha_{max})$ for longitudinal control of the CAV | | 🪫EV routing p3.1 | high, low, dead | $\{search, stay, charge\}$ | p, q, | $r_{stay}$: expected number of packages while staying<br>$r_{search}$: expected number of packages while searching<br>-10: AV is dead and needs rescue<br><br>$r_{search}gt; $r_{stay}$ | | ![[Pasted image 20240428084418.png\|300]] | | | ✈️Plane Loading <br>p3.2 | (remaining capacity, accumulated value) | $\{\text{Load i} \| i \in [1, 5]\}$ | deterministic | $r((w, v) \xrightarrow[item \; i]{} (w-w_i, v+v_i)) = V_ilt;br> | (1, 5, 1) | ![[Pasted image 20240428085545.png\|200]] | | | 🏷️ price setting <br>p3.3 | $\left\{x_i, i=1, \ldots, n\right\} \cup\{o, r\}lt;br>$x_i$ : setting the price $s_i$.<br>$o$ : riders draw away by the competing service.<br>$r:$ invest alternative transportation projects | $A\left(x_i\right)=\{A ; J\} \quad A(o)=\{I, R\}$ | | | | ![[Pasted image 20240429072342.png\|300]] | $\begin{gathered}V^*\left(x_i\right)=\max \left\{s_i, \gamma\left(\beta V^*(o)+(1-\beta) \sum_{j=1}^n p_j V^*\left(x_j\right)\right)\right\} \\ V^*(o)=\max \left\{0,-c+\gamma\left((1-\alpha) V^*(o)+\alpha \sum_{j=1}^n p_j V^*\left(x_j\right)\right)\right\}\end{gathered}$ | | 🅿️easy parking (p3.4) | 1,2,3,4,5,6,7 | | | r(1,2,3,5,6,7)=-1, r(4)=0<br>sparse reward - shapping | $V^∗ (s) = − \| s − 4 \|$ | ![[Pasted image 20240429094618.png\|300]] | (s,a,r) sample = (3, −1, −1), (2, 1, −1), (3, 1, −1), (4, 1, 0)<br>$Q(s, a) \leftarrow Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime} \in\{-1,1\}} Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)lt;br>$\begin{aligned} Q(3,-1) & \leftarrow 0+\frac{1}{2}\left(-1+\max _{a^{\prime}} Q\left(2, a^{\prime}\right)\right)=\frac{1}{2}(-1+0)=-\frac{1}{2} \\ Q(2,1) & \leftarrow 0+\frac{1}{2}\left(-1+\max _{a^{\prime}} Q\left(3, a^{\prime}\right)\right)=\frac{1}{2}(-1+0)=-\frac{1}{2} \\ Q(3,1) & \leftarrow 0+\frac{1}{2}\left(-1+\max _{a^{\prime}} Q\left(4, a^{\prime}\right)\right)=\frac{1}{2}(-1+0)=-\frac{1}{2}\end{aligned}lt;br>$\hat{Q}(s, a) \xrightarrow{a . s .} \hat{Q}^*(s, a)lt;br>$V_n(s)=\max _{a \in A}\left[r(s, a)+\gamma E_{s^{\prime} \sim P(\cdot \mid s, a)}\left[V_{n-1}\left(s^{\prime}\right)\right]\right]$ | | | <mark class = "green"> 👁️state</mark> | <mark class = "orange">🤜action</mark> | <mark class = "purple"> 👀🤜P-transition</mark> | <mark class = "yellow">💰reward</mark>, obj (max long term expected rwd) | ⏳horizon,timestep $\gamma$; H=1/1-$\gamma$ | discount factor | <mark class = "red">👮🏻policy</mark>, learning alg, bellman eq (NOT MDP) | | 🚓(🚗🚦)<br>q21<br>🚓(🚗)<br>q23<br> | { v0, v1, v2,..,vn−1, x0, x1,..,xn−1, tl} traffic light state<br>$\{v_0, v_1, v_2, \ldots v_{n-1}, x_0, x_1, \ldots x_{n-1}\}$ | $a=(1-\alpha) a_0+\alpha a_{+}$ | $\begin{gathered}P\left(v_i^{\prime} \mid s, a\right)=N\left(v_i+a_i \Delta t, \sigma_1\right) \\ P\left(x_i^{\prime} \mid s, a\right)=N\left(x_i+v_i \Delta t+0.5 a_i \Delta t^2, \sigma_2\right)\end{gathered}$ | $-w_f \frac{1}{n} \sum_{i=0}^n f_i+w_s \frac{1}{n} \sum_{i=0}^n v_i$ | | | | | 🚁drone<br>q23 | A, B, C | ab, ba, bc, ca, cb | deterministic except C2A | r(ab) =-8, r(ba)=2, r(ca)=4, r(cb)=8, r(bc)=-2, | | ![[Pasted image 20240428103737.png\|300]] | ![[Pasted image 20240501111555.png]] | | signal<br>q23 | (green, stop), (green, go), (red, stop), (red, go) | | | 1 when (green, stop) o.w. 0 | | | | | ☕️ free coffee (Q, V) | {s0, s1, s2} | only one action available | P(s1\|s0) = 1, P(s2\|s1) = p, P(s1\|s1) = 1-p, P(s2\|s2) = 1 (absorbing state) | r(s1) = 1, r(s0) = r(s2) = 0 | infinite | ![[Pasted image 20240429093200.png\|300]] | V with optimal Bell.eq$V^{*}(s)=max_{a \in A} [r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot\|s,a)} [V^{*}(s')]]lt;br>$V^{*}(s_2)=r(s_2)+ \gamma \times 1 \times V^{*}(s_2)$ $\Rightarrow V^{*}(s_2)=0lt;br>$V^{*}(s_1)=r(s_1)+\gamma [p \times V^{*}(s_1)+(1-p) \times V^{*}(s_2)]$\Rightarrow V^{*}(s_1)=\frac{1}{1-\gamma p}lt;br>$V^{*}(s_0)=\frac{\gamma}{1-\gamma p}lt;br> Q with Bellman consistency equation (Q=V as 1action)<br>$Q^{*}(s,a)=r(s,a)+\gamma \mathbb{E}_{s' \sim P(\cdot\|s,a)} [V^{*}(s')]lt;br>$Q^{*}(s_0,a)=r(s_0)+\gamma \times 1 \times V^{*}(s_1)\Rightarrow Q^{*}(s_0,a)=0+\gamma \times \frac{1}{1-\gamma p} =\frac{\gamma}{1-\gamma p}$ | | 🏷️Asset pricing | Inventory level $\{S_0,S_0-1,...,0 \}$ | set price (continuous) | $\begin{gathered}s_{t+1}=s_t-w_t\left(a_t\right) \\ w_t\left(a_t\right)= \begin{cases}1 & \text { with probability } \alpha e^{-a_t} \\ 0 & \text { with probability } 1-\alpha e^{-a_t}\end{cases} \end{gathered}$ | $\begin{cases}a_t & \text { with probability } \alpha e^{-a_t} \\ 0 & \text { with probability } 1-\alpha e^{-a_t}\end{cases}lt;br><br>$\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t a_t w_t\right]$ | | | among policy val.iter, q-iter, q-iter is usually used due to state, action, E_{s0, a1, s1,..} | | 🔁 value iter | {1, 2, ..., 16} | {Up, Down, Left, Right} | 1. Stochastic: P(s'\|s,a) = 0.8 for desired direction, 0.1 for each perpendicular direction. Bumping into wall leaves agent in same state.<br>2,3. Deterministic: State transitions based on chosen action, staying in same state if bumping into wall. | 1. $r_s=0$ for non-terminal states, $r_g=+5$ for goal state (12), $r_r=-5$ for red square of death (5)<br>2. $r_s=-1$ for non-terminal states, $r_g=+5$ for goal state (12), $r_r=-5$ for red square of death (5)<br>3. All rewards have +2 added, so $r_s=+1$, $r_g=+7$, $r_r=-3$ | 1. $\gamma=0.9lt;br>2,3. $\gamma=1.0$ , <br><br>H: Infinite | ![[Pasted image 20240503063028.png]] | $v_n(s) = \max_{a \in A} [r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot\|s,a)}[V_{n-1}(s')]]lt;br>- Bellman operator is used to compute values when policy is known<br>- Optimal policy encourages reaching goal state in minimum steps when $r_s=-1$ | | | | | | | | | | [[📝🌊nailin_sail_eval]]