Reinforcement Learning 2

Reinforcement Learning 2

Megerstses Tanuls = Reinforcement Learning (RL) Szepesvri Csaba Gpi Tanuls s Ember-Gp Interfszek Csoport MTA SZTAKI [email protected] www.sztaki.hu/~szcsaba Gpi tanuls s Ember-Gp Interfszek Csoport MTA SZTAKI, 2004 Tanuls Kocsis Levente, PhD Megerstses tanuls Szepesvri Csaba, PhD Klasszifikci Szamonek Zoltn, PhD hallg. Jellegzetessg kivons your name? Alkalmazsi terletek Kontroll, jtkok Beszd

Termszetes nyelv (NKFP projekt: NYELVBNYSZ) Pnzgyi mat. (portfli opt.) 2 MA: Megerstses Tanuls Tartalom: Motivci Algoritmusok, mdszerek, eszkzk Alkalmazsok AI - a nagy kp ntelligencia: Tanuls Programozi lustasg + a feladatok komplexitsnak kezelse: Minl nllbb tanuls 4

Hol tartunk? (MLHCI Csoport) Pker Clok: mesterszint jtk jtk aspektusok ellenfl modellezs Autverseny-szimultor Clok: Emberi teljestmny mesteri reprodukcija Autvezets forgalomban 5 Mi a megerstses tanuls (RL) ? Nagyfok nllsg a tanulsban Informcik: bntets/jutalom alapjn megfigyelsek a krnyezetrl (llapotok)

Cl: a jutalom egy fggvnyt maximalizlni! +3 +50 -1 -1 r9 s1 s2 s3 s4 s5 a1 a2 a3 a4 a5 s9 a9 r1 r r 4 5

6 A k-kar bandita problma Akcik tlagos kifizets (jutalom) 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 gens -5 100 0 7

Markov Dntsi Folyamatok ~ Markov Decision Processes (MDPs) llapotok, vletlentl fgg tmenetekkel tmenetvalsznsgek aktulis llapottl fggnek r=0 1 a1 a2 2 r=2 Transition matrix P, and reward function 8

Hossztv jutalom gens politikja rgztett: Az Rt kifizets a t pillanat utni ssz-jutalom +3 +50 -1 -1 r1 r4 r5 r9 9 rtk = Hasznossg = Vrhat kifizets Rt valsznsgi vltoz

Vehetjk a vrhat rtkt! Politiktl fgg Rt ! Feladat: talljuk meg azt a politikt amelyik a vrhat rtket maximalizlja, minden llapotban 10 Az eddigi sztori.. RL feladatok rszei: Tbb lpses dntsi feladatok Cl *-ot megtallni Kritrium: Rvid tv Hossz tv st at rt+1 st+1 at+1

rt+2 st+2 at+2 rt+3 st+3 11 A Bellman egyenletek A Markov tulajdonsg miatt a vrhat sszjutalmat egy rekurzv egyenlettel is kifejezhet: 4 (s)s) s 3 5

ahol s Mskpp: V = TV vagy BV = 0 12 Bellman egyenletek - optimlis rtkel fggvny Optimlis rtkel fggvny Moh politka: mindig a Q* szerinti legjobb akcit vlasztja: argmax_a Q*(s,a) Ez optimlis!!! Politika javts algoritmus: (kirtkel, javt)* 13 Bootstrapping mdszerek P s R ismerett felttelezve; Dinamikus Programozs 4 (s)s)

s Nem ismerjk P-t s R-et, mintavtelezs; Temporal Difference learning st at = (s)st) rt+1 3 5 st+1 14 TD(0) tanuls: Politikk kirtkelse t:=0 is the policy to be evaluated Initialise Vt ( s ) arbitrarily for all s S Repeat

select an action at from (s)st) at st observe the transition rt+1 V ( st ) according to update st+1

Vt 1 ( st ) Vt ( st ) (rt 1 Vt ( st 1 ) Vt ( st )) t:=t+1 15 On- s Off- politika tanuls On politika: az ppen kvetett politikt rtkeljk pl. TD tanulssal Vt1 ( st ) Vt ( st ) (rt 1 Vt ( st 1 ) Vt ( st )) at st rt+1 st+1 Off-politika: ms politikt kvetnk, mint aminek az rtkt szmoljuk at st+1

st rt+1 Pl. Q-tanuls: Q t 1 ( st , at ) Q t ( st , at ) rt 1 max Q t ( st 1 , b) Q t ( st , at ) bA 16 Off-politika tanuls A Q-tanuls elnyei Az optimlis politika rtkt becsli

mikzben tetszleges (felfedez) akcikat lehet vgrehatjani -moh felfedezs: Moh akci valsznsggel Vletlen akci 1-valsznsggel Garantlt konvergencia, ha kellen bejrjuk az MDP-t Meg lehet-e tallni -ot on-politika algoritmussal? 17 On politika tanuls: Sarsa Trljk a max opertort! rtkeljk a kvetett politikt: at

st rt+1 st+1 at+1 Q t1 ( st , at ) Q t ( st , at ) rt 1 Q t ( st 1 , at 1 ) Q t ( st , at ) Fokozatosan, lassan vltoztassuk a politikt Konvergl! (Jaakkola,Singh,Littman,Szepesvri) 18 On politika tanuls: Sarsa t:=0 Initialise arbitrarily for all

select an action at from explore( ) Repeat observe the transition st at rt+1 select an action at+1 from explore( update st+1 ) according to t:=t+1 19

sszefoglals: TD, Q-learning, Sarsa TD learning Vt1 ( st ) Vt ( st ) (rt 1 Vt ( st 1 ) Vt ( st )) st One step Q-learning at rt+1 st+1 Q t 1 ( st , at ) Q t ( st , at ) rt 1 max Q t ( st 1 , b) Q t ( st , at ) bA

st at st+1 rt+1 Sarsa learning Q t1 ( st , at ) Q t ( st , at ) rt 1 Q t ( st 1 , at 1 ) Q t ( st , at ) st at rt+1 st+1

at+1 20 2-es fokozat: Eligibility traces, TD( A TD hibval a TD tanulsban csak egy llapot rtkt mdostjuk: Vt1 ( st ) Vt ( st ) (rt 1 Vt ( st 1 ) Vt ( st )) st-2 at-2 rt-1 st-1 at-1 rt st

at rt+1 st+1 Minden llapotra meghatrozunk egy if st s 1 alkalmazhatsgi mrtket: et 1 ( s ) et ( s ) otherwise ahol 0 1 Mdostsuk minden llapot Vt ( s) rtkt az alkalmazhatsgi mrtkkel arnyosan: Vt1 ( s ) Vt ( s ) (rt 1 Vt ( st 1 ) Vt ( st ))et ( s ) 21 Eligibility trace a Q-tanulsban: Q()

Sokflekppen lehet csinlni Pl. minden s,a prra: Q t 1 ( s, a) Q t ( s, a) rt 1 max Q t ( st 1 , b) Q t ( st , at ) et ( s, a) st-1 at-1 rt bA

st at rt+1 st+1 Nem-moh akcinl is van informci visszaterjeszts Elvsz a konvergencia garancia! Watkins megoldsi javaslata: nem-moh utn e:=0 Problma: hatsfokot cskkenti Bias variance dilemma at+1 agreedy

22 Sarsa() Msik megolds: hasznljuk a Sarsa algoritmust! Minden s,a prra: Q t1 ( s, a ) Q t ( s, a ) rt 1 Q t ( st 1 , at 1 ) Q t ( st , at ) et ( s, a) st at rt+1 st+1 at+1 rt+2

st+2 at+2 Konvergencia tulajdonsg megmarad(?) 23 Kzelt RL Mirt? Id s trkorltok! (Bellman: dimenzionalts tka) ltalnosts j szitucikra (elgtelen mintavtelezs) Megoldsok rtk-fggvny kzeltse Politika trbeli keress Kzelt modellek + tervezs 24 Lineris approximci

Egyszer s hasznos! Vannak konvergencia eredmnyek Most: lineris TD( Slyvektor a t. idpillanatban: t t 1 , t 2 t n Feature vektor az s llapotra: s s 1 , s 2 s n Becsls Vt s ts Cl: minimalizlni.. 2 MSE t P ( s ) V s Vt s sS

25 rtkfggvny kzelts: approximtorok Vlasztsok: pl. CMAC, RBF npszerek CMAC: n db. cserpdarab Features s s 1 , s 2 s n s i 1 or 0 Tulajdonsgok Coarse coding Szablyos feds j hatsfok Vletlen hash: memriaigny

cskkenti 26 Lineris kzeltsek Gradiens mdszer t -re TD egyenlet j alakja: t 1 t rt Vt st 1 Vt st et

Most az E.T. n-dimenzis vektor, amit gy mdostunk: et et 1 t * Konvergl -hoz 27 jabb nreklm William D. Smart, Cs. Szepesvri, ICML2004: Q-learning egy formja konvergl egy megfelel fggvnyapproximtorral egytt hasznlva. Nem gradiens mdszer. A megfelel gradiens mdszer konvergencija nem ismert. Sejts: .... Konvergens? 28 Egy klnsen sikeres plda:

TD-gammon TD() tanuls, 1 rejtett rteg neuronhl, Backprop 1,500,000 jtk (sajt magval) A legjobb jtkosokkal azonos kpessgek (vilgbajnok) Backgammon llapottere: ~1020 , DP nem megy!! 29 Modell alap RL: struktrlt modellek Dinamikus Bayes hl a P llapottmenetek reprezentcijra (mskpp: faktorizlt MDP) V: fa Backup: goal regression Hasonlt a tervezsi feladatokra

30 RL: rejtett llapotok POMDP, k-Markov ot st ot+2 ot+1 at rt+1 st+1 at+1 rt+2 st+2 at+2

POMDP-ben a tervezs nem(sem) kivihet (intractable) Faktorizlt POMDP-k: igretes Politika keress elnys 31 Politika keress (direkt mdszer) Mdszerek Gradiens Evolcis (egyb local/global search) 32 Alkalmazsok 33 Robot navigcis feladat Sridhar Mahadevan UMass

Pavlov: Nomad 200 robot Nomad 200 simulator 34 Hierarchikus modellek trbeli modellezsre Entire environment Sridhar Mahadevan UMass 575 states Corridor state

1385 states Production state 35 Hierarchikus modellek vertical transitions entry states exit states horizontal transitions abstract states product states,

which generate observations 36 (Yong Liu, Singapore) Internet forgalom-szablyozs Multi-protocol label switching Cl: a sok lehetsges tvonalbl gy vlasztani, hogy a blokkols valsznsgt minimalizljuk Ingress router ingress router egress router egress router 37 Jeremy Wyatt Yoshiyuki Matsumura Matthew Todd

University of Birmingham School of Computer Science Robot foci: szimulcis liga Situation (s) Action (a) Utility Ball kickable, goal near shoot 0.6

Ball kickable, goal far shoot 0.33 Ball kickable, goal far pass 0.4 Q(s,a) 38

A k-lb robot 39 Egyidej (konkurrens) akcik Example: driving Look in the mirror Look at the road Check the speed Head & eyes Steer the wheel Put on high gear Steer the wheel Right arm Press brakes Accelerate Press brakes

Legs Decision epochs 40 M.L.Puterman, 2002 Alkalmazsok (A-tl N-ig) Airline Meal Planning Behaviourial Ecology Capacity Expansion

Decision Analysis Equipment Replacement Fisheries Management Gambling Systems Highway Pavement Repair Inventory Control Job Seeking Strategies Knapsack Problems Learning Medical Treatment Network Control

41 M.L.Puterman, 2002 Alkalmazsok (O-tl Z-ig) Option Pricing Project Selection Queueing System Control Robotic Motion Scheduling Tetris

User Modeling Vision (Computer) Water Resources X-Ray Dosage Yield Management Zebra Hunting 42 Nhny tovbbi RL alkalmazs Liftek vezrlse (Barto & Crites) temezsi feladatok, rsikl pakolsa (Zhang & Dietterich) Dinamikus csatorna kioszts mobil hlzatokban (Singh & Bertsekas) Egyenslyozs: Jrni, biciklizni, seprt egyenslyozni

tanuls, zsonglrkds Ragadoz-prda (PacMan) Portfli optimalizls 43 Aktv terletek Optimlis felfedez stratgik Struktrlt modellek Relcis modellek Folytonos llapot s akci-terek

Hierarchikus RL llapotok s akcik absztrakcii (options, macros,..) Rejtett llapotok (eg. POMDPs) Prediktv llapot-reprezentci Politika keress Szignifikancia tesztek 44 Reinforcement Learning: key papers Overviews R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. J. Wyatt, Reinforcement Learning: A Brief Overview. Perspectives on Adaptivity and Learning. Springer Verlag, 2003. L.Kaelbling, M.Littman and A.Moore, Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237-285, 1996. Value Function Approximation D. Bersekas and J.Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1998. Eligibility Traces S.Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123-158, 1996.

45 Reinforcement Learning: key papers Structured Models and Planning C. Boutillier, T. Dean and S. Hanks. Decision Theoretic Planning: Structural Assumptions and Computational Leverage. Journal of Artificial Intelligence Research, 11:1-94, 1999. R. Dearden, C. Boutillier and M.Goldsmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1-2):49-107, 2000. B. Sallans. Reinforcement Learning for Factored Markov Decision Processes Ph.D. Thesis, Dept. of Computer Science, University of Toronto, 2001. K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, University of California, Berkeley, 2002. 46 Reinforcement Learning: key papers Policy Search R. Williams. Simple statistical gradient algorithms for connectionist reinforcement learning. Machine Learning, 8:229-256.

R. Sutton, D. McAllester, S. Singh, Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 12, 2000. Hierarchical Reinforcement Learning R. Sutton, D. Precup and S. Singh. Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181-211. R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD Thesis, University of California, Berkeley, 1998. A. Barto and S. Mahadevan. Recent Advances in Hierarchical Reinforcement Learning. Discrete Event Systems Journal 13: 41-77, 2003. 47 Reinforcement Learning: key papers Exploration N. Meuleau and P.Bourgnine. Exploration of multi-state environments: Local Measures and back-propagation of uncertainty. Machine Learning, 35:117-154, 1999. J. Wyatt. Exploration control in reinforcement learning using optimistic model selection. In Proceedings of 18th International Conference on Machine Learning, 2001.

POMDPs L. Kaelbling, M. Littman, A. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101:99-134, 1998. 48

Recently Viewed Presentations

  • Immunochromatography - Islamic University of Gaza

    Immunochromatography - Islamic University of Gaza

    Immunochromatography. Introduction. Immunochromatographic assays, also called lateral flow dipstick immunoassay or simply strip tests. Are simple devices intended to detect the presence (or absence) of a target analyte in sample (matrix) without the need for specialized and costly equipment. ...
  • Monopolistic Competition and Oligopoly

    Monopolistic Competition and Oligopoly

    Monopolistic competition contains a considerable amount of competition mixed with a small dose of monopoly power. Oligopoly, in contrast, implies a blend of greater monopoly power and less competition. First, monopolistic competition is defined, listing important characteristics, typical examples, and...
  • LESSON 1 Yes / No questions Wh questions

    LESSON 1 Yes / No questions Wh questions

    incorrect. correct. Elections next year? Are elections next year? He want to stay? Does he want to stay? The boys eaten? Have the boys eaten? The dog swim?
  • Universiti Malaya, Kuala Lumpur, Malaysia Malaysian Nuclear Agency,

    Universiti Malaya, Kuala Lumpur, Malaysia Malaysian Nuclear Agency,

    Small funds are easily obtainable for travel, minor purchases and perhaps contributions to running costs. Hope to apply for some substantial funding. tq Table Pipeline Buffer Format Readout control FPGA-based readout on a single chip Output signals from the FPGA-based...
  • Online Learning - Hinckley Academy

    Online Learning - Hinckley Academy

    Author: BST Created Date: 03/01/2016 06:03:23 Title: Online Learning Last modified by: APathan Company: JCC
  • The Second and Third Crusades - Weebly

    The Second and Third Crusades - Weebly

    ALL OF US will be able to describe the key events of the Second and Third Crusades. (3B-4B) EVEN BETTER IF you can explain the reasons why the Second and Third Crusades happened. (4A-5A) EXCELLENT IF you can make a...
  • International Fixed Income

    International Fixed Income

    Times New Roman Garamond Brush Script MT Symbol Vecton Microsoft Equation 3.0 Microsoft Word Document Microsoft Graph 97 Chart Adobe PhotoDeluxe Business Edition Image International Fixed Income Outline I. Strategies Strategies continued….
  • Medical Errors: Causes and Prevention - Bertholf

    Medical Errors: Causes and Prevention - Bertholf

    Medical Errors: Causes and Prevention Author: Roger L. Bertholf Last modified by: Roger L. Bertholf Created Date: 1/4/2005 1:55:57 PM Document presentation format: On-screen Show Company: Personal copy Other titles