# Lecture 8: Learning to Search Imitation Learning in Lecture 8: Learning to Search Imitation Learning in NLP Kai-Wei Chang CS @ UCLA [email protected] Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 Learning to search approaches Shift-Reduce parser Maintain a buffer and a stack Make predictions from left to right Three (four) types of actions: Shift, Reduce, Left, Right

Credit: Google research blog Kai-Wei Chang (University of Virginia) 2 Structured Prediction as a Search problem Decomposition of y often implies an ordering a sequential decision making process I Pro can can a can Md

Dt Nn Vb Kai-Wei Chang 3 Notations Input: Truth: Predicted: Loss: I can can a can

Pro Md Vb Dt Nn Pro Pro Pro Pro Md Md Md Md Md Md Nn Nn Dt Dt Dt

Dt Vb Nn Md Vb Goal: make joint prediction to minimize a joint loss find such that minimizing based on samples Kai-Wei Chang ( MSR -> U of Virginia) 4 Credit Assignment Problem When making a mistake, which local decision should be blamed? sentence

Kai-Wei Chang (University of Virginia) 5 An Analogy from Playing Mario From Mario AI competition 2009 Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} High level goal: Extracted 27K+ binary features from last 4 observations Watch an expert play and (14 binary features for every cell)

learn to mimic her behavior Video credit: Stphane Ross, Geoff Gordon and Drew Bagnell Input: Example of Search Space I can Pro Md can a I

can can can decision Pro Md Vb Dt Nn decision action Pro Md Vb Dt Nn decision action Pro Md Vb Dt Nn

action Kai-Wei Chang 7 Example of Search Space I can can a can Pro Md

Vb Dt Vb Encodes an output = (e) from which loss(y, ) can be computed (at training time) I can can a can Pro Md Vb Dt Nn

e end Kai-Wei Chang 8 Policies A policy maps observations to actions p( obs. input: x timestep: t partial traj:

anything else Kai-Wei Chang ) =a 9 Labeled data Reference policy Given partial traj. and true label , the minimum achievable loss is 1 2 . 1 ( ,+1 , )

Kai-Wei Chang loss( y , ( a )) e 10 Labeled data Reference policy Given partial traj. and true label , the minimum achievable loss is The optimal action is the corresponding The optimal policy is the policy that always selects the optimal action Reference policy can be constructed by the gold label in the training phase Kai-Wei Chang 11 Ingredients for learning to search

Structured Learning Learning to Search Input: Truth: Outputs: Loss: Search Space: - state: - action: - end state Policies: Reference policy: Kai-Wei Chang 12

A Simple Approach Collect trajectories from expert Store as dataset Train classifier on Let play the game! Kai-Wei Chang 13 Learning a Policy[Chang+ 15, Ross+15] At ? state, we construct a cost-sensitive multi-class example (?, [0, .2, .8]) E loss=0 E loss=.2 ? E loss=.8

rollout rollin one-step deviations Kai-Wei Chang 14 Example: Sequence Labeling x = Receive input: the monster ate the sandwich y = Dt Nn Vb Dt Nn

Make a sequence of predictions: x = the monster ate the sandwich = Dt Dt Dt Dt Dt Pick a timestep and try all perturbations there: x = Dt = Nn = Vb = the monster ate the sandwich Dt Dt Vb Dt Nn Dt

Nn Vb Dt Nn Dt Vb Vb Dt Nn Compute losses and construct example: ( { w=monster, p=Dt, }, [1,0,1] ) l=1 l=0 l=1 Learning a Policy[Chang+ 15, Ross+15] At ? state, we construct a cost-sensitive multi-class example (?, [0, .2, .8])

E loss=0 E loss=.2 ? E loss=.8 rollout rollin one-step deviations Kai-Wei Chang 16 Analysis [ICML 15]: Learning to search better than your teacher

17 Analysis Roll-in with Ref: unbounded structured regret 18 Analysis Roll-out with Ref: no local optimal if reference is sub-optimal 19 Analysis

Roll-in & Roll-out with current policy ignore Ref 20 Analysis Minimizes a combination of regret to Ref and regret to its own one-step deviations. Competes with Ref when Ref is good. Competes with local deviations to improve on suboptimal Ref 21 How to Program? Sample codes are available in VW CS XXX Lecture 1

22