Iterative Hard Thresholding for Sparse/Low-rank Linear Regression Prateek Jain Microsoft Research, India Ambuj Tewari Univ of Michigan Purushottam Kar MSR, India Praneeth Netrapalli MSR, NE

Microsoft Research India Our work Foundations Systems Applications Interplay of society and technology Academic and government outreach Our vectors of impact

Research impact Company Machine Learning and Optimization @ MSRI High-dimensional Learning & Optimization Manik Varma Prateek Jain Ravi Kannan

Amit Deshpande Navin Goyal Sundarajan S. Vinod Nair Sreangsu Acharyya Kush Bhatia Aditya Nori

Raghavendra Udupa Purushottam Kar Extreme Classification Online Learning / Multi-armed Bandits

We are Hiring! Interns Learning with Structured Losses PostDocs Distributed Machine Learning Applied Researchers Probabilistic Programming Full-time Researchers Privacy Preserving Machine Learning SVMs & Kernel Learning Learning in High-dimensions

2 2 Need to solve: can be: Set of sparse vectors Set of group-sparse vectors

Set of low-rank matrices Non-convex Comp. Complexity: NP-Hard Overview Most popular approach: convex relaxation Solvable in poly-time Guarantees under certain assumptions Slow in practice Practical Practical AlgorithmsAlgorithms

For High-d ML Problems For High-d ML Problems Theoretically Theoretically Provable Provable Algorithms Algorithms ForProblems High-d ML Problems For High-d ML

Results Sparse Regression [Garg & Khandekar. ICML 2008; J., Kar, Tewari. NIPS14] constraint Matrix Completion/Regression [J., Netrapalli, Sanghavi. STOC 2013; Hardt & Wooters. COLT 2014] Low-rank constraint Robust Regression [Loh & Wainwright. NIPS 2013 ; Bhatia, J., Kar. Submitted, 2015] Tensor Factorization and Completion [Anandkumar et al. Arxiv 2012; J., Oh. NIPS14] Low-tensor rank constraint Dictionary Learning [Agarwal et al. COLT 2014; Arora et al. COLT 2014] Non-convex bilinear form + Sparsity constraint

Phase Sensing [Netrapalli, J., Sanghavi. NIPS13 ; Candes et al. Arxiv2014] System of quadratic equations Low-rank matrix approximation [Bhojanapalli, J., Sanghavi. SODA15] 6 Outline Sparse Linear Regression Lasso Iterative Hard Thresholding Our Results

Low-rank Matrix Regression Low-rank Matrix Completion Conclusions Sparse Linear Regression = n But: sparse ( non-zeros) d

= Sparse Linear Regression : number of non-zeros NP-hard problem in general : non-convex function

Convex Relaxation Relaxed Problem: Non-differentiable Lasso Problem Known to promote sparsity Pros: a) Principled approach, b) Solid theoretical guarantees Cons: Slow to optimize Our Approach : Projected Gradient Descent

[Jain, Tewari, Kar2014] Projection onto ball? 0.1 0 .01 1 1 0.9 0.1 []

sort 1 1 0.9 0.1 0.1 0.01 [] Hard Thresholding

1 1 0.9 0 0 0 [] 3 () Convex-projections vs Non-convex Projections

convex set 1st order Optimality condition non-convex set 0-th order Optimality condition 0 order condition sufficient for convergence of Proj. Grad. Descent? In general, NO But, for certain specially structured problems, YES!!! Restricted Isometry Property (RIP) satisfies RIP if, for all sparse vectors acts as an Isometry

Formally: For all -sparse 2 2 2 (1 )| || || | | (1+ ) | | | Xw Popular RIP Ensembles

= ( 2 log ) Most popular examples: Proof under RIP

Assume: Recall: Hard Thresholding : Triangle inequality RIP [Blumensath & Davies09, Garg & Khandekar09] What if RIP is not possible?

Eigenvalues of So, doesnt hold even for infinite samples Problem is solvable for samples using standard regression Iterative Hard Thresholding: Larger Sparsity For t=1, 2, Stronger Projection Guarantee ||

2 ( ) || || ( ) | |2 dim of 2 2 Statistical Guarantees 2 min ( )=|| | |

Statistically: Known Computation Lower-bound: Same as Lasso sparse Recall, [J., Tewari, Kar2014]

General Result for Any Function satisfies RSC/RSS, i.e., IHT guarantee: After steps If and [J., Tewari, Kar2014] Extension to other Non-convex Procedures IHT-Fully Corrective HTP [Foucart12]

CoSAMP [Tropp & Neadell2008] Subspace Pursuit [Dai & Milenkovic2008] OMPR [J., Tewari, Dhillon2010] Partial hard thresholding and two-stage family [J., Tewari, Dhillon2010] Empirical Results Hard Thresholding Greedy 350x 90x

(d) =2 log ( ) , =300, =1 =2 log ( ) , =20,000, =1 More Empirical Results = 0 log Empirical Results: poor condition number

Low-rank Matrix Regression 2 2 matrix Low-rank Matrix Regression Convex relaxation: sum of singular values of Several interesting results: [Recht et al.2007, Negahban et al2009]

Projected Gradient Descent: where Statistical Guarantees [J., Tewari, Kar2014] Low-rank Matrix Completion set of known entries

Special case of low-rank matrix regression However, assumptions required by the regression analysis not satisfied Guarantees Projected Gradient Descent: Show -approximate recovery in iterations Assuming: incoherent uniformly sampled First near linear time algorithm for exact Matrix Completion with finite samples

[J., Netrapalli2015] Tale of two Lemmas Lemma 1: Perturbation bound with bounds Standard bounds only give: incoherent zero-mean with small variance Lemma 2: Davis-Kahn style result for matrix perturbation If and Empirical Results Matrix Regression

Hard Thresholding Trace-norm Number of data points (n) =100,||=5 log Summary High-dimensional problems Need to impose structure on Typical structures Sparsity

Low-rank Low-rank+Sparsity Non-convex sets but easy projection Proof of convergence (linear rate) Under suitable generative model assumptions Future Work Generalized theory for such provable non-convex optimization Performance analysis on different models Empirical comparisons on real-world datasets Questions?