A Quest for a Universal Model for Signals: From Sparsity to ConvNets Yaniv Romano The Electrical Engineering Department Technion Israel Institute of technology :Advisor Prof. Michael Elad The research leading to these results has been received funding from the European union's Seventh Framework Program (FP/2007-2013) ERC grant Agreement ERC-SPARSE- 320649 1 Modeling Data Stock Market Text Documents Matrix Data Biological Signals Social Networks Seismic Data Still Images

Videos Radar Imaging Data-sources are structured This structure, when identified, is the engine behind our ability to process this data Traffic info Voice Signals 3D Objects Medical Imaging 2 Modeling Data Stock Market Text Documents Matrix Data Biological Signals Social Networks Seismic Data

Still Images Videos Radar Imaging Almost every task in data processing requires a model true for denoising, deblurring, super-resolution, compression, recognition, prediction, and more We will put a special emphasis on image denoising Traffic info Voice Signals 3D Objects Medical Imaging 3 Models A model: a mathematical description of the underlying signal of interest, describing our beliefs regarding its structure The Underlying Idea

Generative Modeling of Data Sources Enables o A systematic algorithm development, & o A theoretical analysis of their performance 4 Sparse-Land We will concentrate on sparse and redundant representations It is a universal model Adaptive (via learning) to the data source at hand Intersection with machine-learning How can we treat high-dimensional signals? Facing the curse of dimensionality problem, as we need to learn lots of parameters 5 Part I: Local Modeling Traditional approach to treat images reduce the dimensions by working locally on small image patches In the first part of this talk we will study the limitations of this local approach and suggest different ways to overcome them As a result, we push denoising algorithms to new heights

Improving the model Improving the performance 6 Part II: Global Sparse Models We proceed by revisiting the traditional local modeling approach for treating high-dimensional signals, and ask: Why can we use a local prior to solve a global problem? This will take us to Convolutional Sparse Coding (CSC), which is a global model that based on a local sparsity prior The convolutional structure is the key to bypass the curse of dimensionality 7 Part III: Going Deeper We then propose a multi-layered extension of the CSC model, called ML-CSC, which is tightly connected to Convolutional Neural Networks (CNN) SparseLand CNN * Sparse Representation Theory Convolutional Neural Networks The ML-CSC model enables a theoretical analysis of deep learning algorithms (e.g. forward-pass)

performance *Our analysis holds true for fully connected networks as well 8 Denoising via Local Modeling Boosting of Image Denoising Algorithms [SIAM IS 15] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP 15] Global Sparse Modeling Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR 16] Gaussian Mixture Diffusion [IEEE-ICSEE 16] Convolutional Dictionary Learning via Local Processing [ICCV 17]

Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP 13] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing 17] Inverse Problems The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS 17] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP 16] Con-Patch: When a Patch Meets its Context [IEEE-TIP 16] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI 16] Example-Based Image Improving local modeling Improved denoising performance Synthesis via Randomized Leveraging denoising algorithm to treat other inverse problems Patch-Matching [IEEE-TIP 17] Better treatment of inverse problems by improving local priors

Single Image Interpolation via Adaptive Non-Local Global sparse priors, and the connection to deep learning Sparsity-Based Modeling [IEEE-TIP 14] Denoising via Local Modeling Boosting of Image Denoising Algorithms [SIAM IS 15] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP 15] Global Sparse Modeling Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR 16] Gaussian Mixture Diffusion [IEEE-ICSEE 16] Convolutional Dictionary Learning via Local Processing [ICCV 17] Improving K-SVD Denoising

by Post-Processing its Method-Noise [IEEE-ICIP 13] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing 17] Inverse Problems The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS 17] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP 16] Con-Patch: When a Patch Meets its Context [IEEE-TIP 16] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI 16] Example-Based Image Improving local modeling Improved denoising performance Synthesis via Randomized Leveraging denoising algorithm to treat other inverse problems Patch-Matching [IEEE-TIP 17] Better treatment of inverse problems by improving local priors Single Image Interpolation

via Adaptive Non-Local Global sparse priors, and the connection to deep learning Sparsity-Based Modeling [IEEE-TIP 14] Denoising via Local Modeling Boosting of Image Denoising Algorithms [SIAM IS 15] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP 15] Global Sparse Modeling Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR 16] Gaussian Mixture Diffusion [IEEE-ICSEE 16] Convolutional Dictionary Learning via Local Processing [ICCV 17] Improving K-SVD Denoising by Post-Processing its

Method-Noise [IEEE-ICIP 13] On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing 17] Inverse Problems The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS 17] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP 16] Con-Patch: When a Patch Meets its Context [IEEE-TIP 16] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI 16] Example-Based Image Improving local modeling Improved denoising performance Synthesis via Randomized Leveraging denoising algorithm to treat other inverse problems Patch-Matching [IEEE-TIP 17] Better treatment of inverse problems by improving local priors Single Image Interpolation via Adaptive Non-Local

Global sparse priors, and the connection to deep learning Sparsity-Based Modeling [IEEE-TIP 14] Denoising via Local Modeling Boosting of Image Denoising Algorithms [SIAM IS 15] Patch-Disagreement as a Way to Improve K-SVD Denoising [IEEE-ICASP 15] Global Sparse Modeling Convolutional Neural Networks Analyzed via Convolutional Sparse Coding [JMLR 16] Gaussian Mixture Diffusion [IEEE-ICSEE 16] Convolutional Dictionary Learning via Local Processing [ICCV 17] Improving K-SVD Denoising by Post-Processing its Method-Noise [IEEE-ICIP 13]

On the Global-Local Dichotomy in Sparsity Modeling [Compressed-Sensing 17] Inverse Problems The Little Engine that Could: Regularization by Denoising (RED) [SIAM IS 17] Turning a Denoiser into a Super-Resolver using Plug and Play Priors [IEEE-ICIP 16] Con-Patch: When a Patch Meets its Context [IEEE-TIP 16] RAISR: Rapid and Accurate Image Super Resolution [IEEE-TCI 16] Example-Based Image Improving local modeling Improved denoising performance Synthesis via Randomized Leveraging denoising algorithm to treat other inverse problems Patch-Matching [IEEE-TIP 17] Better treatment of inverse problems by improving local priors Single Image Interpolation via Adaptive Non-Local Global sparse priors, and the connection to deep learning

Sparsity-Based Modeling [IEEE-TIP 14] Part I: Image Denoising via Local Modeling 13 Image Denoising Original Image White Gaussian Noise Leading image denoising methods are built upon powerful patch-based :local models Noisy Image The Sparse-Land Model Assumes that every patch is a linear combination of a few columns, called atoms, taken from a matrix called a dictionary The operator extracts the -th -dimensional patch

from Sparse coding: ( 0 ) : min i0 s . t . i = i i > i i 15 Patch Denoising Given a noisy patch , solve 0 ( ) : ^ i =argmini0 s . t . i i2

i :Clean patch and are hard to solve Greedy methods such as Convex relaxations such Orthogonal Matching Pursuit as Basis Pursuit (BP) 2 (OMP) or Thresholding ( 1 ) : min i1 + i i2 i 16 Recall K-SVD Denoising Noisy Image [Elad & Aharon, 06] Initial Dictionary Using K-SVD Update the dictionary Reconstructed Image

Denoise each patch Using OMP Despite its simplicity, this is a very well-performing algorithm We refer to this framework as patch averaging 17 What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of :The Local-Global Gap Efficient Independent Local Patch Processing .VS The Global Need to Model The Entire Image 18 What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking

Here is what WE thought of :The Local-Global Gap Efficient Independent Local Patch Processing .VS The Global Need to Model The Entire Image Whitening the residual image [Romano & Elad 13] Exploiting self-similarities [Romano, Protter & Elad 14] Leveraging the context of the patch [Romano & Elad 16] [Romano, Isidoro & Milanfar 16] Enforcing the local model on the final patches (EPLL) [Ren, Romano & Elad 17] SOS-Boosting [Romano & Elad 15] 19 SOS-Boosting [Romano & Elad 15] Given any denoiser, how can we improve its performance? Denoise 20 SOS-Boosting [Romano & Elad 15] Given any denoiser, how can we improve its performance?

Denoise Previous Result Strengthen 21 SOS-Boosting [Romano & Elad 15] Given any denoiser, how can we improve its performance? Denoise Previous Result Strengthen Operate 22 SOS-Boosting [Romano & Elad 15] Given any denoiser, how can we improve its performance? Denoise

Previous Result Strengthen Operate Subtract SOS-Formulation: Intuitively, an improvement is expected since 23 SOS-Boosting [Romano & Elad 15] SOS-Formulation: We study the convergence of the SOS using only the linear part ^ = ( ) = True for K-SVD, EPLL, NLM, BM3D and others, where the overall processing is divided into a non-linear stage of decisions, followed by a linear filtering Relation to Graph Theory: Minimizing a Laplacian regularization functional: 2 ^ = argmin 2 + ( )

Relation to Game Theory: The SOS encourages overlapping patches to reach a consensus 24 What is ? Many researchers kept revisiting this algorithm with a feeling that key features are still lacking Here is what WE thought of :The Local-Global Gap Efficient Independent Local Patch Processing .VS The Global Need to Model The Entire Image Missing a theoretical backbone! Why can we use a local prior to solve a global problem? 25 Part II: Convolutional Sparse Coding Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding Vardan Papyan, Jeremias Sulam and Michael Elad

Convolutional Dictionary Learning via Local Processing Vardan Papyan, Yaniv Romano, Jeremias Sulam, and Michael Elad [LeCun, Bottou, Bengio and Haffner 98] [Lewicki & Sejnowski 99] [Hashimoto & Kurata, 00] [Mrup, Schmidt & Hansen 08] [Zeiler, Krishnan, Taylor & Fergus 10] [Jarrett, Kavukcuoglu, Gregor, LeCun 11] [Heide, Heidrich & Wetzstein 15] [Gu, Zuo, Xie, Meng, Feng & Zhang 15] 26 Intuitively = + + + + + + + The first filter The second filter 27 Convolutional Sparse Coding (CSC) +

+ + + 28 Convolutional Sparse Representation (2(21) 1) i +1 i = i

i+1 L stripe-dictionary = i = i stripe vector Adjacent representations overlap, as they skip by items as we sweep through the patches of 29 CSC Relation to Our Story A clear global model: every patch has a sparse representation w.r.t. to the same local dictionary , just as we have assumed No notion of disagreement on the patch overlaps Related to the current common practice of patch averaging ( - put the patch back in the -th location of the global vector) What about the Pursuit? Patch averaging: independent sparse coding for each patch

CSC: should seek all the representations together Is there a bridge between the two? Well come back to this later What about the theory? Until recently little was known regrading the theoretical aspects of CSC 30 Classical Sparse Theory (Noiseless) ( 0 ) : min0 s . t . = Mutual Coherence: Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that ]Donoho & Elad 03[ [Tropp 04], [Donoho & Elad 03] Assuming that and we have that [Welch, 74] As a result, uniqueness and success of pursuits is guaranteed as long as Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result! 31

Classical Sparse Theory (Noiseless) ( 0 ) : min0 s . t . = Mutual Coherence: Theorem: The OMP and BP are guaranteed to recover the true sparse code assuming that ]Donoho & Elad 03[ Classic SparseLand theory cannot [Tropp 04], [Donoho & Elad 03] provide good explanations for the Assuming that and we have that [Welch, 74] CSC model As a result, uniqueness and success of pursuits is guaranteed as long as Less than 8 non-zeros GLOBALLY are allowed!!! This is a very pessimistic result! 32 Moving to Local Sparsity: Stripes s 0,

=maxi0 i i i i i i i i i i i i i

i i i i =2 i s 0 , ( 0 , ) :min s . t . = is low all are sparse every patch has a sparse representation over i If is locally sparse [Papyan, Sulam & Elad 16] The solution of is necessarily unique The global OMP and BP are guaranteed to recover it This result poses a local constraint for a global guarantee, and as

such, the guarantees scale linearly with the dimension of the signal 33 From Ideal to Noisy Signals In practice, , where is due to noise or model deviations 0, s 0, ( ) : min s . t . 2 ^ How close is to ?

If is locally sparse and the noise is bounded [Papyan, Sulam & Elad 16] The solution of the is stable The solution obtained via global OMP/BP is stable The true and estimated representation are close 34 Global Pursuit via Local Processing 1 2 ( ) : BP= min 2 2 + 1 1 = T i Li i T i i =

i are slices not patches L i 35 Global Pursuit via Local Processing (2) 1 2 ( ) : BP= min 2 2 + 1 1 Using variable splitting 2 1 T

min i i + 1 s. t . i =L i { } , { } 2 2 i i These two problems are equivalent, and convex w.r.t their variables The new formulation targets the local slices, and their sparse representations Can be solved via ADMM, replace the constraint with a penalty 36 Global Pursuit via Local Processing (2) 2 1 2 T min i i + i1+ i L i +u2 2 2 { } , { } 2 i

i ( ) ADMM Sparse-Update:min i ( i 2 i1 + i L i + u2 2 s ) Comment: One iteration of this procedure amounts to

the2 very same patch-averaging 2 algorithm we started with rate e p o to Separable gand hm LARSeiproblems tlocal g i r n o l e This a ile guarant lem 1locally wh glTobal 2prob min olve the i i + i L i +u Slice-Update: 2 2 s 2

i i i Simple L2-based aggregation and averaging 37 Two Comments About this Scheme The Proposed Scheme can be used for Dictionary () Learning We work with Slices and not Patches Slice-based DL algorithm using standard patch-based tools, leading to a faster and simpler method, compared to existing methods

Patches extracted from natural images, and their corresponding slices. Observe how the slices are far simpler, and contained by their corresponding patches Patches Slices Heide et. al Wohlberg Patches Slices [Wohlberg, 2016] Ours 38 Part III: Going Deeper Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Joint work with Vardan Papyan and Michael Elad 39

CSC and CNN There is an analogy between CSC and CNN: Convolutional structure Data driven models ReLU is a sparsifying operator We propose a principled way to analyze CNN through the eyes of sparse representations But first, a short review of CNN 40 Forward Pass of CNN T T T ReLU + T ( ,, {{ , =ReLU + } {

} ) ( ii } , { i } =ReLU 2 + 2 ReLU ( 1+ 1 ) ) ( i i ( ) T 2 2 2 1 1 2 2 ReLU 2 1 2

2 ( 1 1 )) 2 1 1 1 T1 1 1 0 ReLU

41 Training Stage of CNN Consider the task of classification Given a set of signals and their corresponding labels , the CNN learns an end-to-end mapping ( h ( j ) , , ( j , { i } , { i })) , , min { i } { i} j True label Classifier Output of last layer 42 Back to CSC 1

1 1 0 1 We propose to impose the same structure on the representations themselves 1 1 Convolutional sparsity (CSC) assumes an inherent structure is present in natural signals 1 1 1 2 2 1 2

2 1 Multi-Layer CSC (ML-CSC) 43 Intuition: From Atoms to Molecules 2 1 2 1 1 1 1 2 Columns in are convolutional atoms Columns in combine the atoms in creating more complex structures

The dictionary is a superposition of the atoms of The size of the effective atoms is increased throughout the layers (receptive field) 1 1 44 2 Intuition: From Atoms to Molecules 1 1 2 2 1 2 One could chain the multiplication of all the dictionaries into one effective dictionary and then as in SparseLand However, a key property in this model is the

sparsity of each representation (feature-maps) 1 1 45 2 A Small Taste: Model Training (MNIST) MNIST Dictionary: D1: 32 filters of size 77, with stride of 2 (dense) D2: 128 filters of size 5532 with stride of 1 - 99.09 % sparse D3: 1024 filters of size 77128 99.89 % sparse (77) (1515) (2828) 46 A Small Taste: Pursuit Y

0 1 94.51 % sparse (213 nnz) 2 99.52% sparse (30 nnz) 3 99.51% sparse (5 nnz) 47 Deep Coding Problem Noiseless Pursuit Find { K j j=1 } { = 1 1

1 =2 2 s.t . K 1 = K K s 10 , 1 s 20 , 2 s K 0 , K } Noisy Pursuit Find { K j j=1 } { 1 12 0

1 2 22 1 s.t . K 1 K K 2 K 1 s 10 , 1 s 20 , 2 s K 0 , K 48 } Deep Learning Problem min Supervised Dictionary Learning K , { i}i=1

Task driven dictionary learning [Mairal, Bach, Sapiro & Ponce 12] ( h ( j ) , , ( j , { } )) j True label Classifier Deepest representation obtained from DCP Unsupervised Dictionary Learning K i i =1 Find { } s. t . { j 2 j s

0, j s j 1 0 j j 1 2 1 0 , 2 j K1 K j j s K 0, K

} j =1 49 ML-CSC: The Simplest Pursuit The simplest pursuit algorithm (single-layer case) is the THR algorithm, which operates on a given input signal by: =+ Restricting the coefficients to be nonnegative does not restrict the expressiveness of the model and is sparse ^= ( T ) ReLU = Soft Nonnegative Thresholding

50 Consider this for Solving the DCP Layered thresholding (LT): T T ^ 2= ( 2 ( 1 ) ) 2 1 51 Consider this for Solving the DCP Layered thresholding (LT): Estimate via the THR algorithm T T ^ 2= ( 2 ( 1 ) ) 2 1 52 Consider this for Solving the DCP Layered thresholding (LT):

Estimate via the THR algorithm T T ^ 2= ( 2 ( 1 ) ) 2 1 Estimate via the THR algorithm Forward pass of CNN: T T ( )=ReLU ( 2+ 2 ReLU ( 1+ 1 ) ) N ge N C ua ng a L ReLU Bias Weights Forward pass Soft Non-Negative THR

Thresholds Dictionary D Layered Soft NN THR Sp a La rseL n g an ua d ge 53 Consider this for Solving the DCP Layered thresholding (LT): Estimate via the THR algorithm T T ^ 2= ( 2 ( 1 ) ) 2 1 Estimate via the THR algorithm Forward pass of CNN: T T

( )=ReLU ( 2+ 2 ReLU ( 1+ 1 ) ) The layered (soft nonnegative) thresholding and the forward pass algorithm are the very same things !!! 54 Consider this for Solving the DLP DLP (supervised ): min K { i}i=1 , CNN Language Forward pass The thresholds for h , ,

, ( ( j) ( j { } )the ) DCP should also j learned SparseLand Language Layered Soft NN THR 55 Consider this for Solving the DLP DLP (supervised ): min K { i}i=1 ,

The thresholds for h , , , ( ( j) ( j { } )the ) DCP should also j learned Estimate via the layered THR algorithm CNN Language Forward pass SparseLand Language Layered Soft NN THR

56 Consider this for Solving the DLP DLP (supervised ): min K { i}i=1 , The thresholds for h , , , ( ( j) ( j { } )the ) DCP should also j learned

Estimate via the layered THR algorithm CNN training: ( h ( j ) ,, ( , { i } , { i } )) , ,U min { i } { i} CNN Language Forward pass j SparseLand Language Layered Soft NN THR 57 Consider this for Solving the DLP * DLP (supervised ):

min K { i}i=1 , The thresholds for h , , , ( ( j) ( j { } )the ) DCP should also j learned Estimate via the layered THR algorithm CNN training:

( h ( j ) ,, ( , { i } , { i } )) , ,U min { i } { i} j The problem solved by SparseLand the training stage CNN Language Languageas well, of CNN and the DLP are equivalent Forward pass Layered THR Pursuit via assuming that the DCP is approximated the layered thresholding algorithm * Recall that for the ML-CSC, there exists an unsupervised avenue for training the dictionaries that has no

simple parallel in CNN 58 Theoretical Questions M A is sparse Layered Thresholding )Forward Pass( K ^ { i }i=1 ?Other 59 Uniqueness of Is this set

?unique Theorem: If a set of solutions is found for such that: Then these are necessarily the unique solution to this problem. Mutual Coherence: ]Donoho & Elad 03[ The feature maps CNN aims to recover are unique 60 Stability of Is this set ?stable Suppose that we manage to solve the and find a feasible set of representations satisfying all the conditions K ^ { i }i=1 The question we pose is: How close is to ? 61

Stability of Theorem: If the true representations satisfy then the set of solutions obtained by solving this problem (somehow) with and () must obey The problem CNN aims to solve is stable under certain conditions Observe this annoying effect of error magnification as we dive into the model 62 Local Noise Assumption Our analysis relied on the local sparsity of the underlying solution , which was enforced through the norm In what follows, we present stability guarantees that will also depend on the local energy in the noise vector This will be enforced via the norm, defined as: p 2, =max i 2 i 63 Stability of Layered-THR Theorem: If then the layered hard THR (with the proper thresholds) will

find the correct supports and where we have defined and The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded Problems: 1. Contrast 2. Error growth 3. Error even if no noise 64 Layered Basis Pursuit (Noisy) We chose the Thresholding algorithm due to its simplicity, but we do know that there are better pursuit methods how about using them? Lets use the Basis Pursuit instead LBP 1 1 2 = min 1 12 + 1 11

2 1 LBP 2 =min 2 1 LBP 2 22 + 2 21 1 2 2 Deconvolutional networks [Zeiler, Krishnan, Taylor & Fergus 10] 65 Stability of Layered BP Theorem: Assuming that then for correctly chosen we are guaranteed that 1. The support of is contained in that of

2. The error is bounded: , where 3. Every entry in greater than will be found Problems: 1. Contrast 2. Error growth 3. Error even if no noise 66 Layered Iterative Thresholding j Layered BP: Layered Iterative Soft-Thresholding j t Note that our suggestion implies that groups of layers share the same dictionaries Can be seen as a recurrent neural network [Gregor & LeCun 10] * 67 This Talk

The ML-CSC was shown to enable a theoretical study of CNN, along with new insights Convolutional Neural Networks Independent patch-processing Local Sparsity Convolutional Sparse Coding Multi-Layer Convolutional Sparse Coding Extension of the classical sparse theory The underlying idea: to a multi-layer setting Modeling the data source in order to be able to

Aofnovel interpretation and theoretically analyze propose andthe a multi-layer extension of mentioned several interesting connections between We described presented aanalyze theoretical limitations study patch-based of the CSC theoretical understanding of CNN algorithms

CSC,and shown toand be tightly connected to CNN CSC this us tothat some processing andCNN a practical and ways algorithm toled overcome works oflocally these performance 68 ?Questions 69