SIGNAL PROCESSING AND NETWORKING FOR BIG DATA APPLICATIONS LECTURES 12-13: REGULARIZATION AND OPTIMIZATION ZHU HAN UNIVERSITY OF HOUSTON 1 THANKS XUSHENG DU AND KEVIN TSAI FOR SLIDE PREPARATION

PART 1 REGULARIZATION OUTLINE Parameter Norm Penalties Dataset Augmentation Noise Robustness Semi-Supervised Learning Multi-Task Learning Early Stopping Parameter Tying and Parameter Sharing

Sparse Representations Bagging and Other Ensemble Methods Dropout Adversarial Training Federated Training 2

REGULARIZATION Reduce test error Solve overfitting (Pediatric ADHD) Regularization of an estimator works by trading increased bias for reduced variance (mention CRLB vs. MUSIC) 3 PARAMETER NORM PENALTIES Norm penalty where is the parameters/weights Note: In this chapter, we ignore bias for simplicity, so

indicate weights should be affected by a norm penalty where includes and unregularized parameters Objective function Regularized objective function is a hyperparameter results in no regularization 4 PARAMETER NORM PENALTIES

Minimize will decrease both original objective and the size of the parameters (or some subset) Typically penalizes only the weights and leaves the biases unregularized Each bias controls only a single variable. Regularizing the bias parameters can introduce a significant amount of underfitting

5 PARAMETER NORM PENALTIES L2 PARAMETER REGULARIZATION Weight decay Ridge regression or Tikhonov regularization Regularization term Total objective function minimize Parameter gradient Update weights

6 PARAMETER NORM PENALTIES L1 PARAMETER REGULARIZATION Regularization term minimize Total objective function Parameter gradient Gradient no longer scales linearly

Usually use as a feature selection mechanism (LASSO) 7 DATASET AUGMENTATION 8 DATASET AUGMENTATION Limited data create fake data Very effective technique for

Object Recognition Injecting noise add noise to data Difficult for some task: density estimation, etc. (solve problem first) 9 DATASET AUGMENTATION

THINGS TO LOOK-OUT Need carefully generate data Image rotation produce different classes images Too many noise And more 10

NOISE ROBUSTNESS 11 NOISE ROBUSTNESS Noise injection can be much more powerful than shrinking the parameters Add to inputs or weights Org. object function:

Object function with noise: 12 SEMI-SUPERVISED LEARNING 13 SEMI-SUPERVISED LEARNING Goal is to classify the inputs have similar representation to the same class

Learn from both labeled and unlabeled input to estimate Usually construct separate models for supervised and unsupervised with shared parameters 14 SEMI-SUPERVISED LEARNING Compare labeled input with unlabeled input. Fill in missing data.

Usually used for preprocessing F1 F2 F3

Y (class) A B C 1

A B C More detail: Salakhutdinov and Hinton (2008) and Chapelle et al. (2006) 15 MULTI-TASK LEARNING

16 MULTI-TASK LEARNING When part of a model is shared across tasks, that part of the model is more constrained towards good values Generic parameters Task-specific parameters Deep learning point of view: among the

factors that explain the variations observed in the data associated with the different tasks, some are shared across two or more tasks 17 EARLY STOPPING 18 EARLY STOPPING

Training error decreases but validation set error increases Stop training when reach local minimum of validation error (for certain amount of time when error not improve) Save parameters when validation error improves 19

EARLY STOPPING ADVANTAGES & DISADVANTAGES ADVANTAGES Reduces computational cost Will not damage the learning dynamics DISADVANTAGES

Requires a validation set Memory to save parameters Early stop checking process can run parallel with training process (ideally with CPU or GPU) 20 EARLY STOPPING

ALGORITHM 21 PARAMETER TYING & PARAMETER SHARING 22 PARAMETER TYING We might not know precisely

what values the parameters should take From knowledge of the domain and model architecture, there should be some dependencies between the model parameters 23

PARAMETER TYING EXAMPLE Two models ( and ) performing the same classification task Assumptions: different input distributions Two models map the input to different but related outputs: and should be close to where we can use a penalty:

24 PARAMETER SHARING Force sets of parameters to be equal Lead to significant reduction in the memory footprint of the model. Ex: convolutional neural networks (CNNs) 25

SPARSE REPRESENTATIONS 26 SPARSE REPRESENTATIONS Place a penalty on the activations of the units in a neural network, encouraging their activations to be sparse Sparse parametrization many of the

parameters become zero (or close to zero) Penalty derived from a Student-t prior (Olshausen and Field, 1996; Bergstra, 2011), KL divergence penalties (Larochelle and Bengio, 2008), etc. 27 BAGGING

28 BAGGING Def: technique for reducing generalization error by combining several models Model averaging: train several models and have them vote on the output

Different models will usually not make same errors on test set means models avg. not help. means errors and perfectly uncorrelated Variances Covariances 29

BAGGING EXAMPLE Disadv.: Training multiple models and evaluating multiple models on each test example 30 DROPOUT 31

DROPOUT Removing non-output units by multiplying the unit output value by zero Unlike bagging, dropout aims for exponentially large number of neural networks 32

DROPOUT Minibatch-based learning algorithm Randomly sample a different binary mask when load an when load an example into a minibatch 80% include input unit and 50% include hidden unit

33 DROPOUT Let be the mask vector and be the parameters Dropout training consists in minimizing (expectation of loss function)

Contains exponentially many terms but can obtain unbiased estimate 34 DROPOUT ADVANTAGE & DISADVANTAGE Adv.:

Computationally cheap per example per update Not significantly limit the type of model as long as it can be trained with stochastic gradient decent Optimal validation set error is much lower Disadv.: Large model Much more iterations when training 35

ADVERSARIAL TRAINING 36 ADVERSARIAL TRAINING Check the networks level of understanding of a task Use optimization procedure to search for input that make the neural network fail the task in order to TRAIN the network 37

ADD-ON MATERIAL FEDERATED LEARNING COLLABORATIVE MACHINE LEARNING WITHOUT CENTRALIZED TRAINING DATA 38 TYPICAL DEEP LEARNING MODEL Data from numerous phones Train in the cloud server Model back to phone Same model to all users

39 FEDERATED LEARNING Model to phone Train in phone Summarize changes Summary back to server 40

ADVANTAGE Smarter models Lower latency Less power consumption Privacy Personalized model, immediate usability 41 TECHNIQUE CHALLENGES optimization algorithm

SGD: high iteration Devices are not even, it requires low throughput Federated Averaging algorithm 10-100x less communication compared to a naively one use the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps. Upload speed << download speed Compressing updates

high-dimensional sparse convex models: click-through-rate prediction 42 SECURITY No user data stored in the cloud Secure Aggregation protocol Only can decrypt the average when 100s or 1000s of users participated

No individuals data can be inspected before averaging. 43 SCENARIO the device is idle, plugged in free wireless connection

44 PART II: OPTIMIZATION OUTLINE Difference between pure optimization and optimization in training Challenges in optimization of neural networks Introduction to some optimization algorithms 45

GOAL OF OPTMIZATION IN TRAINING Finding the parameters of a neural network that significantly reduce a cost function J(), which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms. 46 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN

TRAINING Pure optimization acts directly while machine learning (ML) acts indirectly Training Set Suppose we have performance measurement P, which is optimized on training set, but our goal is to let it does good job on testing set.

ML Testing Set 47 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Pure optimization acts directly while machine learning (ML) acts indirectly

empirical distribution per-example loss function target predicted output the data generating

distribution 48 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Empirical Risk Minimization the data generating distribution

target predicted output per-example loss function The goal of ML is to reduce the expected generalization error given by , the quantity of which is also known as risk

But ML only knows the distribution of training samples, if the true distribution is known, then ML is converted to an optimization problem. 49 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Empirical Risk Minimization Convert ML to optimization problem, which minimize the expected loss on the training set

However, this approach is often overfitting. Give up generalization too much. 50 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Surrogate loss function EX: 0-1 loss function: Use negative log-likelihood as surrogate loss function.

The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if the model can do that well, then it can pick the classes that yield the least classification error in expectation. 51 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Surrogate loss function

EX: 0-1 loss function: However, sometimes, loss is not simply 0 or 1. Even in the same class, every sample may have its own value of loss compared to other samples 52 EX: diagnose in hospital, it's more costly to miss a postive case of disease (false negative) than to falsely dignose disease (false positive).

Loss of false negative > Loss of false positive 53 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Surrogate loss function EX: 0-1 loss function: When 0-1 loss function reaches 0, the loss of loglikelihood continues decreasing, which guarantees

robustness of the classifier by further pushing the classes apart by considering the actual loss (some mistakes in diagnose will be more severe than others) 54 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Batch and minibatch

One aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective function usually decomposes as a sum over the training examples. 55 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING

Batch and minibatch = 56 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Batch and minibatch optimize

Computing this expectation exactly is very expensive because it requires evaluating the model on every example in the entire dataset. 57 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING

Batch and minibatch In a huge dataset, there're big amounts of redundancy. Assume a worst case, all m samples in the training set could be identical copies of each other. A sampling based estimate of the gradient could compute the correct gradient with a single sample, using m times less computation than the naive approach. To solve this problem, sampling is needed. 58

DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Batch and minibatch Batch training: put all training samples into training at once Stochastic training (online training): fetch water from a stream Deep Learning use something in between (batch and stoachastic), which is called minibatch. 59

DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Batch and minibatch The size of minibatch is driven by following factors: Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. Multicore architectures are usually underutilized by

extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch. If all examples in the batch are to be processed in parallel (as is typically 60 DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN

TRAINING Batch and minibatch Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models. 61

DIFFERENCE BETWEEN PURE OPTIMIZATION AND OPTIMIZATION IN TRAINING Batch and minibatch Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a

small batch size might require a small learning rate to maintain stability due to the high variance in the estimate of the gradient. The total runtime can be very high due to the need to make more steps, both because of the 62 OPTIMIZATION OUTLINE Difference between pure optimization and optimization in

training Challenges in optimization of neural networks Introduction to some optimization algorithms 63 ILL-CONDITIONING Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.

Ill-conditioning can manifest by causing SGD to get stuck in the sense that even very small steps increase the cost function. 64 ILL-CONDITIONING Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.

a second-order Taylor series expansion of the cost function predicts that a gradient descent step of will add to the cost 65 ILL-CONDITIONING

Ill-conditioning of the gradient becomes a problem when exceeds . To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor the squared gradient norm and the term. 66

ILL-CONDITIONING In many cases, the gradient norm does not shrink significantly throughout learning, but the more than order of magnitude. grows by The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be shrunk to compensate for even stronger curvature.

67 ILL-CONDITIONING Gradient descent often does not arrive at a critical point of any kind Unlike the KKT condition in optimization 68 LOCAL MINIMA One of the most prominent features of a convex optimization

problem is that it can be reduced to the problem of finding a local minimum. When optimizing a convex function, we know that we have reached a good solution if we find a critical point of any kind. 69 LOCAL MINIMA Non-identifiability of neural networks causes a lot of local minima: For example, we could take a neural network and modify layer 1 by

swapping the incoming weight vector for unit i with the incoming weight vector for unit j, then doing the same for the outgoing weight vectors. If we have m layers with n units each, then there are ways of arranging the hidden units. This kind of non-identifiability is known as weight space symmetry. 70 LOCAL MINIMA Local minima can be problematic if they have high cost in comparison to the global minimum.

If local minima with high cost are common, this could pose a serious problem for gradient-based optimization algorithms. 71 LOCAL MINIMA The problem remains an active area of research, but experts now suspect that, for sufficiently large neural networks, most local minima have a low cost function value, and that it is not important to find a true global minimum rather than to find a

point in parameter space that has low but not minimal cost 72 CLIFFS AND EXPLODING GRADIENTS Neural networks with many layers often have extremely steep regions resembling cliffs. These result from the multiplication of several large weights together. On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off of the cliff structure altogether.

73 CLIFFS AND EXPLODING GRADIENTS 74 LONG-TERM DEPENDENCIES RNN as example Any eigenvalues that are not near an absolute value of 1

will either explode if they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude. 75 LONG-TERM DEPENDENCIES 76 THEORETICAL LIMITS OF OPTIMIZATION Several theoretical results show that there are limits on the performance of any optimization algorithm we might design

for neural networks Theoretical analysis of whether an optimization algorithm can accomplish this goal is extremely difficult. 77 OPTIMIZATION OUTLINE Difference between pure optimization and optimization in training

Challenges in optimization of neural networks Introduction to some optimization algorithms 78 STOCHASTIC GRADIENT DESCENT Stochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms for machine learning in general and for deep learning in particular. It is possible to obtain an unbiased estimate of the gradient

by taking the average gradient on a minibatch of m examples drawn i.i.d. from the data generating distribution. 79 STOCHASTIC GRADIENT DESCENT 80 STOCHASTIC GRADIENT DESCENT Because the SGD gradient estimator introduces a source of

noise (the random sampling of m training examples) that does not vanish even when we arrive at a minimum, it is necessary to gradually decrease the learning rate over time, so we now denote the learning rate at iteration k as 81 STOCHASTIC GRADIENT DESCENT A sufficient condition to guarantee convergence of SGD is that

82 STOCHASTIC GRADIENT DESCENT In practice, it is common to decay the learning rate linearly until iteration : with . After iteration , it is common to leave constant. 83 MOMENTUM

A concept comes from physics The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction. 84 MOMENTUM The velocity v accumulates the gradient elements The larger is relative to , the more previous gradients affect

the current direction. 85 MOMENTUM We can see that a poorly conditioned quadratic objective looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses the canyon lengthwise, while gradient steps waste time moving back and forth across the narrow axis of the canyon. 86

MOMENTUM The step size is the largest when many successive gradients point in exactly the same direction. If the momentum algorithm always observes gradient g, then it will accelerate in the direction of -g , until reaching a terminal velocity where the size of each step is It is thus helpful to think of the momentum hyperparameter in terms of . For example, = 0.9 corresponds to multiplying the maximum speed by 10 relative to the gradient descent algorithm.

87 MOMENTUM 88 NESTEROV MOMENTUM The update rules in this case are given by: 89

NESTEROV MOMENTUM The difference between Nesterov momentum and standard momentum is where the gradient is evaluated. With Nesterov momentum the gradient is evaluated after the current velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard method of momentum. 90

NESTEROV MOMENTUM 91 PARAMETERS INITIALIZATION STRATEGY Training algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations. 92

PARAMETERS INITIALIZATION STRATEGY Some heuristics are available for choosing the initial scale of the weights. One heuristic is to initialize the weights of a fully connected layer with m inputs and n outputs by sampling each weight from Glorot and Bengio (2010) suggest using the normalized initialization: 93

ADAPTIVE LEARNING RATE Neural network researchers have long realized that the learning rate was reliably one of the hyperparameters that is the most difficult to set because it has a significant impact on model performance. Recently, a number of incremental (or mini-batch-based) methods have been introduced that adapt the learning rates of model parameters. 94

ADAGRAD The AdaGrad algorithm individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values 95 ADAGRAD 96

ADAGRAD The AdaGrad algorithm enjoys some desirable theoretical properties. However, empirically it has been found thatfor training deep neural network modelsthe accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in the effective learning rate. 97 RMSPROP The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to

perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average. RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl. 98 RMSPROP

99 RMSPROP 100 RMSPROP Empirically, RMSProp has been shown to be an effective and practical optimization algorithm for deep neural networks. It is currently one of the go-to optimization methods being employed

routinely by deep learning practitioners. 101 ADAM The name Adam derives from the phrase adaptive moments. First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with exponential weighting) of the gradient. Second, Adam includes bias corrections to the estimates of both

the first-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origion 102 ADAM 103 WHICH ALGORITHM TO CHOOSE Currently, the most popular optimization algorithms actively in use

include SGD, SGD with momentum, RMSProp, RMSProp with momentum and Adam. Unfortunately, there is currently no consensus on this point. Schaul et al. (2014) presented a valuable comparison of a large number of optimization algorithms across a wide range of learning tasks. While the results suggest that the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed 104 fairly robustly, no single best algorithm has emerged.

THANKS 105