Transfer Learning Delasa Aghamirzaie, Abraham Lama Salomon Deep

Transfer Learning Delasa Aghamirzaie, Abraham Lama Salomon Deep

Transfer Learning Delasa Aghamirzaie, Abraham Lama Salomon Deep Learning for Perception 9/15/2015 Outline Convolutional Neural Networks: AlexNet

Lion Image labels Krizhevsky, Sutskever, Hinton NIPS 2012 slide credit Jason Yosinski

Layer 1 Filter (Gabor and color blobs) Layer 2 Layer 5 Last

Layer Zeiler et al. arXiv 2013, ECCV 2014 Gabor filter: linear filters used for edge detection with similar orientation representations to the human visual system slide credit Jason Yosinski

Nguyen et al. arXiv 2014 general ?? ??

?? specific Layer number slide credit Jason Yosinski Lion Transfer Learning Overview

Input A Task A Layer n Transfer AnB: Frozen Weights Back-propagation

Task B Input B Back-propagation AnB+: Fine-tuning ImageNet

dataset A 500 Classes dataset B 1000 Classes

500 Classes slide credit Jason Yosinski Deng et al., 2009 A Images A Labels

Train using Caffe framework (Jia et al.) 500 Classes slide credit Jason Yosinski A Images A Labels

Train using Caffe framework (Jia et al.) 500 Classes slide credit Jason Yosinski A Images A Labels

Train using Caffe framework (Jia et al.) 500 Classes slide credit Jason Yosinski baseA A Images

baseB B Images slide credit Jason Yosinski slide credit Jason Yosinski A Images

A Labels slide credit Jason Yosinski B Images B Labels

Hypothesis: if transferred features are specific to task A, performance drops. Otherwise the performance should be the same. slide credit Jason Yosinski transfer AnB Compare to

B Images B Labels baseB slide credit Jason Yosinski slide credit Jason Yosinski

B Images B Labels slide credit Jason Yosinski B Images B Labels

slide credit Jason Yosinski selffer BnB B Images B Labels

slide credit Jason Yosinski slide credit Jason Yosinski slide credit Jason Yosinski Performance drops due to...

Fragile co-adaptation Representation specificity slide credit Jason Yosinski slide credit Jason Yosinski

slide credit Jason Yosinski slide credit Jason Yosinski slide credit Jason Yosinski Transfer + fine-tuning improves generalization slide credit Jason Yosinski

ImageNet has many related categories... Dataset A: random gecko Dataset B: random garbage truck

fire truck radiator binoculars baseball lion

panther gorilla toucan bookshop rabbit

slide credit Jason Yosinski ImageNet has many related categories... Dataset A: man-made Dataset B: natural garbage truck

gorilla fire truck gecko radiator

toucan baseball rabbit binoculars panther

bookshop lion slide credit Jason Yosinski Similar A/B slide credit Jason Yosinski

Similar A/B Dissimilar A/B slide credit Jason Yosinski Similar A/B

Random (Jarret et al. 2009) slide credit Jason Yosinski Dissimilar A/B Conclusions fine-tuning helps

co-adaptation specificity Measure general to specific transition layer by layer Transferability governed by: lost co-adaptations specificity difference between base and target dataset Fine-tuning helps even on large target dataset

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng and Trevor Darrell Yangqing Jia, author of Caffe and its precursor DeCAF 34

Problem: performance with conventional visual representations had reached a plateau. Solution: discover effective representations that capture salient semantics for a given task. can deep architectures do this? 35

Why Deep Models: deep architectures should be able to capture salient aspects of a given domain [Krizhevsky NIPS 2012][Singh ECCV 2012]. perform better than traditional hand-engineered representations [Le CVPR 2011] Had been applied to large-scale visual recognition tasks However: with limited training data, fully-supervised deep architectures generally overfit

many visual recognition challenges have tasks with few training examples 36 Approach: Train a Deep convolutional model in a fully supervised setting using Krizhevsky method and ImageNet database. [Krizhevsky NIPS 2012].

Extract various features from the network Evaluate the efficacy of these features on generic vision tasks Questions: Do features extracted from the CNN generalize the other datasets ? How does performance vary with network depth? How does performance vary with network architecture? 37

Adopted Network: Deep CNN architecture proposed by Krizhevsky [Krizhevsky NIPS 2012]. 5 convolutional layers (with pooling and ReLU) 3 fully-connected layers won ImageNet Large Scale Visual recognition Challenge 2012 top-1 validation error rate of 40.7% follow architecture and training protocol with two differences

input 256 x 256 images rather than 224 x 224 images no data augmentation trick 38 Qualitatively and Quantitatively Feedback: Comparison with GIST features [Oliva & Torralba, 2001] and LLC features [Wang at al., 2010]

Use of t-SNE algorithm [van der Maaten & Hilton, 2008] Use of ILSVRC-2012 validation set to avoid overfitting (150,000 photographs) Use of SUN-397 dataset to evaluate how dataset bias affects results 39 Feature Generalization and Visualization T-SNE Algorithm

t-SNE feature visualizations on the ILSVRC-2012 validation set LLC Features GIST Features DeCAF LLC FEATURES

40 LLC FEATURES GIST FEATURES DeCAF1 FEATURES DeCAF6 FEATURES

41 DeCAF6 features trained on ILSVRC-2012 generalized to SUN397 when considering semantic groupings of labels SUN-397: Large-scale scene recognition from abbey to zoo. (899 categories and 130,519 images) 42

Computational Time Break-down of the computation time analyzed using the decaf framework. The convolution and fully-connected layers take most of the time to run, which is understandable as they involve large matrix-matrix multiplications6.

43 Experimental Comparison Feedback Not evaluation of features from any earlier layers in the CNN do not contain rich semantic representation Results on multiple datasets to evaluate the strength of DeCAF for basic object recognition (Caltech-101)

domain adaptation (Office) fine-grained recognition (Caltech-UCSD) scene recognition (SUN-397) 44 Experiments: Object Recognition Caltech-101

Compared also with the two-layers convolutional network of Jarret et al (2009) 45 Experiments: Domain Adaptation Office dataset (Saenko et al., 2010), which has 3 domains: Amazon: images taken from amazon.com Webcam and Dslr: images taken in office environment using a webcam or SLR camera

46 Experiments: Domain Adaptation DeCAF robust to resolution changes DeCAF provides better category clustering than SURF DeCAF clusters same category instances across domains DeCAF6 FEATURES

GIST FEATURES 47 Experiments: Subcategory Recognition Fine grained recognition involves recognizing subclasses of the same object class such as different bird species, dog breeds, flower types, etc. Caltech-UCSD birds dataset

- First adopt ImageNet-like pipeline, DeCAF6 and a multi-class logistic regression - Second adopt deformable part descriptors (DPD) method [Zhang et al., 2013] 48 Experiments: Scene Recognition Goal: classify the scene of the entire image SUN-397 large-scale scene recognition database

Outperforms Xiao ed al. (2010), the current state-of-the-art method DeCAF demonstrate - the ability to generalize to other tasks - representational power as compared to traditional hand-engineered features

49 Are the features extracted by a deep network could be exploited for a wide variety of vision tasks? CNN representation replaces pipelines of service-oriented architecture (s.o.a) methods and achieve better results. 50

OverFeat: publicly available trained CNN, with a structure that follows Krizhevsky et al. Trained for image classification of ImageNet ILSVRC 2013 (1.2 million images, 1000 categories). The features extracted from the OverFeat network were used as a generic image representation The CNN features used are trained only using ImageNet data, while the

simple classifiers are trained using images specific to the tasks dataset. 51 52 Experimental Comparison Feedback Results on multiple different recognition tasks: visual classification (Pascal VOC 2007, MIT-67 )

fine-grained recognition (Caltech-UCSD, Oxford 102) attribute detection (UIUC 64, H3D dataset) visual image retrieval (Oxford5k, Paris6k, Sculptures6k, Holidays and Ukbench) 53

Visual Classification The feature vector is L2 normalized to unit length for all the experiments. The 4096 dimensional feature vector was used in combination with a Support Vector Machine (SVM) to solve different classification tasks (CNN-SVM). The training set was augmented by adding cropped and rotated samples (CNNaug+ SVM).

54 Visual Classification Databases: Pascal VOC 2007 for object image classification. Pascal VOC 2007 contains 10000 images of 20 classes including animals, handmade and natural objects. MIT-67 indoor scenes for scene recognition. The MIT scenes dataset has 15620

images of 67 indoor scene classes. In contrast to object detection, object image classification requires no localization of the objects. 55 Visual Classification

Pascal VOC 2007 Image Classification Results compared to other methods which also use training data outside VOC. The CNN representation is not tuned for the Pascal VOC dataset 56 Visual Classification Evolution of the mean image classification AP (average precision) over PASCAL VOC 2007 classes as we use a deeper representation from the OverFeat CNN

trained on the ILSVRC dataset. 57 Visual Classification Confusion matrix for the MIT-67 indoor dataset. Some of the offdiagonal confused classes have been annotated, these particular cases could be hard even for a human to distinguish. 58

Visual Classification Results of MIT 67 Scene Classification Using a CNN off-the-shelf representation with linear SVMs training significantly outperforms a majority of the baselines. The performance is measured by the average classification accuracy of different classes (mean of the confusion matrix diagonal).

Fine Grained Recognition Results on CUB 200-2011 Bird dataset. 60 Fine Grained Recognition Results on the Oxford 102 Flowers dataset

61 Attribute Detection An attribute is a semantic or abstract quality which different instances/categories share. Databases:

UIUC 64 object attributes dataset. There are 3 categories of attributes in this dataset: shape (e.g. is 2D boxy) part (e.g. has head) material (e.g. is furry). H3D dataset which defines 9 attributes for a subset of the person images from Pascal VOC 2007. The attributes range from has glasses to is male. 62

Attribute Detection UIUC 64 object attribute dataset results H3D Human Attributes dataset results. 63 Visual Image Retrieval

The result of object retrieval on 5 datasets 64 Image Representation: Shallow Features: handcrafted classical representations.

Improved Fisher Vector (IFV). Deep Features: CNN based representations. 65 Comparison:

ConvNet based feature representations with different pre-trained network architectures and different learning heuristics. 66 CNN-F Network (Fast Architecture) Similar to Krizhevsky et al. (ILSVRC-2012 winner) Fast processing is ensured by the 4 pixel stride in the first

convolutional layer 67 CNN-M Network (Medium Architecture) Similar to Zeiler & Fergus (ILSVRC-2013 winner) Smaller receptive window size + stride in conv1 68

CNN-S Network (Slow Architecture) Similar to Overfeat accurate network (ICLR 2014) Smaller stride in in conv2 69 VGG Very Deep Network

Simonyan & Zisserman (ICLR 2015) Smaller receptive window size + stride, and deeper 70 71 Data Augmentation:

Given pre-trained ConvNet, augmentation applied at test time 72 Data Augmentation: 73

Data Augmentation: 74 Fine-tuning: TN-CLS TN-CLS classification loss

TN-RNK TN-RNK ranking loss 75 Evolution of Performance on PASCAL VOC-2007 over the recent years 76

Key points: We can learn features to perform semantic visual discrimination tasks using simple linear classifiers CNN features tend to cluster images into interesting semantic categories on which the network was never explicitly trained. Performance improves across a spectrum of visual recognition tasks. Data augmentation helps a lot, both for deep and shallow features. Fine tuning makes a difference, and should use ranking loss where appropriate.

77 CloudCV CloudCV DeCAF Server http://www.cloudcv.org/decaf-server/ 78

Questions 79

Recently Viewed Presentations

  • Chapter 1 Structure and Bonding - faculty.swosu.edu

    Chapter 1 Structure and Bonding - faculty.swosu.edu

    IV. Procedure and Calculations for Buffers Make 50ml of a buffer with pH = 5.00 from 1.00 M Acetic Acid and 5.00 g of Sodium Acetate Example calculation for pH = 5.20 Ka = 1.8 x 10-5 so pKa =...
  • Life in the Edo Period - HUMANITIES AND SOCIAL SCIENCES

    Life in the Edo Period - HUMANITIES AND SOCIAL SCIENCES

    At the start of the Edo period, there were about 200 to 250 castles in Japan. This was a smaller number than in previous periods because the Tokugawa shoguns enforced a policy of 'one domain, one castle' to limit the...
  • Day 2 - Essex Primary SCITT

    Day 2 - Essex Primary SCITT

    There are six broad developmental stages. 22 to 36 months. 30 to 50 months . 40 to 60 + months. All children develop in different ways & at different rates.
  • Overview of the Philippines National Review for HLPF

    Overview of the Philippines National Review for HLPF

    The T21 model contains 18 sectors, 6 sectors foreach sphere and expandable based on country needs.The current version currently contains some SDG-related indicators, but can be further customized once the national-level indicators has been finalized.
  • Intd 408 Furniture Design - جامعة نزوى

    Intd 408 Furniture Design - جامعة نزوى

    The golden section Golden section is a ratio that results when a line is divided so that the short segment has the same relationship to the long segment that the long segment has to the sum of two parts. It...
  • LEARNING FROM DISASTERS (Focus on Nephrology Experience on

    LEARNING FROM DISASTERS (Focus on Nephrology Experience on

    Outline. Lessons from past disasters (Objectives) Be aware. Understand. Prepare. It is not my objective to make everyone here to be part of a rescue team or become expert in disaster preparedness in case, I my self have no experience...
  • Weathering

    Weathering

    Photo source: Morguefile.com Acid rain: Burning coal, oil, and gas for energy produces pollution. This pollution combines with air and water, to create acid rain, and fast chemical weathering.
  • Sylvia Plath's The Bell Jar - Lower Moreland Township ...

    Sylvia Plath's The Bell Jar - Lower Moreland Township ...

    2. "The gray, padded car roof closed over my head like the roof of a prison van, and the white, shining, identical clapboard houses… proceeded past, one bar after another in a large but escape-proof cage" (114). What does Esther's...