Skip to the content.

Transfer learning

Making your ConvNets great since 2014

Convolutional neural networks (CNN): the context

Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. The appearance of these filters is very common, and made researchers understand that such first layers are not specific for a certain task or a dataset, but they can truly build the foundation of a general learning system. This phenomenon occurs not only for different datasets, but even with very different training objectives, including supervised image classification (Krizhevsky et al., 2012), unsupervised density learning (Lee et al., 2009), and unsupervised learning of sparse representations (Le et al., 2011). On the other hand, generalization power is progressively lost by the terminal layers of the network, as neurons become more and more specialized and tied to the specific classification problem they have been trained on.

Image not found

Fig.1: The first convolutional layers in convnets tend to learn features similar to Gabor filters and color blobs

What is ‘transfer learning’?

It is often difficult (and computationally very expensive) to collect the amount of data needed to train CNNs from scratch (which is in order of tens of thousands of images). Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet as a fixed feature extractor for the task of interest (Razavian et al., 2014).

This means: take a ConvNet that has been pre-trained on ImageNet (a huge repository of images collected for this purpose), remove the last fully-connected layer (which as we have shown is specific for the task the network was trained for), then treat the rest of the ConvNet as a feature extractor for the new dataset. Once you extract the features for all images, train a classifier for the new dataset.

The effectiveness of transfer learning is supported by a vast amount of evidence, and some recent (empirical) findings have spurred much interest in the methodology, as:

Image not found

Fig.2: Transfer learning: the green layers are borrowed from the first network and supplied as feature extractors to the second network (with frozen weights)

Why it is useful

Transfer learning approaches

Transfer learning takes two slightly different approaches (http://cs231n.github.io/transfer-learning/):

Transfer learning using inception-v3

The following code builds a model that classifies Fender Stratocaster, Gibson Les Paul and Gibson SG guitar pictures (In this regard, the Flickr APIs, Instagram or Google Image results are invaluable resources for the collection of training sets for ConvNets).

Model building is performed through two subsequent operations:

The pre-trained ConvNet we will use for this tutorial is the Inception-v3, which is a ConvNet developed by Google (the default input size for this model is 299 x 299 with three channels).

(The following code has been written using the Keras 2.0.6 library with TensorFlow backend)

Let’s first import the necessary libraries and instantiate the Inception-v3 model supported by Keras:

from keras.applications.inception_v3 import InceptionV3 
from keras.preprocessing import image 
from keras.models import Model 
from keras.layers import Dense, GlobalAveragePooling2D 
from keras import backend as K 

# Instantiate the inception-V3 pre-trained model
base_model = InceptionV3( weights =' imagenet', include_top = False) 

include_top is set False in order to exclude the last three layers (including the final softmax layer with 200 classes of output). We are basically ‘chopping off’ the external layers to replace them with a classifier of our choice, to be refitted using a new training dataset.

We now stack on top of the Inception-V3 model (the ‘chopped’ version) our classifier. x = GlobalAveragePooling2D()( x) is used to convert the input to the correct shape for the dense layer to handle. This means that this function ‘bridges’ from the convolutional Inception V3 layers to the Dense layer that we use to classify our guitars in the final layers.

layer x = base_model.output x = GlobalAveragePooling2D()( x)
# let's add a fully-connected layer as first layer
x = Dense( 1024, activation ='relu')(x)
# and a logistic layer with 2 classes as the last layer (Les Paul vs Stratocaster)
predictions = Dense(3, activation ='softmax')(x)
# let's create a Keras model to be trained
model = Model( input = base_model.input, output = predictions)

All the convolutional InceptionV3 layers are frozen, so that training is allowed only for the external layers

for layer in base_model.layers: 
    layer.trainable = False

Now our model is ready to be trained on the ‘guitar’ dataset:

# compile the model
model.compile( optimizer =' rmsprop', loss =' categorical_crossentropy') # train the model on the new data for a few epochs model.fit_generator(...)

Now freeze the first ~150 layers (an hyperparameter to be tuned) and ‘unfreeze’ the rest of the layers to perform fine-tuning:

for layer in model.layers[: 150]: 
layer.trainable = False 

for layer in model.layers[ 150:]: 
layer.trainable = True

We train again using the Dense layer on top of some inception layers. When fine-tuning, it’s important to lower your learning rate relative to the rate that was used when training from scratch (lr=0.0001), otherwise, the optimization could destabilize and the loss diverge.

# We re-train again using SGD
from keras.optimizers import SGD 
model.compile( optimizer = SGD( lr = 0.0001, momentum = 0.9), 
               loss =' categorical_crossentropy')
        
# we train our model again (fine-tuning)
# alongside the top Dense layers 
model.fit_generator(...) 

We now have a deep network that has been trained on a new domain using the standard Inception V3 network, although (of course) there are many hyper-parameters that need to be fine-tuned in order to achieve a good level of accuracy.

References (cited papers are not reported)