#
C2 Deep Learning
Deep learning is a class of machine learning algorithms that uses multiple stacked layers of processing units to learn high-level representations from unstructured data.
#
Data for DL
There are mainly 2 types of data
- Structured - Tabular data as input, arranged into columns of features that describe each observation.
- Unstructured
- Data that is not naturally arranged into columns of features, such as images, audio, and text.
- Individual pixels/characters/etc are uninformative. The granularity of the data combined with the high degree of spatial dependence destroys the concept of the pixel or character as an informative feature in its own right.
- A deep learning model, on the other hand, can learn how to build high-level informative features by itself, directly from the unstructured data whereas traditional models would fail to do so.
#
Deep NN
#
NN
- A neural network consists of a series of stacked layers. Each layer contains units that are connected to the previous layer’s units through a set of weights. The most common is a dense/fully connected layer (connect all units to every other)
- NN where all adjacent layers are fully connected - Multi layer perceptrons.
- The input is transformed by each layer in turn (forward pass through the network), until it reaches the output layer. Specifically, each unit applies a transformation to a weighted sum of its inputs and passes the output through to the subsequent layer. The final output layer is the culmination of this process.
- Finding the params is training the NN
- The error in the prediction is propagated backward through the network, adjusting each set of weights a small amount in the direction that improves the prediction most significantly. This process is called backpropagation.
#
Learning features
- The critical property of a NN is learning features from data.
- Units in each subsequent layer are able to represent increasingly sophisticated aspects of the original input, by combining lower-level features from the previous layer.
import numpy as np
import tensorflow as tf
keras = tf.keras
from keras import datasets, utils, layers, models, optimizers
import matplotlib.pyplot as plt
# Training a MLP
# We scale these values to lie between 0 and 1, as NNs work best when the absolute value of each input is less than 1.
# We also use one-hot encoding because the output will be a probability
(x_train, y_train), (x_test, y_test) = datasets.cifar10.load_data()
CLASSES = 10
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
y_train = utils.to_categorical(y_train, CLASSES) # (50000, 10)
y_test = utils.to_categorical(y_test, CLASSES)
# Sequential API
model = models.Sequential(
[
layers.Flatten(input_shape=(32, 32, 3)),
layers.Dense(200, activation="relu"),
layers.Dense(150, activation="relu"),
layers.Dense(10, activation="softmax"),
]
)
# Functional API
# Many models require that the output from a layer is passed to multiple subsequent layers
# or conversely, that a layer receives input from multiple preceding layers.
# For these models, the Sequential class is not suitable and we use the functional API instead.
input_layer = layers.Input(shape=(32, 32, 3))
x = layers.Flatten()(input_layer)
# Can also define activation per layer
# x = layers.Dense(units=200)(x)
# x = layers.Activation('relu')(x)
x = layers.Dense(units=200, activation="relu")(x)
x = layers.Dense(units=150, activation="relu")(x)
output_layer = layers.Dense(units=10, activation="softmax")(x)
model = models.Model(input_layer, output_layer)
model.summary()
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) [(None, 32, 32, 3)] 0
flatten_5 (Flatten) (None, 3072) 0
dense_15 (Dense) (None, 200) 614600
dense_16 (Dense) (None, 150) 30150
dense_17 (Dense) (None, 10) 1510
=================================================================
Total params: 646260 (2.47 MB)
Trainable params: 646260 (2.47 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
#
Layers
We use 3 types of layers
- Input - Entry point to network which specifies input shape
- Flatten - Flatten input into a vector as Dense requires a flat vector
- Dense - A fully connected NN layer. The weighted sum of inputs is passed through an activation function to get output
#
Activation functions
We use the following AFs
- ReLU (Rectified Linear Unit)
- f(x) = max(0, x)
- ReLU units can sometimes die if they always output 0 - a large bias towards negative pre-activation values.
- Gradient is 0 and no error is back-propogated.
- LeakyReLU
- f(x) = \left\{\begin{array}{ll}x & x >= 0 \\ ax & x < 0\end{array}\right.
- Fixes vanishing gradient problem of ReLU
- Sigmoid
- f(x) = \frac 1 {1 + e^{-x}}
- Scales output b/w 0 and 1
- Used in binary/multilabel classification
- Softmax
- f(x) = \frac {e^{x_i}} {\sum_{j=1}^J e^{x_j}}
- Total sum of output probabilities = 1
- Used for multiclass classification
opt = optimizers.Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
#
Loss function
Compares predicted to ground truth. We use the following (p_i = predicted value)
- Mean Squared Error
- MSE = \frac 1n \sum_{i=1}^n {(y_i - p_i)}^2
- Used in regression
- Categorical cross entropy
- $ = - \sum_^n y_ilog(p_i)$
- Used in classification, each observation belongs to a class
- Binary cross entropy
- $ = -\frac 1n \sum_^n (y_ilog(p_i) + (1-y_i)log(1-p_i))$
- Used in binary classification with one output unit
#
Optimizers
The optimizer is the algorithm that will be used to update the weights in the neural network based on the gradient of the loss function.
We use
- Adam (Adaptive Moment Estimation)
- RMSProp (Root Mean Squared Propagation)
model.fit(x_train, y_train, batch_size=32, epochs=10, shuffle=True)
Epoch 1/10
1563/1563 [==============================] - 12s 7ms/step - loss: 1.8447 - accuracy: 0.3356
Epoch 2/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.6592 - accuracy: 0.4070
Epoch 3/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.5806 - accuracy: 0.4379
Epoch 4/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.5312 - accuracy: 0.4563
Epoch 5/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4902 - accuracy: 0.4686
Epoch 6/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4584 - accuracy: 0.4808
Epoch 7/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4317 - accuracy: 0.4903
Epoch 8/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4098 - accuracy: 0.4964
Epoch 9/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.3873 - accuracy: 0.5052
Epoch 10/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.3689 - accuracy: 0.5133
<keras.src.callbacks.History at 0x7f97ed613510>
#
Training
- Weights are initialized randomly
- In each training step, one batch of images is passed and errors are backpropagated to update weights
- Continues till all data in passed once - 1 epoch
- Iterates through all epochs
model.evaluate(x_test, y_test)
313/313 [==============================] - 1s 2ms/step - loss: 1.4679 - accuracy: 0.4732
[1.467934012413025, 0.4731999933719635]
CLASSES = np.array(
[
"airplane",
"automobile",
"bird",
"cat",
"deer",
"dog",
"frog",
"horse",
"ship",
"truck",
]
)
preds = model.predict(x_test)
preds_single = CLASSES[np.argmax(preds, axis=-1)]
actual_single = CLASSES[np.argmax(y_test, axis=-1)]
313/313 [==============================] - 1s 2ms/step
n_to_show = 10
indices = np.random.choice(range(len(x_test)), n_to_show)
plt.style.use('dark_background')
fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i, idx in enumerate(indices):
img = x_test[idx]
ax = fig.add_subplot(1, n_to_show, i + 1)
ax.axis("off")
ax.text(
0.5,
-0.35,
"pred = " + str(preds_single[idx]),
fontsize=10,
ha="center",
transform=ax.transAxes,
)
ax.text(
0.5,
-0.7,
"act = " + str(actual_single[idx]),
fontsize=10,
ha="center",
transform=ax.transAxes,
)
ax.imshow(img)
#
CNN
One of the reasons our network isn’t yet performing as well as it might is because there isn’t anything in the network that takes into account the spatial structure of the input images.
#
Convolution Layers
- The convolution is performed by multiplying the filter pixelwise with the portion of the image, and summing the results.
- The output is more positive when the portion of the image closely matches the filter
- A convolutional layer is simply a collection of filters, where the values stored in the filters are the weights that are learned by the neural network through training.
- We can stack convolutional layers to make the NN more powerful which capture increasingly higher level features.
- Convolutions are applied on each channel of a RGB image.
#
Stride
- The step size used by the layer to move kernels across the input.
- Increasing stride results in reduction of spatial size of tensor and increase in number of channels
#
Padding
- A "same" padding, pads the input data with zeros so that the output size is the same
input_layer = layers.Input(shape=(32, 32, 3))
conv_layer_1 = layers.Conv2D(filters=10, kernel_size=(4, 4), strides=2, padding="same")(
input_layer
)
conv_layer_2 = layers.Conv2D(filters=20, kernel_size=(3, 3), strides=2, padding="same")(
conv_layer_1
)
flatten_layer = layers.Flatten()(conv_layer_2)
output_layer = layers.Dense(units=10, activation="softmax")(flatten_layer)
model = models.Model(input_layer, output_layer)
model.summary()
Model: "model_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) [(None, 32, 32, 3)] 0
conv2d_2 (Conv2D) (None, 16, 16, 10) 490
conv2d_3 (Conv2D) (None, 8, 8, 20) 1820
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) [(None, 32, 32, 3)] 0
conv2d_2 (Conv2D) (None, 16, 16, 10) 490
conv2d_3 (Conv2D) (None, 8, 8, 20) 1820
flatten_7 (Flatten) (None, 1280) 0
dense_19 (Dense) (None, 10) 12810
=================================================================
Total params: 15120 (59.06 KB)
Trainable params: 15120 (59.06 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
#
Batch normalization
- If weights start to become too large - exploding gradient problem, that is, the calculation of gradients in backprop can grow exponentially large causing wild fluctuations in weight values.
#
Covariate shift
- Scaling input ensures a stable start in training, unscaled input can create huge activation values leading to exploding gradients
- We assume activations are relatively scaled but the activation distributions might move away from this assumption - covariate shift
#
Training
- Batch norm is a technique that reduces this problem
- During training, the layer calculates mean and SD of each input channel across the batch and mean-normalizes it
- There are 2 learned parameters for each channel - the scale (\gamma) and shift (\beta)
- We place this after Dense/Conv layers
#
Prediction
- During prediction we do not have a batch over which to calc mean and SD
- Batch norm layer calculates moving average of mean and SD of each channel and stores the values
- The moving average and SD are non trainable and hence result in there being 4 params for each channel in the layer
- Momentum is the weight for these 2 params
#
Dropout
- A form of regularization to counter overfitting
- Each dropout layer chooses random set of units of preceding layer and sets it to 0
- Does nothing during testing
input_layer = layers.Input((32, 32, 3))
x = layers.Conv2D(filters=32, kernel_size=3, strides=1, padding="same")(input_layer)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(filters=32, kernel_size=3, strides=2, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(filters=64, kernel_size=3, strides=1, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)
x = layers.Conv2D(filters=64, kernel_size=3, strides=2, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)
x = layers.Flatten()(x)
x = layers.Dense(128)(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)
x = layers.Dropout(rate=0.5)(x)
output_layer = layers.Dense(10, activation="softmax")(x)
model = models.Model(input_layer, output_layer)
model.summary()
Model: "model_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_6 (InputLayer) [(None, 32, 32, 3)] 0
conv2d_4 (Conv2D) (None, 32, 32, 32) 896
batch_normalization (Batch (None, 32, 32, 32) 128
Normalization)
leaky_re_lu (LeakyReLU) (None, 32, 32, 32) 0
conv2d_5 (Conv2D) (None, 16, 16, 32) 9248
batch_normalization_1 (Bat (None, 16, 16, 32) 128
chNormalization)
leaky_re_lu_1 (LeakyReLU) (None, 16, 16, 32) 0
conv2d_6 (Conv2D) (None, 16, 16, 64) 18496
batch_normalization_2 (Bat (None, 16, 16, 64) 256
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_6 (InputLayer) [(None, 32, 32, 3)] 0
conv2d_4 (Conv2D) (None, 32, 32, 32) 896
batch_normalization (Batch (None, 32, 32, 32) 128
Normalization)
leaky_re_lu (LeakyReLU) (None, 32, 32, 32) 0
conv2d_5 (Conv2D) (None, 16, 16, 32) 9248
batch_normalization_1 (Bat (None, 16, 16, 32) 128
chNormalization)
leaky_re_lu_1 (LeakyReLU) (None, 16, 16, 32) 0
conv2d_6 (Conv2D) (None, 16, 16, 64) 18496
batch_normalization_2 (Bat (None, 16, 16, 64) 256
chNormalization)
leaky_re_lu_2 (LeakyReLU) (None, 16, 16, 64) 0
conv2d_7 (Conv2D) (None, 8, 8, 64) 36928
batch_normalization_3 (Bat (None, 8, 8, 64) 256
chNormalization)
leaky_re_lu_3 (LeakyReLU) (None, 8, 8, 64) 0
flatten_8 (Flatten) (None, 4096) 0
dense_20 (Dense) (None, 128) 524416
batch_normalization_4 (Bat (None, 128) 512
chNormalization)
leaky_re_lu_4 (LeakyReLU) (None, 128) 0
dropout (Dropout) (None, 128) 0
dense_21 (Dense) (None, 10) 1290
=================================================================
Total params: 592554 (2.26 MB)
Trainable params: 591914 (2.26 MB)
Non-trainable params: 640 (2.50 KB)
_________________________________________________________________
opt = optimizers.Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=10, shuffle=True)
Epoch 1/10
1563/1563 [==============================] - 66s 41ms/step - loss: 1.5301 - accuracy: 0.4658
Epoch 2/10
1563/1563 [==============================] - 62s 40ms/step - loss: 1.1268 - accuracy: 0.6033
Epoch 3/10
1563/1563 [==============================] - 60s 38ms/step - loss: 0.9870 - accuracy: 0.6543
Epoch 4/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.9071 - accuracy: 0.6860
Epoch 5/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.8473 - accuracy: 0.7054
Epoch 6/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.7903 - accuracy: 0.7258
Epoch 7/10
1563/1563 [==============================] - 56s 36ms/step - loss: 0.7507 - accuracy: 0.7360
Epoch 8/10
1563/1563 [==============================] - 56s 36ms/step - loss: 0.7111 - accuracy: 0.7518
Epoch 9/10
1563/1563 [==============================] - 56s 36ms/step - loss: 0.6723 - accuracy: 0.7652
Epoch 10/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.6429 - accuracy: 0.7757
<keras.src.callbacks.History at 0x7f96e2507010>
model.evaluate(x_test, y_test, batch_size=1000)
10/10 [==============================] - 2s 197ms/step - loss: 0.8337 - accuracy: 0.7240
[0.8337039351463318, 0.7239999771118164]
CLASSES = np.array(
[
"airplane",
"automobile",
"bird",
"cat",
"deer",
"dog",
"frog",
"horse",
"ship",
"truck",
]
)
preds = model.predict(x_test)
preds_single = CLASSES[np.argmax(preds, axis=-1)]
actual_single = CLASSES[np.argmax(y_test, axis=-1)]
313/313 [==============================] - 3s 11ms/step
n_to_show = 10
indices = np.random.choice(range(len(x_test)), n_to_show)
plt.style.use('dark_background')
fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i, idx in enumerate(indices):
img = x_test[idx]
ax = fig.add_subplot(1, n_to_show, i + 1)
ax.axis("off")
ax.text(
0.5,
-0.35,
"pred = " + str(preds_single[idx]),
fontsize=10,
ha="center",
transform=ax.transAxes,
)
ax.text(
0.5,
-0.7,
"act = " + str(actual_single[idx]),
fontsize=10,
ha="center",
transform=ax.transAxes,
)
ax.imshow(img)