# C2 Deep Learning

Deep learning is a class of machine learning algorithms that uses multiple stacked layers of processing units to learn high-level representations from unstructured data.

# Data for DL

There are mainly 2 types of data

Structured - Tabular data as input, arranged into columns of features that describe each observation.
Unstructured
- Data that is not naturally arranged into columns of features, such as images, audio, and text.
- Individual pixels/characters/etc are uninformative. The granularity of the data combined with the high degree of spatial dependence destroys the concept of the pixel or character as an informative feature in its own right.
- A deep learning model, on the other hand, can learn how to build high-level informative features by itself, directly from the unstructured data whereas traditional models would fail to do so.

# Deep NN

# NN

A neural network consists of a series of stacked layers. Each layer contains units that are connected to the previous layer’s units through a set of weights. The most common is a dense/fully connected layer (connect all units to every other)
NN where all adjacent layers are fully connected - Multi layer perceptrons.
The input is transformed by each layer in turn (forward pass through the network), until it reaches the output layer. Specifically, each unit applies a transformation to a weighted sum of its inputs and passes the output through to the subsequent layer. The final output layer is the culmination of this process.
Finding the params is training the NN
The error in the prediction is propagated backward through the network, adjusting each set of weights a small amount in the direction that improves the prediction most significantly. This process is called backpropagation.

# Learning features

The critical property of a NN is learning features from data.
Units in each subsequent layer are able to represent increasingly sophisticated aspects of the original input, by combining lower-level features from the previous layer.

import numpy as np
import tensorflow as tf

keras = tf.keras
from keras import datasets, utils, layers, models, optimizers
import matplotlib.pyplot as plt

# Training a MLP

# We scale these values to lie between 0 and 1, as NNs work best when the absolute value of each input is less than 1.
# We also use one-hot encoding because the output will be a probability

(x_train, y_train), (x_test, y_test) = datasets.cifar10.load_data()
CLASSES = 10
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

y_train = utils.to_categorical(y_train, CLASSES)  # (50000, 10)
y_test = utils.to_categorical(y_test, CLASSES)

# Sequential API
model = models.Sequential(
    [
        layers.Flatten(input_shape=(32, 32, 3)),
        layers.Dense(200, activation="relu"),
        layers.Dense(150, activation="relu"),
        layers.Dense(10, activation="softmax"),
    ]
)

# Functional API
# Many models require that the output from a layer is passed to multiple subsequent layers
# or conversely, that a layer receives input from multiple preceding layers.
# For these models, the Sequential class is not suitable and we use the functional API instead.
input_layer = layers.Input(shape=(32, 32, 3))
x = layers.Flatten()(input_layer)

# Can also define activation per layer
# x = layers.Dense(units=200)(x)
# x = layers.Activation('relu')(x)
x = layers.Dense(units=200, activation="relu")(x)

x = layers.Dense(units=150, activation="relu")(x)
output_layer = layers.Dense(units=10, activation="softmax")(x)
model = models.Model(input_layer, output_layer)

model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 flatten_5 (Flatten)         (None, 3072)              0         
                                                                 
 dense_15 (Dense)            (None, 200)               614600    
                                                                 
 dense_16 (Dense)            (None, 150)               30150     
                                                                 
 dense_17 (Dense)            (None, 10)                1510      
                                                                 
=================================================================
Total params: 646260 (2.47 MB)
Trainable params: 646260 (2.47 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

# Layers

We use 3 types of layers

Input - Entry point to network which specifies input shape
Flatten - Flatten input into a vector as Dense requires a flat vector
Dense - A fully connected NN layer. The weighted sum of inputs is passed through an activation function to get output

# Activation functions

We use the following AFs

ReLU (Rectified Linear Unit)
- f(x) = max(0, x)
- ReLU units can sometimes die if they always output 0 - a large bias towards negative pre-activation values.
- Gradient is 0 and no error is back-propogated.
LeakyReLU
- f(x) = \left\{\begin{array}{ll}x & x >= 0 \\ ax & x < 0\end{array}\right.
- Fixes vanishing gradient problem of ReLU
Sigmoid
- f(x) = \frac 1 {1 + e^{-x}}
- Scales output b/w 0 and 1
- Used in binary/multilabel classification
Softmax
- f(x) = \frac {e^{x_i}} {\sum_{j=1}^J e^{x_j}}
- Total sum of output probabilities = 1
- Used for multiclass classification

opt = optimizers.Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

# Loss function

Compares predicted to ground truth. We use the following (p_i = predicted value)

Mean Squared Error
- MSE = \frac 1n \sum_{i=1}^n {(y_i - p_i)}^2
- Used in regression
Categorical cross entropy
- $ = - \sum_^n y_ilog(p_i)$
- Used in classification, each observation belongs to a class
Binary cross entropy
- $ = -\frac 1n \sum_^n (y_ilog(p_i) + (1-y_i)log(1-p_i))$
- Used in binary classification with one output unit

# Optimizers

The optimizer is the algorithm that will be used to update the weights in the neural network based on the gradient of the loss function.

We use

Adam (Adaptive Moment Estimation)
RMSProp (Root Mean Squared Propagation)

model.fit(x_train, y_train, batch_size=32, epochs=10, shuffle=True)

Epoch 1/10
1563/1563 [==============================] - 12s 7ms/step - loss: 1.8447 - accuracy: 0.3356
Epoch 2/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.6592 - accuracy: 0.4070
Epoch 3/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.5806 - accuracy: 0.4379
Epoch 4/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.5312 - accuracy: 0.4563
Epoch 5/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4902 - accuracy: 0.4686
Epoch 6/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4584 - accuracy: 0.4808
Epoch 7/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4317 - accuracy: 0.4903
Epoch 8/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.4098 - accuracy: 0.4964
Epoch 9/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.3873 - accuracy: 0.5052
Epoch 10/10
1563/1563 [==============================] - 11s 7ms/step - loss: 1.3689 - accuracy: 0.5133





<keras.src.callbacks.History at 0x7f97ed613510>

# Training

Weights are initialized randomly
In each training step, one batch of images is passed and errors are backpropagated to update weights
Continues till all data in passed once - 1 epoch
Iterates through all epochs

model.evaluate(x_test, y_test)

313/313 [==============================] - 1s 2ms/step - loss: 1.4679 - accuracy: 0.4732

[1.467934012413025, 0.4731999933719635]

CLASSES = np.array(
    [
        "airplane",
        "automobile",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    ]
)
preds = model.predict(x_test)

preds_single = CLASSES[np.argmax(preds, axis=-1)]
actual_single = CLASSES[np.argmax(y_test, axis=-1)]

313/313 [==============================] - 1s 2ms/step

n_to_show = 10
indices = np.random.choice(range(len(x_test)), n_to_show)

plt.style.use('dark_background')
fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

for i, idx in enumerate(indices):
    img = x_test[idx]
    ax = fig.add_subplot(1, n_to_show, i + 1)
    ax.axis("off")
    ax.text(
        0.5,
        -0.35,
        "pred = " + str(preds_single[idx]),
        fontsize=10,
        ha="center",
        transform=ax.transAxes,
    )
    ax.text(
        0.5,
        -0.7,
        "act = " + str(actual_single[idx]),
        fontsize=10,
        ha="center",
        transform=ax.transAxes,
    )
    ax.imshow(img)

# CNN

One of the reasons our network isn’t yet performing as well as it might is because there isn’t anything in the network that takes into account the spatial structure of the input images.

# Convolution Layers

The convolution is performed by multiplying the filter pixelwise with the portion of the image, and summing the results.
The output is more positive when the portion of the image closely matches the filter
A convolutional layer is simply a collection of filters, where the values stored in the filters are the weights that are learned by the neural network through training.
We can stack convolutional layers to make the NN more powerful which capture increasingly higher level features.
Convolutions are applied on each channel of a RGB image.

# Stride

The step size used by the layer to move kernels across the input.
Increasing stride results in reduction of spatial size of tensor and increase in number of channels

# Padding

A "same" padding, pads the input data with zeros so that the output size is the same

input_layer = layers.Input(shape=(32, 32, 3))
conv_layer_1 = layers.Conv2D(filters=10, kernel_size=(4, 4), strides=2, padding="same")(
    input_layer
)
conv_layer_2 = layers.Conv2D(filters=20, kernel_size=(3, 3), strides=2, padding="same")(
    conv_layer_1
)
flatten_layer = layers.Flatten()(conv_layer_2)
output_layer = layers.Dense(units=10, activation="softmax")(flatten_layer)

model = models.Model(input_layer, output_layer)
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_5 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_2 (Conv2D)           (None, 16, 16, 10)        490       
                                                                 
 conv2d_3 (Conv2D)           (None, 8, 8, 20)          1820      
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_5 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_2 (Conv2D)           (None, 16, 16, 10)        490       
                                                                 
 conv2d_3 (Conv2D)           (None, 8, 8, 20)          1820      
                                                                 
 flatten_7 (Flatten)         (None, 1280)              0         
                                                                 
 dense_19 (Dense)            (None, 10)                12810     
                                                                 
=================================================================
Total params: 15120 (59.06 KB)
Trainable params: 15120 (59.06 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

# Batch normalization

If weights start to become too large - exploding gradient problem, that is, the calculation of gradients in backprop can grow exponentially large causing wild fluctuations in weight values.

# Covariate shift

Scaling input ensures a stable start in training, unscaled input can create huge activation values leading to exploding gradients
We assume activations are relatively scaled but the activation distributions might move away from this assumption - covariate shift

# Training

Batch norm is a technique that reduces this problem
During training, the layer calculates mean and SD of each input channel across the batch and mean-normalizes it
There are 2 learned parameters for each channel - the scale (\gamma) and shift (\beta)
We place this after Dense/Conv layers

# Prediction

During prediction we do not have a batch over which to calc mean and SD
Batch norm layer calculates moving average of mean and SD of each channel and stores the values
The moving average and SD are non trainable and hence result in there being 4 params for each channel in the layer
Momentum is the weight for these 2 params

# Dropout

A form of regularization to counter overfitting
Each dropout layer chooses random set of units of preceding layer and sets it to 0
Does nothing during testing

input_layer = layers.Input((32, 32, 3))

x = layers.Conv2D(filters=32, kernel_size=3, strides=1, padding="same")(input_layer)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)

x = layers.Conv2D(filters=32, kernel_size=3, strides=2, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)

x = layers.Conv2D(filters=64, kernel_size=3, strides=1, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)

x = layers.Conv2D(filters=64, kernel_size=3, strides=2, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)

x = layers.Flatten()(x)
x = layers.Dense(128)(x)
x = layers.BatchNormalization()(x)
x = layers.LeakyReLU()(x)
x = layers.Dropout(rate=0.5)(x)

output_layer = layers.Dense(10, activation="softmax")(x)

model = models.Model(input_layer, output_layer)
model.summary()

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_6 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_4 (Conv2D)           (None, 32, 32, 32)        896       
                                                                 
 batch_normalization (Batch  (None, 32, 32, 32)        128       
 Normalization)                                                  
                                                                 
 leaky_re_lu (LeakyReLU)     (None, 32, 32, 32)        0         
                                                                 
 conv2d_5 (Conv2D)           (None, 16, 16, 32)        9248      
                                                                 
 batch_normalization_1 (Bat  (None, 16, 16, 32)        128       
 chNormalization)                                                
                                                                 
 leaky_re_lu_1 (LeakyReLU)   (None, 16, 16, 32)        0         
                                                                 
 conv2d_6 (Conv2D)           (None, 16, 16, 64)        18496     
                                                                 
 batch_normalization_2 (Bat  (None, 16, 16, 64)        256       
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_6 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 conv2d_4 (Conv2D)           (None, 32, 32, 32)        896       
                                                                 
 batch_normalization (Batch  (None, 32, 32, 32)        128       
 Normalization)                                                  
                                                                 
 leaky_re_lu (LeakyReLU)     (None, 32, 32, 32)        0         
                                                                 
 conv2d_5 (Conv2D)           (None, 16, 16, 32)        9248      
                                                                 
 batch_normalization_1 (Bat  (None, 16, 16, 32)        128       
 chNormalization)                                                
                                                                 
 leaky_re_lu_1 (LeakyReLU)   (None, 16, 16, 32)        0         
                                                                 
 conv2d_6 (Conv2D)           (None, 16, 16, 64)        18496     
                                                                 
 batch_normalization_2 (Bat  (None, 16, 16, 64)        256       
 chNormalization)                                                
                                                                 
 leaky_re_lu_2 (LeakyReLU)   (None, 16, 16, 64)        0         
                                                                 
 conv2d_7 (Conv2D)           (None, 8, 8, 64)          36928     
                                                                 
 batch_normalization_3 (Bat  (None, 8, 8, 64)          256       
 chNormalization)                                                
                                                                 
 leaky_re_lu_3 (LeakyReLU)   (None, 8, 8, 64)          0         
                                                                 
 flatten_8 (Flatten)         (None, 4096)              0         
                                                                 
 dense_20 (Dense)            (None, 128)               524416    
                                                                 
 batch_normalization_4 (Bat  (None, 128)               512       
 chNormalization)                                                
                                                                 
 leaky_re_lu_4 (LeakyReLU)   (None, 128)               0         
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_21 (Dense)            (None, 10)                1290      
                                                                 
=================================================================
Total params: 592554 (2.26 MB)
Trainable params: 591914 (2.26 MB)
Non-trainable params: 640 (2.50 KB)
_________________________________________________________________

opt = optimizers.Adam(learning_rate=0.0005)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=32, epochs=10, shuffle=True)

Epoch 1/10
1563/1563 [==============================] - 66s 41ms/step - loss: 1.5301 - accuracy: 0.4658
Epoch 2/10
1563/1563 [==============================] - 62s 40ms/step - loss: 1.1268 - accuracy: 0.6033
Epoch 3/10
1563/1563 [==============================] - 60s 38ms/step - loss: 0.9870 - accuracy: 0.6543
Epoch 4/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.9071 - accuracy: 0.6860
Epoch 5/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.8473 - accuracy: 0.7054
Epoch 6/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.7903 - accuracy: 0.7258
Epoch 7/10
1563/1563 [==============================] - 56s 36ms/step - loss: 0.7507 - accuracy: 0.7360
Epoch 8/10
1563/1563 [==============================] - 56s 36ms/step - loss: 0.7111 - accuracy: 0.7518
Epoch 9/10
1563/1563 [==============================] - 56s 36ms/step - loss: 0.6723 - accuracy: 0.7652
Epoch 10/10
1563/1563 [==============================] - 59s 38ms/step - loss: 0.6429 - accuracy: 0.7757





<keras.src.callbacks.History at 0x7f96e2507010>

model.evaluate(x_test, y_test, batch_size=1000)

10/10 [==============================] - 2s 197ms/step - loss: 0.8337 - accuracy: 0.7240

[0.8337039351463318, 0.7239999771118164]

CLASSES = np.array(
    [
        "airplane",
        "automobile",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    ]
)
preds = model.predict(x_test)

preds_single = CLASSES[np.argmax(preds, axis=-1)]
actual_single = CLASSES[np.argmax(y_test, axis=-1)]

313/313 [==============================] - 3s 11ms/step

n_to_show = 10
indices = np.random.choice(range(len(x_test)), n_to_show)

plt.style.use('dark_background')
fig = plt.figure(figsize=(15, 3))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

for i, idx in enumerate(indices):
    img = x_test[idx]
    ax = fig.add_subplot(1, n_to_show, i + 1)
    ax.axis("off")
    ax.text(
        0.5,
        -0.35,
        "pred = " + str(preds_single[idx]),
        fontsize=10,
        ha="center",
        transform=ax.transAxes,
    )
    ax.text(
        0.5,
        -0.7,
        "act = " + str(actual_single[idx]),
        fontsize=10,
        ha="center",
        transform=ax.transAxes,
    )
    ax.imshow(img)