# C3 Variational Autoencoders

# Autoencoders

An autoencoder is simply a neural network that is trained to perform the task of encoding and decoding an item, such that the output from this process is as close to the original item as possible.
We can decode any encoded value and produce novel data and hence use it as a generative model
The encoder maps each data obs to a point in latent space. This vector representation is known as an embedding because the encoder tries to embed as much info as possible.
The decoder decodes the embedding and produces data.

image.png

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

keras = tf.keras

from keras import datasets, layers, optimizers, models, backend, metrics, losses

(x_train, y_train), (x_test, y_test) = datasets.fashion_mnist.load_data()

def preprocess(imgs):
    imgs = imgs.astype("float32") / 255.0
    # Pad image to 32x32 for easy manipulation
    imgs = np.pad(imgs, ((0, 0), (2, 2), (2, 2)), constant_values=0.0)
    imgs = np.expand_dims(imgs, -1)
    return imgs


x_train = preprocess(x_train)
x_test = preprocess(x_test)

# Autoencoder architecture

Made of 2 parts
- Encoder - Compresses high dimensional input into a lower dimensional embedding vector
- Decoder - Decompresses embedding back to original domain
Autoencoder is trained to reconstruct image after it has passed through encoder and decoder because we are interested in the embedding/latent space which allows us to generate new data.
Embedding (z) is a compression of og data into a lower dim latent space. Choosing any point in this space and passing through decoder creates new data

# Encoder

Encoder takes input and maps to embedding vector in latent space.
We use a standard CNN architecture here

# Decoder

A mirror image of encoder for decoding embeddings
We use Convolutional Transpose layer - which increases the size of an image rather than decrease by setting a bigger stride

# Combining the two

Output of autoencoder is simply the output of encoder after it has passed through the decoder
Loss chosen is usually
- RMSE
  - Output is symmetrically distributed around average pixel values
  - Leads to pixelized edges
- Binary cross-entropy
  - More asymmetrical, penalizes errors towards extremes than centers
  - Ex: if true value is 0.7, 0.8 is penalised more than 0.6
  - Produces blurrier images

# Reconstructing images

The reconstruction is not perfect and many details are lost like text or logos as the latent space is 2D.
We can visualise embeddings by plotting encoder output.

encoder_input = layers.Input(shape=(32, 32, 1), name="encoder_input")
x = layers.Conv2D(32, (3, 3), strides=2, activation="relu", padding="same")(
    encoder_input
)
x = layers.Conv2D(64, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(128, (3, 3), strides=2, activation="relu", padding="same")(x)
shape_before_flattening = backend.int_shape(x)[1:]
x = layers.Flatten()(x)

encoder_output = layers.Dense(2, name="encoder_output")(x)
encoder = models.Model(encoder_input, encoder_output)

decoder_input = layers.Input(shape=(2,), name="decoder_input")
x = layers.Dense(np.prod(shape_before_flattening))(decoder_input)
x = layers.Reshape(shape_before_flattening)(x)
x = layers.Conv2DTranspose(128, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(64, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
decoder_output = layers.Conv2D(
    1, (3, 3), strides=1, activation="sigmoid", padding="same", name="decoder_output"
)(x)
decoder = models.Model(decoder_input, decoder_output)

2023-10-23 14:44:43.995083: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:22:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-23 14:44:45.163183: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

autoencoder = models.Model(encoder_input, decoder(encoder_output))
autoencoder.compile(optimizer="adam", loss="binary_crossentropy")

autoencoder.fit(
    x_train,
    x_train,
    epochs=5,
    batch_size=100,
    shuffle=True,
    validation_data=(x_test, x_test),
)

Epoch 1/5
600/600 [==============================] - 66s 107ms/step - loss: 0.2954 - val_loss: 0.2632
Epoch 2/5
600/600 [==============================] - 63s 105ms/step - loss: 0.2577 - val_loss: 0.2571
Epoch 3/5
600/600 [==============================] - 58s 97ms/step - loss: 0.2538 - val_loss: 0.2536
Epoch 4/5
600/600 [==============================] - 58s 97ms/step - loss: 0.2517 - val_loss: 0.2529
Epoch 5/5
600/600 [==============================] - 66s 110ms/step - loss: 0.2506 - val_loss: 0.2518





<keras.src.callbacks.History at 0x7fa796e4eb50>

example_images = x_test[:5000]
predictions = autoencoder.predict(example_images)

embeddings = encoder.predict(example_images)
plt.figure(figsize=(8, 8))
plt.scatter(embeddings[:, 0], embeddings[:, 1], c="white", alpha=0.5, s=3)
plt.show()

157/157 [==============================] - 3s 20ms/step
157/157 [==============================] - 1s 4ms/step

plt.imshow(x_test[0])

<matplotlib.image.AxesImage at 0x7fa7607996d0>

plt.imshow(predictions[0])

<matplotlib.image.AxesImage at 0x7fa76df96c10>

# Visualisation

Each white point represents an image that has been embedded
We understand this better by using labels to color the plot
We see that even though models does not know labels, the autoencoder has naturally grouped items that look alike to same parts of the latent space

# Image generation

We generate novel images by sampling latent space points and passing through decoder.
From the distribution of actual points we see
- Some items are represented over small area, others over large
- Distribution is not symmetrical about origin or bounded.
- Large gaps b/w colors containing few points

# Challenges

We see that
- If we pick uniformly in a bounded space we are more likely to pick bag-like than ankle boot-like because the bag latent space is larger
- It is not obvious how to choose a point as distributions are not defined
- There are empty spaces in latent space where none of og images are encoded. Even points that are central may not be decoded into well formed images as the autoencoder is not forced to ensure that the space is continuous.
In 2D, these issues are subtle but become more apparent in higher dims
We solve this with a variational autoencoder

mins, maxs = np.min(embeddings, axis=0), np.max(embeddings, axis=0)
sample = np.random.uniform(mins, maxs, size=(18, 2))
reconstructions = decoder.predict(sample)

print(sample[0])
plt.imshow(reconstructions[0], cmap="gray")

1/1 [==============================] - 0s 100ms/step
[-3.76511182 -3.05445118]

<matplotlib.image.AxesImage at 0x7fa767912f10>

# Variational autoencoder

We mainly change the encoder and the loss function

# Encoder

In a autoencoder, each image mapped to a point in latent space. However in a VAE, each image is mapped to a multivariate normal distribution around a point in latent space.
- Normal distribution is a probability distribution defined by mean and variance
  - Standard/unit normal is where mean is 0 and variance is 1
  - f(x | \mu, \sigma^2) = \frac1{\sqrt {2\pi\sigma^2}}e^{-\frac{{(x-\mu)}^2}{2\sigma^2}}
  - We sample a point z from the normal distribution using the equation - z=\mu+\sigma\epsilon
- Multivariate normal extends to k dimenstions and is given by
  - f(x_1, \cdot, x_k) = \frac{exp(-\frac12{(X-\mu)}^T\Sigma^{-1}(X-\mu))}{\sqrt{{(2\pi)}^k|\Sigma|}}
  - We use the isotropic multivariate normal where the covariance matrix - \Sigma is diagonal which implies that the distribution is independent in each dim
  - Standard multivariate normal is where mean is a zero valued vector and covar matrix is the identity matrix
Encoder maps input to a mean and variance vector
As variance is always positive, we map to log of variance to map it to (-\inf, \inf)
We sample a point from this by
- z = z_{mean} + z_{sigma} * epsilon
- z_{sigma} = exp(z_{log\_var} * 0.5)
- epsilon ~ N(0, I)

# A sampling layer to sample from dist defined by z_mean and z_log_var
class Sampling(layers.Layer):
    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = backend.random_normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

encoder_input = layers.Input(     shape=(32, 32, 1), name="encoder_input" )
x = layers.Conv2D(32, (3, 3), strides=2, activation="relu", padding="same")(     encoder_input )
x = layers.Conv2D(64, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(128, (3, 3), strides=2, activation="relu", padding="same")(x)
shape_before_flattening = backend.int_shape(x)[1:]
x = layers.Flatten()(x)

z_mean = layers.Dense(2, name="z_mean")(x)
z_log_var = layers.Dense(2, name="z_log_var")(x)
z = Sampling()([z_mean, z_log_var])

encoder = models.Model(encoder_input, [z_mean, z_log_var, z], name="encoder")

class VAE(models.Model):
    def __init__(self, encoder, decoder, **kwargs):
        super(VAE, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.total_loss_tracker = metrics.Mean(name="total_loss")
        self.reconstruction_loss_tracker = metrics.Mean(name="reconstruction_loss")
        self.kl_loss_tracker = metrics.Mean(name="kl_loss")

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
        ]

    def call(self, inputs):
        z_mean, z_log_var, z = encoder(inputs)
        reconstruction = decoder(z)
        return z_mean, z_log_var, reconstruction

    def train_step(self, data):
        with tf.GradientTape() as tape:
            z_mean, z_log_var, reconstruction = self(data)
            reconstruction_loss = tf.reduce_mean(
                500 * losses.binary_crossentropy(data, reconstruction, axis=(1, 2, 3))
            )
            kl_loss = tf.reduce_mean(
                tf.reduce_sum(
                    -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)),
                    axis=1,
                )
            )
            total_loss = reconstruction_loss + kl_loss
        grads = tape.gradient(total_loss, self.trainable_weights)

        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)

        return {m.name: m.result() for m in self.metrics}

vae = VAE(encoder, decoder)
vae.compile(optimizer="adam")
vae.fit(x_train, epochs=5, batch_size=100)

Epoch 1/5
600/600 [==============================] - 61s 99ms/step - total_loss: 141.5739 - reconstruction_loss: 136.5573 - kl_loss: 5.0166
Epoch 2/5
600/600 [==============================] - 61s 102ms/step - total_loss: 133.5628 - reconstruction_loss: 128.5448 - kl_loss: 5.0181
Epoch 3/5
600/600 [==============================] - 60s 99ms/step - total_loss: 132.3981 - reconstruction_loss: 127.2663 - kl_loss: 5.1318
Epoch 4/5
600/600 [==============================] - 57s 95ms/step - total_loss: 131.6895 - reconstruction_loss: 126.4955 - kl_loss: 5.1940
Epoch 5/5
600/600 [==============================] - 57s 96ms/step - total_loss: 131.1532 - reconstruction_loss: 125.8996 - kl_loss: 5.2536





<keras.src.callbacks.History at 0x7fa7606865d0>