#
C3 Variational Autoencoders
#
Autoencoders
- An autoencoder is simply a neural network that is trained to perform the task of encoding and decoding an item, such that the output from this process is as close to the original item as possible.
- We can decode any encoded value and produce novel data and hence use it as a generative model
- The encoder maps each data obs to a point in latent space. This vector representation is known as an embedding because the encoder tries to embed as much info as possible.
- The decoder decodes the embedding and produces data.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
keras = tf.keras
from keras import datasets, layers, optimizers, models, backend, metrics, losses
(x_train, y_train), (x_test, y_test) = datasets.fashion_mnist.load_data()
def preprocess(imgs):
imgs = imgs.astype("float32") / 255.0
# Pad image to 32x32 for easy manipulation
imgs = np.pad(imgs, ((0, 0), (2, 2), (2, 2)), constant_values=0.0)
imgs = np.expand_dims(imgs, -1)
return imgs
x_train = preprocess(x_train)
x_test = preprocess(x_test)
#
Autoencoder architecture
- Made of 2 parts
- Encoder - Compresses high dimensional input into a lower dimensional embedding vector
- Decoder - Decompresses embedding back to original domain
- Autoencoder is trained to reconstruct image after it has passed through encoder and decoder because we are interested in the embedding/latent space which allows us to generate new data.
- Embedding (z) is a compression of og data into a lower dim latent space. Choosing any point in this space and passing through decoder creates new data
#
Encoder
- Encoder takes input and maps to embedding vector in latent space.
- We use a standard CNN architecture here
#
Decoder
- A mirror image of encoder for decoding embeddings
- We use Convolutional Transpose layer - which increases the size of an image rather than decrease by setting a bigger stride
#
Combining the two
- Output of autoencoder is simply the output of encoder after it has passed through the decoder
- Loss chosen is usually
- RMSE
- Output is symmetrically distributed around average pixel values
- Leads to pixelized edges
- Binary cross-entropy
- More asymmetrical, penalizes errors towards extremes than centers
- Ex: if true value is 0.7, 0.8 is penalised more than 0.6
- Produces blurrier images
- RMSE
#
Reconstructing images
- The reconstruction is not perfect and many details are lost like text or logos as the latent space is 2D.
- We can visualise embeddings by plotting encoder output.
encoder_input = layers.Input(shape=(32, 32, 1), name="encoder_input")
x = layers.Conv2D(32, (3, 3), strides=2, activation="relu", padding="same")(
encoder_input
)
x = layers.Conv2D(64, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(128, (3, 3), strides=2, activation="relu", padding="same")(x)
shape_before_flattening = backend.int_shape(x)[1:]
x = layers.Flatten()(x)
encoder_output = layers.Dense(2, name="encoder_output")(x)
encoder = models.Model(encoder_input, encoder_output)
decoder_input = layers.Input(shape=(2,), name="decoder_input")
x = layers.Dense(np.prod(shape_before_flattening))(decoder_input)
x = layers.Reshape(shape_before_flattening)(x)
x = layers.Conv2DTranspose(128, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(64, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2DTranspose(32, (3, 3), strides=2, activation="relu", padding="same")(x)
decoder_output = layers.Conv2D(
1, (3, 3), strides=1, activation="sigmoid", padding="same", name="decoder_output"
)(x)
decoder = models.Model(decoder_input, decoder_output)
2023-10-23 14:44:43.995083: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:22:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-23 14:44:45.163183: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
autoencoder = models.Model(encoder_input, decoder(encoder_output))
autoencoder.compile(optimizer="adam", loss="binary_crossentropy")
autoencoder.fit(
x_train,
x_train,
epochs=5,
batch_size=100,
shuffle=True,
validation_data=(x_test, x_test),
)
Epoch 1/5
600/600 [==============================] - 66s 107ms/step - loss: 0.2954 - val_loss: 0.2632
Epoch 2/5
600/600 [==============================] - 63s 105ms/step - loss: 0.2577 - val_loss: 0.2571
Epoch 3/5
600/600 [==============================] - 58s 97ms/step - loss: 0.2538 - val_loss: 0.2536
Epoch 4/5
600/600 [==============================] - 58s 97ms/step - loss: 0.2517 - val_loss: 0.2529
Epoch 5/5
600/600 [==============================] - 66s 110ms/step - loss: 0.2506 - val_loss: 0.2518
<keras.src.callbacks.History at 0x7fa796e4eb50>
example_images = x_test[:5000]
predictions = autoencoder.predict(example_images)
embeddings = encoder.predict(example_images)
plt.figure(figsize=(8, 8))
plt.scatter(embeddings[:, 0], embeddings[:, 1], c="white", alpha=0.5, s=3)
plt.show()
157/157 [==============================] - 3s 20ms/step
157/157 [==============================] - 1s 4ms/step
plt.imshow(x_test[0])
<matplotlib.image.AxesImage at 0x7fa7607996d0>
plt.imshow(predictions[0])
<matplotlib.image.AxesImage at 0x7fa76df96c10>
#
Visualisation
- Each white point represents an image that has been embedded
- We understand this better by using labels to color the plot
- We see that even though models does not know labels, the autoencoder has naturally grouped items that look alike to same parts of the latent space
#
Image generation
- We generate novel images by sampling latent space points and passing through decoder.
- From the distribution of actual points we see
- Some items are represented over small area, others over large
- Distribution is not symmetrical about origin or bounded.
- Large gaps b/w colors containing few points
#
Challenges
- We see that
- If we pick uniformly in a bounded space we are more likely to pick bag-like than ankle boot-like because the bag latent space is larger
- It is not obvious how to choose a point as distributions are not defined
- There are empty spaces in latent space where none of og images are encoded. Even points that are central may not be decoded into well formed images as the autoencoder is not forced to ensure that the space is continuous.
- In 2D, these issues are subtle but become more apparent in higher dims
- We solve this with a variational autoencoder
mins, maxs = np.min(embeddings, axis=0), np.max(embeddings, axis=0)
sample = np.random.uniform(mins, maxs, size=(18, 2))
reconstructions = decoder.predict(sample)
print(sample[0])
plt.imshow(reconstructions[0], cmap="gray")
1/1 [==============================] - 0s 100ms/step
[-3.76511182 -3.05445118]
<matplotlib.image.AxesImage at 0x7fa767912f10>
#
Variational autoencoder
We mainly change the encoder and the loss function
#
Encoder
- In a autoencoder, each image mapped to a point in latent space. However in a VAE, each image is mapped to a multivariate normal distribution around a point in latent space.
- Normal distribution is a probability distribution defined by mean and variance
- Standard/unit normal is where mean is 0 and variance is 1
- f(x | \mu, \sigma^2) = \frac1{\sqrt {2\pi\sigma^2}}e^{-\frac{{(x-\mu)}^2}{2\sigma^2}}
- We sample a point z from the normal distribution using the equation - z=\mu+\sigma\epsilon
- Multivariate normal extends to k dimenstions and is given by
- f(x_1, \cdot, x_k) = \frac{exp(-\frac12{(X-\mu)}^T\Sigma^{-1}(X-\mu))}{\sqrt{{(2\pi)}^k|\Sigma|}}
- We use the isotropic multivariate normal where the covariance matrix - \Sigma is diagonal which implies that the distribution is independent in each dim
- Standard multivariate normal is where mean is a zero valued vector and covar matrix is the identity matrix
- Normal distribution is a probability distribution defined by mean and variance
- Encoder maps input to a mean and variance vector
- As variance is always positive, we map to log of variance to map it to (-\inf, \inf)
- We sample a point from this by
- z = z_{mean} + z_{sigma} * epsilon
- z_{sigma} = exp(z_{log\_var} * 0.5)
- epsilon ~ N(0, I)
# A sampling layer to sample from dist defined by z_mean and z_log_var
class Sampling(layers.Layer):
def call(self, inputs):
z_mean, z_log_var = inputs
batch = tf.shape(z_mean)[0]
dim = tf.shape(z_mean)[1]
epsilon = backend.random_normal(shape=(batch, dim))
return z_mean + tf.exp(0.5 * z_log_var) * epsilon
encoder_input = layers.Input( shape=(32, 32, 1), name="encoder_input" )
x = layers.Conv2D(32, (3, 3), strides=2, activation="relu", padding="same")( encoder_input )
x = layers.Conv2D(64, (3, 3), strides=2, activation="relu", padding="same")(x)
x = layers.Conv2D(128, (3, 3), strides=2, activation="relu", padding="same")(x)
shape_before_flattening = backend.int_shape(x)[1:]
x = layers.Flatten()(x)
z_mean = layers.Dense(2, name="z_mean")(x)
z_log_var = layers.Dense(2, name="z_log_var")(x)
z = Sampling()([z_mean, z_log_var])
encoder = models.Model(encoder_input, [z_mean, z_log_var, z], name="encoder")
class VAE(models.Model):
def __init__(self, encoder, decoder, **kwargs):
super(VAE, self).__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
self.total_loss_tracker = metrics.Mean(name="total_loss")
self.reconstruction_loss_tracker = metrics.Mean(name="reconstruction_loss")
self.kl_loss_tracker = metrics.Mean(name="kl_loss")
@property
def metrics(self):
return [
self.total_loss_tracker,
self.reconstruction_loss_tracker,
self.kl_loss_tracker,
]
def call(self, inputs):
z_mean, z_log_var, z = encoder(inputs)
reconstruction = decoder(z)
return z_mean, z_log_var, reconstruction
def train_step(self, data):
with tf.GradientTape() as tape:
z_mean, z_log_var, reconstruction = self(data)
reconstruction_loss = tf.reduce_mean(
500 * losses.binary_crossentropy(data, reconstruction, axis=(1, 2, 3))
)
kl_loss = tf.reduce_mean(
tf.reduce_sum(
-0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)),
axis=1,
)
)
total_loss = reconstruction_loss + kl_loss
grads = tape.gradient(total_loss, self.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {m.name: m.result() for m in self.metrics}
vae = VAE(encoder, decoder)
vae.compile(optimizer="adam")
vae.fit(x_train, epochs=5, batch_size=100)
Epoch 1/5
600/600 [==============================] - 61s 99ms/step - total_loss: 141.5739 - reconstruction_loss: 136.5573 - kl_loss: 5.0166
Epoch 2/5
600/600 [==============================] - 61s 102ms/step - total_loss: 133.5628 - reconstruction_loss: 128.5448 - kl_loss: 5.0181
Epoch 3/5
600/600 [==============================] - 60s 99ms/step - total_loss: 132.3981 - reconstruction_loss: 127.2663 - kl_loss: 5.1318
Epoch 4/5
600/600 [==============================] - 57s 95ms/step - total_loss: 131.6895 - reconstruction_loss: 126.4955 - kl_loss: 5.1940
Epoch 5/5
600/600 [==============================] - 57s 96ms/step - total_loss: 131.1532 - reconstruction_loss: 125.8996 - kl_loss: 5.2536
<keras.src.callbacks.History at 0x7fa7606865d0>