# U1 transformers working

from transformers import pipeline

# About transformers

The Transformer architecture was introduced in June 2017

This was followed by the release of language models which showcase self supervised learning wherein the labels are automatically computed

  • GPT-like (also called auto-regressive Transformer models)
  • BERT-like (also called auto-encoding Transformer models)
  • BART/T5-like (also called sequence-to-sequence Transformer models)

This in itself is not very useful and hence, transfer learning is used wherein model is fine-tuned in a supervised way for a given task.

The general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

Sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.

  • Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
  • Checkpoints: These are the weights that will be loaded in a given architecture.
  • Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.

# Transfer learning

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

Fine-tuning, on the other hand, is the training done after a model has been pretrained. The knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning

# Transformer architecture

Alt text
Alt text

Model is composed of 2 blocks

  • Encoder (left) - The encoder receives an input and builds a representation of it.
  • Decoder (right) - The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence.

Types of models

  • Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
  • Decoder-only models: Good for generative tasks such as text generation.
  • Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

# Attention layers

Transformer models are built with special layers called attention layers. This layer will tell the model to pay specific attention to certain words in the sentence you passed it

For example, in translating English - “You like this course” to French, the model needs to look at "like" along with "You" for proper translation but other words don't matter for this task.

A word by itself has a meaning, but that meaning is deeply affected by the context.

# Original architecture

The Transformer architecture was originally designed for translation.

During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language.

In the encoder, the attention layers can use all the words in a sentence but the decoder works sequentially and can only pay attention to the words in the sentence that it has already translated - only the words before the word currently being generated.

For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

To speed things up during training, the decoder is fed the whole target, but it is not allowed to use future words

The first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder.

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.

# Encoder models (BERT)

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

For a given set of words passed, it creates a feature vector/tensor or sequence of numbers per word which is the words numerical representation. Dimension of this sequence is defined by architecture of the BERT.

The representation of a word also takes surrounding context and hence is called a contextualized value. It does this thanks to self-attention mechanism which relates to different words in a single sentence to compute a representation of the sentence

Pretraining is usually corrupting a part of sentence and tasking it to reconstruct it.

They are useful for

  • Sentence classification (sentiment analysis)
  • Named entity recognition
  • Question answering
  • Masked language modelling

# Decoder models (GPT, CTRL)

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

The architecture is similar to encoders but it differs in using masked self attention - the left or right words are masked and only one side is used as context and hence are called unidirectional.

Auto-regressive models reuse past outputs as inputs in following steps.

The pretraining of decoder models usually revolves around predicting the next word in the sentence and are used for text generation known as causal language modelling.

# Sequence to sequence models (BART, T5)

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.

As for the architecture, the output of encoder is passed to decoder with additional inputs (start of sequence word) to decoder. As soon as the decoder outputs a word, the encoder is no longer used and the decoder works in an auto regressive manner. The two don't share weights

One example is the T5 model which is pretrained by replacing random spans of text with a single mask special word, and the objective is then to predict the text that this mask word replaces.

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.