# Intro & Word Vectors

# Introduction

Language is not a formal system, rather it is a system made up by people xkcd: I Could Care Less
It is a means of communication, to pass knowledge between time and space.
AI takes knowledge present in books written in languages to understand that language - creating a virtuous cycle

# How do we represent meaning of a word

Defn
- Commoner - Meaning is the idea represented by a word
- Linguistics - signifier(symbol) \iff signified(thing) (a.k.a denotational semantics, not very useful for a computer)
- For a computer - WordNet and other tools
  - A thesaurus containing lists of synonym sets and hypernyms ("is a" relationship)
  - Ex: Panda "is a" animal; good == beneficial
  - Problems
    - Misses context/nuance
    - Misses new meanings/modern terms
    - Subjective
    - Requires humans to alter and adapt the database
    - Cannot compute word similarity
Problems with traditional NLP
- Words are regarded as discrete symbols (localised repr in DL)
  - a.k.a we one hot encode words into numeric vectors
  - Ex: motel = [0 0 0 0 1 0]
  - Implies we need huge vectors to accommodate the vocabulary
  - No word similarity/relationships (Could use WordNet synonyms, etc but these fail due to incompleteness)
Modern DL methods use distributional semantics
- Word's meaning is given by words that frequently appear close to it
- We try to encode the similarities too along with the symbols
- When a word w appears in text, its context is a set of words that appear nearby
- A word on its own is a token but when taken from several sentences forms a type that defines its context and meaning

# Word embeddings

Build a dense vector (usually 300 in size) for each word (unlike sparse for WordNet) such that it's similar to vectors of words that appear in similar context
Forms a distributed repr
They are called embeddings because these can be mapped to a higher dim space. This space will contain clusters of words that are similar in meaning/context

# Word2vec

A framework for learning word vecs
Given
- We have a large corpus wherein every word in a fixed vocabulary is repr by a random vector
Calculate
- We go through each position t in a text which has a center word c and outside words/context o
- Use similarity of word vecs for c and o to calculate P(o|c)
- Adjust word vecs to maximise the P
- First iteration -![[windows.png]]
- Further iterations change center word successively.
Calculating P(w_{t+j}|w_j; \theta)
- Use 2 vectors per word w
- v_w when w is used as a center word
- u_w when it is used as a context word
- P(o|c) = \frac{e^{u_o^Tv_c}}{\sum_{w\in V}e^{u_o^Tv_c}}
- Notice that this is a softmax function being applied
- ![[word2vec prediction fn.png]]
  - Dot product is a natural measure of similarity between words (larger implies more similar)
Objective function
- \forall positions t = 1,\ldots, T, predict context words w_{t+j} within window of size m given center word w_j
- Data likelihood (how good a job we do at predicting words in context of other words) is $$L(\theta) = \Pi_^T;\Pi_{-m \le j \le m;; j \ne 0}; P(w_{t+j}| w_t; \theta) $$ wherein \theta is all variables to be optimized (we want to maximise this likelihood)
- Objective function is J(\theta) = -\frac{1}{T}logL(\theta) (we convert products to sums with log as it's easier to deal with)
- Minimizing J \iff Maximizing accuracy
Training
- To train, like any other model, we adjust params to minimize loss
- Parameters - \theta = word vectors (2 vectors per word) = ![[Pasted image 20240704193407.png]]
- We compute all vector gradients
  - Calculate \frac{\partial}{\partial V_c}P(o|c) and \frac{\partial}{\partial U_c}P(o|c)
- The probabilities we end up with are usually small in general but we try to fiddle with the word vecs to maximise it
Usecases
- Analogy (relationships)
- Word similarity
Notes
- The two representations for a word are quite similar (not same) to each other. We average them and collect one vector per word.