Intro & Word Vectors
Introduction
- Language is not a formal system, rather it is a system made up by people xkcd: I Could Care Less
- It is a means of communication, to pass knowledge between time and space.
- AI takes knowledge present in books written in languages to understand that language - creating a virtuous cycle
How do we represent meaning of a word
- Defn
- Commoner - Meaning is the idea represented by a word
- Linguistics - signifier(symbol) \iff signified(thing) (a.k.a denotational semantics, not very useful for a computer)
- For a computer - WordNet and other tools
- A thesaurus containing lists of synonym sets and hypernyms ("is a" relationship)
- Ex: Panda "is a" animal; good == beneficial
- Problems
- Misses context/nuance
- Misses new meanings/modern terms
- Subjective
- Requires humans to alter and adapt the database
- Cannot compute word similarity
- Problems with traditional NLP
- Words are regarded as discrete symbols (localised repr in DL)
- a.k.a we one hot encode words into numeric vectors
- Ex: motel = [0 0 0 0 1 0]
- Implies we need huge vectors to accommodate the vocabulary
- No word similarity/relationships (Could use WordNet synonyms, etc but these fail due to incompleteness)
- Modern DL methods use distributional semantics
- Word's meaning is given by words that frequently appear close to it
- We try to encode the similarities too along with the symbols
- When a word w appears in text, its context is a set of words that appear nearby
- A word on its own is a token but when taken from several sentences forms a type that defines its context and meaning
Word embeddings
- Build a dense vector (usually 300 in size) for each word (unlike sparse for WordNet) such that it's similar to vectors of words that appear in similar context
- Forms a distributed repr
- They are called embeddings because these can be mapped to a higher dim space. This space will contain clusters of words that are similar in meaning/context
Word2vec
- A framework for learning word vecs
- Given
- We have a large corpus wherein every word in a fixed vocabulary is repr by a random vector
- Calculate
- We go through each position t in a text which has a center word c and outside words/context o
- Use similarity of word vecs for c and o to calculate P(o|c)
- Adjust word vecs to maximise the P
- First iteration -![[windows.png]]
- Further iterations change center word successively.
- Calculating P(w_{t+j}|w_j; \theta)
- Use 2 vectors per word w
- v_w when w is used as a center word
- u_w when it is used as a context word
- P(o|c) = \frac{e^{u_o^Tv_c}}{\sum_{w\in V}e^{u_o^Tv_c}}
- Notice that this is a softmax function being applied
- ![[word2vec prediction fn.png]]
- Dot product is a natural measure of similarity between words (larger implies more similar)
- Objective function
- \forall positions t = 1,\ldots, T, predict context words w_{t+j} within window of size m given center word w_j
- Data likelihood (how good a job we do at predicting words in context of other words) is $$L(\theta) = \Pi_^T;\Pi_{-m \le j \le m;; j \ne 0}; P(w_{t+j}| w_t; \theta) $$ wherein \theta is all variables to be optimized (we want to maximise this likelihood)
- Objective function is J(\theta) = -\frac{1}{T}logL(\theta) (we convert products to sums with log as it's easier to deal with)
- Minimizing J \iff Maximizing accuracy
- Training
- To train, like any other model, we adjust params to minimize loss
- Parameters - \theta = word vectors (2 vectors per word) = ![[Pasted image 20240704193407.png]]
- We compute all vector gradients
- Calculate \frac{\partial}{\partial V_c}P(o|c) and \frac{\partial}{\partial U_c}P(o|c)
- The probabilities we end up with are usually small in general but we try to fiddle with the word vecs to maximise it
- Usecases
- Analogy (relationships)
- Word similarity
- Notes
- The two representations for a word are quite similar (not same) to each other. We average them and collect one vector per word.