# Intro & Word Vectors

# Introduction

  • Language is not a formal system, rather it is a system made up by people xkcd: I Could Care Less
  • It is a means of communication, to pass knowledge between time and space.
  • AI takes knowledge present in books written in languages to understand that language - creating a virtuous cycle

# How do we represent meaning of a word

  • Defn
    • Commoner - Meaning is the idea represented by a word
    • Linguistics - signifier(symbol) \iff signified(thing) (a.k.a denotational semantics, not very useful for a computer)
    • For a computer - WordNet and other tools
      • A thesaurus containing lists of synonym sets and hypernyms ("is a" relationship)
      • Ex: Panda "is a" animal; good == beneficial
      • Problems
        • Misses context/nuance
        • Misses new meanings/modern terms
        • Subjective
        • Requires humans to alter and adapt the database
        • Cannot compute word similarity
  • Problems with traditional NLP
    • Words are regarded as discrete symbols (localised repr in DL)
      • a.k.a we one hot encode words into numeric vectors
      • Ex: motel = [0 0 0 0 1 0]
      • Implies we need huge vectors to accommodate the vocabulary
      • No word similarity/relationships (Could use WordNet synonyms, etc but these fail due to incompleteness)
  • Modern DL methods use distributional semantics
    • Word's meaning is given by words that frequently appear close to it
    • We try to encode the similarities too along with the symbols
    • When a word w appears in text, its context is a set of words that appear nearby
    • A word on its own is a token but when taken from several sentences forms a type that defines its context and meaning

# Word embeddings

  • Build a dense vector (usually 300 in size) for each word (unlike sparse for WordNet) such that it's similar to vectors of words that appear in similar context
  • Forms a distributed repr
  • They are called embeddings because these can be mapped to a higher dim space. This space will contain clusters of words that are similar in meaning/context

# Word2vec

  • A framework for learning word vecs
  • Given
    • We have a large corpus wherein every word in a fixed vocabulary is repr by a random vector
  • Calculate
    • We go through each position t in a text which has a center word c and outside words/context o
    • Use similarity of word vecs for c and o to calculate P(o|c)
    • Adjust word vecs to maximise the P
    • First iteration -![[windows.png]]
    • Further iterations change center word successively.
  • Calculating P(w_{t+j}|w_j; \theta)
    • Use 2 vectors per word w
    • v_w when w is used as a center word
    • u_w when it is used as a context word
    • P(o|c) = \frac{e^{u_o^Tv_c}}{\sum_{w\in V}e^{u_o^Tv_c}}
    • Notice that this is a softmax function being applied
    • ![[word2vec prediction fn.png]]
      • Dot product is a natural measure of similarity between words (larger implies more similar)
  • Objective function
    • \forall positions t = 1,\ldots, T, predict context words w_{t+j} within window of size m given center word w_j
    • Data likelihood (how good a job we do at predicting words in context of other words) is $$L(\theta) = \Pi_^T;\Pi_{-m \le j \le m;; j \ne 0}; P(w_{t+j}| w_t; \theta) $$ wherein \theta is all variables to be optimized (we want to maximise this likelihood)
    • Objective function is J(\theta) = -\frac{1}{T}logL(\theta) (we convert products to sums with log as it's easier to deal with)
    • Minimizing J \iff Maximizing accuracy
  • Training
    • To train, like any other model, we adjust params to minimize loss
    • Parameters - \theta = word vectors (2 vectors per word) = ![[Pasted image 20240704193407.png]]
    • We compute all vector gradients
      • Calculate \frac{\partial}{\partial V_c}P(o|c) and \frac{\partial}{\partial U_c}P(o|c)
    • The probabilities we end up with are usually small in general but we try to fiddle with the word vecs to maximise it
  • Usecases
    • Analogy (relationships)
    • Word similarity
  • Notes
    • The two representations for a word are quite similar (not same) to each other. We average them and collect one vector per word.