# W4 Built in Algorithms

# Why use built in

  • Implementations are highly optimized and scalable, support GPU's and distributed systems
  • Focus more on domain specific tasks than low level code
  • Trained models can be downloaded and reused

# Usage timeline

  • If the task is simple and used a lot everywhere, go for built-in
  • If the task is more niche, script mode can be used which involves scripting with Python frameworks.
  • The highest customization can be done with your own container

# Built-in examples

  • Classification - XGBoost, KNN
  • Regression - Linear, XGBoost
  • Time series forecasting - DeepAR forecasting (uses RNN's)
  • Dimensionality reduction - PCA
  • Anomaly detection - Random Cut Forest (RCF)
  • Clustering - KMeans
  • Topic modeling - Latent Dirichlet Allocation (LDA), Neural Topic Model (NTM)
  • Content moderation - Image classification
  • Object detection
  • Semantic segmentation
  • Machine translation
  • Text summarization
  • Speech to text
  • Text classification

# Text analysis

  • Word2Vec
    • Converts text into vectors (embeddings)
    • Architectures to create embeddings
      • Continuous bag of words (CBOW)
      • Continuous skip-gram
  • GloVe
  • FastText
    • Extension on Word2Vec
    • Breaks word into character n-grams
    • Embedding is aggregate of embedding of each n-gram within the word
  • Transformers
    • Uses self-attention
  • BlazingText (we use this cuz AWS)
    • Scales Word2Vec to distributed compute
    • Extends FastText to use GPU with CUDA
    • Saves money by early-stopping
    • Optimized IO datasets
  • ELMo
    • BidLLM
  • GPT
  • BERT

# Training model

  • BlazingText takes the hyper params
    • epochs
    • learning_rate
    • vector_dim
    • word_ngrams
# Sentiment analysis

def tokenize(review):
    return nltk.word_tokenize(review)

train = sagemaker.inputs.TrainingInput(...)
val = sagemaker.inputs.TrainingInput(...)

channels = {
    'train': train,
    'val': val,
}

# Docker image
image_uri = sagemaker.image_uris.retrieve(frameword='blazing_text')

estimator = sagemaker.estimator.Estimator(image_uri=image_uri)
estimator.set_hyperparameters(...)
estimator.fit(...)

# Creates API endpoint on EC2
classifier = estimator.deploy(initial_instance_count=1, ...)

payload = {'instances': ['Nice']}
response = classifier.predict(...)