#
W4 Built in Algorithms
#
Why use built in
- Implementations are highly optimized and scalable, support GPU's and distributed systems
- Focus more on domain specific tasks than low level code
- Trained models can be downloaded and reused
#
Usage timeline
- If the task is simple and used a lot everywhere, go for built-in
- If the task is more niche, script mode can be used which involves scripting with Python frameworks.
- The highest customization can be done with your own container
#
Built-in examples
- Classification - XGBoost, KNN
- Regression - Linear, XGBoost
- Time series forecasting - DeepAR forecasting (uses RNN's)
- Dimensionality reduction - PCA
- Anomaly detection - Random Cut Forest (RCF)
- Clustering - KMeans
- Topic modeling - Latent Dirichlet Allocation (LDA), Neural Topic Model (NTM)
- Content moderation - Image classification
- Object detection
- Semantic segmentation
- Machine translation
- Text summarization
- Speech to text
- Text classification
#
Text analysis
- Word2Vec
- Converts text into vectors (embeddings)
- Architectures to create embeddings
- Continuous bag of words (CBOW)
- Continuous skip-gram
- GloVe
- FastText
- Extension on Word2Vec
- Breaks word into character n-grams
- Embedding is aggregate of embedding of each n-gram within the word
- Transformers
- Uses self-attention
- BlazingText (we use this cuz AWS)
- Scales Word2Vec to distributed compute
- Extends FastText to use GPU with CUDA
- Saves money by early-stopping
- Optimized IO datasets
- ELMo
- BidLLM
- GPT
- BERT
#
Training model
- BlazingText takes the hyper params
- epochs
- learning_rate
- vector_dim
- word_ngrams
# Sentiment analysis
def tokenize(review):
return nltk.word_tokenize(review)
train = sagemaker.inputs.TrainingInput(...)
val = sagemaker.inputs.TrainingInput(...)
channels = {
'train': train,
'val': val,
}
# Docker image
image_uri = sagemaker.image_uris.retrieve(frameword='blazing_text')
estimator = sagemaker.estimator.Estimator(image_uri=image_uri)
estimator.set_hyperparameters(...)
estimator.fit(...)
# Creates API endpoint on EC2
classifier = estimator.deploy(initial_instance_count=1, ...)
payload = {'instances': ['Nice']}
response = classifier.predict(...)