# W1 Analyze & Visualize Dataset

# Definitions

AI lets machines mimic human behaviour
ML is a subset of AI that uses stat methods and algos that learn from data without being explicitly programmed
DL is a subset of ML that uses NN's
Practical DS involves using massive realtime datasets, cleaning and extracting features and gaining insights & knowledge from it in the cloud.
- Provides elasticity and scalability compared to local env
- Scaling out - using multiple distributed CPU's
- Scaling up - upgrading resources

# ML Workflow

Steps
- Ingest & Analyze
- Prepare & Transform
- Train & Tune
- Deploy & Manage
Software used - S3, Athena & Sagemaker

# Data ingestion & exploration

Data lakes
- Centralized and secure repo of data
- Stores and shares data of any type and scale ((semi|un)structured & streamed data)
- Is governed, private and secure
S3 (Simple Storage Service)
- Object storage - data + UID + metadata in the form of objects
- Provides extra tools for dev

# AWS Tools

Data Wrangler
- OSS Python lib
- Connects pandas to AWS to load/unload data
Glue Data Catalog
- Register data stored in S3
- Metadata/Schema of S3 database stored in Glue database
- Has crawlers that are event driven
Athena
- Serverless SQL based data query tool
- Based on Presto - a OSS distributed SQL engine

%pip install awswrangler

# Sample code
import awswrangler as wr
import pandas as pd


df = wr.s3.read_csv(path="s3_bucket")

wr.catalog.create_database(name="")
wr.catalog.create_csv_table(table="", columns_types={})

wr.athena.create_athena_bucket()
df = wr.athena.read_sql_query(sql="", database="")