# W1 Analyze & Visualize Dataset

# Definitions

  • AI lets machines mimic human behaviour
  • ML is a subset of AI that uses stat methods and algos that learn from data without being explicitly programmed
  • DL is a subset of ML that uses NN's
  • Practical DS involves using massive realtime datasets, cleaning and extracting features and gaining insights & knowledge from it in the cloud.
    • Provides elasticity and scalability compared to local env
    • Scaling out - using multiple distributed CPU's
    • Scaling up - upgrading resources

# ML Workflow

  • Steps
    • Ingest & Analyze
    • Prepare & Transform
    • Train & Tune
    • Deploy & Manage
  • Software used - S3, Athena & Sagemaker

# Data ingestion & exploration

  • Data lakes
    • Centralized and secure repo of data
    • Stores and shares data of any type and scale ((semi|un)structured & streamed data)
    • Is governed, private and secure
  • S3 (Simple Storage Service)
    • Object storage - data + UID + metadata in the form of objects
    • Provides extra tools for dev

# AWS Tools

  • Data Wrangler
    • OSS Python lib
    • Connects pandas to AWS to load/unload data
  • Glue Data Catalog
    • Register data stored in S3
    • Metadata/Schema of S3 database stored in Glue database
    • Has crawlers that are event driven
  • Athena
    • Serverless SQL based data query tool
    • Based on Presto - a OSS distributed SQL engine
%pip install awswrangler
# Sample code
import awswrangler as wr
import pandas as pd


df = wr.s3.read_csv(path="s3_bucket")

wr.catalog.create_database(name="")
wr.catalog.create_csv_table(table="", columns_types={})

wr.athena.create_athena_bucket()
df = wr.athena.read_sql_query(sql="", database="")