#
W1 Analyze & Visualize Dataset
#
Definitions
- AI lets machines mimic human behaviour
- ML is a subset of AI that uses stat methods and algos that learn from data without being explicitly programmed
- DL is a subset of ML that uses NN's
- Practical DS involves using massive realtime datasets, cleaning and extracting features and gaining insights & knowledge from it in the cloud.
- Provides elasticity and scalability compared to local env
- Scaling out - using multiple distributed CPU's
- Scaling up - upgrading resources
#
ML Workflow
- Steps
- Ingest & Analyze
- Prepare & Transform
- Train & Tune
- Deploy & Manage
- Software used - S3, Athena & Sagemaker
#
Data ingestion & exploration
- Data lakes
- Centralized and secure repo of data
- Stores and shares data of any type and scale ((semi|un)structured & streamed data)
- Is governed, private and secure
- S3 (Simple Storage Service)
- Object storage - data + UID + metadata in the form of objects
- Provides extra tools for dev
#
AWS Tools
- Data Wrangler
- OSS Python lib
- Connects pandas to AWS to load/unload data
- Glue Data Catalog
- Register data stored in S3
- Metadata/Schema of S3 database stored in Glue database
- Has crawlers that are event driven
- Athena
- Serverless SQL based data query tool
- Based on Presto - a OSS distributed SQL engine
%pip install awswrangler
# Sample code
import awswrangler as wr
import pandas as pd
df = wr.s3.read_csv(path="s3_bucket")
wr.catalog.create_database(name="")
wr.catalog.create_csv_table(table="", columns_types={})
wr.athena.create_athena_bucket()
df = wr.athena.read_sql_query(sql="", database="")