# W2 Statistical Bias and Feature Importance

# Statistical Bias

Training data does not comprehensively represent the problem space
The statistic over/under estimates a parameter
Some elements in dataset are more heavily weighted
- Ex: a fraud detection dataset might majorly consist of good transactions and hence will struggle to find a anomaly

# Cause of bias

Activity bias
- Exist in human generated content (social media)
- A small population participates in this and not repr of whole population
Societal bias
- Pre-concieved notions prevalent in society
Selection bias
- Generated in the ML system
- Caused by a feedback loop
Data drift
- Data distribution varies from that of training set
- Types
  - Covariant drift - feature drift
  - Prior probability drift - target drift
  - Concept drift - relationship b/w target & feature drift (defn of a feature changes based on a natural feature)

# Measuring bias

Metrics are applicable to a particular facet (sensitive feature to analyze) of the dataset
Types
- Class imbalance - measures imbalance in number of examples b/w different facet values
- Difference in proportions in labels - imbalance of positive outcomes b/w different facet values

# Detect Bias

SageMaker Data Wrangler & Clarify - Connects to sources, visualizes and transforms data, creates bias report
- DW - A more visual, UI experience, contains dropdowns, selections etc.
- Clarify - API based approach
Clarify Processor allows you to scale bias detection to distributed clusters with instance_count = number of nodes and instance_type = processing capacity
In the background, Clarify is using SageMaker Processing Job executes bias detection at scale by taking data from S3 and processing in clusters containing containers. The result is put back in S3.

# Feature Importance with SHAP

Feature importance
- Explains features that make up the dataset using a score (importance) aka how useful it is compared to others to create a model.
SHAP (SHapely Additive exPlanations)
- Based on game theory
  - Each feature value of a instance is the player
  - Prediction is the reward/payout
- Local & Global explanations
  - Local - Importance of a feature
  - Global - Importance of all the data to the model
Uses the random cut forest algorithm to calculate scores
Based on the scores, we can do feature engineering to improve the dataset and model.

from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(role=role, ...)
bias_report_output = # S3 path

bias_data_config = clarify.DataConfig(...)

clarify_processor.run_pre_training_bias(methods=["CI", "DPL"], ...)