# W2 Statistical Bias and Feature Importance

# Statistical Bias

  • Training data does not comprehensively represent the problem space
  • The statistic over/under estimates a parameter
  • Some elements in dataset are more heavily weighted
    • Ex: a fraud detection dataset might majorly consist of good transactions and hence will struggle to find a anomaly

# Cause of bias

  • Activity bias
    • Exist in human generated content (social media)
    • A small population participates in this and not repr of whole population
  • Societal bias
    • Pre-concieved notions prevalent in society
  • Selection bias
    • Generated in the ML system
    • Caused by a feedback loop
  • Data drift
    • Data distribution varies from that of training set
    • Types
      • Covariant drift - feature drift
      • Prior probability drift - target drift
      • Concept drift - relationship b/w target & feature drift (defn of a feature changes based on a natural feature)

# Measuring bias

  • Metrics are applicable to a particular facet (sensitive feature to analyze) of the dataset
  • Types
    • Class imbalance - measures imbalance in number of examples b/w different facet values
    • Difference in proportions in labels - imbalance of positive outcomes b/w different facet values

# Detect Bias

  • SageMaker Data Wrangler & Clarify - Connects to sources, visualizes and transforms data, creates bias report
    • DW - A more visual, UI experience, contains dropdowns, selections etc.
    • Clarify - API based approach
  • Clarify Processor allows you to scale bias detection to distributed clusters with instance_count = number of nodes and instance_type = processing capacity
  • In the background, Clarify is using SageMaker Processing Job executes bias detection at scale by taking data from S3 and processing in clusters containing containers. The result is put back in S3.

# Feature Importance with SHAP

  • Feature importance
    • Explains features that make up the dataset using a score (importance) aka how useful it is compared to others to create a model.
  • SHAP (SHapely Additive exPlanations)
    • Based on game theory
      • Each feature value of a instance is the player
      • Prediction is the reward/payout
    • Local & Global explanations
      • Local - Importance of a feature
      • Global - Importance of all the data to the model
  • Uses the random cut forest algorithm to calculate scores
  • Based on the scores, we can do feature engineering to improve the dataset and model.
from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(role=role, ...)
bias_report_output = # S3 path

bias_data_config = clarify.DataConfig(...)

clarify_processor.run_pre_training_bias(methods=["CI", "DPL"], ...)