Machine Learning Using Redshift ML - Data Analyst

In this lab you will create a model using Redshift ML Auto.

Contents

Before You Begin

This lab assumes you have launched a Redshift cluster. If you have not launched a Redshift cluster see LAB 1 - Creating Redshift Clusters. We will use the Redshift Query Editor in the Redshift console for this lab.

Data Preparation

Data Set Information: Bank Marketing data set The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Attribute Information /Input variables / bank client data:

  1. age (numeric)
  2. job
  3. marital
  4. education
  5. default
  6. housing
  7. loan
  8. contact
  9. month
  10. day_of_week
  11. duration
  12. campaign
  13. pdays
  14. previous
  15. poutcome
  16. emp.var.rate
  17. cons.price.idx
  18. cons.conf.idx
  19. euribor3m
  20. nr.employed

Output variable (desired target):

  1. y
    Definition: has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Reference: https://archive.ics.uci.edu/ml/datasets/bank+marketing

  1. Execute the following statements to create and load the training and inference tables in Redshift. The training data is used to create the model and the inference data is used to make predictions.
CREATE TABLE bank_details_training(
   age numeric,
   job varchar,
   marital varchar,
   education varchar,
   "default" varchar,
   housing varchar,
   loan varchar,
   contact varchar, 
   month varchar,
   day_of_week varchar,
   duration numeric,
   campaign numeric,
   pdays numeric,
   previous numeric,
   poutcome varchar,
   emp_var_rate numeric,
   cons_price_idx numeric,     
   cons_conf_idx numeric,     
   euribor3m numeric,
   nr_employed numeric,
   y boolean ) ;

COPY bank_details_training from 's3://redshift-downloads/redshift-ml/workshop/bank-marketing-data/training_data/' REGION 'us-east-1' IAM_ROLE '<< REPLACE IAM_ROLE >> ' CSV IGNOREHEADER 1 delimiter ';';



CREATE TABLE bank_details_inference(
   age numeric,
   job varchar,
   marital varchar,
   education varchar,
   "default" varchar,
   housing varchar,
   loan varchar,
   contact varchar, 
   month varchar,
   day_of_week varchar,
   duration numeric,
   campaign numeric,
   pdays numeric,
   previous numeric,
   poutcome varchar,
   emp_var_rate numeric,
   cons_price_idx numeric,     
   cons_conf_idx numeric,     
   euribor3m numeric,
   nr_employed numeric,
   y boolean ) ;

COPY bank_details_inference from 's3://redshift-downloads/redshift-ml/workshop/bank-marketing-data/inference_data/' REGION 'us-east-1' IAM_ROLE '<< REPLACE IAM_ROLE >>' CSV IGNOREHEADER 1 delimiter ';';

Create Model

The create model will take ~ 60 minutes to run.

Complete Autopilot generated with minimal user inputs This will be a binary classification problem but auto pilot will choose the relevant algorithm based on the data and inputs

  1. Execute the following statement to create the model
DROP MODEL model_bank_marketing;
CREATE MODEL model_bank_marketing
FROM (
SELECT    
   age ,
   job ,
   marital ,
   education ,
   "default" ,
   housing ,
   loan ,
   contact , 
   month ,
   day_of_week ,
   duration ,
   campaign ,
   pdays ,
   previous ,
   poutcome ,
   emp_var_rate ,
   cons_price_idx ,     
   cons_conf_idx ,     
   euribor3m ,
   nr_employed ,
   y 
FROM
    bank_details_training )
    TARGET y
FUNCTION func_model_bank_marketing
IAM_ROLE '<< REPLACE IAM_ROLE >>'
SETTINGS (
  S3_BUCKET '<< REPLACE S3 bucket >>',
  MAX_RUNTIME 3600
  )
;

Check Accuracy and Run Inferance Query

  1. Next, run the following SQL Query to get some information about your model. Note the model state should say ‘Ready’. Pay attention to the validation:f1 score - it will be between 0 and 1, the closer to 1, the better the model.
show model model_bank_marketing;
  1. Check Inference/Accuracy of the model. Run the following queries - the first checks the accuracy of the model and the second will use the function created by the pre built model for the inference and against the data set in inference table bank_details_inference.
--Inference/Accuracy on inference dats 

WITH infer_data
 AS (
    SELECT  y as actual, func_model_bank_marketing(age,job,marital,education,"default",housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed) AS predicted,
     CASE WHEN actual = predicted THEN 1::INT
         ELSE 0::INT END AS correct
    FROM bank_details_inference
    ),
 aggr_data AS (
     SELECT SUM(correct) as num_correct, COUNT(*) as total FROM infer_data
 )
 SELECT (num_correct::float/total::float) AS accuracy FROM aggr_data;



--Predict how many will subscribe for term deposit vs not subscribe

WITH term_data AS ( SELECT func_model_bank_marketing( age,job,marital,education,"default",housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed) AS predicted 
FROM bank_details_inference )
SELECT 
CASE WHEN predicted = 'Y'  THEN 'Yes-will-do-a-term-deposit'
     WHEN predicted = 'N'  THEN 'No-term-deposit'
     ELSE 'Neither' END as deposit_prediction,
COUNT(1) AS count
from term_data GROUP BY 1;

Before You Leave

If you are done using your cluster, please think about decommissioning it to avoid having to pay for unused resources.

Feedback/Issue

We would love to hear from you

Please let us know if there's a way we can make this lab better, if there is a lab you would like added, or if you want help getting started with your Redshift deployment.

Submit