Machine Learning Using Redshift ML - Data Scientist

In this lab you will create a model using Redshift ML and provide the MODEL_TYPE , OBJECTIVE, PREPROCESSORS and HYPER PARAMETERS.

Contents

Before You Begin

This lab assumes you have launched a Redshift cluster. If you have not launched a Redshift cluster see LAB 1 - Creating Redshift Clusters. We will use the Redshift Query Editor in the Redshift console for this lab.

Data Preparation

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

Name Data Type Measurement Unit Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years

Reference : https://archive.ics.uci.edu/ml/datasets/Abalone

  1. Execute the following statements to create and load the training and inference tables in Redshift. The training data is used to create the model and the inference data is used to make predictions.
DROP TABLE IF EXISTS abalone_xgb_train;

CREATE TABLE abalone_xgb_train (
length_val float, 
diameter float, 
height float,
whole_weight float, 
shucked_weight float, 
viscera_weight float,
shell_weight float, 
rings int
);

COPY abalone_xgb_train FROM 's3://redshift-downloads/redshift-ml/workshop/xgboost_abalone_data/train/' REGION 'us-east-1' IAM_ROLE '<< REPLACE IAM_ROLE >>' IGNOREHEADER 1 CSV;



DROP TABLE IF EXISTS abalone_xgb_test;

CREATE TABLE abalone_xgb_test (
length_val float, 
diameter float, 
height float,
whole_weight float, 
shucked_weight float, 
viscera_weight float,
shell_weight float, 
rings int
);

COPY abalone_xgb_test FROM 's3://redshift-downloads/redshift-ml/workshop/xgboost_abalone_data/test/' REGION 'us-east-1' IAM_ROLE '<< REPLACE IAM_ROLE >>' IGNOREHEADER 1 CSV;

Create Model

Create a new model with XGBOOST using abalone dataset

For this example, the user is considered advanced machine learning expert where the autopilot is not used and the user will directly provide advanced properties including preprocessors and hyper parameters .

For this example, we are going to provide the MODEL_TYPE , OBJECTIVE, PREPROCESSORS and HYPER PARAMETERS.

For all options supported - https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html#r_auto_off_create_model

  1. Execute the following statement to create the model
-- ~ 10 mins 
CREATE MODEL model_abalone_xgboost_regression 
FROM (SELECT
      length_val,
      diameter,
      height,
      whole_weight,
      shucked_weight,
      viscera_weight,
      shell_weight,
      rings
     FROM abalone_xgb_train)
TARGET Rings 
FUNCTION func_model_abalone_xgboost_regression 
IAM_ROLE '<< REPLACE IAM_ROLE >>' 
AUTO OFF 
MODEL_TYPE xgboost 
OBJECTIVE 'reg:squarederror' 
PREPROCESSORS 'none' 
HYPERPARAMETERS DEFAULT EXCEPT (NUM_ROUND '100') 
SETTINGS (S3_BUCKET '<< REPLACE S3 bucket >>');

Check Accuracy and Run Inferance Query

  1. Next, run the following SQL Query to get some information about your model. Note the model state should say ‘Ready’.
show model model_abalone_xgboost_regression;
  1. Check Inference/Accuracy of the model. Run the following queries - the first checks the accuracy of the model and the second will use the function created by the pre built model for the inference and against the data set in inference table.
--Check Model Accuracy

-- MSE/RMSE [The lower the better]: For regression problems, we compute Mean Squared Error / Root Mean Squared Error.

WITH infer_data AS (
    SELECT Rings AS label, func_model_abalone_xgboost_regression(
Length_val, Diameter, Height, Whole_weight, Shucked_weight, Viscera_weight,
Shell_weight
) AS predicted,
    CASE WHEN label is NULL THEN 0 ELSE label END AS actual
    FROM abalone_xgb_test
)
SELECT SQRT(AVG(POWER(actual - predicted, 2))) AS rmse FROM infer_data;


--Predict the age group of Abalone Species for harvesting, run on the test table
WITH age_data AS ( SELECT func_model_abalone_xgboost_regression( length_val, 
                                               diameter, 
                                               height, 
                                               whole_weight, 
                                               shucked_weight, 
                                               viscera_weight, 
                                               shell_weight ) + 1.5 AS age
FROM abalone_xgb_test )
SELECT 
CASE WHEN age  > 20 THEN 'age_over_20'
     WHEN age  > 10 THEN 'age_between_10_20'
     WHEN age  > 5  THEN 'age_between_5_10'
     ELSE 'age_5_and_under' END as age_group,
COUNT(1) AS count
from age_data GROUP BY 1;

Before You Leave

If you are done using your cluster, please think about decommissioning it to avoid having to pay for unused resources.

Feedback/Issue

We would love to hear from you

Please let us know if there's a way we can make this lab better, if there is a lab you would like added, or if you want help getting started with your Redshift deployment.

Submit