ETL/ELT Strategies

This lab demonstrates how you can modernize your ETL/ELT processes using Materialized Views, Stored Procedures, and Query Scheduling to transform data within Redshift.

Contents

Before You Begin

This lab assumes you have launched a Redshift cluster, loaded it with TPC Benchmark data and can gather the following information. If you have not launched a cluster, see LAB 1 - Creating Redshift Clusters. If you have not yet loaded it, see LAB 2 - Data Loading.

  • [Your-Redshift_Hostname]
  • [Your-Redshift_Port]
  • [Your-Redshift_Username]
  • [Your-Redshift_Password]

It also assumes you have access to a configured client tool. For more details on configuring SQL Workbench/J as your client tool, see Lab 1 - Creating Redshift Clusters : Configure Client Tool. As an alternative you can use the Redshift provided online Query Editor which does not require an installation.

https://console.aws.amazon.com/redshift/home?#query:

Prior to starting this lab, let’s delete some previously loaded data from the lineitem table.

delete from lineitem
using orders
where l_orderkey = o_orderkey
and datepart(year, o_orderdate) = 1998 and datepart(month, o_orderdate) = 8;

1 Materialized Views

In a data warehouse environment, applications often need to perform complex queries on large tables—for example, SELECT statements that perform multi-table joins and aggregations on the tables that contain billions of rows. Processing these queries can be expensive in terms of system resources and the time it takes to compute the results. Materialized views in Amazon Redshift provide a way to address these issues. A materialized view contains a precomputed result set, based on SQL query over one or more base tables. Here you will learn how to create, query and refresh a materialized view.

  • Let’s take an example where you want to generate a report of the top suppliers by shipped quantity. This will join large tables like and lineitem, and suppliers and scan a large quantity of data. You might write a query like the following:
select n_name, s_name, l_shipmode,
  SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;

This query takes time to execute and because it is scanning a large amount of data will use a lot of I/O & CPU resources. Think of a situation, where multiple users in the organization need get supplier-level metrics like the above. Each may write similarly heavy queries which can be time consuming and expensive operations. Instead of that you can use a materialized view to store precomputed results for speeding up queries that are predictable and repeated.

Amazon Redshift provides a few methods to keep materialized views up-to-date. You can configure the automatic refresh option to refresh materialized views when base tables of mare updated. The auto refresh operation runs at a time when cluster resources are available to minimize disruptions to other workloads.

  • Execute below query to create materialized view which aggregates the lineitem data to the supplier level. Note, the AUTO REFRESH option is set to YES and we’ve included additional columns in our MV in case other users can take advantage of this aggregated data.
CREATE MATERIALIZED VIEW supplier_shipmode_agg
AUTO REFRESH YES AS
select l_suppkey, l_shipmode, datepart(year, L_SHIPDATE) l_shipyear,
  SUM(L_QUANTITY)	TOTAL_QTY,
  SUM(L_DISCOUNT) TOTAL_DISCOUNT,
  SUM(L_TAX) TOTAL_TAX,
  SUM(L_EXTENDEDPRICE) TOTAL_EXTENDEDPRICE  
from LINEITEM
group by 1,2,3;
  • Now execute the below query which has been re-written to use the materialized view. Note the difference in query execution time. You get the same results in few seconds.
select n_name, s_name, l_shipmode,
  SUM(TOTAL_QTY) Total_Qty
from supplier_shipmode_agg
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where l_shipyear > 1997
group by 1,2,3
order by 3 desc
limit 1000;

Another powerful feature of Materialized view is auto query rewrite. Amazon Redshift can automatically rewrite queries to use materialized views, even when the query doesn’t explicitly reference a materialized view.

  • Now, re-run your original query which references the lineitem table and see this query now executes faster because Redshift has re-written this query to leverage the materialized view instead of base table.
select n_name, s_name, l_shipmode, SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;

You can verify that the query re-writer is using the MV by running an explain operation:

explain
select n_name, s_name, l_shipmode, SUM(L_QUANTITY) Total_Qty
from lineitem
join supplier on l_suppkey = s_suppkey
join nation on s_nationkey = n_nationkey
where datepart(year, L_SHIPDATE) > 1997
group by 1,2,3
order by 3 desc
limit 1000;

Below output shows a successful automatic rewriting and execution of query on the materialized view.

Write additional queries which can leverage your materialized view but which do not directly reference it. For example, Total Extendedprice by Region.

2 Stored Procedures

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database. A stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn’t have to return a value. You can use the PL/pgSQL procedural language, including looping and conditional expressions, to control logical flow.

Let’s see how you can create and invoke stored procedure in Redshift. Here our goal is to incrementally refresh the lineitem data. Execute the following query to create lineitem staging table:

create table stage_lineitem (
  L_ORDERKEY bigint NOT NULL,
  L_PARTKEY bigint,
  L_SUPPKEY bigint,
  L_LINENUMBER integer NOT NULL,
  L_QUANTITY decimal(18,4),
  L_EXTENDEDPRICE decimal(18,4),
  L_DISCOUNT decimal(18,4),
  L_TAX decimal(18,4),
  L_RETURNFLAG varchar(1),
  L_LINESTATUS varchar(1),
  L_SHIPDATE date,
  L_COMMITDATE date,
  L_RECEIPTDATE date,
  L_SHIPINSTRUCT varchar(25),
  L_SHIPMODE varchar(10),
  L_COMMENT varchar(44));
  • Execute below script to create a stored procedure. This stored procedure performs following tasks:
  1. Truncate staging table to clean up old data
  2. Load data in the stage_lineitem table using the COPY command.
  3. Merge updated records in existing lineitem table.

Make sure to replace the [Your-AWS_Account_Id] and [Your-Redshift_Role] values in the script below.

CREATE OR REPLACE PROCEDURE lineitem_incremental()
AS $$
BEGIN

truncate stage_lineitem;  

copy stage_lineitem from 's3://redshift-immersionday-labs/data/lineitem-part/l_orderyear=1998/l_ordermonth=8/'
iam_role 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift-Role]'
region 'us-west-2' gzip delimiter '|' COMPUPDATE PRESET;

delete from lineitem using stage_lineitem
where stage_lineitem.l_orderkey=lineitem.l_orderkey and stage_lineitem.l_linenumber = lineitem.l_linenumber;

insert into lineitem
select * from stage_lineitem;

END;
$$ LANGUAGE plpgsql;

Before you call this stored procedure, capture a metric using the materialized view. We’ll compare this value after the stored procedure loads new data to demonstrate the Materialized View auto refresh capability.

select SUM(TOTAL_QTY) Total_Qty from supplier_shipmode_agg;

Call this stored procedure using CALL statement. When executed it will perform an incremental load:

call lineitem_incremental();

3 Query scheduling

Amazon Redshift allows you to schedule your SQL queries for executions in recurring schedules. You can now schedule time sensitive or long running queries, loading or unloading your data, stored procedures or refreshing your materialized views on a regular schedule. You can use the Amazon Redshift Console or Amazon Redshift Data API to schedule your SQL queries.

To schedule queries, the following AWS Identity and Access Management (IAM) configurations will need to be made:

  1. An IAM role must be created which can be assumed by the Redshift Scheduler service, Event Bridge and your IAM principle (User or Role). This role will be attached to the schedule and must also be able to retrieve the Redshift credentials needed to execute the query.
  2. Your IAM principle (User or Role) must be modified to be able to assume the above role.

The steps we will complete include the following:

Create a Role

Navigate to the IAM console and create a new Role

https://console.aws.amazon.com/iam/home?region=us-west-2#/roles$new

Select the service “Redshift” and the option “Redshift Scheduler”. Click “Next Permissions”.

Notice that the AmazonRedshiftFullAccess permission is required. This policy will enable Redshift to get temporary credentials and execute your query using the Redshift Data API.

Click “Next Tag”, Click “Next Review” and then provide role name as RedshiftQueryScheduler.

Modify the RedshiftQueryScheduler role you just created, navigate to Trust Relationships tab and click Edit Trust Relationship. Add to the policy as follows:

Make sure to replace the [Your-AWS_Account_Id] and [Your-IAM_User] values in the policy below.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "scheduler.redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Sid": "AssumeRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::[Your-AWS_Account_Id]:user/[Your-IAM_User]"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Modify your user

Go to IAM user which you have given above as [Your-IAM_User]. Click Add Inline policy and replace with your account id.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AssumeIAMRole",
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::[Your-AWS_Account_Id]:role/RedshiftQueryScheduler"
        }
    ]
}

Schedule your query

Navigate, back to Redshift query editor, ensure the query to call your stored procedure is in the editor and click on the Schedule button.

https://console.aws.amazon.com/redshiftv2/home?#query-editor:
call lineitem_incremental();

Select IAM Role that you created in previous step. Also, select the cluster, and provide a database name and database user.

Enter a query name as well as the query text.

Provide values for Repeat By, Repeat every, and Repeat time. When you select “Repeat at time (UTC)” enter a time that is little later than current time so you can observe the execution. Optionally, you can enable monitoring via Amazon SNS notifications. If you have not enabled SNS for your Redshift cluster, see Event Subscriptions. For this example, we can leave this Disabled.

Monitor your query

Navigate to the scheduled queries tab and you can see your query scheduler has been created as shown below:

Click on the schedule and after successful execution on scheduled time, you can see the status is “success”.

Re-write the stored procedure to pass in parameters for the year and month. Modify your schedule to dynamically get the current year and current month and use those as parameters to your stored procedure.

4 Bringing it together

Let’s see if Redshift has automatically refreshed materialized view after lineitem table got changed. Note, the materialized view refresh is asynchronous. For this lab, expect ~5min for the data to be refreshed after you called the lineitem_incremental procedure:

select db_name, userid, schema_name, mv_name, status, refresh_type from SVL_MV_REFRESH_STATUS
where mv_name = 'supplier_shipmode_agg';

The output shows that MV was refreshed automatically and incrementally.

Run the below queries on the MV and compare with the value you had noted previously. You will see SUM has changed which indicates that Redshift has identified changes that have taken place in the base table or tables, and then applied those changes to the materialized view.

select SUM(TOTAL_QTY) Total_Qty from supplier_shipmode_agg;

Before You Leave

If you are done using your cluster, please think about decommissioning it to avoid having to pay for unused resources.