Data Sharing

In a data warehouse environment, you often have mixed workloads with various user groups and analytics use-cases. While Redshift natively has workload management built with machine learning, there may be situation where critical workloads need to be shielded from unpredictable user behavior. In addition, you may want to cost allocate the analytic spend so each consumer of the data can be responsible for the data they produce as well as the compute they use. Finally, it can be cumbersome to move and manage multiple copies of your data. For all these use cases, Redshift data sharing can be used. This lab demonstrates how you can isolate your workloads by sharing data between 2 Redshift clusters.

Contents

Before You Begin

This lab assumes you have launched a Redshift cluster, configured your client tool, and have loaded it with TPC Benchmark data. If you have not launched a cluster, see LAB 1 - Creating Redshift Clusters. If you have not yet loaded it, see LAB 2 - Data Loading. If you have not configured your client tool, see Lab 1 - Creating Redshift Clusters : Configure Client Tool. For this lab, you will need to gather the following information about your cluster from LAB 1 - Creating Redshift Clusters.

Redshift data sharing requires and RA3 node type.

It also assumes you have access to a configured client tool. For more details on configuring SQL Workbench/J as your client tool, see Lab 1 - Creating Redshift Clusters : Configure Client Tool. As an alternative you can use the Redshift provided online Query Editor which does not require an installation.

https://console.aws.amazon.com/redshift/home?#query:

1 Create a consumer cluster

Launch a new cluster in the same account & region as the cluster you already have containing the TPC Benchmark data. You can use the instructions found in LAB 1 - Creating Redshift Clusters. You can use the same VPC, Subnet Group, Security Group, and IAM Role associated with your original cluster.

Redshift data sharing requires and RA3 node type.

Capture the following information about this new cluster and use it to configure your client tool. For more details on configuring your client tool, see Lab 1 - Creating Redshift Clusters : Configure Client Tool.

  • [Your-Redshift_Hostname]
  • [Your-Redshift_Port]
  • [Your-Redshift_Username]
  • [Your-Redshift_Password]

After the cluster has been created and you’ve successfully connected to it, execute the following command and capture the [Consumer-Namespace], you will use this later.

select current_namespace;

2 Create a data share

  1. In a new window, connect to the producer cluster with the loaded TPC benchmark data and create a data share containing all tables and grant usage on this data share to the consumer cluster you just created.

Replace the [Consumer-Namespace] with the value you retrieved previously.

CREATE DATASHARE tpc_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE tpc_share ADD SCHEMA public;
ALTER DATASHARE tpc_share ADD ALL TABLES IN SCHEMA public;
GRANT USAGE ON DATASHARE tpc_share TO NAMESPACE '[Consumer-Namespace]';

After the datashare has been created you can validate what datashares exist on the cluster using following commands.

show datashares;

You can also use following command to get information on datashares

select * from SVV_DATASHARES;

Following command shows what objects datashare contains.

select * from svv_datashare_objects;

What happens when new objects are added or removed from the public schema in the producer cluster with the TPC data, will it be automatically reflected in the consumer cluster?

 Answer - Click to expand
  • Add a new table to the public schema. Execute the following statement in producer cluster:
create table public.customer_new (
  C_CUSTKEY bigint NOT NULL,
  C_NAME varchar(25),
  C_ADDRESS varchar(40),
  C_NATIONKEY bigint,
  C_PHONE varchar(15),
  C_ACCTBAL decimal(18,4),
  C_MKTSEGMENT varchar(10),
  C_COMMENT varchar(117))
diststyle all;

INSERT INTO public.customer_new VALUES (1, 'Customer 1', 'NorthEast'), (2, 'Customer 2', 'SouthEast');
  • Check to see if this table was automatically added to the datashare.
select * from svv_datashare_objects where object_name = 'public.customer_new';
  • You can explicitly add this new table to datashare using following command:
ALTER DATASHARE tpc_share add TABLE public.customer_new;
  1. Let’s say you don’t want to expose customer details to the consumer cluster, you can also remove a table from the datashare. Execute the following statement:
ALTER DATASHARE tpc_share REMOVE TABLE public.customer;
  1. Execute the following command and capture the [TPC-Namespace], you will use later when mapping the datashare to your consumer cluster.
select current_namespace;

3 Map and query shared data

  1. Switching back to the consumer connection, execute the following to create a new database which is connected to the datashare:

Replace the [TPC-Namespace] with the value you retrieved previously.

CREATE DATABASE tpc FROM DATASHARE tpc_share OF NAMESPACE '[TPC-Namespace]';
  1. Test querying a table you should have access to and one you don’t have access to:
select * from tpc.public.region limit 100;
select * from tpc.public.customer limit 100;

How long does it take for new data to appear in the consumer cluster? Insert a new record in the TPC cluster region table and see if it’s available in the consumer cluster.

 Answer - Click to expand
  • In the producer cluster and execute following statement to insert a row into the region table:
insert into public.region values (5, 'New_region','Test Region');
  • In the consumer cluster execute the following statement to validate new data is available in region table:
select * from tpc.public.region
where r_name = 'New_region';
  1. In addition to [db].[schema].[table] notation, you can map a datashare schema to a local schema:
CREATE EXTERNAL SCHEMA tpc
FROM REDSHIFT DATABASE 'tpc' SCHEMA 'public';
select * from tpc.region limit 100;
select * from tpc.orders limit 100;

What is the resource utilization on the producer cluster with the TPC benchmark data when querying from the consumer cluster?

 Answer - Click to expand
  • Navigate to the Amazon Redshift Console and note the CPU utilization of the producer and consumer clusters.
https://console.aws.amazon.com/redshiftv2/home?#dashboard

Before You Leave

If you are done using your clusters, please think about decommissioning it to avoid having to pay for unused resources.

Feedback/Issue

We would love to hear from you

Please let us know if there's a way we can make this lab better, if there is a lab you would like added, or if you want help getting started with your Redshift deployment.

Submit