Loading & querying semi-structured data

In 8. Querying nested JSON in S3 you learned how to query semi-structured data in S3. In this lab you will learn how to load semi-structured JSON data into Redshift into a single column with the SUPER datatype or by using automatic shredding. You will learn how to this strategy can help with an evolving schema in your semi-structured data.

Contents

Before You Begin

This lab assumes you have launched a Redshift cluster, configured your client tool. If you have not launched a cluster, see LAB 1 - Creating Redshift Clusters. If you have not configured your client tool, see Lab 1 - Creating Redshift Clusters : Configure Client Tool. For this lab, you will need to gather the following information about your cluster from LAB 1 - Creating Redshift Clusters.

  • [Your-Redshift_Role_Arn]

Load JSON data

Assume you have a feed of transaction data landing at the following s3 location. Our goal is to load it in the simplist way possible:

https://s3.console.aws.amazon.com/s3/buckets/redshift-downloads?region=us-east-1&prefix=semistructured/tpch-nested/data/json/customer_orders_lineitem/&showversions=false

  1. Create a table to store this data without worrying about the schema or underlying structure:
drop table if exists transaction;
create table transaction (
	data_json super
);
  1. Now, load the data using the COPY statement. Notice the noshred option which will cause the data to be loaded into one field. We will see later how to Load JSON data with shredding

Replace the [Your-Redshift_Role_Arn] with the value you retrieved previously.

copy transaction
from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role '[Your-Redshift_Role_Arn]' region 'us-east-1'
json 'noshred';
  1. Verify the data has been loaded. Notice the JSON structure. At the top-level you have customer-related attributes.
select * from transaction limit 1;

  1. Take note of the record count so we can compare the counts between a nested an un-nested data set.
select count(1) from transaction;

  1. Write a query which extracts the customer-related attributes. Since these are at the top-level of JSON structure you should expect a row count that matches the previous row count:
select data_json.c_custkey, data_json.c_phone, data_json.c_acctbal from transaction;

  1. Notice that certain fields have double-quote wrappers. This is because Redshift does not enforce types when querying from a SUPER field. This helps with schema evolution to ensure that if the data type changes over time the load process will not fail. See the following documentation to learn more about dynamic typing. Type-cast the extracted fields so they display correctly.
select data_json.c_custkey::int, data_json.c_phone::varchar, data_json.c_acctbal::decimal(18,2) from transaction;

Unnest the data

In our use-case you may have noticed that the JSON document contained a c_orders field which was another complex data structure. Furthermore, the c_orders structure contains another nested field. o_lineitems.

  1. Write a query which unnests the data. Determine the count of orders and lineitems. Notice that to unnest the data, you leverage the PartiQL syntax.
select count(1) from transaction t, t.data_json.c_orders o;
select count(1) from transaction t, t.data_json.c_orders o, o.o_lineitems l;

  1. To get a full data set of customer, order and lineitem details, execute a query which extracts data from each level in the nested hierarchy.
select data_json.c_custkey::int, data_json.c_phone::varchar, data_json.c_acctbal::decimal(18,2), o.o_orderstatus::varchar, l.l_shipmode::varchar, l.l_extendedprice::decimal(18,2)
from transaction t, t.data_json.c_orders o, o.o_lineitems l;

Load JSON data with shredding

In some use-cases the top-level JSON data does not change often and/or your ETL process does not need to incorporate new fields added to the top-level. In these cases, you can use the auto option with option ignorecase parameter to load your JSON data.

  1. Create a table that includes the fields from the first level of the JSON document:
drop table if exists transaction_shred;
create table transaction_shred (
  c_custkey bigint,
  c_phone varchar(20),
  c_acctbal decimal(18,2),
  c_orders super  
);
  1. Now, load the data once more using the COPY statement. This time with the auto option which will cause the data to be loaded in fields which match the column names in the JSON objects. The column mapping will use a case-insensitive match based on the ignorecase keyword.
copy transaction_shred from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role '[Your-Redshift_Role_Arn]' region 'us-east-1'
json 'auto ignorecase';

What happens when the table columns do not match the JSON file? Missing a column? Has an extra column? Columns are out of order?

  1. Let’s get a full data set of customer, order and lineitem details but this time, taking advantage of the fact the first level of the hierarchy has been shredded.
select c_custkey, c_phone, c_acctbal, o.o_orderstatus::varchar, l.l_shipmode::varchar, l.l_extendedprice::decimal(18,2)
from transaction_shred t, t.c_orders o, o.o_lineitems l;

Parsing JSON documents to SUPER columns

You can insert or update JSON data into a SUPER column using the JSON_PARSE function. This function parses data into JSON format and converts it into the SUPER data type, which you can use in INSERT or UPDATE statements. This includes validating the JSON document in the data.

  1. Insert a new customer order in the transaction_shred table.
INSERT INTO transaction_shred VALUES
  (1234,
   '800-867-5309',
   441989.88,
   JSON_PARSE(  
    '[{
      "o_orderstatus":"F",
      "o_clerk":"Clerk#0000001991",
      "o_lineitems":[
         {
            "l_returnflag":"R",
            "l_receiptdate":"2017-07-23",
            "l_tax":0.03,
            "l_shipmode":"TRUCK",
            "l_suppkey":4799,
            "l_shipdate":"2014-06-24",
            "l_commitdate":"2014-06-05",
            "l_partkey":54798,
            "l_quantity":4,
            "l_linestatus":"F",
            "l_comment":"Net new order for new customer",
            "l_extendedprice":28007.64,
            "l_linenumber":1,
            "l_discount":0.02,
            "l_shipinstruct":"TAKE BACK RETURN"
         }
      ],
      "o_orderdate":"2014-06-01",
      "o_shippriority":0,
      "o_totalprice":28308.25,
      "o_orderkey":1953697,
      "o_comment":"wing for 997.1 gt3",
      "o_orderpriority":"5-LOW"
    }],'));

The Insert errors out because the JSON_PARSE function detects a malformed JSON document.

  1. Remove the unneeded comma at the end of the JSON and try inserting the data again.
INSERT INTO transaction_shred VALUES
  (1234,
   '800-867-5309',
   441989.88,
   JSON_PARSE(  '[{
      "o_orderstatus":"F",
      "o_clerk":"Clerk#0000001991",
      "o_lineitems":[
         {
            "l_returnflag":"R",
            "l_receiptdate":"2017-07-23",
            "l_tax":0.03,
            "l_shipmode":"TRUCK",
            "l_suppkey":4799,
            "l_shipdate":"2014-06-24",
            "l_commitdate":"2014-06-05",
            "l_partkey":54798,
            "l_quantity":4,
            "l_linestatus":"F",
            "l_comment":"Net new order for new customer",
            "l_extendedprice":28007.64,
            "l_linenumber":1,
            "l_discount":0.02,
            "l_shipinstruct":"TAKE BACK RETURN"
         }
      ],
      "o_orderdate":"2014-06-01",
      "o_shippriority":0,
      "o_totalprice":28308.25,
      "o_orderkey":1953697,
      "o_comment":"wing for 997.1 gt3",
      "o_orderpriority":"5-LOW"
   }]'));
  1. Next, update the row you just inserted to add a new line item for the order.

If you update a SUPER data column, Amazon Redshift requires the complete document to be passed to column values. Amazon Redshift doesn’t support partial update.

update transaction_shred 
set c_orders =  
   JSON_PARSE(  '[{
      "o_orderstatus":"F",
      "o_clerk":"Clerk#0000001991",
      "o_lineitems":[
         {
            "l_returnflag":"R",
            "l_receiptdate":"2017-07-23",
            "l_tax":0.03,
            "l_shipmode":"TRUCK",
            "l_suppkey":4799,
            "l_shipdate":"2014-06-24",
            "l_commitdate":"2014-06-05",
            "l_partkey":54798,
            "l_quantity":4,
            "l_linestatus":"F",
            "l_comment":"Net new order for new customer",
            "l_extendedprice":28007.64,
            "l_linenumber":1,
            "l_discount":0.02,
            "l_shipinstruct":"TAKE BACK RETURN"
         },
         {
            "l_returnflag":"R",
            "l_receiptdate":"2017-07-23",
            "l_tax":0.03,
            "l_shipmode":"TRUCK",
            "l_suppkey":4799,
            "l_shipdate":"2014-06-24",
            "l_commitdate":"2014-06-05",
            "l_partkey":54798,
            "l_quantity":4,
            "l_linestatus":"F",
            "l_comment":"Net new order2 for new customer",
            "l_extendedprice":28007.64,
            "l_linenumber":2,
            "l_discount":0.02,
            "l_shipinstruct":"TAKE BACK RETURN"
         }
      ],
      "o_orderdate":"2014-06-01",
      "o_shippriority":0,
      "o_totalprice":28308.25,
      "o_orderkey":1953697,
      "o_comment":"wing for 997.1 gt3",
      "o_orderpriority":"5-LOW"
   }]')
   where c_custkey = 1234;
  1. Finally, let’s get a full data set of customer, order and lineitem details.
select c_custkey, c_phone, c_acctbal, o.o_orderstatus::varchar, l.l_shipmode::varchar, l.l_extendedprice::decimal(18,2), l.l_linenumber::int
from transaction_shred t, t.c_orders o, o.o_lineitems l
order by c_custkey;

If the JSON_PARSE function is missing in the query, Amazon Redshift treats the value as a single string instead of a JSON-formatted string that must be parsed.

Before You Leave

If you are done using your cluster, please think about decommissioning it to avoid having to pay for unused resources. For more documentation on ingesting and querying semistructured data in Amazon Redshift refer to the documentation here: https://docs.aws.amazon.com/redshift/latest/dg/super-overview.html

Feedback/Issue

We would love to hear from you

Please let us know if there's a way we can make this lab better, if there is a lab you would like added, or if you want help getting started with your Redshift deployment.

Submit