Image by Editor
Many of us have experienced the core power of speed and efficiency delivered by centralizing compute within the Cloud Data Warehouse. While this is true, many of us have also realized that, like with anything, this value comes with its own set of downsides.
One of the primary drawbacks of this approach is that you must learn and execute queries in different languages, specifically SQL. While writing SQL is faster and less expensive than standing up a secondary infrastructure to run python (on your laptop or in-office servers), it comes with many different complexities depending on what information the data analyst wants to extract from the cloud warehouse. The switch over to cloud data warehouses increases the utility of complex SQL versus python. Having been through this experience myself, I decided to record the specific transformations that are the most painful to learn and perform in SQL and provide the actual SQL needed to alleviate some of this pain for my readers.
To aid in your workflow, you’ll notice that I provide examples of the data structure before and after the transform is executed, so you can follow along and validate your work. I have also provided the actual SQL needed to perform each of the 5 hardest transformations. You’ll need new SQL to perform the transformation across multiple projects as your data changes. We’ve provided links to dynamic SQL for each transformation so you can continue to capture the SQL needed for your analysis on an as needed basis!
Date Spines
It is not clear where the term date spine originated, but even those who don’t know the term are probably familiar with what it is.
Imagine you are analyzing your daily sales data, and it looks like this:
sales_date | product | sales |
2022-04-14 | A | 46 |
2022-04-14 | B | 409 |
2022-04-15 | A | 17 |
2022-04-15 | B | 480 |
2022-04-18 | A | 65 |
2022-04-19 | A | 45 |
2022-04-19 | B | 411 |
No sales happened on the 16th and 17th, so the rows are completely missing. If we were trying to calculate average daily sales, or build a time series forecast model, this format would be a major problem. What we need to do is insert rows for the missing days.
Here is the basic concept:
- Generate or select unique dates
- Generate or select unique products
- Cross Join (cartesian product) all combinations of 1&2
- Outer Join #3 to your original data
Customizable SQL for Datespine
The end result will look like this:
sales_date | product | sales |
2022-04-14 | A | 46 |
2022-04-14 | B | 409 |
2022-04-15 | A | 17 |
2022-04-15 | B | 480 |
2022-04-16 | A | 0 |
2022-04-16 | B | 0 |
2022-04-17 | A | 0 |
2022-04-17 | B | 0 |
2022-04-18 | A | 65 |
2022-04-18 | B | 0 |
2022-04-19 | A | 45 |
2022-04-19 | B | 411 |
Pivot / Unpivot
Sometimes, when doing an analysis, you want to restructure the table. For instance, we might have a list of students, subjects, and grades, but we want to break out subjects into each column. We all know and love Excel because of its pivot tables. But have you ever tried to do it in SQL? Not only does every database have annoying differences in how PIVOT is supported, but the syntax is unintuitive and easily forgettable.
Before:
Student | Subject | Grade |
Jared | Mathematics | 61 |
Jared | Geography | 94 |
Jared | Phys Ed | 98 |
Patrick | Mathematics | 99 |
Patrick | Geography | 93 |
Patrick | Phys Ed | 4 |
Customizable SQL for Pivot
Result:
Student | Mathematics | Geography | Phys Ed |
Jared | 61 | 94 | 98 |
Patrick | 99 | 93 | 4 |
One-hot Encoding
This one isn’t necessarily difficult but is time-consuming. Most data scientists don’t consider doing one-hot-encoding in SQL. Although the syntax is simple, they would rather transfer the data out of the data warehouse than the tedious task of writing a 26-line CASE statement. We don’t blame them!
However, we recommend taking advantage of your data warehouse and its processing power. Here is an example using STATE as a column to one-hot-encode.
Before:
Babyname | State | Qty |
Alice | AL | 156 |
Alice | AK | 146 |
Alice | PA | 654 |
… | … | … |
Zelda | NY | 417 |
Zelda | AL | 261 |
Zelda | CO | 321 |
Customizable SQL for One-Hot Encode
Result:
Babyname | State | State_AL | State_AK | … | State_CO | Qty |
Alice | AL | 1 | 0 | … | 0 | 156 |
Alice | AK | 0 | 1 | … | 0 | 146 |
Alice | PA | 0 | 0 | … | 0 | 654 |
… | … | … | … | |||
Zelda | NY | 0 | 0 | … | 0 | 417 |
Zelda | AL | 1 | 0 | … | 0 | 261 |
Zelda | CO | 0 | 0 | … | 1 | 321 |
Market Basket Analysis
When doing a market basket analysis or mining for association rules, the first step is often formatting the data to aggregate each transaction into a single record. This can be challenging for your laptop, but your data warehouse is designed to crunch this data efficiently.
Typical transaction data:
SALESORDERNUMBER | CUSTOMERKEY | ENGLISHPRODUCTNAME | LISTPRICE | WEIGHT | ORDERDATE |
SO51247 | 11249 | Mountain-200 Black | 2294.99 | 23.77 | 1/1/2013 |
SO51247 | 11249 | Water Bottle – 30 oz. | 4.99 | 1/1/2013 | |
SO51247 | 11249 | Mountain Bottle Cage | 9.99 | 1/1/2013 | |
SO51246 | 25625 | Sport-100 Helmet | 34.99 | 12/31/2012 | |
SO51246 | 25625 | Water Bottle – 30 oz. | 4.99 | 12/31/2012 | |
SO51246 | 25625 | Road Bottle Cage | 8.99 | 12/31/2012 | |
SO51246 | 25625 | Touring-1000 Blue | 2384.07 | 25.42 | 12/31/2012 |
Customizable SQL for Market Basket
Result:
NUMTRANSACTIONS | ENGLISHPRODUCTNAME_LISTAGG |
207 | Mountain Bottle Cage, Water Bottle – 30 oz. |
200 | Mountain Tire Tube, Patch Kit/8 Patches |
142 | LL Road Tire, Patch Kit/8 Patches |
137 | Patch Kit/8 Patches, Road Tire Tube |
135 | Patch Kit/8 Patches, Touring Tire Tube |
132 | HL Mountain Tire, Mountain Tire Tube, Patch Kit/8 Patches |
Time-Series Aggregations
Time series aggregations are not only used by data scientists but they’re used for analytics as well. What makes them difficult is that window functions require the data to be formatted correctly.
For example, if you want to calculate the average sales amount in the past 14 days, window functions require you to have all sales data broken up into one row per day. Unfortunately, anyone who has worked with sales data before knows that it is usually stored at the transaction level. This is where time-series aggregation comes in handy. You can create aggregated, historical metrics without reformatting the entire dataset. It also comes in handy if we want to add multiple metrics at one time:
- Average sales in the past 14 days
- Biggest purchase in last 6 months
- Count Distinct product types in last 90 days
If you wanted to use window functions, each metric would need to be built independently with several steps.
A better way to handle this, is to use common table expressions (CTEs) to define each of the historical windows, pre-aggregated.
For example:
Transaction ID | Customer ID | Product Type | Purchase Amt | Transaction Date |
65432 | 101 | Grocery | 101.14 | 2022-03-01 |
65493 | 101 | Grocery | 98.45 | 2022-04-30 |
65494 | 101 | Automotive | 239.98 | 2022-05-01 |
66789 | 101 | Grocery | 86.55 | 2022-05-22 |
66981 | 101 | Pharmacy | 14 | 2022-06-15 |
67145 | 101 | Grocery | 93.12 | 2022-06-22 |
Customizable SQL for Time Series Aggregate SQL
Result:
Transaction ID | Customer ID | Product Type | Purchase Amt | Transaction Date | Avg Sales Past 14 Days | Max Purchase Past 6 months | Count Distinct Product Type last 90 days |
65432 | 101 | Grocery | 101.14 | 2022-03-01 | 101.14 | 101.14 | 1 |
65493 | 101 | Grocery | 98.45 | 2022-04-30 | 98.45 | 101.14 | 2 |
65494 | 101 | Automotive | 239.98 | 2022-05-01 | 169.21 | 239.98 | 2 |
66789 | 101 | Grocery | 86.55 | 2022-05-22 | 86.55 | 239.98 | 2 |
66981 | 101 | Pharmacy | 14 | 2022-06-15 | 14 | 239.98 | 3 |
67145 | 101 | Grocery | 93.12 | 2022-06-22 | 53.56 | 239.98 | 3 |
Conclusion
I hope this piece helps shed some light on the different troubles that a data practitioner will encounter when operating within the modern data stack. SQL is a double-edged sword when it comes to querying the cloud warehouse. While centralizing the compute in the cloud data warehouse increases speed, it sometimes requires some extra SQL skills. I hope that this piece has helped answer questions and provides the syntax and background needed to tackle these problems.
Josh Berry (@Twitter) leads Customer Facing Data Science at Rasgo and has been in the data and analytics profession since 2008. Josh spent 10 years at Comcast where he built the data science team and was a key owner of the internally developed Comcast feature store – one of the first feature stores to hit the market. Following Comcast, Josh was a critical leader in building out Customer Facing Data Science at DataRobot. In his spare time Josh performs complex analysis on interesting topics such as baseball, F1 racing, housing market predictions, and more.
- 5 Tricky SQL Queries Solved
- Every Complex DataFrame Manipulation, Explained & Visualized Intuitively
- 4 Useful Intermediate SQL Queries for Data Science
- The Definitive Guide to Solving the Phantom Read in MySQL
- How to Manage Your Complex IT Landscape with AIOps
- Python String Matching Without Complex RegEx Syntax