Study Notes #21

SQL GROUP BY

  • allows creating segments that will aggregate independent from one another
  • allows to take the sum of data limited to each account rather than across the entire dataset
  • goes in between the WHERE and ORDER clause
  • you should always see any columns in the SELECT statement that are not being aggregated on, in the GROUP BY statement

The key takeaways here:

  • GROUP BY can be used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.
  • Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause.
  • The GROUP BY always goes between WHERE and ORDER BY.
  • ORDER BY works like SORT in spreadsheet software.

GROUP BY – Expert Tip

Before we dive deeper into aggregations using GROUP BY statements, it is worth noting that SQL evaluates the aggregations before the LIMIT clause. If you don’t group by any columns, you’ll get a 1-row result—no problem there. If you group by a column with enough unique values that it exceeds the LIMIT number, the aggregates will be calculated, and then some rows will simply be omitted from the results.

This is actually a nice way to do things because you know you’re going to get the correct aggregates. If SQL cuts the table down to 100 rows, then performed the aggregations, your results would be substantially different. The above query’s results exceed 100 rows, so it’s a perfect example. In the next concept, use the SQL environment to try removing the LIMIT and running it again to see what changes.

Quiz

  1. Which account (by name) placed the earliest order? Your solution should have the account name and the date of the order.
SELECT a.name, o.occurred_at
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY o.occurred_at DESC
LIMIT 1;

Answer Key:
SELECT a.name, o.occurred_at
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY occurred_at
LIMIT 1;

I learned here that "earliest" is the opposite of "most recent."
  1. Find the total sales in usd for each account. You should include two columns – the total sales for each company’s orders in usd and the company name.
SELECT a.name, SUM(o.total_amt_usd) AS TotalPerAccount
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
  1. Via what channel did the most recent (latest) web_event occur, which account was associated with this web_event? Your query should return only three values – the datechannel, and account name.
SELECT a.name AS AccountName, w.channel AS Channel, w.occurred_at AS Date
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
ORDER BY w.occurred_at DESC
LIMIT 1;

Molina Healthcare
  1. Find the total number of times each type of channel from the web_events was used. Your final table should have two columns – the channel and the number of times the channel was used.
SELECT w.channel AS Channel, COUNT(w.channel) AS count
FROM web_events w
GROUP BY Channel;

Answer Key:
SELECT w.channel, COUNT(*)
FROM web_events w
GROUP BY w.channel
  1. Who was the primary contact associated with the earliest web_event?
SELECT a.primary_poc AS PrimaryContact
FROM web_events w
JOIN accounts a
ON a.id = w.account_id
ORDER BY w.occurred_at DESC
LIMIT 1;

Hilde Klopfer

I learned again here that "earliest" is the opposite of "the most recent", hence should have been ASC and not DESC.
  1. What was the smallest order placed by each account in terms of total usd. Provide only two columns – the account name and the total usd. Order from smallest dollar amounts to largest.
SELECT a.name AS Name, SUM(o.total_amt_usd) AS TotalUSD
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name
ORDER BY TotalUSD;

Nike	390.25
This is in total per account.

If I get the minimum to get the smallest order placed by an account: 

SELECT a.name AS Name, min(o.total_amt_usd) AS TotalUSD
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name
ORDER BY TotalUSD;

My second answer is correct. The missing value is the amount of smallest order placed of which account, and NOT the account with smallest orders in total. 

Answer Key:
SELECT a.name, MIN(total_amt_usd) smallest_order
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY smallest_order;
  1. Find the number of sales reps in each region. Your final table should have two columns – the region and the number of sales_reps. Order from the fewest reps to most reps.
SELECT r.name AS Region, COUNT(s.name) AS SalesReps
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
GROUP BY Region
ORDER BY SalesReps;
regionsalesreps
Midwest9
Southeast10
West10
Northeast21

Key takeaways:

  • You can GROUP BY multiple columns at once, as we showed here. This is often useful to aggregate across a number of different segments.
  • The order of columns listed in the ORDER BY clause does make a difference. You are ordering the columns from left to right.

GROUP BY – Expert Tips

  • The order of column names in your GROUP BY clause doesn’t matter—the results will be the same regardless. If we run the same query and reverse the order in the GROUP BY clause, you can see we get the same results.
  • As with ORDER BY, you can substitute numbers for column names in the GROUP BY clause. It’s generally recommended to do this only when you’re grouping many columns, or if something else is causing the text in the GROUP BY clause to be excessively long.
  • A reminder here that any column that is not within an aggregation must show up in your GROUP BY statement. If you forget, you will likely get an error. However, in the off chance that your query does work, you might not like the results!
SELECT account_id,
       channel,
       COUNT(id) as events
FROM web_events
GROUP BY account_id, channel
ORDER BY account_id DESC
account_idchannelevents
4501direct8
4501organic1
4491organic5
4491direct26
4491banner3
4491adwords12
4491twitter3
4491facebook8
4481adwords1
4481organic1

Quiz

  1. For each account, determine the average amount of each type of paper they purchased across their orders. Your result should have four columns – one for the account name and one for the average quantity purchased for each of the paper types for each account.
SELECT a.name AS Name, AVG(o.standard_qty) AS Standard, AVG(o.gloss_qty) AS Gloss, AVG(o.poster_qty) AS Poster
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name;

Answer Key: 
SELECT a.name, AVG(o.standard_qty) avg_stand, AVG(o.gloss_qty) avg_gloss, AVG(o.poster_qty) avg_post
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
namestandardglossposter
Boeing360.2666666666666667222.3333333333333333107.9333333333333333
Western Digital192.7846153846153846239.8307692307692308170.2615384615384615
Sysco191.0588235294117647231.1764705882352941173.8676470588235294
Southern347.500000000000000027.750000000000000024.9000000000000000
Altria Group381.098039215686274522.490196078431372525.5686274509803922
Energy Transfer Equity222.1666666666666667158.5000000000000000118.3333333333333333
Aramark120.2500000000000000126.8750000000000000143.0000000000000000
Cognizant Technology Solutions253.8666666666666667260.8666666666666667116.5333333333333333
Time Warner254.3333333333333333246.888888888888888991.3333333333333333
  1. For each account, determine the average amount spent per order on each paper type. Your result should have four columns – one for the account name and one for the average amount spent on each paper type.
SELECT a.name AS Name, AVG(o.standard_amt_usd) AS Standard, AVG(o.gloss_amt_usd) AS Gloss, AVG(o.poster_amt_usd) AS Poster
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name;

Answer Key:
SELECT a.name, AVG(o.standard_amt_usd) avg_stand, AVG(o.gloss_amt_usd) avg_gloss, AVG(o.poster_amt_usd) avg_post
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
namestandardglossposter
Boeing1797.73066666666666671665.2766666666666667876.4186666666666667
Western Digital961.99523076923076921796.33246153846153851382.5236923076923077
Sysco953.38352941176470591731.51176470588235291411.8052941176470588
Southern1734.0250000000000000207.8475000000000000202.1880000000000000
Altria Group1901.6792156862745098168.4515686274509804207.6172549019607843
Energy Transfer Equity1108.61166666666666671187.1650000000000000960.8666666666666667
Aramark600.0475000000000000950.29375000000000001161.1600000000000000
Cognizant Technology Solutions1266.79466666666666671953.8913333333333333946.2506666666666667
Time Warner1269.12333333333333331849.1977777777777778741.6266666666666667
Marsh & McLennan868.26000000000000001896.84250000000000001299.2000000000000000
3M1563.83035714285714292095.3275000000000000910.3100000000000000
ConAgra Foods880.23600000000000001806.58800000000000001141.6720000000000000
National Oilwell Varco842.57074074074074071778.45888888888888892242.6237037037037037
Cummins1033.42900000000000001390.89300000000000001059.6600000000000000
Unum Group1696.6000000000000000586.7166666666666667242.6977777777777778
Twenty-First Century Fox476.2677777777777778408.0662962962962963673.6592592592592593
  1. Determine the number of times a particular channel was used in the web_events table for each sales rep. Your final table should have three columns – the name of the sales rep, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
SELECT s.name AS SalesRepName, w.channel AS Channel, COUNT(w.channel) AS ChannelCount 
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN web_events w
ON a.id = w.account_id
GROUP BY SalesRepName, Channel
ORDER BY ChannelCount DESC;

Answer Key:
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
salesrepnamechannelchannelcount
Earlie Schleusnerdirect234
Vernita Plumpdirect232
Moon Toriandirect194
Georgianna Chisholmdirect188
Tia Amatodirect185
Maren Mustodirect184
Nelle Meauxdirect179
Maryanna Fiorentinodirect168
Dorotha Seawelldirect161
  1. Determine the number of times a particular channel was used in the web_events table for each region. Your final table should have three columns – the region name, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
SELECT r.name AS RegionName, w.channel AS Channel, COUNT(w.channel) AS ChannelCount 
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN web_events w
ON a.id = w.account_id
GROUP BY RegionName, Channel
ORDER BY ChannelCount DESC;

Answer Key:
SELECT r.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name, w.channel
ORDER BY num_events DESC;
regionnamechannelchannelcount
Northeastdirect1800
Southeastdirect1548
Westdirect1254
Midwestdirect696
Northeastfacebook335
Northeastorganic317
Northeastadwords300

DISTINCT is always used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement. Therefore, you only use DISTINCT once in any particular SELECT statement.

You could write:

SELECT DISTINCT column1, column2, column3
FROM table1;

which would return the unique (or DISTINCT) rows across all three columns.

You would not write:

SELECT DISTINCT column1, DISTINCT column2, DISTINCT column3
FROM table1;

You can think of DISTINCT the same way you might think of the statement “unique”.

DISTINCT – Expert Tip

It’s worth noting that using DISTINCT, particularly in aggregations, can slow your queries down quite a bit.

Query 1:

SELECT account_id,
       channel,
       COUNT(id) as events
FROM web_events
GROUP BY account_id, channel
ORDER BY account_id, channel DESC

Query 2:

SELECT account_id,
       channel
FROM web_events
GROUP BY account_id, channel
ORDER BY account_id

Query 3:

SELECT DISTINCT account_id,
       channel
FROM web_events
ORDER BY account_id

Quiz

  1. Use DISTINCT to test if there are any accounts associated with more than one region.
SELECT DISTINCT r.name AS Region, a.name AS Accounts
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
JOIN accounts a
ON s.id = a.sales_rep_id
ORDER BY Accounts;

Answer: 
SELECT DISTINCT id, name
FROM accounts;

What happened here?
This is the simplest way to code this. There's no need to join different tables, as the number of rows will already return the value. Therefore, will only have to count the distinct account id and account name. 
Answer Key:

The below two queries have the same number of resulting rows (351), so we know that every account is associated with only one region. If each account was associated with more than one region, the first query should have returned more rows than the second query.

SELECT a.id as "account id", r.id as "region id", 
a.name as "account name", r.name as "region name"
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id;
and

SELECT DISTINCT id, name
FROM accounts;
  1. Have any sales reps worked on more than one account?
SELECT DISTINCT s.name AS SalesRep, a.name AS Accounts
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
ORDER BY SalesRep;
Answer Key:

Actually, all of the sales reps have worked on more than one account. The fewest number of accounts any sales rep works on is 3. There are 50 sales reps, and they all have more than one account. Using DISTINCT in the second query assures that all of the sales reps are accounted for in the first query.

SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.id, s.name
ORDER BY num_accounts;
and

SELECT DISTINCT id, name
FROM sales_reps;

SQL HAVING

HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.

Query 1:

SELECT account_id,
       SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
GROUP BY 1
ORDER BY 2 DESC

Query 2: Results in an Error

SELECT account_id,
       SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
WHERE SUM(total_amt_usd) >= 250000
GROUP BY 1
ORDER BY 2 DESC

Query 3:

SELECT account_id,
       SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
GROUP BY 1
HAVING SUM(total_amt_usd) >= 250000
  • WHERE subsets the returned data based on a logical condition.
  • WHERE appears after the FROM, JOIN, and ON clauses, but before GROUP BY.
  • HAVING appears after the GROUP BY clause, but before the ORDER BY clause
  • HAVING is like WHERE, but it works on logical statements involving aggregations.

Quiz

  1. How many of the sales reps have more than 5 accounts that they manage?
SELECT sales_rep_id, COUNT(name)
FROM accounts
GROUP BY sales_rep_id
HAVING COUNT(name) > 5;
sales_rep_idcount
3217407
32198011
3219509
3215806
32184011
3215207
3216807

I am out here…

Answer Key:

SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.id, s.name
HAVING COUNT(*) > 5
ORDER BY num_accounts;

and technically, we can get this using a SUBQUERY as shown below. This same logic can be used for the other queries, but this will not be shown.

SELECT COUNT(*) num_reps_above5
FROM(SELECT s.id, s.name, COUNT(*) num_accounts
     FROM accounts a
     JOIN sales_reps s
     ON s.id = a.sales_rep_id
     GROUP BY s.id, s.name
     HAVING COUNT(*) > 5
     ORDER BY num_accounts) AS Table1;

My answer didn’t directly show the answer to the question, while this second query from the Answer Key does. What happened there?

On the second query, it seems like it first counted the number of sales_rep more then 5 in a parenthesis (subquery), then on the SELECT created one table to display the total. I have to try this to the second problem below:

  1. How many accounts have more than 20 orders?
SELECT COUNT(*) TotalOrders
FROM (SELECT account_id, COUNT(id)
FROM orders
GROUP BY account_id
HAVING COUNT(id) > 20) AS Table1;
totalorders
120

Answer Key

SELECT a.id, a.name, COUNT(*) num_orders
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING COUNT(*) > 20
ORDER BY num_orders;

In their answer here, they didn’t use a subquery. 😞 It returned a table, but have to look at the total to see and still matched with my answer – 120.

  1. Which account has the most orders?
SELECT account_id, COUNT(id)
FROM orders
GROUP BY account_id
ORDER BY COUNT(id) DESC
LIMIT 1;
SELECT a.name, COUNT(o.id)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY COUNT(o.id) DESC
LIMIT 1;
namecount
Leucadia National71

Answer Key:

SELECT a.id, a.name, COUNT(*) num_orders
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY num_orders DESC
LIMIT 1;

I think my answer is still correct, except they added the account ID column.

  1. Which accounts spent more than 30,000 usd total across all orders?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
HAVING SUM(o.total_amt_usd) > '30000'
ORDER BY a.name;
namesum
3M127945.10
Abbott Laboratories96819.92
ADP163579.18
Aetna237781.30
Aflac117862.77
Allstate96779.18
Ally Financial37653.96

Answer Key:

SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING SUM(o.total_amt_usd) > 30000
ORDER BY total_spent;

My answer is still correct, except they sorted by total USD spent, and added Account ID column.

  1. Which accounts spent less than 1,000 usd total across all orders?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
HAVING SUM(o.total_amt_usd) < '1000'
ORDER BY a.name;
namesum
Delta Air Lines859.64
Level 3 Communications881.73
Nike390.25

Answer Key:

SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING SUM(o.total_amt_usd) < 1000
ORDER BY total_spent;

My answer is still correct, except they sorted by total usd spent and added account ID column.

  1. Which account has spent the most with us?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY SUM(o.total_amt_usd) DESC
LIMIT 1;
namesum
EOG Resources382873.30

Answer Key:

SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY total_spent DESC
LIMIT 1;

My answer is still correct, except they added Account ID column.

  1. Which account has spent the least with us?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY SUM(o.total_amt_usd) ASC
LIMIT 1;
namesum
Nike390.25

Answer Key:

SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY total_spent
LIMIT 1;

My answer is correct, except they added Account ID column and I forgot that I didn’t have to add ASC.

  1. Which accounts used facebook as a channel to contact customers more than 6 times?
SELECT a.name, w.channel, COUNT(w.channel)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE w.channel = 'facebook'
GROUP BY a.name, w.channel
HAVING COUNT(w.channel) > 6;
namechannelcount
Core-Mark Holdingfacebook8
Cameron Internationalfacebook9
ADPfacebook9
Massachusetts Mutual Life Insurancefacebook9
BlackRockfacebook10
PayPal Holdingsfacebook8
eBayfacebook7

Answer Key:

SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.id, a.name, w.channel
HAVING COUNT(*) > 6 AND w.channel = 'facebook'
ORDER BY use_of_channel;

My query returned the same results, but two things I learned here:

  1. I didn’t have to use the WHERE clause, as I could have used AND clause with HAVING clause to filter facebook.
  2. HAVING clause also works with categorical data.
  1. Which account used facebook most as a channel?
SELECT a.name, w.channel, COUNT(w.channel)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE w.channel = 'facebook'
GROUP BY a.name, w.channel
ORDER BY COUNT(w.channel) DESC
LIMIT 1;
namechannelcount
Gilead Sciencesfacebook16

Answer Key:

SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE w.channel = 'facebook'
GROUP BY a.id, a.name, w.channel
ORDER BY use_of_channel DESC
LIMIT 1;

Note: This query above only works if there are no ties for the account that used facebook the most. It is a best practice to use a larger limit number first such as 3 or 5 to see if there are ties before using LIMIT 1.

  1. Which channel was most frequently used by most accounts?
SELECT a.name, w.channel, COUNT(w.channel)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.name, w.channel
ORDER BY COUNT(w.channel) DESC
LIMIT 1;
namechannelcount
Leucadia Nationaldirect52

Answer Key:

SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.id, a.name, w.channel
ORDER BY use_of_channel DESC
LIMIT 10;

My answer is correct, “direct”, the instructions didn’t specify to display data for 10 accounts.

SQL DATE Functions

GROUPing BY a date column is not usually very useful in SQL, as these columns tend to have transaction data down to a second. Keeping date information at such granular levels is both a blessing and a curse, as it gives really precise information (a blessing), but it makes grouping information together directly difficult (a curse).

DATE_TRUNC

DATE_TRUNC allows you to truncate your date to a particular part of your date-time column. Common truncations are daymonth, and yearHere is a great blog post by Mode Analytics on the power of this function.

DATE_PART

DATE_PART can be useful for pulling a specific portion of a date, but notice pulling month or day of the week (dow) means that you are no longer keeping the years in order. Rather you are grouping for certain components regardless of which year they belonged in.

For additional functions you can use with dates, check out the documentation here, but the DATE_TRUNC and DATE_PART functions definitely give you a great start!

You can reference the columns in your select statement in GROUP BY and ORDER BY clauses with numbers that follow the order they appear in the select statement. For example

SELECT standard_qty, COUNT(*)

FROM orders

GROUP BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)

ORDER BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)

SELECT DATE_PART('dow',occurred_at) AS day_of_week,
       account_id,
       occurred_at,
       total
FROM orders
SELECT DATE_PART('dow',occurred_at) AS day_of_week,
       SUM(total) AS total_qty
FROM orders
GROUP BY 1
ORDER BY 2

dow – day of the week

Quiz

  1. Find the sales in terms of total dollars for all orders in each year, ordered from greatest to least. Do you notice any trends in the yearly sales totals?
SELECT DATE_PART('year', occurred_at) AS Year, SUM(total_amt_usd)
FROM orders
GROUP BY DATE_PART('year', occurred_at)
ORDER BY SUM(total_amt_usd) DESC;
yearsum
201612864917.92
20155752004.94
20144069106.54
2013377331.00
201778151.43

There’s a continued yearly growth from 2013-2016, but it suddenly fell down to it’s lowest in 2017.

Answer Key

SELECT DATE_PART('year', occurred_at) ord_year,  SUM(total_amt_usd) total_spent
 FROM orders
 GROUP BY 1
 ORDER BY 2 DESC;

When we look at the yearly totals, you might notice that 2013 and 2017 have much smaller totals than all other years. If we look further at the monthly data, we see that for 2013 and 2017 there is only one month of sales for each of these years (12 for 2013 and 1 for 2017). Therefore, neither of these is evenly represented. Sales have been increasing year over year, with 2016 being the largest sales to date. At this rate, we might expect 2017 to have the largest sales.

  1. Which month did Parch & Posey have the greatest sales in terms of total dollars? Are all months evenly represented by the dataset?
SELECT DATE_TRUNC('month', occurred_at) AS Month, SUM(total_amt_usd)
FROM orders
GROUP BY DATE_TRUNC('month', occurred_at)
ORDER BY SUM(total_amt_usd) DESC;

*Month specific to a year. 
Answer is December of 2016.
monthsum
2016-12-01T00:00:00.000Z1770282.62
SELECT DATE_PART('month', occurred_at) AS Month, SUM(total_amt_usd)
FROM orders
GROUP BY DATE_PART('month', occurred_at)
ORDER BY SUM(total_amt_usd) DESC;

*Month regardless of the year. 
monthsum
123129411.98

Answer Key:

SELECT DATE_PART('month', occurred_at) ord_month, SUM(total_amt_usd) total_spent
FROM orders
WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'
GROUP BY 1
ORDER BY 2 DESC; 

They had to remove the data from 2013 and 2017 to be fair, as there were only one order for each of those years. The data for the years were not evenly represented.

  1. Which year did Parch & Posey have the greatest sales in terms of the total number of orders? Are all years evenly represented by the dataset?
SELECT DATE_PART('year', occurred_at) AS Year, COUNT(id) AS TotalOrders
FROM orders
GROUP BY DATE_PART('year', occurred_at)
ORDER BY COUNT(id) DESC;
yeartotalorders
20163757

Same answer whether I use DATE_TRUNC or DATE_PART

Answer Key:

SELECT DATE_PART('year', occurred_at) ord_year,  COUNT(*) total_sales
FROM orders
GROUP BY 1
ORDER BY 2 DESC;
  1. Which month did Parch & Posey have the greatest sales in terms of the total number of orders? Are all months evenly represented by the dataset?
SELECT DATE_PART('month', occurred_at) AS Month, COUNT(id) AS TotalOrders
FROM orders
GROUP BY DATE_PART('month', occurred_at)
ORDER BY COUNT(id) DESC;

*Month regardless of Year
monthtotalorders
12882
SELECT DATE_TRUNC('month', occurred_at) AS Month, COUNT(id) AS TotalOrders
FROM orders
GROUP BY DATE_TRUNC('month', occurred_at)
ORDER BY COUNT(id) DESC;

*Month specific to the year, December of 2016. 
monthtotalorders
2016-12-01T00:00:00.000Z463
  1. In which month of which year did Walmart spend the most on gloss paper in terms of dollars?
SELECT DATE_TRUNC('month', occurred_at) AS Month, SUM(gloss_amt_usd) AS Total_Gloss_in_USD
FROM orders
GROUP BY DATE_TRUNC('month', occurred_at)
ORDER BY SUM(gloss_amt_usd) DESC;

*They specified that they need the month of which year.
monthtotal_gloss_in_usd
2016-12-01T00:00:00.000Z506825.83

Answer Key:

SELECT DATE_TRUNC('month', o.occurred_at) ord_date, SUM(o.gloss_amt_usd) tot_spent
FROM orders o 
JOIN accounts a
ON a.id = o.account_id
WHERE a.name = 'Walmart'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;

I am completely wrong here as I missed the condition – Walmart.

CASE Statements

Derive

  • take data from existing columns and modify them

CASE

  • handles “IF” “THEN” logic
  • must end with the word “END”

ELSE

  • captures values not specified in “WHEN” and “THEN” statements
SELECT account_id,
       occurred_at,
       total,
       CASE WHEN total > 500 THEN 'Over 500'
            WHEN total > 300 AND total <= 500 THEN '301 - 500'
            WHEN total > 100 AND total <=300 THEN '101 - 300'
            ELSE '100 or under' END AS total_group
FROM orders

THEN adds the value to enter when conditions are met on the column added via END AS.

account_idoccurred_attotaltotal_group
10012015-10-06T17:31:14.000Z169101 – 300
10012015-11-05T03:34:33.000Z288101 – 300
10012015-12-04T04:21:55.000Z132101 – 300
10012016-01-02T01:18:24.000Z176101 – 300
10012016-02-01T19:27:27.000Z165101 – 300
10012016-03-02T15:29:32.000Z173101 – 300

CASE – Expert Tip

  • The CASE statement always goes in the SELECT clause.
  • CASE must include the following components: WHEN, THEN, and END. ELSE is an optional component to catch cases that didn’t meet any of the other previous CASE conditions.
  • You can make any conditional statement using any conditional operator (like WHERE) between WHEN and THEN. This includes stringing together multiple conditional statements using AND and OR.
  • You can include multiple WHEN statements, as well as an ELSE statement again, to deal with any unaddressed conditions.

Example

In a quiz question in the previous Basic SQL lesson, you saw this question:

  1. Create a column that divides the standard_amt_usd by the standard_qty to find the unit price for standard paper for each order. Limit the results to the first 10 orders, and include the id and account_id fields. NOTE – you will be thrown an error with the correct solution to this question. This is for a division by zero. You will learn how to get a solution without an error to this query when you learn about CASE statements in a later section.

Let’s see how we can use the CASE statement to get around this error.

SELECT id, account_id, standard_amt_usd/standard_qty AS unit_price
FROM orders
LIMIT 10;

Now, let’s use a CASE statement. This way any time the standard_qty is zero, we will return 0, and otherwise, we will return the unit_price.

SELECT account_id, CASE WHEN standard_qty = 0 OR standard_qty IS NULL THEN 0
                        ELSE standard_amt_usd/standard_qty END AS unit_price
FROM orders
LIMIT 10;

Now the first part of the statement will catch any of those divisions by zero values that were causing the error, and the other components will compute the division as necessary. You will notice, we essentially charge all of our accounts 4.99 for standard paper. It makes sense this doesn’t fluctuate, and it is more accurate than adding 1 in the denominator like our quick fix might have been in the earlier lesson.

You can try it yourself using the environment below.

Query 1:

SELECT id,
       account_id,
       occurred_at,
       channel,
       CASE WHEN channel = 'facebook' THEN 'yes' END AS is_facebook
FROM web_events
ORDER BY occurred_at

Query 2:

SELECT id,
       account_id,
       occurred_at,
       channel,
       CASE WHEN channel = 'facebook' THEN 'yes' ELSE 'no' END AS is_facebook
FROM web_events
ORDER BY occurred_at

Query 3:

SELECT id,
       account_id,
       occurred_at,
       channel,
       CASE WHEN channel = 'facebook' OR channel = 'direct' THEN 'yes' 
       ELSE 'no' END AS is_facebook
FROM web_events
ORDER BY occurred_at

Query 4:

SELECT account_id,
       occurred_at,
       total,
       CASE WHEN total > 500 THEN 'Over 500'
            WHEN total > 300 THEN '301 - 500'
            WHEN total > 100 THEN '101 - 300'
            ELSE '100 or under' END AS total_group
FROM orders

CASE & Aggregations

SELECT CASE WHEN total > 500 THEN 'Over 500'
            ELSE '500 or under' END AS total_group,
            COUNT(*) AS order_count
FROM orders
GROUP BY 1
total_grouporder_count
500 or under3716
Over 5003196

This is the same as:

SELECT CASE WHEN total > 500 THEN 'Over 500'
            ELSE '500 or under' END AS total_group,
            COUNT(*) AS order_count
FROM orders
GROUP BY CASE WHEN total > 500 THEN 'Over 500'
            ELSE '500 or under' END;

Quiz

  1. Write a query to display for each order, the account ID, the total amount of the order, and the level of the order – ‘Large’ or ’Small’ – depending on if the order is $3000 or more, or smaller than $3000.
SELECT account_id, SUM(total_amt_usd), CASE WHEN SUM(total_amt_usd) >= 3000 THEN 'Large' ELSE 'Small' END AS Level
FROM orders
GROUP BY account_id
ORDER BY SUM(total_amt_usd) DESC;
account_idsumlevel
4211382873.30Large
4151345618.59Large
1301326819.48Large
1871300694.79Large
4111293861.14Large
3411291047.25Large

I think this is the correct answer:

SELECT account_id, total_amt_usd, CASE WHEN total_amt_usd >= 3000 THEN 'Large' ELSE 'Small' END AS Level
FROM orders
GROUP BY CASE WHEN total_amt_usd >= 3000 THEN 'Large' ELSE 'Small' END, total_amt_usd, account_id
ORDER BY total_amt_usd DESC

*It asks for for each order and not for orders in total per account. 

Answer Key:

SELECT account_id, total_amt_usd,
CASE WHEN total_amt_usd > 3000 THEN 'Large'
ELSE 'Small' END AS order_level
FROM orders;
  1. Write a query to display the number of orders in each of three categories, based on the total number of items in each order. The three categories are: ‘At Least 2000’, ‘Between 1000 and 2000’ and ‘Less than 1000’.
SELECT CASE WHEN total >= 2000 THEN 'At Least 2000'
WHEN total BETWEEN 1000 AND 2000 THEN 'Between 1000 and 2000'
WHEN total < 1000 THEN 'Less than 1000' END AS Category,
COUNT(*) Num_Orders
FROM orders
GROUP BY CASE WHEN total >= 2000 THEN 'At Least 2000'
WHEN total BETWEEN 1000 AND 2000 THEN 'Between 1000 and 2000'
WHEN total < 1000 THEN 'Less than 1000' END;
categorynum_orders
At Least 200070
Between 1000 and 2000511
Less than 10006331

Answer Key:

SELECT CASE WHEN total >= 2000 THEN 'At Least 2000'
WHEN total >= 1000 AND total < 2000 THEN 'Between 1000 and 2000'
ELSE 'Less than 1000' END AS order_category,
COUNT(*) AS order_count
FROM orders
GROUP BY 1;

This returns the same value.

  1. We would like to understand 3 different levels of customers based on the amount associated with their purchases. The top-level includes anyone with a Lifetime Value (total sales of all orders) greater than 200,000 usd. The second level is between 200,000 and 100,000 usd. The lowest level is anyone under 100,000 usd. Provide a table that includes the level associated with each account. You should provide the account name, the total sales of all orders for the customer, and the level. Order with the top spending customers listed first.
SELECT a.name AccountName, SUM(o.total_amt_usd), 
       CASE WHEN SUM(o.total_amt_usd) > 200000 THEN 'Top Level' WHEN SUM(o.total_amt_usd) <= 200000 AND SUM(o.total_amt_usd) >= 100000 THEN 'Second Level' WHEN SUM(o.total_amt_usd) < 100000 THEN 'Lowest Level' END AS Level
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY SUM(o.total_amt_usd) DESC;
accountnamesumlevel
EOG Resources382873.30Top Level
Mosaic345618.59Top Level
IBM326819.48Top Level
General Dynamics300694.79Top Level
Republic Services293861.14Top Level
Leucadia National291047.25Top Level
Arrow Electronics281018.36Top Level

Answer Key

SELECT a.name, SUM(total_amt_usd) total_spent, 
CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'
WHEN  SUM(total_amt_usd) > 100000 THEN 'middle'
ELSE 'low' END AS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id 
GROUP BY a.name
ORDER BY 2 DESC;

The same results.

  1. We would now like to perform a similar calculation to the first, but we want to obtain the total amount spent by customers only in 2016 and 2017. Keep the same levels as in the previous question. Order with the top spending customers listed first.
SELECT DATE_PART('year', o.occurred_at) AS Year, a.name AccountName, SUM(o.total_amt_usd), 
       CASE WHEN SUM(o.total_amt_usd) > 200000 THEN 'Top Level' WHEN SUM(o.total_amt_usd) <= 200000 AND SUM(o.total_amt_usd) >= 100000 THEN 'Second Level' WHEN SUM(o.total_amt_usd) < 100000 THEN 'Lowest Level' END AS Level
FROM accounts a
JOIN orders o
ON a.id = o.account_id
WHERE DATE_PART('year', o.occurred_at) = '2016' OR DATE_PART('year', o.occurred_at) = '2017'
GROUP BY a.name, DATE_PART('year', o.occurred_at)
ORDER BY SUM(o.total_amt_usd) DESC;
yearaccountnamesumlevel
2016Pacific Life255319.18Top Level
2016Mosaic172180.04Second Level
2016CHS163471.78Second Level
2016Core-Mark Holding148105.93Second Level
2016Disney129157.38Second Level
2016National Oilwell Varco121873.16Second Level
2016Sears Holdings114003.21Second Level

Answer Key

SELECT a.name, SUM(total_amt_usd) total_spent, 
CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'
WHEN  SUM(total_amt_usd) > 100000 THEN 'middle'
ELSE 'low' END AS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id
WHERE occurred_at > '2015-12-31' 
GROUP BY 1
ORDER BY 2 DESC;

This returns same results, but simpler queries. I realize I would need to work on strategy on manipulating data. Here, instead of giving a condition with DATE_PART, it simply uses > DATE since 2016 and 2017 is beyond 2015-12-31 and data available is only till 2017.

  1. We would like to identify top-performing sales reps, which are sales reps associated with more than 200 orders. Create a table with the sales rep name, the total number of orders, and a column with top or not depending on if they have more than 200 orders. Place the top salespeople first in your final table.
SELECT s.name SalesRepName, COUNT(o.id) TotalOrder, CASE WHEN COUNT(o.id) > 200 THEN 'Top' WHEN COUNT(o.id) <= 200 THEN 'Not' END AS TopPerformer
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN orders o
ON a.id = o.account_id
GROUP BY s.name
ORDER BY COUNT(o.id) DESC;
salesrepnametotalordertopperformer
Earlie Schleusner335Top
Vernita Plump299Top
Tia Amato267Top
Georgianna Chisholm256Top
Moon Torian250Top
Nelle Meaux241Top
Maren Musto224Top

Answer Key

SELECT s.name, COUNT(*) num_ords,
  CASE WHEN COUNT(*) > 200 THEN 'top'
  ELSE 'not' END AS sales_rep_level
FROM orders o
JOIN accounts a
ON o.account_id = a.id 
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name
ORDER BY 2 DESC;

This also returns the same results.

  1. The previous didn’t account for the middle, nor the dollar amount associated with the sales. Management decides they want to see these characteristics represented as well. We would like to identify top-performing sales reps, which are sales reps associated with more than 200 orders or more than 750000 in total sales. The middle group has any rep with more than 150 orders or 500000 in sales. Create a table with the sales rep name, the total number of orders, total sales across all orders, and a column with topmiddle, or low depending on these criteria. Place the top salespeople based on the dollar amount of sales first in your final table. You might see a few upset salespeople by this criteria!
SELECT s.name SalesRepName, COUNT(o.id) TotalOrder, SUM(o.total_amt_usd) TotalOrderUSD, CASE WHEN COUNT(o.id) > 200 OR SUM(o.total_amt_usd) > 750000 THEN 'Top' WHEN COUNT(o.id) = 200 OR COUNT(o.id) > 150 OR SUM(o.total_amt_usd) = 750000 OR SUM(o.total_amt_usd) > 500000 THEN 'Middle' ELSE 'Low' END AS TopPerformer
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN orders o
ON a.id = o.account_id
GROUP BY s.name
ORDER BY SUM(o.total_amt_usd) DESC;
salesrepnametotalordertotalorderusdtopperformer
Earlie Schleusner3351098137.72Top
Tia Amato2671010690.60Top
Vernita Plump299934212.93Top
Georgianna Chisholm256886244.12Top
Arica Stoltzfus186810353.34Top
Dorotha Seawell208766935.04Top
Nelle Meaux241749076.16Top
Sibyl Lauria193722084.27Middle
Maren Musto224702697.29Top
Brandie Riva167675917.64Middle
Charles Bidwell205675637.19Top
Elwood Shutt191662500.24Middle
Maryanna Fiorentino204655954.74Top

RECAP

Each of the sections has been labeled to assist if you need to revisit a particular topic. Intentionally, the solutions for a particular section are actually not in the labeled section, because my hope is this will force you to practice if you have a question about a particular topic we covered.

You have now gained a ton of useful skills associated with SQL. The combination of JOINs and Aggregations is one of the reasons SQL is such a powerful tool.

If there was a particular topic you struggled with, I suggest coming back and revisiting the questions with a fresh mind. The more you practice the better, but you also don’t want to get stuck on the same problem for an extended period of time!

In this lesson, we covered and you can now:

  • Deal with NULL values
  • Create aggregations in your SQL Queries including
    • COUNT
    • SUM
    • MIN & MAX
    • AVG
    • GROUP BY
    • DISTINCT
    • HAVING
  • Create DATE functions
  • Implement CASE statements

Up next are Subqueries and Temporary Tables. See you there!

KeyTermDefinition
DISTINCTAlways used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement.
GROUP BYUsed to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.
HAVINGis the “clean” way to filter a query that has been aggregated
NULLsA datatype that specifies where no data exists in SQL
This entry was posted in Study Notes and tagged . Bookmark the permalink.

Leave a Reply