Study Notes #18

Aggregation

  • no data / different from zero or space
  • a datatype that specifies where no data exists in SQL. They are often ignored in our aggregation functions, which you will get a first look at in the next concept using COUNT.
  • is not a value, but a property of a data (use IS query, instead of =)
  • are different than a zero – they are cells where data does not exist.
  • When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don’t use =, because NULL isn’t considered a value in SQL. Rather, it is a property of the data.

NULLs – Expert Tip

There are two common ways in which you are likely to encounter NULLs:

  • NULLs frequently occur when performing a LEFT or RIGHT JOIN. You saw in the last lesson – when some rows in the left table of a left join are not matched with rows in the right table, those rows will contain some NULL values in the result set.
  • NULLs can also occur from simply missing data in our database

Query 1

SELECT *
FROM accounts
WHERE id > 1500 and id < 1600

Query 2

SELECT *
FROM accounts
WHERE primary_poc = NULL

Query 3

SELECT *
FROM accounts
WHERE primary_poc IS NOT NULL

First Aggregation – COUNT

You can see that this returns a single row of output in a single column,PauseMute

Loaded: 100.00%

Remaining Time -0:30

1xPlayback RateCaptionsPicture-in-PictureFullscreen

COUNT the Number of Rows in a Table

Try your hand at finding the number of rows in each table. Here is an example of finding all the rows in the accounts table.

SELECT COUNT(*)
FROM accounts;

But we could have just as easily chosen a column to drop into the aggregation function:

SELECT COUNT(accounts.id)
FROM accounts;

These two statements are equivalent, but this isn’t always the case, which we will see in the next video.

Code from the Video

Query 1

SELECT *
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Query 2

SELECT COUNT(*)
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Query 3

SELECT COUNT(*) AS order_count
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

First Aggregation – COUNT

So, the result produced by aPauseMute

0%

Loaded: 100.00%

Remaining Time -0:05

1xPlayback RateCaptionsPicture-in-PictureFullscreen

COUNT the Number of Rows in a Table

Try your hand at finding the number of rows in each table. Here is an example of finding all the rows in the accounts table.

SELECT COUNT(*)
FROM accounts;

But we could have just as easily chosen a column to drop into the aggregation function:

SELECT COUNT(accounts.id)
FROM accounts;

These two statements are equivalent, but this isn’t always the case, which we will see in the next video.

Code from the Video

Query 1

SELECT *
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Query 2

SELECT COUNT(*)
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Query 3

SELECT COUNT(*) AS order_count
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Count

  • can help us identify the number of null values in any particular column
  • if the count result of a column matches the number of rows in a table, there are no nulls in the column
  • If the count result of a column is less than the number of rows in the table, we know the difference is the number of nulls
  • we can use the COUNT function in any column

COUNT the Number of Rows in a Table

Try your hand at finding the number of rows in each table. Here is an example of finding all the rows in the accounts table.

SELECT COUNT(*)
FROM accounts;

But we could have just as easily chosen a column to drop into the aggregation function:

SELECT COUNT(accounts.id)
FROM accounts;

Query 1

SELECT *
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Query 2

SELECT COUNT(*)
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Query 3

SELECT COUNT(*) AS order_count
FROM orders
WHERE occurred_at >= '2016-12-01'
AND occurred_at < '2017-01-01'

Query 1

SELECT COUNT (*) AS account_count
FROM accounts

Counts all the number of rows, regardless of NULL

Query 2

SELECT COUNT (id) AS account_id_count
FROM accounts

Counts the number of records in one column, regardless of null. 

Query 3

SELECT COUNT(primary_poc) AS account_primary_poc_count
FROM accounts

Counts the number of records in one column, regardless of null. 

Query 4

Example: there will not be any NULL values in the workspace.

SELECT *
FROM accounts
WHERE primary_poc IS NULL

Counts the number of NULLs of specific column in a table. 

SUM

  • can only be used for columns that have quantitative data, or numerical values
  • SUM(id), ie SUMS(standard_qty)
  • SUM treats NULL as 0
  • aggregators only aggregate vertically – the values of a column. If you want to perform a calculation across rows, you would do this with simple arithmetic.
SELECT SUM(standard_qty) AS standard,
       SUM(gloss_qty) AS gloss,
       SUM(poster_qty) AS poster
FROM orders

This displays 3 columns - "standard", "gloss" and "poster". It returns the sum of standard_qty, gloss_qty and poster_qty from the orders table. 

Quiz

  1. Find the total amount of poster_qty paper ordered in the orders table.
SELECT SUM(poster_qty) AS TotalPoster
FROM orders;

Answer: 723646

Answer Key: 
SELECT SUM(poster_qty) AS total_poster_sales
FROM orders;

I am correct. 
  1. Find the total amount of standard_qty paper ordered in the orders table.
SELECT SUM(standard_qty) AS TotalStandard
FROM orders;

Answer: 1938346

Answer Key:
SELECT SUM(standard_qty) AS total_standard_sales
FROM orders;

I am correct. 
  1. Find the total dollar amount of sales using the total_amt_usd in the orders table.
SELECT SUM(total_amt_usd) AS TotalAmount
FROM orders;

Answer: 23141511.83

Answer Key: 
SELECT SUM(total_amt_usd) AS total_dollar_sales
FROM orders;

I am correct. 
  1. Find the total amount spent on standard_amt_usd and gloss_amt_usd paper for each order in the orders table. This should give a dollar amount for each order in the table.
SELECT orders.id, SUM(standard_amt_usd) AS TotalAmount, SUM(gloss_amt_usd) AS TotalGlossAmount
FROM orders
GROUP BY orders.id

Note: For this kind of scenario, the column (orders.id) we're looking for the total per (total amount per order) must be in GROUP BY query

Answer Key: 
SELECT standard_amt_usd + gloss_amt_usd AS total_standard_gloss
FROM orders;

I need to be careful in reading the problem. "the total amount spent on standard_amt_usd and gloss_amt_usd for each" I made a table with individual calculations, when it should have been the total for both columns. Still good learning on GROUP BY
  1. Find the standard_amt_usd per unit of standard_qty paper. Your solution should use both aggregation and a mathematical operator.
SELECT SUM(standard_amt_usd)/SUM(standard_qty) AS CostPerUnit
FROM orders

This is just my hard attempt. 
Answer: 4.9900000000000000

Answer Key: 
SELECT SUM(standard_amt_usd)/SUM(standard_qty) AS standard_price_per_unit
FROM orders;

I am correct. 

SQL MIN & MAX

  • similar with COUNT
  • ignore NULLs
  • can be used on non-numerical columns (i.e. MIN will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible)
SELECT MIN(standard_qty) AS standard_min,
       MIN(gloss_qty) AS gloss_min,
       MIN(poster_qty) AS poster_min,
       MAX(standard_qty) AS standard_max,
       MAX(gloss_qty) AS gloss_max,
       MAX(poster_qty) AS poster_max
FROM   orders

This returns a table with multiple columns, based on SELECT. Then displays either MIN or MAX per column. 

SQL AVG

  • can only be used on numerical values
  • ignores NULL completely
  • returns the mean of the data
  • SUM/COUNT (If you want to treat NULLs as 0)

MEDIAN – Expert Tip

One quick note that a median might be a more appropriate measure of center for this data, but finding the median happens to be a pretty difficult thing to get using SQL alone — so difficult that finding a median is occasionally asked as an interview question.

SELECT AVG(standard_qty) AS standard_avg,
       AVG(gloss_qty) AS gloss_avg,
       AVG(poster_qty) AS poster_avg
FROM orders

This returns a table with 3 columns, showing the AVERAGE for each

Quiz

  1. When was the earliest order ever placed? You only need to return the date.
SELECT MIN(orders.occurred_at)
FROM orders

Answer Key: 
SELECT MIN(occurred_at) 
FROM orders;

I am correct, though it seems like the table doesn't need to be added on the SELECT query.
  1. Try performing the same query as in question 1 without using an aggregation function.
SELECT orders.occurred_at
FROM orders
ORDER BY orders.occurred_at ASC

This is my hard attempt, it shows the entire column, but I could see the MIN. 

Answer Key: 
SELECT occurred_at 
FROM orders 
ORDER BY occurred_at
LIMIT 1;

I am almost correct, but I learned that I could use LIMIT query to only see one entry.
  1. When did the most recent (latest) web_event occur?
SELECT max(web_events.occurred_at)
FROM web_events

Answer: 2017-01-01T23:51:09.000Z

Answer Key: 
SELECT MAX(occurred_at)
FROM web_events;

I have to ask this - do I need to put ; in the end?
  1. Try to perform the result of the previous query without using an aggregation function.
SELECT web_events.occurred_at
FROM web_events
ORDER BY web_events.occurred_at DESC

Answer Key: 
SELECT occurred_at
FROM web_events
ORDER BY occurred_at DESC
LIMIT 1;

Almost correct again, if I knew about the LIMIT
  1. Find the mean (AVERAGE) amount spent per order on each paper type, as well as the mean amount of each paper type purchased per order. Your final answer should have 6 values – one for each paper type for the average number of sales, as well as the average amount.
SELECT AVG(orders.standard_amt_usd) AS AverageStandardUSD,
       AVG(orders.gloss_amt_usd) AS AverageGlossUSD,
       AVG(orders.poster_amt_usd) AS AveragePosterUSD,
       AVG(orders.standard_qty) AS AverageStandard,
       AVG(orders.gloss_qty) AS AverageGloss,
       AVG(orders.poster_qty) AS AveragePoster
FROM orders

Answer Key: 
SELECT AVG(standard_qty) mean_standard, AVG(gloss_qty) mean_gloss, 
        AVG(poster_qty) mean_poster, AVG(standard_amt_usd) mean_standard_usd, 
        AVG(gloss_amt_usd) mean_gloss_usd, AVG(poster_amt_usd) mean_poster_usd
FROM orders;

I am correct. 
  1. Via the video, you might be interested in how to calculate the MEDIAN. Though this is more advanced than what we have covered so far try finding – what is the MEDIAN total_usd spent on all orders
SELECT MAX(orders.total_amt_usd) + MIN(orders.total_amt_usd)/2 AS Median
FROM orders

Answer: 232207.07000000000000000000

This is a hard attempt. 

Answer Key: 

SELECT *
FROM (SELECT total_amt_usd
   FROM orders
   ORDER BY total_amt_usd
   LIMIT 3457) AS Table1
ORDER BY total_amt_usd DESC
LIMIT 2;

I am completely wrong here - what happened??? 

Since there are 6912 orders - we want the average of the 3457 and 3456 order amounts when ordered. This is the average of 2483.16 and 2482.55. This gives the median of 2482.855. This obviously isn't an ideal way to compute. If we obtain new orders, we would have to change the limit. SQL didn't even calculate the median for us. The above used a SUBQUERY, but you could use any method to find the two necessary values, and then you just need the average of them.

This seems to have the concept of Number Summary I learned in the previous lesson, except I would be dealing with high number of counts. 
This entry was posted in Study Notes and tagged . Bookmark the permalink.

Leave a Reply