# Study Notes #18

# Aggregation

- no data / different from zero or space
- a datatype that specifies where no data exists in SQL. They are often ignored in our aggregation functions, which you will get a first look at in the next concept using
**COUNT**. - is not a value, but a property of a data (use IS query, instead of =)
- are different than a zero – they are cells where data does not exist.
- When identifying
**NULL**s in a**WHERE**clause, we write**IS NULL**or**IS NOT NULL**. We don’t use`=`

, because**NULL**isn’t considered a value in SQL. Rather, it is a property of the data.

## NULLs – Expert Tip

There are two common ways in which you are likely to encounter **NULL**s:

**NULL**s frequently occur when performing a**LEFT**or**RIGHT JOIN**. You saw in the last lesson – when some rows in the left table of a left join are not matched with rows in the right table, those rows will contain some**NULL**values in the result set.**NULL**s can also occur from simply missing data in our database

### Query 1

**SELECT** *
**FROM** accounts
**WHERE** **id** > 1500 **and** **id** < 1600

### Query 2

**SELECT** *
**FROM** accounts
**WHERE** primary_poc = NULL

### Query 3

**SELECT** *
**FROM** accounts
**WHERE** primary_poc **IS** **NOT** NULL

## First Aggregation – COUNT

You can see that this returns a single row of output in a single column,PauseMute

Loaded: 100.00%

Remaining Time -0:30

1xPlayback RateCaptionsPicture-in-PictureFullscreen

### COUNT the Number of Rows in a Table

Try your hand at finding the number of rows in each table. Here is an example of finding all the rows in the **accounts** table.

**SELECT** **COUNT**(*)
**FROM** accounts;

But we could have just as easily chosen a column to drop into the aggregation function:

**SELECT** **COUNT**(accounts.**id**)
**FROM** accounts;

These two statements are equivalent, but this isn’t always the case, which we will see in the next video.

## Code from the Video

### Query 1

**SELECT** *
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

### Query 2

**SELECT** **COUNT**(*)
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

### Query 3

**SELECT** **COUNT**(*) **AS** order_count
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

## First Aggregation – COUNT

So, the result produced by aPauseMute

0%

Loaded: 100.00%

Remaining Time -0:05

1xPlayback RateCaptionsPicture-in-PictureFullscreen

### COUNT the Number of Rows in a Table

Try your hand at finding the number of rows in each table. Here is an example of finding all the rows in the **accounts** table.

**SELECT** **COUNT**(*)
**FROM** accounts;

But we could have just as easily chosen a column to drop into the aggregation function:

**SELECT** **COUNT**(accounts.**id**)
**FROM** accounts;

These two statements are equivalent, but this isn’t always the case, which we will see in the next video.

## Code from the Video

### Query 1

**SELECT** *
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

### Query 2

**SELECT** **COUNT**(*)
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

### Query 3

**SELECT** **COUNT**(*) **AS** order_count
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

## Count

- can help us identify the number of null values in any particular column
**if the count result of a column matches the number of rows in a table, there are no nulls in the column**- If the count result of a column is less than the number of rows in the table, we know the difference is the number of nulls
- we can use the COUNT function in any column

### COUNT the Number of Rows in a Table

Try your hand at finding the number of rows in each table. Here is an example of finding all the rows in the **accounts** table.

**SELECT** **COUNT**(*)
**FROM** accounts;

But we could have just as easily chosen a column to drop into the aggregation function:

**SELECT** **COUNT**(accounts.**id**)
**FROM** accounts;

### Query 1

**SELECT** *
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

### Query 2

**SELECT** **COUNT**(*)
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

### Query 3

**SELECT** **COUNT**(*) **AS** order_count
**FROM** orders
**WHERE** occurred_at >= '2016-12-01'
**AND** occurred_at < '2017-01-01'

### Query 1

**SELECT** **COUNT** (*) **AS** account_count
**FROM** accounts
Counts all the number of rows, regardless of NULL

### Query 2

**SELECT** **COUNT** (**id**) **AS** account_id_count
**FROM** accounts
Counts the number of records in one column, regardless of null.

### Query 3

**SELECT** **COUNT**(primary_poc) **AS** account_primary_poc_count
**FROM** accounts
Counts the number of records in one column, regardless of null.

### Query 4

Example: there will not be any NULL values in the workspace.

**SELECT** *
**FROM** accounts
**WHERE** primary_poc **IS** NULL
Counts the number of NULLs of specific column in a table.

## SUM

- can only be used for columns that have quantitative data, or numerical values
`SUM(id), ie SUMS(standard_qty)`

- SUM treats NULL as
`0`

**aggregators only aggregate vertically – the values of a column**. If you want to perform a calculation across rows, you would do this with simple arithmetic.

**SELECT** **SUM**(standard_qty) **AS** standard,
**SUM**(gloss_qty) **AS** gloss,
**SUM**(poster_qty) **AS** poster
**FROM** orders
This displays 3 columns - "standard", "gloss" and "poster". It returns the sum of standard_qty, gloss_qty and poster_qty from the orders table.

## Quiz

- Find the total amount of
**poster_qty**paper ordered in the**orders**table.

```
SELECT SUM(poster_qty) AS TotalPoster
FROM orders;
Answer: 723646
```**Answer Key:
SELECT SUM(poster_qty) AS total_poster_sales
FROM orders;**
*I am correct. *

- Find the total amount of
**standard_qty**paper ordered in the**orders**table.

```
SELECT SUM(standard_qty) AS TotalStandard
FROM orders;
Answer: 1938346
```**Answer Key:
SELECT SUM(standard_qty) AS total_standard_sales
FROM orders;**
*I am correct. *

- Find the total dollar amount of sales using the
**total_amt_usd**in the**orders**table.

```
SELECT SUM(total_amt_usd) AS TotalAmount
FROM orders;
Answer: 23141511.83
```**Answer Key:
SELECT SUM(total_amt_usd) AS total_dollar_sales
FROM orders;**
*I am correct. *

- Find the total amount spent on
**standard_amt_usd**and**gloss_amt_usd**paper for each order in the orders table. This should give a dollar amount for each order in the table.

```
SELECT orders.id, SUM(standard_amt_usd) AS TotalAmount, SUM(gloss_amt_usd) AS TotalGlossAmount
FROM orders
GROUP BY orders.id
Note: For this kind of scenario, the column (orders.id) we're looking for the total per (total amount per order) must be in GROUP BY query
```**Answer Key:
SELECT standard_amt_usd + gloss_amt_usd AS total_standard_gloss
FROM orders;**
*I need to be careful in reading the problem. "the total amount spent on standard_amt_usd and gloss_amt_usd for each" I made a table with individual calculations, when it should have been the total for both columns. Still good learning on GROUP BY*

- Find the
**standard_amt_usd**per unit of**standard_qty**paper. Your solution should use both aggregation and a mathematical operator.

```
SELECT SUM(standard_amt_usd)/SUM(standard_qty) AS CostPerUnit
FROM orders
This is just my hard attempt.
Answer: 4.9900000000000000
```**Answer Key:
SELECT SUM(standard_amt_usd)/SUM(standard_qty) AS standard_price_per_unit
FROM orders;
**
*I am correct. *

## SQL MIN & MAX

- similar with COUNT
- ignore NULLs
- can be used on non-numerical columns (i.e.
**MIN**will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible)

```
SELECT MIN(standard_qty) AS standard_min,
MIN(gloss_qty) AS gloss_min,
MIN(poster_qty) AS poster_min,
MAX(standard_qty) AS standard_max,
MAX(gloss_qty) AS gloss_max,
MAX(poster_qty) AS poster_max
FROM orders
This returns a table with multiple columns, based on SELECT. Then displays either MIN or MAX per column.
```

## SQL AVG

- can only be used on numerical values
- ignores NULL completely
- returns the mean of the data
- SUM/COUNT (If you want to treat NULLs as 0)

### MEDIAN – Expert Tip

One quick note that a median might be a more appropriate measure of center for this data, but finding the median happens to be a pretty difficult thing to get using SQL alone — so difficult that finding a median is occasionally asked as an interview question.

**SELECT** **AVG**(standard_qty) **AS** standard_avg,
**AVG**(gloss_qty) **AS** gloss_avg,
**AVG**(poster_qty) **AS** poster_avg
**FROM** orders
This returns a table with 3 columns, showing the AVERAGE for each

## Quiz

- When was the earliest order ever placed? You only need to return the date.

```
SELECT MIN(orders.occurred_at)
FROM orders
```**Answer Key:
SELECT MIN(occurred_at)
FROM orders;
I am correct, though it seems like the table doesn't need to be added on the SELECT query.**

- Try performing the same query as in question 1 without using an aggregation function.

```
SELECT orders.occurred_at
FROM orders
ORDER BY orders.occurred_at ASC
This is my hard attempt, it shows the entire column, but I could see the MIN.
```**Answer Key:
SELECT occurred_at
FROM orders
ORDER BY occurred_at
LIMIT 1;
I am almost correct, but I learned that I could use LIMIT query to only see one entry.**

- When did the most recent (latest)
**web_event**occur?

```
SELECT max(web_events.occurred_at)
FROM web_events
Answer: 2017-01-01T23:51:09.000Z
```**Answer Key:
SELECT MAX(occurred_at)
FROM web_events;**
I have to ask this - do I need to put ; in the end?

- Try to perform the result of the previous query without using an aggregation function.

```
SELECT web_events.occurred_at
FROM web_events
ORDER BY web_events.occurred_at DESC
```**Answer Key:
SELECT occurred_at
FROM web_events
ORDER BY occurred_at DESC
LIMIT 1;
Almost correct again, if I knew about the LIMIT**

- Find the mean (
**AVERAGE**) amount spent per order on each paper type, as well as the mean amount of each paper type purchased per order. Your final answer should have 6 values – one for each paper type for the average number of sales, as well as the average amount.

```
SELECT AVG(orders.standard_amt_usd) AS AverageStandardUSD,
AVG(orders.gloss_amt_usd) AS AverageGlossUSD,
AVG(orders.poster_amt_usd) AS AveragePosterUSD,
AVG(orders.standard_qty) AS AverageStandard,
AVG(orders.gloss_qty) AS AverageGloss,
AVG(orders.poster_qty) AS AveragePoster
FROM orders
```**Answer Key:
SELECT AVG(standard_qty) mean_standard, AVG(gloss_qty) mean_gloss,
AVG(poster_qty) mean_poster, AVG(standard_amt_usd) mean_standard_usd,
AVG(gloss_amt_usd) mean_gloss_usd, AVG(poster_amt_usd) mean_poster_usd
FROM orders;
I am correct. **

- Via the video, you might be interested in how to calculate the MEDIAN. Though this is more advanced than what we have covered so far try finding – what is the MEDIAN
**total_usd**spent on all**orders**

```
SELECT MAX(orders.total_amt_usd) + MIN(orders.total_amt_usd)/2 AS Median
FROM orders
Answer: 232207.07000000000000000000
This is a hard attempt.
```**Answer Key:
SELECT *
FROM (SELECT total_amt_usd
FROM orders
ORDER BY total_amt_usd
LIMIT 3457) AS Table1
ORDER BY total_amt_usd DESC
LIMIT 2;**
*I am completely wrong here - what happened???
*
Since there are 6912 orders - we want the average of the 3457 and 3456 order amounts when ordered. This is the average of 2483.16 and 2482.55. This gives the median of 2482.855. This obviously isn't an ideal way to compute. If we obtain new orders, we would have to change the limit. SQL didn't even calculate the median for us. The above used a SUBQUERY, but you could use any method to find the two necessary values, and then you just need the average of them.
*This seems to have the concept of Number Summary I learned in the previous lesson, except I would be dealing with high number of counts. *

## Leave a Reply