Study Notes #17

LEFT and RIGHT JOINs

INNER JOIN – only returns rows that appear in both tables.

SELECT a.id, a.name, o.total
FROM orders o
JOIN accounts a
ON o.account_id = a.id

I learned here that the result table will only show data of those rows that appear in all tables, those that don’t appear in all tables will not be included. i.e. for accounts that don’t have orders yet will not appear, because they don’t have total yet.

The previous JOIN I learn is not order-specific. If I am to use LEFT or RIGHT, the query becomes order specific. The FROM is the LEFT table, and the JOIN is the right table.

Query1

SELECT a.id, a.name, o.total
FROM orders o
LEFT JOIN accounts a
ON o.account_id = a.id

Query2

SELECT a.id, a.name, o.total
FROM orders o
RIGHT JOIN accounts a
ON o.account_id = a.id

INNER JOIN – we have always pulled rows only if they exist as a match across two tables.

A LEFT JOIN and RIGHT JOIN do the same thing if we change the tables that are in the FROM and JOIN statements.

A LEFT JOIN will at least return all the rows that are in an INNER JOIN.

JOIN and INNER JOIN are the same.

A LEFT OUTER JOIN is the same as LEFT JOIN.

Quick Note

You might see the SQL syntax of

LEFT OUTER JOIN

OR

RIGHT OUTER JOIN

These are the exact same commands as the LEFT JOIN and RIGHT JOIN we learned.

OUTER JOINS

The last type of join is a full outer join. This will return the inner join result set, as well as any unmatched rows from either of the two tables being joined.

Again this returns rows that do not match one another from the two tables. The use cases for a full outer join are very rare.

You can see examples of outer joins at the link here and a description of the rare use cases here. We will not spend time on these given the few instances you might need to use them.

Similar to the above, you might see the language FULL OUTER JOIN, which is the same as OUTER JOIN.

Quiz

SELECT c.countryid, c.countryName, s.stateName
FROM Country c
JOIN State s
ON c.countryid = s.countryid;

Since this is a JOIN (INNER JOIN technically), we only get rows that show up in both tables. Therefore our resulting table will essentially look like the right table with the countryName pulled in as a column. Since 1, 2, 3, and 4 are countryids in both tables, this information will be pulled together. The countryids of 5 and 6 only show up in the Country table. Therefore, these will be dropped.

SELECT c.countryid, c.countryName, s.stateName
FROM Country c
LEFT JOIN State s
ON c.countryid = s.countryid;

We have a column for each of the identified elements in our SELECT statement. We will have all of the same rows as in a JOIN statement, but we also will obtain the additional two rows in the Country table that are not in the State table for Sri Lanka and Brazil.

Quick Note – with LEFT or RIGHT query the opposite table determines which table would include all rows regardless of whether there are existing data on the other tables. In this Quiz, the LEFT is set to the STATE, so the opposite is country, hence it includes all rows from the country table.

LEFT and RIGHT JOIN Solutions

This section is a walkthrough of those final two problems in the previous concept. First, another look at the two tables we are working with:

Table of Country and State

Tables of Country & State

INNER JOIN Question

The questions are aimed to assure you have a conceptual idea of what is happening with LEFT and INNER JOINs before you need to use them for more difficult problems.

For an INNER JOIN like the one here:

SELECT c.countryid, c.countryName, s.stateName
FROM Country c
JOIN State s
ON c.countryid = s.countryid;

We are essentially JOINing the matching PKFK links from the two tables, as shown in the below image.

Diagram of cells selected with inner join

Cells Selected with Inner Join

The resulting table will look like:

countryidcountryNamestateName
1IndiaMaharashtra
1IndiaPunjab
2NepalKathmandu
3United StatesCalifornia
3United StatesTexas
4CanadaAlberta

LEFT JOIN Question

The questions are aimed to assure you have a conceptual idea of what is happening with LEFT and INNER JOINs before you need to use them for more difficult problems.

For a LEFT JOIN like the one here:

SELECT c.countryid, c.countryName, s.stateName
FROM Country c
LEFT JOIN State s
ON c.countryid = s.countryid;

We are essentially JOINing the matching PKFK links from the two tables, as we did before, but we are also pulling all the additional rows from the Country table even if they don’t have a match in the State table. Therefore, we obtain all the rows of the INNER JOIN, but we also get additional rows from the table in the FROM.

Diagram of cells selected with left join

Cells Selected with Left Join

The resulting table will look like:

countryidcountryNamestateName
1IndiaMaharashtra
1IndiaPunjab
2NepalKathmandu
3United StatesCalifornia
3United StatesTexas
4CanadaAlberta
5Sri LankaNULL
6BrazilNULL

FINAL LEFT JOIN Note

If we were to flip the tables, we would actually obtain the same exact result as the JOIN statement:

SELECT c.countryid, c.countryName, s.stateName
FROM State s
LEFT JOIN Country c
ON c.countryid = s.countryid;

This is because if State is on the LEFT table, all of the rows exist in the RIGHT table again.

Diagram of cells selected with left join

Cells Selected with Left Join

The resulting table will look like:

countryidcountryNamestateName
1IndiaMaharashtra
1IndiaPunjab
2NepalKathmandu
3United StatesCalifornia
3United StatesTexas
4CanadaAlberta

Quick Note – I just realized that the concept of the LEFT and RIGHT JOINs is similar with the FILTER tool of Excel.

Logic in the ON clause reduces the rows before combining the tables.

Logic in the WHERE clause occurs after the JOIN occurs.

A simple rule to remember is that, when the database executes this query, it executes the join and everything in the ON clause first. Think of this as building the new result set. That result set is then filtered using the WHERE clause.

The fact that this example is a left join is important. Because inner joins only return the rows for which the two tables match, moving this filter to the ON clause of an inner join will produce the same result as keeping it in the WHERE clause.

Code

Query1

SELECT orders.*, accounts.*
FROM orders
LEFT JOIN accounts
ON orders.account_id = accounts.id 
WHERE accounts.sales_rep_id = 321500

Query2

SELECT orders.*, accounts.*
FROM orders
LEFT JOIN accounts
ON orders.account_id = accounts.id 
AND accounts.sales_rep_id = 321500

Quiz

  1. Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for the Midwest region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to the account name.
SELECT s.name AS SalesRepName, a.name AS AccountsName, r.name AS RegionName
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
WHERE r.name = 'Midwest'
ORDER BY a.name

Answer Key:

SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
WHERE r.name = 'Midwest'
ORDER BY a.name;
  1. Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for accounts where the sales rep has a first name starting with S and in the Midwest region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to the account name.
SELECT s.name AS SalesRepName, a.name AS AccountsName, r.name AS RegionName
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
WHERE r.name = 'Midwest' AND s.name LIKE 'S%'
ORDER BY a.name;

Answer Key: 
SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
WHERE r.name = 'Midwest' AND s.name LIKE 'S%'
ORDER BY a.name;
  1. Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for accounts where the sales rep has a last name starting with K and in the Midwest region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to the account name.
SELECT s.name AS SalesRepName, a.name AS AccountsName, r.name AS RegionName
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
WHERE r.name = 'Midwest' AND s.name LIKE '% K%'
ORDER BY a.name;

Note: First and Last Name on the same cell, to target the second word (Last Name), used the string '% K%' - this means the second word in the cell starts with K. 

Answer Key: 
SELECT r.name region, s.name rep, a.name account
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
WHERE r.name = 'Midwest' AND s.name LIKE '% K%'
ORDER BY a.name;
  1. Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100. Your final table should have 3 columns: region nameaccount name, and unit price. In order to avoid a division by zero error, adding .01 to the denominator here is helpful total_amt_usd/(total+0.01).
SELECT r.name AS RegionName, a.name AS AccountName, o.total_amt_usd/(o.total + 0.01) UnitPrice
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN orders o
ON a.id = o.account_id
WHERE o.standard_qty <100;

Answer Key: 

SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
WHERE o.standard_qty > 100;

What happened here? This is much more complicated… o.total_amt_usd/(o.total + 0.01) UnitPrice was added to SELECT because I was looking for the Unit Price, while it’s not available in the tables provided. This is like working on Excel, adding a column with a formula. Here the formula to get the unit price is order.total_amt_usd/order.total but had to add +0.01 in order to avoid division by 0. (But this will return an inaccurate data, isn’t it?) Then another query for WHERE I learned here, < > or maybe I can also use =.

  1. Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100 and the poster order quantity exceeds 50. Your final table should have 3 columns: region nameaccount name, and unit price. Sort for the smallest unit price first. In order to avoid a division by zero error, adding .01 to the denominator here is helpful (total_amt_usd/(total+0.01).
SELECT r.name AS RegionName, a.name AS AccountName, o.total_amt_usd/(o.total + 0.01) UnitPrice
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN orders o
ON a.id = o.account_id
WHERE o.standard_qty <100 AND o.poster_qty <50
ORDER BY UnitPrice ASC;

Answer Key: 

SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
WHERE o.standard_qty > 100 AND o.poster_qty > 50
ORDER BY unit_price;
  1. Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds 100 and the poster order quantity exceeds 50. Your final table should have 3 columns: region nameaccount name, and unit price. Sort for the largest unit price first. In order to avoid a division by zero error, adding .01 to the denominator here is helpful (total_amt_usd/(total+0.01).
SELECT r.name AS RegionName, a.name AS AccountName, o.total_amt_usd/(o.total + 0.01) UnitPrice
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN orders o
ON a.id = o.account_id
WHERE o.standard_qty <100 AND o.poster_qty <50
ORDER BY UnitPrice DESC;

Answer Key: 
SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price
FROM region r
JOIN sales_reps s
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
WHERE o.standard_qty > 100 AND o.poster_qty > 50
ORDER BY unit_price DESC;
  1. What are the different channels used by account id 1001? Your final table should have only 2 columns: account name and the different channels. You can try SELECT DISTINCT to narrow down the results to only the unique values.
SELECT DISTINCT a.name, w.channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE a.id = '1001';

Answer Key: 

SELECT DISTINCT a.name, w.channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE a.id = '1001';

New query I learned here – SELECT DISTINCT this will return unique results per row. In the quiz I worked on, if I don’t add DISTINCT the data will return all the rows that match my query resulting duplicates, while if I add DISTINCT the results will return unique rows, removing all the duplicates.

  1. Find all the orders that occurred in 2015. Your final table should have 4 columns: occurred_ataccount nameorder total, and order total_amt_usd.
SELECT o.occurred_at, a.name, o.total, o.total_amt_usd
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN orders o
ON a.id = o.account_id
WHERE o.occurred_at BETWEEN '01-01-2015' AND '01-01-2016'
ORDER BY o.occurred_at DESC; {I added this}

Answer Key: 

SELECT o.occurred_at, a.name, o.total, o.total_amt_usd
FROM accounts a
JOIN orders o
ON o.account_id = a.id
WHERE o.occurred_at BETWEEN '01-01-2015' AND '01-01-2016'
ORDER BY o.occurred_at DESC;

Quick note, when specifying dates BETWEEN start with the start date, and the date after the end date.

BETWEEN operator


Lesson Overview

In this lesson you learned to :

  • Create Joins Using Primary – Foreign Keys
  • Evaluate Various Types of Joins
  • Integrate Aliasing and Filters with Joins

Recap

You learned a key element for JOINing tables in a database has to do with primary and foreign keys. Choosing the set up of data in our database is very important, but not usually the job of a data analyst. This process is known as Database Normalization.

You learned how to combine data from multiple tables using JOINs. The three JOIN statements you are most likely to use are: JOIN, LEFT JOIN, RIGHT JOIN.

There are a few more advanced JOINs that we did not cover here, and they are used in very specific use cases. UNION and UNION ALLCROSS JOIN, and the tricky SELF JOIN. These are more advanced than this course will cover, but it is useful to be aware that they exist, as they are useful in special cases.

You also learned that you can alias tables and columns using AS or not using it. This allows you to be more efficient in the number of characters you need to write, while at the same time you can assure that your column headings are informative of the data in your table.

KeyTermDefinition
Foreign Key (FK)is a column in one table that is a primary key in a different table
JOINis an INNER JOIN that only pulls data that exists in both tables.
LEFT JOINis a JOIN that pulls all the data that exists in both tables, as well as all of the rows from the table in the FROM even if they do not exist in the JOIN statement.
Partition byA subclause of the OVER clause. Similar to GROUP BY.
Primary Key (PK)is a unique column in a particular table
RIGHT JOINis a JOIN pulls all the data that exists in both tables, as well as all of the rows from the table in the JOIN even if they do not exist in the FROM statement.

Looking Ahead

The next lesson is aimed at aggregating data. You have already learned a ton, but SQL might still feel a bit disconnected from statistics and using Excel-like platforms. Aggregations will allow you to write SQL code that will allow for more complex queries, which assist in answering questions like:

  • Which channel generated more revenue?
  • Which account had an order with the most items?
  • Which sales_rep had the most orders? or least orders? How many orders did they have?
This entry was posted in Study Notes and tagged . Bookmark the permalink.

Leave a Reply