3. Anatomy of a SQL Query - GROUP BY to Subqueries
When working with databases, SQL queries often need to aggregate, join, or filter large datasets, and this is where clauses like GROUP BY
, HAVING
, and JOIN
become essential. Additionally, more advanced techniques like subqueries allow us to create more complex and efficient data retrieval structures. In this document, we will explore these concepts in detail.
1. GROUP BY
The GROUP BY
clause is used to group rows that have the same values into summary rows, often used with aggregation functions such as COUNT()
, SUM()
, AVG()
, MIN()
, and MAX()
.
Example Use Case:
Imagine you want to find out how many orders each user has made, what their total spending is, the average price they paid, and the dates of their first and last orders. This can be achieved using aggregation functions:
SELECT user_id, COUNT(*) AS num_orders, SUM(price) AS total_spent,
AVG(price) AS avg_price, MIN(order_date) AS first_order,
MAX(order_date) AS last_order
FROM orders
GROUP BY user_id;
COUNT(*)
: Counts the number of orders.SUM(price)
: Adds up the total spent by each user.AVG(price)
: Calculates the average price of their orders.MIN(order_date)
: Finds the date of their first order.MAX(order_date)
: Finds the date of their most recent order.
Filtering Grouped Data: HAVING
The HAVING
clause is used to filter results after they have been grouped. For example, if you want to find only users who have made more than five orders, you can use:
SELECT user_id, COUNT(*) AS num_orders
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;
This filters out users who have fewer than five orders.
Exercises:
- Find how many users exist for each name in the database.
- For each order status, find how many orders exist in the system.
- What is the average revenue per order (expressed in a
request_revenue
column)?
2. JOIN
SQL JOIN
operations are used to combine data from multiple tables based on related columns. There are several types of joins:
a. INNER JOIN
The INNER JOIN
returns only the rows where there is a match between the tables. If a row in either table doesn't have a matching row in the other table, it will not be included in the result.
SELECT users.email, orders.order_id
FROM users
INNER JOIN orders ON users.user_id = orders.user_id;
This query returns only users who have placed at least one order.
b. LEFT JOIN
The LEFT JOIN
returns all rows from the left table and the matched rows from the right table. If there is no match, the result is NULL
on the side of the right table.
SELECT users.email, orders.order_id
FROM users
LEFT JOIN orders ON users.user_id = orders.user_id;
This query returns all users, whether they have placed an order or not. If a user has no orders, the order_id
field will be NULL
.
c. FULL JOIN
The FULL JOIN
returns all rows when there is a match in either table. Rows that do not have a match in one of the tables will still appear, but with NULL
values in the columns from the non-matching table.
Multiple Joins
You can combine multiple tables by chaining JOIN
clauses. For example, to get the number of orders per category for each user:
SELECT users.email, categories.name, COUNT(*) AS num_orders
FROM users
INNER JOIN orders ON users.user_id = orders.user_id
INNER JOIN categories ON orders.category_id = categories.category_id
GROUP BY users.email, categories.name;
This query returns the count of orders placed by each user for each category.
Exercises:
- How many orders have been made in each category?
- How many customers registered but never placed an order?
3. Subqueries
A subquery is a query nested inside another query. Subqueries can be placed in various parts of a SQL statement, such as in the SELECT
, FROM
, WHERE
, or HAVING
clauses. Subqueries are particularly useful when you need to perform complex filtering or when you're dealing with large datasets.
Example: Subquery in the FROM
Clause
SELECT user_id, name, email
FROM (SELECT * FROM users WHERE name = 'Bernardo') AS filtered_users;
This query first selects all users named "Bernardo" in the subquery and then retrieves the user_id
, name
, and email
from that subset of data.
Subquery in the WHERE
Clause
SELECT user_id, name
FROM users
WHERE user_id IN (SELECT user_id FROM orders WHERE order_date > '2023-01-01');
Here, the subquery retrieves all user_id
s from the orders
table where the order date is after January 1, 2023. The main query then retrieves the names and user IDs of users who match those IDs.
Subquery in the HAVING
Clause
SELECT category_id, COUNT(*)
FROM orders
GROUP BY category_id
HAVING COUNT(*) > (SELECT AVG(order_count) FROM (SELECT COUNT(*) AS order_count FROM orders GROUP BY user_id));
In this case, the subquery calculates the average number of orders per user, and the main query returns categories where the number of orders is higher than this average.
Conclusion
In this document, we've explored more advanced SQL topics such as GROUP BY
, HAVING
, different types of JOIN
s, and the power of subqueries. These techniques allow you to handle more complex data structures, filter aggregated data, and join multiple tables to extract valuable insights.
These skills are essential for making sense of large databases, improving data analysis, and ultimately making more informed decisions based on the data at your disposal.