3. Anatomy of a SQL Query - GROUP BY to Subqueries

When working with databases, SQL queries often need to aggregate, join, or filter large datasets, and this is where clauses like GROUP BY, HAVING, and JOIN become essential. Additionally, more advanced techniques like subqueries allow us to create more complex and efficient data retrieval structures. In this document, we will explore these concepts in detail.


1. GROUP BY

The GROUP BY clause is used to group rows that have the same values into summary rows, often used with aggregation functions such as COUNT(), SUM(), AVG(), MIN(), and MAX().

Example Use Case:

Imagine you want to find out how many orders each user has made, what their total spending is, the average price they paid, and the dates of their first and last orders. This can be achieved using aggregation functions:

SELECT user_id, COUNT(*) AS num_orders, SUM(price) AS total_spent,
       AVG(price) AS avg_price, MIN(order_date) AS first_order,
       MAX(order_date) AS last_order
FROM orders
GROUP BY user_id;

Filtering Grouped Data: HAVING

The HAVING clause is used to filter results after they have been grouped. For example, if you want to find only users who have made more than five orders, you can use:

SELECT user_id, COUNT(*) AS num_orders
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;

This filters out users who have fewer than five orders.

Exercises:


2. JOIN

SQL JOIN operations are used to combine data from multiple tables based on related columns. There are several types of joins:

a. INNER JOIN

The INNER JOIN returns only the rows where there is a match between the tables. If a row in either table doesn't have a matching row in the other table, it will not be included in the result.

SELECT users.email, orders.order_id
FROM users
INNER JOIN orders ON users.user_id = orders.user_id;

This query returns only users who have placed at least one order.

b. LEFT JOIN

The LEFT JOIN returns all rows from the left table and the matched rows from the right table. If there is no match, the result is NULL on the side of the right table.

SELECT users.email, orders.order_id
FROM users
LEFT JOIN orders ON users.user_id = orders.user_id;

This query returns all users, whether they have placed an order or not. If a user has no orders, the order_id field will be NULL.

c. FULL JOIN

The FULL JOIN returns all rows when there is a match in either table. Rows that do not have a match in one of the tables will still appear, but with NULL values in the columns from the non-matching table.

Multiple Joins

You can combine multiple tables by chaining JOIN clauses. For example, to get the number of orders per category for each user:

SELECT users.email, categories.name, COUNT(*) AS num_orders
FROM users
INNER JOIN orders ON users.user_id = orders.user_id
INNER JOIN categories ON orders.category_id = categories.category_id
GROUP BY users.email, categories.name;

This query returns the count of orders placed by each user for each category.

Exercises:


3. Subqueries

A subquery is a query nested inside another query. Subqueries can be placed in various parts of a SQL statement, such as in the SELECT, FROM, WHERE, or HAVING clauses. Subqueries are particularly useful when you need to perform complex filtering or when you're dealing with large datasets.

Example: Subquery in the FROM Clause

SELECT user_id, name, email
FROM (SELECT * FROM users WHERE name = 'Bernardo') AS filtered_users;

This query first selects all users named "Bernardo" in the subquery and then retrieves the user_id, name, and email from that subset of data.

Subquery in the WHERE Clause

SELECT user_id, name
FROM users
WHERE user_id IN (SELECT user_id FROM orders WHERE order_date > '2023-01-01');

Here, the subquery retrieves all user_ids from the orders table where the order date is after January 1, 2023. The main query then retrieves the names and user IDs of users who match those IDs.

Subquery in the HAVING Clause

SELECT category_id, COUNT(*)
FROM orders
GROUP BY category_id
HAVING COUNT(*) > (SELECT AVG(order_count) FROM (SELECT COUNT(*) AS order_count FROM orders GROUP BY user_id));

In this case, the subquery calculates the average number of orders per user, and the main query returns categories where the number of orders is higher than this average.


Conclusion

In this document, we've explored more advanced SQL topics such as GROUP BY, HAVING, different types of JOINs, and the power of subqueries. These techniques allow you to handle more complex data structures, filter aggregated data, and join multiple tables to extract valuable insights.

These skills are essential for making sense of large databases, improving data analysis, and ultimately making more informed decisions based on the data at your disposal.