Subqueries & CTEs

Subqueries are, as the name suggests, queries nested within other queries and they can have several uses when performing data investigations. Subqueries may or may not rely on data coming from the outer query in order to perform their job. Let’s look at a few usage patterns.

Subquery with IN operator

We can use a subquery in combination with the IN operator to restrict our inputs: for example, to only select those invoices that have been issued in the Netherlands:

SELECT *
FROM learnsql.invoices
WHERE country_id IN (
    SELECT id
    FROM learnsql.countries
    WHERE name = 'Netherlands'
);

When using it in this way, we have to be careful to return exactly one column in the subquery, containing a value that is appropriate for the column involved with the IN operator (country_id, in this case): if not, we will receive an error from the database.

Of course it’s also possible to negate the condition, using NOT IN, but this should be done sparingly since it typically leads to poor performance.

Subquery using EXISTS operator

Subqueries are also commonly used in combination with the EXISTS operator, which returns true if the subquery returns at least one row, or false otherwise. For example, this is how we can get only those invoices that have invoiced both at least one service (e.g. an online course) and one subscription:

SELECT *
FROM invoices i
WHERE EXISTS (
    SELECT 1
    FROM learnsql.sales_invoice_lines ail
    JOIN learnsql.subscription_invoice_lines uil ON ail.invoice_id = uil.invoice_id
    WHERE ail.invoice_id = i.id
);

You might be wondering what SELECT 1 means: well, given that EXISTS simply tests the existence of results in the subquery, we can put whatever value we want there and it will work just the same; 1 is just an arbitrary value, we could also use any other value such as 42, ’a’, true, or simply reference any of the columns belonging to the tables used in the subquery.

As you might have noticed, the above subquery relies on its outer query in its WHERE clause: this allows us to filter records returned by the subquery using the value of i.id.

Subquery as temporary table

Let’s assume that we would like to run a targeted marketing campaign for some of our customers, specifically those who have been issued at least 10 invoices so far (let’s call them our “main customers”): unless we use a subquery, we are able to identify such customers using an aggregation, but then we won’t have access to their data.

Using the result of a subquery as temporary table, we can do something like this:

SELECT
c.*,
main_customers.invoice_count
FROM learnsql.customers c
JOIN (
    SELECT c.id, count(*) AS invoice_count
    FROM learnsql.invoices i
    JOIN learnsql.customers c ON i.customer_id = c.id
    GROUP BY c.id
    HAVING count(*) > 10
) main_customers ON main_customers.id = c.id
ORDER BY invoice_count DESC;

Common Table Expressions

A Common Table Expression (CTE) is essentially a named temporary result set that we can use when selecting, inserting, updating, or even deleting data: to some extent, it can be considered as an alternative syntax to subqueries. Let’s rewrite the above subqueries using CTEs, starting from the first one:

WITH my_countries AS (
    SELECT id
    FROM learnsql.countries
    WHERE name = 'Netherlands'
)
SELECT *
FROM learnsql.invoices i
JOIN my_countries mc ON mc.id = i.country_id;

New syntax: WITH <name> AS ( <query> ) lets us assign a name to the result of a query, so that we can then use it in our main query.

Let’s now rewrite our third subquery, the one used as temporary table:

WITH main_customers AS (
    SELECT c.id, count(*) AS invoice_count
    FROM learnsql.invoices i
    JOIN learnsql.customers c ON i.customer_id = c.id
    GROUP BY c.id
    HAVING count(*) > 10
)
SELECT
c.*,
main_customers.invoice_count
FROM learnsql.customers c
JOIN main_customers ON main_customers.id = c.id
ORDER BY invoice_count DESC;

In this case, what we did was literally just move the subquery definition to appear before the main query, which (arguably) makes everything more readable, since there is now a clear separation between the two.

Note: it is possible, if necessary, to define multiple temporary result sets in a single WITH clause, writing them as a comma-separated list:

WITH
resultset_1 AS ( … ),
resultset_2 AS ( … ),
…,
resultset_N AS ( … )
SELECT …

Practice

Are you ready to practice what you’ve just learned? Head over to the Notebook and try to solve the exercises you find there!

Note: the solution to each exercise is in the collapsed block right below it: you can expand it by clicking the arrow icon that appears on the far left when you hover it.