Postgres SQL Lessons From Advent of Code Challenges

2022-02-17


Note: This post was originally published on heap’s blog and was co-written with Amanda Murphy


We did something odd for Advent of Code this year: We solved a few challenges in javascript and then in PostgreSQL. We learned a few interesting things about SQL that we’d like to share here.

Disclaimer: We did not complete all 25 days in SQL (judging by the links from this HN thread, it looks like pretty much no one did), but we still think the things we learned about SQL are useful and worth sharing, especially for non-experts. If you’re an expert, you may not learn anything here.

Window function ranges are awesome

First, a quick refresher on window functions: you probably know that window functions perform calculations across a set of table rows that are somehow related to the current row. The example from the Postgres docs is calculating the department-specific average salary for a particular employee. The SQL looks like this:

SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;

In the above example, the “window” of rows considered are all rows whose department matches the department of the current row, but while working on the challenges, we learned that it’s also possible to define windows using ranges.

For example, on day 1, we’re asked to count the number of times the sum of values in a sliding window of three elements increases. You can express a sum of a sliding window of three elements using window function ranges. The SQL looks like this:

SELECT row_number, sum(value) OVER (ORDER BY row_number RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) AS value
         FROM input_with_rows

And the overall solution looks like this:

WITH input_with_rows AS (
    SELECT row_number() over () AS row_number, value
    FROM day1_input
),
     windows AS (
         SELECT row_number, sum(value) OVER (ORDER BY row_number RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) AS value
         FROM input_with_rows
     ),
     lagged AS (
         SELECT row_number, lag(value) OVER () AS value
         FROM windows
     )
SELECT count(*)
FROM windows
         JOIN lagged ON (windows.row_number = lagged.row_number AND lagged.value < windows.value);

Ranges are just the tip of the iceberg. There are many other methods of defining windows. Check out this page of the Postgres docs for more.

Common Table Expressions for readability and iteration

Common Table Expressions or CTE’s are temporary, scoped, and aliased tables created via a simple query expression. Subqueries can achieve the same result in most cases, and although subqueries are more common in textbooks and database documentation than CTEs, we found that using CTE’s instead of subqueries eliminated a great deal of nesting.

There are plenty of resources that explain why deeply nested code is a code smell. Nesting can quickly become difficult to reason about. Following the flow of execution adds additional cognitive load when the reader has to jump around to get context. CTEs are sequential, and it’s easier for us humans to follow sequential steps.

Another advantage of using CTEs is that they can refer to their own output if you use the RECURSIVE keyword. The docs recommend using recursive CTEs to work with arbitrarily nested hierarchical data, but they can also be used for iteration.

We learned how to use CTEs for iteration while working on day 3. In the second part, we’re asked to:

  1. Start with a list of binary numbers and consider just the first bit of those numbers.
  2. Discard numbers whose first bit does not match the most common bit in the first-bit position of all remaining numbers
  3. If you only have one number left, stop; this is the value you’re looking for
  4. Otherwise, repeat the process, considering the next bit to the right.

The SQL for this is below. The interesting part where we use a recursive CTE is lines 14 - 40:

\copy day3_input FROM 'day3.txt';


WITH bits_with_row_number AS (
    SELECT row_number() over ()            AS row_number,
           regexp_split_to_array(bits, '') AS bits
    FROM day3_input
),
     bits_with_column_number AS (
         SELECT *, row_number() over (partition by row_number) AS col_number
         FROM bits_with_row_number
     ),
     recurse AS (
         WITH RECURSIVE filter AS (
             (WITH target_bit AS (SELECT bits,
                                         CASE
                                             WHEN (sum(bits[1]::integer) OVER ()) < ((count(*) OVER ())::float / 2) THEN 0
                                             ELSE 1 END AS o2,
                                         CASE
                                             WHEN (sum(bits[1]::integer) OVER ()) < ((count(*) OVER ())::float / 2) THEN 1
                                             ELSE 0 END AS co2
                                  FROM bits_with_column_number)
              SELECT *, 1 AS i
              FROM target_bit
              WHERE bits[1]::integer = co2
             )
             UNION ALL
             (WITH target_bit AS (SELECT bits,
                                         i,
                                         CASE
                                             WHEN sum(bits[i + 1]::integer) OVER () < ((count(*) OVER ())::float / 2) THEN 0
                                             ELSE 1 END AS o2,
                                         CASE
                                             WHEN sum(bits[i + 1]::integer) OVER () < ((count(*) OVER ())::float / 2) THEN 1
                                             ELSE 0 END AS co2
                                  FROM filter)
              SELECT bits, o2, co2, i + 1 AS i
              FROM target_bit
              WHERE bits[i + 1]::integer = target_bit.co2)
         )
         SELECT *
         FROM filter
     )
SELECT *
FROM recurse
WHERE i = (SELECT max(i) FROM recurse);

There are plenty of good explanations of how recursive CTEs work out there, so we won’t repeat those explanations here.

How to think relationally instead of iteratively

On day 4, we’re given a list of numbers and a list of bingo boards and we’re asked to find which bingo board wins first. When we started writing the SQL portion, we assumed that we would need to use a recursive CTE so that we could iterate over the boards, but the exercise of solving the problem in SQL helped us see the problem through a new lens. We learned that just because there’s an iterative aspect in the problem statement doesn’t mean we can’t abstract it away.

Breaking down the problem into smaller pieces allowed us to see another way. We needed to find the first winning board, but what does that mean? Well, first, we need to think about how a board wins. A board wins when all the numbers in a column or row have been called. So, we need to find out when each column and row won.

Breaking it down in this way allowed us to see that it was a sorting problem. We already knew the order the numbers were called in. You can use the number order to assign a “time” to each column and row that represents when that column or row won. Then we just find the min of that time for each board. Once you have that, you know when the board won. Then you sort the boards by when they won and you have your answer!

Translating this to SQL looks like this:

WITH boards_ordered_with_call_order AS (
         SELECT *
         FROM boards_ordered
                  JOIN called_numbers_with_row
                       ON called_numbers_with_row.called_number = boards_ordered.value
     ),
     col_win_orders AS (
         SELECT board_no, col_number, max(call_order) AS max_col_call_order
         FROM boards_ordered_with_call_order
         GROUP BY board_no, col_number
     ),
     row_win_orders AS (
         SELECT board_no, row_number, max(call_order) AS max_row_call_order
         FROM boards_ordered_with_call_order
         GROUP BY board_no, row_number
     ),
     row_and_col_win_orders AS (
         SELECT *
         FROM col_win_orders
                  JOIN row_win_orders
                       USING (board_no)
         ORDER BY board_no, row_number, col_number
     ),
     winning_boards AS (
         SELECT board_no,
                min(LEAST(max_col_call_order, max_row_call_order)) AS winning_call_number
         FROM row_and_col_win_orders
         GROUP BY board_no
         ORDER BY 2
         LIMIT 1
     ),
     last_called_number AS (
         SELECT min(called_numbers_with_row.called_number) AS number
         FROM winning_boards
                  JOIN called_numbers_with_row ON (called_numbers_with_row.call_order = winning_call_number)
     )
SELECT board_no, sum(value::integer * called_numbers_with_row.called_number::integer)
FROM boards_ordered_with_call_order
         JOIN winning_boards USING (board_no)
         JOIN called_numbers_with_row ON called_numbers_with_row.call_order = winning_call_number
WHERE boards_ordered_with_call_order.call_order > winning_call_number
GROUP BY 1

This exercise was a great way to add additional tools to our algorithmic toolboxes.

SQL is surprisingly compact

While rewriting our javascript solutions in SQL, we were struck by how compactly and safely we could express solutions in SQL. For example, one challenge requires that you calculate the position of a submarine after it executes a series of up-down-forward commands (it’s day 2, part 1 if you really want to know the details of the problem). Here’s a solution in javascript:

let depth = 0;
let horizontal = 0;
instructions.forEach(({ direction, distance }) => {
  switch (direction) {
    case "up":
      depth -= distance;
      break;
    case "down":
      depth += distance;
      break;
    case "forward":
      horizontal += distance;
      break;
  }
});

console.log(depth * horizontal);

And here’s the same solution in SQL:

SELECT sum(distance) FILTER ( WHERE direction = 'forward' ) *
       (- sum(distance) FILTER ( WHERE direction = 'up' ) +
        sum(distance) FILTER ( WHERE direction = 'down' )) AS position
FROM day2_input

SQL actually looks more declarative and compact here than the javascript solution. We expected nearly all of my SQL solutions to be awkward but was pleasantly surprised to find it very elegant in this problem and in other cases.

We had another goal in writing this post. We’re hiring and these learnings provide a nice glimpse into the engineering culture at Heap. We use a lot of Postgres. We participate in Advent of Code challenges together. We have a growth mindset and love learning new things; if that sounds like a neat place to work, check us out.

postgresjavascript

The code that ChatGPT can't write

Optimizing Postgres Queries at Scale