My Favorite SQL Interview Question: Analyzing Movie Viewing Sequences

Finding insightful SQL interview questions that truly test a candidate’s abilities can be challenging. One of my favorites involves a realistic scenario that assesses not just technical SQL skills, but also the ability to think analytically about data and its implications.

The Scenario

Imagine you’re working for a streaming service like Netflix or Hulu. You have a table storing user viewing data, and you want to understand user behavior by analyzing the sequences of movies they watch. Specifically, you’re tasked with finding the most popular sequence of the first two movies watched by users. This information can be incredibly valuable for improving recommendations and personalizing the user experience.

The Data

Your Snowflake database table, movie_watch, has the following structure:

user_id (INT): A unique identifier for each user.
movie_id (INT): A unique identifier for each movie.
watch_time (TIMESTAMP): The time the user started watching the movie.

The SQL Challenge

The goal is to write a SQL query that identifies the most popular pair of first and second movies watched by users. This involves several key steps:

Determine Viewing Order: Establish the order in which each user watched movies based on the watch_time.
Isolate First Two Movies: Extract the first and second movies watched by each user.
Aggregate and Rank: Count the occurrences of each movie pair and identify the most frequent combination.

Solution 1: Using Self-Join

A common approach to solve this problem is using a self-join:

WITH ranked_movies AS (
    SELECT
        user_id,
        movie_id,
        watch_time,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY watch_time) AS movie_rank
    FROM
        movie_watch
),
first_and_second_movies AS (
    SELECT
        a.user_id,
        a.movie_id AS first_movie_id,
        b.movie_id AS second_movie_id
    FROM
        ranked_movies a
    JOIN
        ranked_movies b
    ON
        a.user_id = b.user_id
        AND a.movie_rank = 1
        AND b.movie_rank = 2
)
SELECT
    first_movie_id,
    second_movie_id,
    COUNT(*) AS pair_count
FROM
    first_and_second_movies
GROUP BY
    first_movie_id, second_movie_id
ORDER BY
    pair_count DESC;

This query first ranks the movies watched by each user using the ROW_NUMBER() window function. Then, it self-joins the ranked_movies CTE to pair the first and second movies for each user. Finally, it aggregates the pairs and counts their occurrences.

Solution 2: Using QUALIFY (Snowflake and others)

A more concise and often more performant approach, especially in databases like Snowflake, is using the QUALIFY clause:

WITH ranked_movies AS (
    SELECT
        user_id,
        movie_id,
        watch_time,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY watch_time) AS movie_rank
    FROM
        movie_watch
)
SELECT
    FIRST_VALUE(movie_id) OVER (PARTITION BY user_id ORDER BY movie_rank) AS first_movie_id,
    NTH_VALUE(movie_id, 2) OVER (PARTITION BY user_id ORDER BY movie_rank) AS second_movie_id
FROM ranked_movies
WHERE movie_rank <=2
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY movie_rank DESC) = 1
GROUP BY first_movie_id, second_movie_id
ORDER BY COUNT(*) DESC;

This query uses the same ranked_movies CTE. However, instead of a self-join, it uses FIRST_VALUE and NTH_VALUE window functions to directly extract the first and second movie IDs. The QUALIFY clause then filters the results to retain only the rows representing the second movie watched by each user (and implicitly the corresponding first movie).

Advantages of Using `QUALIFY`

Conciseness and Readability: The QUALIFY solution is generally more compact and easier to understand than the self-join approach, especially as the complexity of the sequence analysis increases.
Performance: In databases that support it (like Snowflake), QUALIFY can offer performance benefits over self-joins. It avoids creating intermediate result sets, leading to more efficient query execution.
Flexibility: QUALIFY provides a more flexible way to filter results based on window function outputs, making it easier to adapt the query for different analytical needs.

Example Dataset and Expected Output

Let’s illustrate with a small example:

user_id	movie_id	watch_time
1	101	2024-07-01 10:00:00
1	102	2024-07-01 12:00:00
2	101	2024-07-02 15:00:00
2	103	2024-07-02 17:00:00
3	102	2024-07-03 09:00:00
3	101	2024-07-03 11:00:00

Running the query on this data would yield (showing only the top sequences):

first_movie_id	second_movie_id	count
101	102	1
102	101	1
101	103	1

Why This Question is Valuable

This question is excellent for several reasons:

Real-World Relevance: It directly addresses a common analytical task in the entertainment industry.
Technical Depth: It tests knowledge of window functions, CTEs, the QUALIFY clause, aggregation, and sorting.
Analytical Thinking: It requires candidates to think strategically about how to approach the problem and interpret the results.

Business Implications

The insights derived from this query can be used to:

Enhance Recommendations: Offer personalized “watch next” suggestions based on popular viewing patterns.
Improve User Experience: Create curated movie lists or playlists based on common sequences.
Inform Content Acquisition: Understand audience preferences and guide decisions about acquiring new content.

Challenge for Readers

How would you adapt this query to find the most popular sequence of the first three movies watched by users? Share your solutions in the comments!