Finding insightful SQL interview questions that truly test a candidate’s abilities can be challenging. One of my favorites involves a realistic scenario that assesses not just technical SQL skills, but also the ability to think analytically about data and its implications.
The Scenario
Imagine you’re working for a streaming service like Netflix or Hulu. You have a table storing user viewing data, and you want to understand user behavior by analyzing the sequences of movies they watch. Specifically, you’re tasked with finding the most popular sequence of the first two movies watched by users. This information can be incredibly valuable for improving recommendations and personalizing the user experience.
The Data
Your Snowflake database table, movie_watch
, has the following structure:
- user_id (INT): A unique identifier for each user.
- movie_id (INT): A unique identifier for each movie.
- watch_time (TIMESTAMP): The time the user started watching the movie.
The SQL Challenge
The goal is to write a SQL query that identifies the most popular pair of first and second movies watched by users. This involves several key steps:
- Determine Viewing Order: Establish the order in which each user watched movies based on the
watch_time
. - Isolate First Two Movies: Extract the first and second movies watched by each user.
- Aggregate and Rank: Count the occurrences of each movie pair and identify the most frequent combination.
Solution 1: Using Self-Join
A common approach to solve this problem is using a self-join:
WITH ranked_movies AS (
SELECT
user_id,
movie_id,
watch_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY watch_time) AS movie_rank
FROM
movie_watch
),
first_and_second_movies AS (
SELECT
a.user_id,
a.movie_id AS first_movie_id,
b.movie_id AS second_movie_id
FROM
ranked_movies a
JOIN
ranked_movies b
ON
a.user_id = b.user_id
AND a.movie_rank = 1
AND b.movie_rank = 2
)
SELECT
first_movie_id,
second_movie_id,
COUNT(*) AS pair_count
FROM
first_and_second_movies
GROUP BY
first_movie_id, second_movie_id
ORDER BY
pair_count DESC;
This query first ranks the movies watched by each user using the ROW_NUMBER()
window function. Then, it self-joins the ranked_movies
CTE to pair the first and second movies for each user. Finally, it aggregates the pairs and counts their occurrences.
Solution 2: Using QUALIFY (Snowflake and others)
A more concise and often more performant approach, especially in databases like Snowflake, is using the QUALIFY
clause:
WITH ranked_movies AS (
SELECT
user_id,
movie_id,
watch_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY watch_time) AS movie_rank
FROM
movie_watch
)
SELECT
FIRST_VALUE(movie_id) OVER (PARTITION BY user_id ORDER BY movie_rank) AS first_movie_id,
NTH_VALUE(movie_id, 2) OVER (PARTITION BY user_id ORDER BY movie_rank) AS second_movie_id
FROM ranked_movies
WHERE movie_rank <=2
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY movie_rank DESC) = 1
GROUP BY first_movie_id, second_movie_id
ORDER BY COUNT(*) DESC;
This query uses the same ranked_movies
CTE. However, instead of a self-join, it uses FIRST_VALUE
and NTH_VALUE
window functions to directly extract the first and second movie IDs. The QUALIFY
clause then filters the results to retain only the rows representing the second movie watched by each user (and implicitly the corresponding first movie).
Advantages of Using QUALIFY
-
Conciseness and Readability: The
QUALIFY
solution is generally more compact and easier to understand than the self-join approach, especially as the complexity of the sequence analysis increases. -
Performance: In databases that support it (like Snowflake),
QUALIFY
can offer performance benefits over self-joins. It avoids creating intermediate result sets, leading to more efficient query execution. -
Flexibility:
QUALIFY
provides a more flexible way to filter results based on window function outputs, making it easier to adapt the query for different analytical needs.
Example Dataset and Expected Output
Let’s illustrate with a small example:
user_id | movie_id | watch_time |
---|---|---|
1 | 101 | 2024-07-01 10:00:00 |
1 | 102 | 2024-07-01 12:00:00 |
2 | 101 | 2024-07-02 15:00:00 |
2 | 103 | 2024-07-02 17:00:00 |
3 | 102 | 2024-07-03 09:00:00 |
3 | 101 | 2024-07-03 11:00:00 |
Running the query on this data would yield (showing only the top sequences):
first_movie_id | second_movie_id | count |
---|---|---|
101 | 102 | 1 |
102 | 101 | 1 |
101 | 103 | 1 |
Why This Question is Valuable
This question is excellent for several reasons:
- Real-World Relevance: It directly addresses a common analytical task in the entertainment industry.
- Technical Depth: It tests knowledge of window functions, CTEs, the
QUALIFY
clause, aggregation, and sorting. - Analytical Thinking: It requires candidates to think strategically about how to approach the problem and interpret the results.
Business Implications
The insights derived from this query can be used to:
- Enhance Recommendations: Offer personalized “watch next” suggestions based on popular viewing patterns.
- Improve User Experience: Create curated movie lists or playlists based on common sequences.
- Inform Content Acquisition: Understand audience preferences and guide decisions about acquiring new content.
Challenge for Readers
How would you adapt this query to find the most popular sequence of the first three movies watched by users? Share your solutions in the comments!