Calculating Video Completion Rate Using SQL: A Practical Approach
The objective is to compute the completion rate for each video that had play activity in 2021, rounded to three decimal places, and order the results in descending order. The completion rate is defined as the proportion of plays where the viewing duration was greater than or equal to the video's length. The column for the rate should be named avg_comp_play_rate.
The data resides in two tables:
Interaction Log (user_video_log)
user_id: User identifiervideo_id: Video identifierstart_ts: Timestamp when playback began (second precision)end_ts: Timestamp when playback ended (second precision)is_follow: Boolean for follow acsionis_like: Boolean for like actionis_share: Boolean for share actioncomment_id: Identifier for a comment
Video Metadata (video_info)
video_id: Video identifiercreator_id: Creator identifiercategory_tag: Content categoryvideo_duration: Length of the video in secondspublish_time: Publication timestamp
Query Construction Strategy
A structured approach begins by outlining the core SQL skeleton based on the problem's clauses.
-
Identify the Target and Filters
- For videos in 2021: A filter on
start_tsusingWHERE YEAR(start_ts) = 2021. - For each video: A grouping clause
GROUP BY video_id. - Output specification: The
SELECTlist must includevideo_idand the calculated completion rate. - Ordering: Results must be sorted
ORDER BYthe completion rate inDESCorder.
- For videos in 2021: A filter on
-
Form the Initial Skeleton
SELECT video_id, [completion_rate] AS avg_comp_play_rate FROM [table_source] WHERE YEAR(start_ts) = 2021 GROUP BY video_id ORDER BY avg_comp_play_rate DESC -
Resolve the Placeholders
[table_source]: Data is required from both tables. AnINNER JOINon the commonvideo_idis necessary.[completion_rate]: This is the core calculation:Number of Complete Plays / Total Plays. A complete play occurs when(end_ts - start_ts) >= video_duration. The time difference can be calculated usingTIMESTAMPDIFF(SECOND, start_ts, end_ts).- Method A (Explicit Counts):
Alternative,ROUND( SUM(CASE WHEN TIMESTAMPDIFF(SECOND, start_ts, end_ts) >= video_duration THEN 1 ELSE 0 END) / COUNT(*), 3 )COUNT(CASE WHEN ... THEN 1 END)can be used for the numerator, asCOUNTignoresNULL. - Method B (Using AVG): Since the condition yields 1 for complete and 0 for incomplete, the average directly gives the ratio.
ROUND( AVG(CASE WHEN TIMESTAMPDIFF(SECOND, start_ts, end_ts) >= video_duration THEN 1.0 ELSE 0 END), 3 )
- Method A (Explicit Counts):
-
Assemble the Final Query Combining the components yields the executable SQL statement.
SELECT
vlog.video_id,
ROUND(
SUM(CASE WHEN TIMESTAMPDIFF(SECOND, start_ts, end_ts) >= info.video_duration THEN 1 ELSE 0 END) / COUNT(*),
3
) AS avg_comp_play_rate
FROM user_video_log AS vlog
INNER JOIN video_info AS info ON vlog.video_id = info.video_id
WHERE YEAR(start_ts) = 2021
GROUP BY vlog.video_id
ORDER BY avg_comp_play_rate DESC;
Key Technical Concepts
1. Conditional Aggregation
Using CASE expressions inside aggregate functions like SUM or AVG is a standard technique for counting rows that meet specific criteria.
2. Handling NULLs in Aggregates
SUM(column)andAVG(column)ignoreNULLvalues within the specified column.COUNT(column)counts non-NULL values in that column.COUNT(*)counts all rows in the group, irrespective of NULLs.
3. Date and Time Functions
YEAR(date): Extracts the year component from a date or datetime.TIMESTAMPDIFF(unit, datetime1, datetime2): Calculates the difefrence (datetime2 - datetime1) in the specified unit (e.g.,SECOND).
4. String and Date Extraction
While YEAR() is ideal for extracting the year, alternative string functions exist:
LEFT(string, n): Returns the firstncharacters.SUBSTRING(string, start, length): Extracts a substring from positionstart. Example:WHERE LEFT(start_ts, 4) = '2021'achieves the same filter.
5. Rounding Numeric Output
The ROUND(value, decimal_places) function is used to format the final result to the required precision.