I’m definitely not the only person who feels that YouTube sponsor segments have become longer and more frequent recently. Sometimes, I watch videos that seem to be trying to sell me something every couple of seconds.
On one hand, it’s great that both small and medium-sized YouTubers are able to make a living from their craft, but on the other hand, it sure is annoying to be bombarded by ads.
In this blog post, I will explore these sponsor segments, using data from a popular browser extension called SponsorBlock, to figure out if the perceived increase in ads actually did happen and also to quantify how many ads I’m watching.
I will walk you through my analysis, providing code snippets in Sql, DuckDB, and pandas. All the code is available on my GitHub, and since the dataset is open, I will also teach you how to download it, so that you can follow along and play with the data yourself.
These are the questions I will be trying to answer in this analysis:
- Have sponsor segments increased over the years?
- Which channels have the highest percentage of sponsor time per video?
- What is the density of sponsor segments throughout a video?
To get to these answers, we will have to cover much ground. This is the agenda for this post:
- How SponsorBlock Works
- Cleaning the Data
- Exploring and Enhancing the Data
- Answering Each of the Questions
- Conclusion
Let’s get this started!
How SponsorBlock Works
SponsorBlock is an extension that allows you to skip ad segments in videos, similar to how you skip Netflix intros. It’s incredibly accurate, as I don’t remember seeing one wrong segment since I started using it around a month ago, and I watch a lot of smaller non-English creators.
You might be asking yourself how the extension knows which parts of the video are sponsors, and, believe it or not, the answer is through crowdsourcing!
Users submit the timestamps for the ad segments, and other users vote if it’s accurate or not. For the average user, who isn’t contributing at all, the only thing you have to do is to press Enter to skip the ad.

Okay, now that you know what SponsorBlock is, let’s talk about the data.
Cleaning the Data
If you want to follow along, you can download a copy of the data using this SponsorBlock Mirror (it might take you quite a few minutes to download it all). The database schema can be seen here, although most of it won’t be useful for this project.
As one might expect, their database schema is made for the extension to work properly, and not for some guy to basically leech from a huge community effort to find what percentage of ads his favorite creator runs. For this, some work will need to be done to clean and model the data.
The only two tables that are important for this analysis are:
sponsorTimes.csv
: This is the most important table, containing thestartTime
andendTime
of all crowdsourced sponsor segments. The CSV is around 5GB.videoInfo.csv
: Contains the video title, publication date, and channel ID associated with each video.
Before we get into it, these are all the libraries I ended up using. I will explain the less obvious ones as we go.
pandas
duckdb
requests
requests-cache
python-dotenv
seaborn
matplotlib
numpy
The first step, then, is to load the data. Surprisingly, this was already a bit challenging, as I was getting a lot of errors parsing some rows of the CSV. These were the settings I found to work for the majority of the rows:
import duckdb
import os
# Connect to an in-memory DuckDB instance
con = duckdb.connect(database=':memory:')
sponsor_times = con.read_csv(
"sb-mirror/sponsorTimes.csv",
header=True,
columns={
"videoID": "VARCHAR",
"startTime": "DOUBLE",
"endTime": "DOUBLE",
"votes": "INTEGER",
"locked": "INTEGER",
"incorrectVotes": "INTEGER",
"UUID": "VARCHAR",
"userID": "VARCHAR",
"timeSubmitted": "DOUBLE",
"views": "INTEGER",
"category": "VARCHAR",
"actionType": "VARCHAR",
"service": "VARCHAR",
"videoDuration": "DOUBLE",
"hidden": "INTEGER",
"reputation": "DOUBLE",
"shadowHidden": "INTEGER",
"hashedVideoID": "VARCHAR",
"userAgent": "VARCHAR",
"description": "VARCHAR",
},
ignore_errors=True,
quotechar="",
)
video_info = con.read_csv(
"sb-mirror/videoInfo.csv",
header=True,
columns={
"videoID": "VARCHAR",
"channelID": "VARCHAR",
"title": "VARCHAR",
"published": "DOUBLE",
},
ignore_errors=True,
quotechar=None,
)
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
Here is what a sample of the data looks like:
con.sql("SELECT videoID, startTime, endTime, votes, locked, category FROM sponsor_times LIMIT 5")
con.sql("SELECT * FROM video_info LIMIT 5")


Understanding the data in the sponsorTimes
table is ridiculously important, otherwise, the cleaning process won’t make any sense.
Each row represents a user-submitted timestamp for a sponsored segment. Since multiple users can submit segments for the same video, the dataset contains duplicate and potentially incorrect entries, which will need to be dealt with during cleaning.
To find incorrect segments, I will use the votes
and the locked
column, as the latter one represents segments that were confirmed to be correct.
Another important column is the category
. There are a bunch of categories like Intro, Outro, Filler, etc. For this analysis, I will only work with Sponsor and Self-Promo.
I started by applying some filters:
CREATE TABLE filtered AS
SELECT
*
FROM sponsor_times
WHERE category IN ('sponsor', 'selfpromo') AND (votes > 0 OR locked=1)
Filtering for locked segments or segments with more than 0 votes was a big decision. This reduced the dataset by a huge percentage, but doing so made the data very reliable. For example, before doing this, all of the Top 50 channels with the highest percentage of ads were just spam, random channels that ran 99.9% of ads.
With this done, the next step is to get a dataset where each sponsor segment shows up only once. For example, a video with a sponsor segment at the beginning and another at the end should have only two rows of data.
This is very much not the case so far, since in one video we can have multiple user-submitted entries for each segment. To do this, I will use window functions to identify if two or more rows of data represent the same segment.
The first window function compares the startTime
of one row with the endTime
of the previous. If these values don’t overlap, it means they are entries for separate segments, otherwise they are repeated entries for the same segment.
CREATE TABLE new_segments AS
SELECT
-- Coalesce to TRUE to deal with the first row of every window
-- as the values are NULL, but it should count as a new segment.
COALESCE(startTime > LAG(endTime)
OVER (PARTITION BY videoID ORDER BY startTime), true)
AS new_ad_segment,
*
FROM filtered

The new_ad_segment
column is TRUE every time a row represents a new segment of a video. The first two rows, as their timestamps overlap, are properly marked as the same segment.
Next up, the second window function will label each ad segment by number:
CREATE TABLE ad_segments AS
SELECT
SUM(new_ad_segment)
OVER (PARTITION BY videoID ORDER BY startTime)
AS ad_segment,
*
FROM new_segments

Finally, now that each segment is properly numbered, it’s easy to get the segment that is either locked or has the highest amount of votes.
CREATE TABLE unique_segments AS
SELECT DISTINCT ON (videoID, ad_segment)
*
FROM ad_segments
ORDER BY videoID, ad_segment, locked DESC, votes DESC

That’s it! Now this table has one row for each unique ad segment, and I can start exploring the data.
If these queries feel complicated, and you need a refresher on window functions, check out this blog post that will teach you all you need to know about them! The last example covered in the blog post is almost exactly the process I used here.
Exploring and Enhancing the Data
Finally, the dataset is good enough to start exploring. The first thing I did was to get a sense of the size of the data:
- 36.0k Unique Channels
- 552.6k Unique Videos
- 673.8k Unique Sponsor Segments, for an average of 1.22 segments per video
As mentioned earlier, filtering by segments that were either locked or had at least 1 upvote, reduced the dataset massively, by around 80%. But this is the price I had to pay to have data that I could work with.
To check if there is nothing immediately wrong with the data, I gathered the channels that have the most amount of videos:
CREATE TABLE top_5_channels AS
SELECT
channelID,
count(DISTINCT unique_segments.videoID) AS video_count
FROM
unique_segments
LEFT JOIN video_info ON unique_segments.videoID = video_info.videoID
WHERE
channelID IS NOT NULL
-- Some channel IDs are blank
AND channelID != '""'
GROUP BY
channelID
ORDER BY
video_count DESC
LIMIT 5

The amount of videos per channel looks realistic… But this is terrible to work with. I don’t want to go to my browser and look up channel IDs every time I want to know the name of a channel.
To fix this, I created a small script with functions to get these values from the YouTube API in Python. I’m using the library requests_cache
to make sure I won’t be repeating API calls and depleting the API limits.
import requests
import requests_cache
from dotenv import load_dotenv
import os
load_dotenv()
API_KEY = os.getenv("YT_API_KEY")
# Cache responses indefinitely
requests_cache.install_cache("youtube_cache", expire_after=None)
def get_channel_name(channel_id: str) -> str:
url = (
f"https://www.googleapis.com/youtube/v3/channels"
f"?part=snippet&id={channel_id}&key={API_KEY}"
)
response = requests.get(url)
data = response.json()
try:
return data.get("items", [])[0].get("snippet", {}).get("title", "")
except (IndexError, AttributeError):
return ""
Besides this, I also created very similar functions to get the country and thumbnail of each channel, which will be useful later. If you’re interested in the code, check the GitHub repo.
On my DuckDB code, I’m now able to register this Python function and call them within SQL! I just need to be very careful to always use them on aggregated and filtered data, otherwise, I can say bye-bye to my API quota.
# This the script created above
from youtube_api import get_channel_name
# Try registering the function, ignore if already exists
try:
con.create_function('get_channel_name', get_channel_name, [str], str)
except Exception as e:
print(f"Skipping function registration (possibly already exists): {e}")
# Get the channel names
channel_names = con.sql("""
select
channelID,
get_channel_name(channelID) as channel_name,
video_count
from top_5_channels
""")

Much better! I looked up two channels that I’m familiar with on YouTube for a quick sanity check. Linus Tech Tips has a total of 7.2k videos uploaded, with 2.3k present in this dataset. Gamers Nexus has 3k videos, with 700 in the dataset. Looks good enough for me!
The last thing to do, before moving over to actually answering the question I set myself to answer, is to have an idea of the average duration of videos.

This matches my expectations, for the most part. I’m still a bit surprised by the amount of 20-40-minute videos, as for many years the “meta” was to have videos of 10 minutes to maximize YouTube’s own ads.
Also, I thought those buckets of video durations used in the previous graph were quite representative of how I think about video lengths, so I will be sticking with them for the next sections.
For reference, this is the pandas code used to create those buckets.
video_lengths = con.sql("""
SELECT DISTINCT ON (videoID)
videoID,
videoDuration
FROM
unique_segments
WHERE
videoID IS NOT NULL
AND videoDuration > 0
"""
).df()
# Define custom bins, in minutes
bins = [0, 3, 7, 12, 20, 40, 90, 180, 600, 9999999]
labels = ["0-3", "3-7", "7-12", "12-20", "20-40", "40-90", "90-180", "180-600", "600+"]
# Assign each video to a bucket (trasnform duration to min)
video_lengths["duration_bucket"] = pd.cut(video_lengths["videoDuration"] / 60, bins=bins, labels=labels, right=False)
Have Sponsor Segments Increased Over the Years?
The big question. This will prove if I’m being paranoid or not about everyone trying to sell me something at all times. I will start, though, by answering a simpler question, which is the percentage of sponsors for different video durations.
My expectation is that shorter videos have a higher share of their runtime from sponsors in comparison to longer videos. Let’s check if this is actually the case.
CREATE TABLE video_total_ads AS
SELECT
videoID,
MAX(videoDuration) AS videoDuration,
SUM(endTime - startTime) AS total_ad_duration,
SUM(endTime - startTime) / 60 AS ad_minutes,
SUM(endTime - startTime) / MAX(videoDuration) AS ad_percentage,
MAX(videoDuration) / 60 AS video_duration_minutes
FROM
unique_segments
WHERE
videoDuration > 0
AND videoDuration < 5400
AND videoID IS NOT NULL
GROUP BY
videoID
To keep the visualization simple, I’m applying similar buckets, but only up to 90 minutes.
# Define duration buckets (in minutes, up to 90min)
bins = [0, 3, 7, 12, 20, 30, 40, 60, 90]
labels = ["0-3", "3-7", "7-12", "12-20", "20-30", "30-40", "40-60", "60-90"]
video_total_ads = video_total_ads.df()
# Apply the buckets again
video_total_ads["duration_bucket"] = pd.cut(video_total_ads["videoDuration"] / 60, bins=bins, labels=labels, right=False)
# Group by bucket and sum ad times and total durations
bucket_data = video_total_ads.groupby("duration_bucket")[["ad_minutes", "videoDuration"]].sum()
# Convert to percentage of total video time
bucket_data["ad_percentage"] = (bucket_data["ad_minutes"] / (bucket_data["videoDuration"] / 60)) * 100
bucket_data["video_percentage"] = 100 - bucket_data["ad_percentage"]

As expected, if you’re watching shorter-form content on YouTube, then around 10% of it is sponsored! Videos of 12–20 min in duration have 6.5% of sponsors, while 20–30 min have only 4.8%.
To move forward to the year-by-year analysis I need to join the sponsor times with the videoInfo
table.
CREATE TABLE video_total_ads_joined AS
SELECT
*
FROM
video_total_ads
LEFT JOIN video_info ON video_total_ads.videoID = video_info.videoID
Next, let’s just check how many videos we have per year:
SELECT
*,
to_timestamp(NULLIF (published, 0)) AS published_date,
extract(year FROM to_timestamp(NULLIF (published, 0))) AS published_year
FROM
video_total_ads

Not good, not good at all. I’m not exactly sure why but there are a lot of videos that didn’t have the timestamp recorded. It seems that only in 2021 and 2022 videos were reliably stored with their published date.
I do have some ideas on how I can improve this dataset with other public data, but it’s a very time-consuming process and I will leave this for a future blog post. I don’t intend to settle for an answer based on limited data, but for now, I will have to make do with what I have.
I chose to keep the analysis between the years 2018 and 2023, given that those years had more data points.
# Limiting the years as for these here I have a decent amount of data.
start_year = 2018
end_year = 2023
plot_df = (
video_total_ads_joined.df()
.query(f"published_year >= {start_year} and published_year <= {end_year}")
.groupby(["published_year", "duration_bucket"], as_index=False)
[["ad_minutes", "video_duration_minutes"]]
.sum()
)
# Calculate ad_percentage & content_percentage
plot_df["ad_percentage"] = (
plot_df["ad_minutes"] / plot_df["video_duration_minutes"] * 100
)
plot_df["content_percentage"] = 100 - plot_df["ad_percentage"]

There is a steep increase in ad percentage, especially from 2020 to 2021, but afterward, it plateaus, especially for longer videos. This makes a lot of sense since during those years online advertisement grew a lot as people spent more and more time at home.
For shorter videos, there does seem to be an increase from 2022 to 2023. But as the data is limited, and I don’t have data for 2024, I can’t get a conclusive answer to this.
Next up, let’s move into questions that don’t depend on the publishing date, this way I can work with a larger portion of the dataset.
Which Channels Have the Highest Percentage of Sponsor Time Per Video?
This is a fun one for me, as I wonder if the channels I actively watch are the ones that run the most ads.
Continuing from the table created previously, I can easily group the ad and video amount by channel:
CREATE TABLE ad_percentage_per_channel AS
SELECT
channelID,
sum(ad_minutes) AS channel_total_ad_minutes,
sum(videoDuration) / 60 AS channel_total_video_minutes
FROM
video_total_ads_joined
GROUP BY
channelID
I decided to filter for channels that had at least 30 minutes of videos in the data, as a way of eliminating outliers.
SELECT
channelID,
channel_total_video_minutes,
channel_total_ad_minutes,
channel_ad_percentage
FROM
ad_percentage_per_channel
WHERE
-- At least 30 minutes of video
channel_total_video_minutes > 1800
AND channelID IS NOT NULL
ORDER BY
channel_ad_percentage DESC
LIMIT 50
As quickly mentioned earlier, I also created some functions to get the country and thumbnail of channels. This allowed me to create this visualization.

I’m not sure if this surprised me or not. Some of the channels on this list I watch very frequently, especially Gaveta (#31), a Brazilian YouTuber who covers movies and film editing.
I also know that both he and Corridor Crew (#32) do a lot of self-sponsor, promoting their own content and products, so maybe this is also the case for other channels!
In any case, the data seems good, and the percentages seem to match my manual checks and personal experience.
I would love to know if channels that you watch were present in this list, and if it surprised you or not!
If you want to see the Top 150 Creators, subscribe to my free newsletter, as I will be publishing the full list as well as more information about this analysis in there!
What is the Density of Sponsor Segments Throughout a Video?
Have you ever thought about at which point of the video ads work best? People probably just skip sponsor segments placed at the beginning, and just move on and close the video for those placed at the end.
From personal experience, I feel that I’m more likely to watch an ad if it plays around the middle of a video, but I don’t think this is what creators do in most cases.
My goal, then, is to create a heatmap that shows the density of ads during a video runtime. Doing this was surprisingly not obvious, and the solution that I found was so clever that it kinda blew my mind. Let me show you.
This is the data needed for this analysis. One row per ad, with the timestamp when each segment starts and ends:

The first step is to normalize the intervals, e.g., I don’t care that an ad started at 63s, what I want to know is if it started at 1% of the video runtime or 50% of the video runtime.
CREATE TABLE ad_intervals AS
SELECT
videoID,
startTime,
endTime,
videoDuration,
startTime / videoDuration AS start_fraction,
endTime / videoDuration AS end_fraction
FROM
unique_segments
WHERE
-- Just to make sure we don't have bad data
videoID IS NOT NULL
AND startTime >= 0
AND endTime <= videoDuration
AND startTime < endTime
-- Less than 40h
AND videoDuration < 144000

Great, now all intervals are comparable, but the problem is far from solved.
I want you to think, how would you solve this? If I asked you “At 10% runtime out of all videos, how many ads are running?”
I do not believe that this is an obvious problem to solve. My first instinct was to create a bunch of buckets, and then, for each row, I would ask “Is there an ad running at 1% of the runtime? What about at 2%? And so on…”
This seemed like a terrible idea, though. I wouldn’t be able to do it in SQL, and the code to solve it would be incredibly messy. In the end, the implementation of the solution I found was remarkably simple, using the Sweep Line Algorithm, which is an algorithm that is often used in programming interviews and puzzles.
I will show you how I solved it but don’t worry if you don’t understand what is happening. I will share other resources for you to learn more about it later on.
The first thing to do is to transform each interval (startTime, endTime) into two events, one that will count as +1 when the ad starts, and another that will count as -1 when the ad finishes. Afterward, just order the dataset by the “start time”.
CREATE TABLE ad_events AS
WITH unioned as (
-- This is the most important step.
SELECT
videoID,
start_fraction as fraction,
1 as delta
FROM ad_intervals
UNION ALL
SELECT
videoID,
end_fraction as fraction,
-1 as delta
FROM ad_intervals
), ordered AS (
SELECT
videoID,
fraction,
delta
FROM ad_events
ORDER BY fraction, delta
)
SELECT * FROM ordered

Now it’s already much easier to see the path forward! All I have to do is use a running sum on the delta column, and then, at any point of the dataset, I can know how many ads are running!
For example, if from 0s to 10s three ads started, but two of those also finished, I would have a delta of +3 and then -2, which means that there is only one ad currently running!
Going forward, and to simplify the data a bit, I first round the fractions to 4 decimal points and aggregate them. This is not necessary, but having too many rows was a problem when trying to plot the data. Finally, I divide the amount of running ads by the total amount of videos, to have it as a percentage.
CREATE TABLE ad_counter AS
WITH rounded_and_grouped AS (
SELECT
ROUND(fraction, 4) as fraction,
SUM(delta) as delta
FROM ad_events
GROUP BY ROUND(fraction, 4)
ORDER BY fraction
), running_sum AS (
SELECT
fraction,
SUM(delta) OVER (ORDER BY fraction) as ad_counter
FROM rounded_and_grouped
), density AS (
SELECT
fraction,
ad_counter,
ad_counter / (SELECT COUNT(DISTINCT videoID) FROM unique_segments_filtered) as density
FROM running_sum
)
SELECT * FROM density

With this data not only do I know that at the beginning of the videos (0.0% fraction), there are 69987 videos running ads, this also represents 17% of all videos in the dataset.
Now I can finally plot it as a heatmap:

As expected, the bumps at the extremities show that it’s way more common for channels to run ads at the beginning and end of the video. It’s also interesting that there is a plateau around the middle of the video, but then a drop, as the second half of the video is generally more ad-free.
What I found funny is that it’s apparently common for some videos to start straight away with an ad. I couldn’t picture this, so I manually checked 10 videos and it’s actually true… I’m not sure how representative it is, but most of the ones that I opened were gaming-related and in Russian, and they started directly with ads!
Before we move on to the conclusions, what did you think of the solution to this problem? I was surprised at how simple was doing this with the Sweep Line trick. If you want to know more about it, I recently published a blog post covering some SQL Patterns, and the last one is exactly this problem! Just repackaged in the context of counting concurrent meetings.
Conclusion
I really enjoyed doing this analysis since the data feels very personal to me, especially because I’ve been addicted to YouTube lately. I also feel that the answers I found were quite satisfactory, at least for the most part. To finish it off, let’s do a last recap!
Have Sponsor Segments Increased Over the Years?
There was a clear increase from 2020 to 2021. This was an effect that happened throughout all digital media and it’s clearly shown in this data. In more recent years, I can’t say whether there was an increase or not, as I don’t have enough data to be confident.
Which Channels Have the Highest Percentage of Sponsor Time Per Video?
I got to create a very convincing list of the Top 50 channels that run the highest amount of ads. And I discovered that some of my favorite creators are the ones that spend the most amount of time trying to sell me something!
What is the density of sponsor segments throughout a video?
As expected, most people run ads at the beginning and the end of videos. Besides this, a lot of creators run ads around the middle of the video, making the second half slightly more ad-free.
Also, there are YouTubers who immediately start a video with ads, which I think it’s a crazy strategy.
Other Learnings and Next Steps
I liked how clear the data was in showing the percentage of ads in different video sizes. Now I know that I’m probably spending 5–6% of my time on YouTube watching ads if I’m not skipping them since I mostly watch videos that are 10–20 min.
I’m still not fully happy though with the year-by-year analysis. I’ve already looked into other data and downloaded more than 100 GB of YouTube metadata datasets. I’m confident that I can use it, together with the YouTube API, to fill some gaps and get a more convincing answer to my question.
Visualization Code
You might have noticed that I didn’t provide snippets to plot the charts shown here. This was on purpose to make the blog post more readable, as matplotlib code occupies a lot of space.
You can find all the code in my GitHub repo, that way you can copy my charts if you want to.
That’s it for this one! I really hope you enjoyed reading this blog post and learned something new!
If you’re curious about interesting topics that didn’t make it into this post, or enjoy learning about data, subscribe to my free newsletter on Substack. I publish whenever I have something genuinely interesting to share.
Want to connect directly or have questions? Reach out anytime at mtrentz.com.
All images and animations by the author unless stated otherwise.
The post Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data appeared first on Towards Data Science.
Exploring if sponsor segments are getting longer by the year
The post Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data appeared first on Towards Data Science. Data Science, Advertising, Deep Dives, Marketing Data Science, Sql, YouTube Towards Data ScienceRead More


0 Comments