I recently encountered the classical gaps and islands problem, an intriguing issue where the goal is to label clusters of consecutive sequences that contain gaps or missing values. Each cluster, or “island,” needs a distinct label so that aggregate functions can be applied to each one separately.
The “gaps and islands” problem in SQL is a common scenario where you need to identify contiguous sequences (islands) and gaps in a dataset.
In this post, we will explore two simple business cases for gaps and islands. It is important to note that there are multiple ways to implement a solution for gaps and islands, including subqueries, the rank function with CTE, and cursors. I will focus on the rank function with CTE as I find it neat and effective.
Inventory Status
Businesses maintaining an inventory often need to track periods when a product is continuously in stock or out of stock. This can help understand supply chain issues and the demand patterns for various products. Such insights can aid businesses in making informed purchasing decisions, thus avoiding overstocking or understocking.
I have already set up the InventoryStatus
table. You may run it to study the inventory status of two products for a six-day period from 25th June to 30th June 2024.
The SQL blocks are editable!!
select * from InventoryStatus;
Let’s proceed with the rank function with CTE approach. Here, we use two CTEs with the second one referencing the first. Run the query below to see it in action.
WITH StatusPeriods AS (
SELECT
ProductID,
StatusDate,
InStock,ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY StatusDate) -
ROW_NUMBER() OVER (PARTITION BY ProductID, InStock ORDER BY StatusDate) AS GroupID
FROM InventoryStatus
),AS (
Islands SELECT
ProductID,
InStock,MIN(StatusDate) AS StartPeriod,
MAX(StatusDate) AS EndPeriod,
MAX(StatusDate)) - JULIANDAY(MIN(StatusDate)) AS Days
JULIANDAY(FROM StatusPeriods
GROUP BY ProductID, InStock, GroupID
)SELECT * FROM Islands
ORDER BY ProductID, StartPeriod;
In the StatusPeriods
CTE, two row numbers are calculated:
One based on the order of
StatusDate
partitioned byProductID
.One based on the order of
StatusDate
partitioned byProductID
andInStock
.
The difference between these row numbers (GroupID
) helps identify contiguous periods by grouping rows where the difference is the same.
The Islands CTE groups by ProductID
, InStock
, and GroupID
to identify contiguous periods, selecting the minimum StatusDate
and maximum StatusDate
as the start and end of each period, with Days
representing the duration of each contiguous period.
Continuous Subscription Periods
Subscription models are prevalent across many industries, from music (Spotify) and movie streaming (Netflix) to online shopping (Amazon Prime) and food services (Zomato Gold). Predictable revenue, reducing churn, forecasting inventory, valuable customer data, and increased average order value are just a few of the many motivators for businesses adopting this model. The psychology behind consumer subscription behavior are also quite compelling for this wide adoption.
I have already set up the Subscriptions
table with dummy data for the year 2024. You may run it to examine the data.
A subscription period is considered continuous if the next subscription starts on the day following the end of the previous subscription.
select * from Subscriptions;
As before, we will implement the rank function with CTE approach. Run the following query:
WITH SubscriptionPeriods AS (
SELECT
CustomerID,
StartDate,
EndDate,LAG(EndDate, 1) OVER (PARTITION BY CustomerID ORDER BY StartDate) AS PrevEndDate
FROM Subscriptions
),AS (
Islands SELECT
CustomerID,
StartDate,
EndDate,SUM(CASE
WHEN PrevEndDate IS NULL
OR JULIANDAY(StartDate) - JULIANDAY(PrevEndDate) > 1 THEN 1
ELSE 0
END) OVER (PARTITION BY CustomerID ORDER BY StartDate) AS IslandGroup
FROM SubscriptionPeriods
)SELECT
CustomerID,MIN(StartDate) AS StartPeriod,
MAX(EndDate) AS EndPeriod
FROM Islands
GROUP BY CustomerID, IslandGroup
ORDER BY CustomerID, StartPeriod;
In the SubscriptionPeriods
CTE, the LAG
function is used to get the EndDate
of the previous subscription row for each customer.
In the Islands
CTE, a CASE
statement determines if the current row should start a new group. A new group starts if:
PrevEndDate
isNULL
(i.e., it’s the first row for the customer).The difference in days between
PrevEndDate
andStartDate
is more than one day.
A running total (SUM
with CASE
) is calculated to assign a unique IslandGroup
identifier for each contiguous subscription period.
The final query performs the grouping for islands. This approach can be similarly applied to other business cases, such as calculating the continuous work period for employees or the vacation periods of employees.
By understanding and implementing the gaps and islands problem, businesses can gain valuable insights into various operational aspects, enabling more informed decision-making.