Mastering SQL Gaps and Islands: Solutions for Inventory and Subscription Period Analysis

Author

Bevan Stanely

Published

July 6, 2024

I recently encountered the classical gaps and islands problem, an intriguing issue where the goal is to label clusters of consecutive sequences that contain gaps or missing values. Each cluster, or “island,” needs a distinct label so that aggregate functions can be applied to each one separately.

Note

The “gaps and islands” problem in SQL is a common scenario where you need to identify contiguous sequences (islands) and gaps in a dataset.

In this post, we will explore two simple business cases for gaps and islands. It is important to note that there are multiple ways to implement a solution for gaps and islands, including subqueries, the rank function with CTE, and cursors. I will focus on the rank function with CTE as I find it neat and effective.

Inventory Status

Businesses maintaining an inventory often need to track periods when a product is continuously in stock or out of stock. This can help understand supply chain issues and the demand patterns for various products. Such insights can aid businesses in making informed purchasing decisions, thus avoiding overstocking or understocking.

I have already set up the InventoryStatus table. You may run it to study the inventory status of two products for a six-day period from 25th June to 30th June 2024.

Tip

The SQL blocks are editable!!

select * from InventoryStatus;

Let’s proceed with the rank function with CTE approach. Here, we use two CTEs with the second one referencing the first. Run the query below to see it in action.

WITH StatusPeriods AS (
    SELECT
        ProductID,
        StatusDate,
        InStock,
        ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY StatusDate) - 
        ROW_NUMBER() OVER (PARTITION BY ProductID, InStock ORDER BY StatusDate) AS GroupID
    FROM InventoryStatus
),
Islands AS (
    SELECT
        ProductID,
        InStock,
        MIN(StatusDate) AS StartPeriod,
        MAX(StatusDate) AS EndPeriod,
        JULIANDAY(MAX(StatusDate)) - JULIANDAY(MIN(StatusDate)) AS Days
    FROM StatusPeriods
    GROUP BY ProductID, InStock, GroupID
)
SELECT * FROM Islands
ORDER BY ProductID, StartPeriod;

In the StatusPeriods CTE, two row numbers are calculated:

  • One based on the order of StatusDate partitioned by ProductID.

  • One based on the order of StatusDate partitioned by ProductID and InStock.

The difference between these row numbers (GroupID) helps identify contiguous periods by grouping rows where the difference is the same.

The Islands CTE groups by ProductID, InStock, and GroupID to identify contiguous periods, selecting the minimum StatusDate and maximum StatusDate as the start and end of each period, with Days representing the duration of each contiguous period.

Continuous Subscription Periods

Subscription models are prevalent across many industries, from music (Spotify) and movie streaming (Netflix) to online shopping (Amazon Prime) and food services (Zomato Gold). Predictable revenue, reducing churn, forecasting inventory, valuable customer data, and increased average order value are just a few of the many motivators for businesses adopting this model. The psychology behind consumer subscription behavior are also quite compelling for this wide adoption.

I have already set up the Subscriptions table with dummy data for the year 2024. You may run it to examine the data.

Note

A subscription period is considered continuous if the next subscription starts on the day following the end of the previous subscription.

select * from Subscriptions;

As before, we will implement the rank function with CTE approach. Run the following query:

WITH SubscriptionPeriods AS (
    SELECT 
        CustomerID,
        StartDate,
        EndDate,
        LAG(EndDate, 1) OVER (PARTITION BY CustomerID ORDER BY StartDate) AS PrevEndDate
    FROM Subscriptions
),
Islands AS (
    SELECT
        CustomerID,
        StartDate,
        EndDate,
        SUM(CASE 
                WHEN PrevEndDate IS NULL 
                  OR JULIANDAY(StartDate) - JULIANDAY(PrevEndDate) > 1 THEN 1
                ELSE 0
            END) OVER (PARTITION BY CustomerID ORDER BY StartDate) AS IslandGroup
    FROM SubscriptionPeriods
)
SELECT 
    CustomerID,
    MIN(StartDate) AS StartPeriod,
    MAX(EndDate) AS EndPeriod
FROM Islands
GROUP BY CustomerID, IslandGroup
ORDER BY CustomerID, StartPeriod;

In the SubscriptionPeriods CTE, the LAG function is used to get the EndDate of the previous subscription row for each customer.

In the Islands CTE, a CASE statement determines if the current row should start a new group. A new group starts if:

  • PrevEndDate is NULL (i.e., it’s the first row for the customer).

  • The difference in days between PrevEndDate and StartDate is more than one day.

A running total (SUM with CASE) is calculated to assign a unique IslandGroup identifier for each contiguous subscription period.

The final query performs the grouping for islands. This approach can be similarly applied to other business cases, such as calculating the continuous work period for employees or the vacation periods of employees.

By understanding and implementing the gaps and islands problem, businesses can gain valuable insights into various operational aspects, enabling more informed decision-making.