Wednesday, October 10, 2012

Home on the (Date) Range

Here's an interesting problem that came across my desk this week.  We have, to generalize the data a little, a table containing various product discounts, each of which has a date range during which it applies.

Here's what the table looks like:
CREATE TABLE productdiscounts (startdate DATE,enddate DATE,keyval CHAR(3),discount int) 
INSERT INTO productdiscounts VALUES
('1/1/12','1/31/12','abc',1)
,('1/17/12','2/15/12','abc',8)
,('2/7/12','3/7/12','abc',4)
,('1/15/12','5/13/12','def',2)
,('4/17/12','4/28/12','def',16)
,('2/12/12','4/20/12','def',32)
The challenge was, we need to derive a set that has a row for each date range for which a different set of discounts applies.  So, for product abc, we have a $1 discount from 1/1/12 through 1/16/12, and then on 1/17/12, an additional discount of $8 takes effect.  On 2/1/12, the $1 discount goes out of effect, and then on 2/7/12, a new $4 discount takes effect.  And so on.

Basically, the set we need to create should look like this:
keyval startdate         enddate         discount
abc         2012-01-01   2012-01-16 1
abc         2012-01-17 2012-01-31 9
abc         2012-02-01 2012-02-06 8
abc         2012-02-07   2012-02-15 12
abc         2012-02-16 2012-03-07 4
def         2012-01-15 2012-02-11 2
def         2012-02-12 2012-04-16 34
def         2012-04-17 2012-04-20 50
def         2012-04-21   2012-04-28 18
def         2012-04-29 2012-05-13 2
So how do you take a set of staggering, overlapping date ranges, and come up with one continuous set of ranges, aggregating the discounts as you go?

Our old friend, the numbers table

My first instinct was to use a numbers table and compute the aggregates against that set, like this:

SELECT pd.keyval,d.dayval,SUM(pd.discount) FROM
(SELECT DATEADD(d,number,'1/1/12') AS dayval FROM master..spt_values WHERE type='p') d
INNER JOIN dbo.productdiscounts pd ON pd.startdate<=d.dayval AND pd.enddate>=d.dayval
GROUP BY pd.keyval,d.dayval
The derived table d gives us a set of days from 1/1/12 through sometime in 2017 (2048 days), and by joining to the product discounts, we can add up the discount in effect on each day.  The resultant set looks like this (I omitted the records for every single day):

keyval dayval totaldiscount
abc 2012-01-01 1
...
abc 2012-01-16 1
abc 2012-01-17 9
...
abc 2012-01-31 9
abc 2012-02-01 8
...
abc 2012-02-06 8
abc 2012-02-07 12
 ...
abc 2012-02-15 12
Now the question becomes, how do we find the starting and ending dates for each unique range?  It's not enough to just look for MIN(dayval) and MAX(dayval) for a given aggregate, since the total discount might go from 1 to 9 and then back to 1 again.  SQL 2012 has some window functions that would help with this, but our shop is barely on 2008.

We were pretty close to (ugh) walking this table with a cursor.  But then we did the math.  For two years of discounts, we'd have over 700 rows for each product, multiplied by the number of products.  Using a cursor on millions or rows wasn't a pleasant prospect.

A fresh approach

Looking at the data, we realized that the number of dates involved was fairly small.  Discounts tended to start and end at month end, so in a two year range, we had maybe 40 dates on which a discount started or ended.  And so that got me thinking.

Our desired date ranges would always start on either the start date of a particular discount, or on the day after the end date of a discount.  If a discount wasn't starting or ending, there'd be no change in the total discount that day.

With this in mind, we can create a CTE of "interesting dates":

WITH interestingdates AS (
SELECT DISTINCT keyval,startdate FROM dbo.productdiscounts
UNION SELECT DISTINCT keyval,DATEADD(d,1,enddate) FROM dbo.productdiscounts)
Note that the DISTINCT will eliminate any duplicate dates in the list.  Note also that the keyval (i.e. the product id) is also included, since a date is interesting only for the product affected by that discount.

Armed with our list of interesting dates by product, we can calculate the end date of each range, which is just the day before the next start date:

,interestingranges as
(SELECT a.keyval,a.startdate,
      (SELECT DATEADD(d,-1,MIN(startdate)) FROM interestingdates b WHERE b.keyval=a.keyval AND b.startdate>a.startdate) AS enddate
FROM interestingdates a)
Which leaves us in a pretty good spot.  Here's what the interestingranges CTE looks like:
keyval startdate         enddate
abc        2012-01-01 2012-01-16
abc        2012-01-17 2012-01-31
abc        2012-02-01 2012-02-06
abc        2012-02-07 2012-02-15
abc        2012-02-16 2012-03-07
abc        2012-03-08 NULL
def        2012-01-15 2012-02-11
def        2012-02-12 2012-04-16
def        2012-04-17 2012-04-20
def        2012-04-21 2012-04-28
def        2012-04-29 2012-05-13
def        2012-05-14 NULL
Notice that the last date range for each product is open-ended, since our calculation of the end date is based on the next start date.  These ranges might be handy; if not, it's easy to filter them out.

Having identified the interesting ranges, we just need to aggregate the discount amounts:

SELECT ir.keyval, ir.startdate, ir.enddate, SUM(pd.discount) AS discount
FROM interestingranges ir
INNER JOIN dbo.productdiscounts pd ON pd.startdate<= ir.startdate AND pd.enddate>=ir.enddate AND pd.keyval=ir.keyval
GROUP BY ir.keyval, ir.startdate, ir.enddate
Here's our completed aggregate:
keyval startdate         enddate         discount
abc         2012-01-01   2012-01-16 1
abc         2012-01-17 2012-01-31 9
abc         2012-02-01 2012-02-06 8
abc         2012-02-07   2012-02-15 12
abc         2012-02-16 2012-03-07 4
def         2012-01-15 2012-02-11 2
def         2012-02-12 2012-04-16 34
def         2012-04-17 2012-04-20 50
def         2012-04-21   2012-04-28 18
def         2012-04-29 2012-05-13 2
I  purposely chose discount amounts that were powers of 2, so it's easy to check which discounts are in effect on any day against the aggregate.  A $50 discount on def from 4/17 through 7/20, for instance has to be $32 + $16 + $2.   In reality, it's possible that two (or more) contiguous date ranges to result in the same aggregate-- for example, a $5 discount could expire, and two new discounts for $3 and $2 could take effect on the same day.  In our case, the aggregation was a little more involved, in such a way that the unique set of individual discounts still made each range unique, even if the total discount ended up the same.

The total script ends up looking like this:

--an interesting day is one where a promotion starts or the day after a promotion ends
;WITH interestingdates AS (
SELECT DISTINCT keyval,startdate FROM dbo.productdiscounts
UNION SELECT DISTINCT keyval,DATEADD(d,1,enddate) FROM dbo.productdiscounts
--an interesting range is just each interesting date and the next interesting date, by key
,interestingranges as
(SELECT a.keyval,a.startdate,
      (SELECT DATEADD(d,-1,MIN(startdate)) FROM interestingdates b WHERE b.keyval=a.keyval AND b.startdate>a.startdate) AS enddate
FROM interestingdates a
SELECT ir.keyval, ir.startdate, ir.enddate, SUM(pd.discount) AS discount
FROM interestingranges ir
INNER JOIN dbo.productdiscounts pd ON pd.startdate<= ir.startdate AND pd.enddate>=ir.enddate AND pd.keyval=ir.keyval
GROUP BY ir.keyval, ir.startdate, ir.enddate