Monday, August 2, 2010

CTEs and readability

I've been working lately on keeping my queries more readable. Here's a pattern I use a lot when I'm not thinking about readability.

Say I want to know the average number of unique items contained in a set of orders. This is a classic aggregate-of-an-aggregate problem. First you need to find how many unique items are in each order:

select count(distinct prodcode) from orderdetail group by orderid

Then you can take that set and get an average:

select avg(prodcount) from (select count(distinct prodcode) prodcount from orderdetail group by orderid) as x

I'm always tempted to do this because it just involves wrapping my first query in parentheses, adding a column alias, and giving the whole thing a table alias. Here's two problems with this:
  1. I used "x" as a table alias. Syntax requires me to give the CTE a name, but I don't really care what it is. Until I add another layer to the query. I've seen queries that are multiple pages long and full of "as x", "as x1", "as x2" etc.
  2. It's already starting to get a little difficult to tell which from clause applies to which tables, and which columns came from where. Scope is also difficult to infer-- if I use this query as part of something else and want to add a date range, where do I add it?
Consider this syntax instead:

with items_per_order as
(select count(distinct prodcode) as prodcount from orderdetail group by orderid)

select avg(prodcount) from items_per_order ipo

The query benefits from a better alias for the CTE, but also we've simplified the intent: the last line is a very simple query getting the average count from a set. Items_per_order encapsulates the logic of getting the item count, and yet it's right there in the query if I want to delve into it. Scope is also much clearer: orderdetail and its columns can be referenced from within the items_per_order CTE, but not outside of it. It's clear where items_per_order starts and stops.

If I continue down the road of building more logic into the query, I can add more CTEs:
with orders_denormalized as
(select o.orderid, s.name as salespersonname, c.name as customername, sc.code from orderheader o left join salesperson s on o.sid=s.id
left join customer c on o.custid=c.id
inner join statuscodes sc on o.statusid=sc.id),

items_per_order as
(select orderid, count(distinct prodcode) as prodcount from orderdetail group by orderid),

orders_with_itemcount as
(select o.orderid,o.salesersonname,o.customername,o.code,i.prodcount
from orders_denormalized o inner join items_per_order i on o.orderid=i.orderid )

select avg(prodcount) from orders_with_itemcount owi where customername='steve smith'

I have some distinct encapsulations here that help keep track of where things came from. The definition of each CTE is obvious, and the end query is simple and easy to read.

Performance should definitely be considered here; I'd want to see the query plans to see how well this all performs. But as a way to keep ad hoc encapsulations in order (as often happens in SSRS as you continue to hang logic on your query to build complex segment logic), this has made maintenance a whole lot easier for me.

No comments:

Post a Comment