Thursday, August 19, 2010

Sorting Text Fields Numerically

Just a little tidbit that I came across this morning, involving sorting fields that have numeric data but in a text field.

In this instance, I'm using a cursor to loop through my tables and run sp_spaceused on them, and then put the results into a temporary table:
insert into #TableSizes exec sp_spaceused @objname = @name;

The problem is, when I select from the temp table:

select [Table Name]
, [Number of Rows]
, [Reserved Space]
, [Data Space]
, [Index Size]
, [Unused Space]
from [#TableSizes]
order by [Reserved Space] desc

the Reserved Space column isn't sorted correctly. I end up with something like this:






Table NameNumber of RowsReserved Space
TableD6096 KB
TableB587952 KB
TableA1111518728 KB
TableC6652908717752 KB

Because Reserved Space is a varchar column, it sorts as text, based on the first character first. So 9's sort to the top, even small numbers like 96.

The correct solution is probably to trim off the " KB" from the end, cast to an integer, and sort by that:

order by cast(left([Reserved Space],len([Reserved Space])-3) as int) desc

which then also allows you to use SUM, etc. on the column.

I was in a hurry and didn't want to mess around with all the parenthesis, and came up with a quick and dirty solution:

order by len([Reserved Space]) desc,[Reserved Space] desc

This sorts the longest strings (hence the biggest numbers) to the top of the list, and then sorts each string alphabetically from 9 to 0.

Not something you'll need to do every day, but I thought it was a neat solution.

Thursday, August 12, 2010

It's the Color of the Sky, See?


"I want to know where my data is, physically!" This was the repeated demand of the gentleman sitting next to me at last night's SQL SIG talk on SQL Azure-- Microsoft's new cloud-based RDBMS. (Of equal concern was, once his data was inevitably lost or compromised, "Who do I sue???") The pitch was so perfect, he could have been a plant-- those of us who were skeptical about data in the cloud, having just heard our prejudice given an angry voice, were suddenly defending it.

Still, it's easy to understand his frustration. In the cloud-heavy world of NetFlix, Rhapsody, and Amazon Kindles, an increasing percent of the media and information in our private lives isn't on our bookshelves-- it's out there somewhere. I used to buy CDs and DVDs obsessively, but now spend $9 a month to access a vastly larger collection online. Like the old chestnut about beer, you don't buy media anymore, you rent it.

Microsoft wants to turn your SQL Server into such a commodity, and they have a pretty good start on it. SQL Azure is part of the Windows Azure platform, that lets you host web applications in the cloud. What's appealing is that the interfaces for both managing and programming SQL Azure are the ones you're used to: you can connect SQL Management Studio and BIDS (at least the latest versions) to your Azure server and they behave exactly the same as they would with your real SQL Server, at least in most cases. Connecting to an Azure database with ADO.NET is transparent --but for the connection string-- and you use the same Connection and Command objects.

In short, once you've set up your Azure server and database, you can almost forget that the data's in the cloud and not on your local server. Almost. Context switching ("USE Database2") and three-part naming ("select * from database2.dbo.mytable") don't work-- you have to stay in your own database. Connections have some new behaviors and errors to respond to connectivity issues and other throttles-- Azure kills transactions that run longer than 5 minutes, for example. But for smaller, non-complicated apps, it's pretty smooth.

So where is your data, physically? When you sign up for an Azure account, you pick a geographic location (currently pretty large regions) and Microsoft holds your database on a farm of servers across multiple locations, with replication handled automatically. Essentially your server becomes a black box to you-- your traffic will be routed through load balancers to the right place, and server resources can be added instantly as activity grows and shrinks.

How about performance? Here's where your application architecture may need to change. If you have a stateless web app that's already designed to live in a web farm nicely, you're probably set-- host your web application on Windows Azure servers, and set some affinity settings to make sure your app and database servers in the cloud have good connectivity. This is called Near Computing, because your business logic is close to the database.

If you have more of a two-tier app, you'll need to adjust to the increased latency of accessing data in the cloud, and some of the new behaviors mentioned above. Depending on how chatty your app is with the database server, this isn't trivial. If you're already optimized for poor network latency (say, if your database server is across a wan from the application), it may not be that much different.

There's also some hybrid methods to sync your local SQL database with an Azure database, asynchronously, so you can migrate parts of your app to the cloud and leave others local where it makes sense. This sounds like a bit of a nightmare but might be a good choice for broad applications that might benefit from moving certain functions into the cloud.

Certainly, if your application is entirely local, connected, and speed-sensitive (say, an inventory and routing app where the users are inside the warehouse and connected directly to the data at all times), moving to the cloud might not be a fit. But for bringing small apps online quickly, and particularly when your users are already geographically scattered, it's a tempting architecture.

Licensing plans are still fairly immature, and more complicated features like service broker haven't been implemented yet, but Microsoft is pretty excited about this, and they're certainly not the only ones with their head in the Cloud.

If I were building a new app from scratch today, I'd give Azure a good solid look-- and I might also have my lawyer figure out who to sue.

Wednesday, August 4, 2010

Summer Reading

Es Cue El is going on vacation to beautiful Toronto, Ontario for the weekend, so no blog posts unless Toronto is really boring.

Your weekend reading assignment: Aaron Bertrand's classic "Bad Habits To Kick" series. There's lots of good gems in here, and most of them are things (like always terminating statements with semicolons) that you can easily add to your finger memory with a couple days' effort.

See you Monday!

Oracle XE

If you're looking to get your feet wet in the world of Oracle, then Oracle XE is for you.

Somewhat comparable to SQL Server Express Edition, Oracle 10g XE is the free version of Oracle-- free to develop, free to deploy, free to distribute. It's limited to one database per machine, 4 GB per database, 1 GB of memory, and one processor (you can install it on multi-processor machines but it will only use one).

Combined with Oracle Application Express (APEX) -- a free toolkit for building applications, including web-based user interfaces, reporting, security, and dashboards -- you can actually build little web-based database applications pretty quickly without much learning.

I think the hope is that smaller organizations will consider using it for internal applications and then grow into the more expensive versions. I'm not sure how likely that is, but I am working with a friend to build a small inventory/billing app for her small business, and I was going to use MS Access just because it's fast and familiar. After installing Oracle XE, I'm considering using it, although admittedly I'd be doing so just as an excuse to learn a new skillset.

Oracle has a reputation of being somewhat of a monster to install, and if you've ever read anything about the Universal Installer, you may think you need a PhD just to get it installed. Oracle XE and APEX are quick installs that ask you a handful of questions (install path, etc) and for the most part you can sail right through with the defaults. I had to change the default port and still don't quite know why, but even that was easy-- the installer told me that 8080 was in use, so I tried 8081, then 8082, until I found one that was open. That was the biggest glitch I ran into.

The other great thing about the XE/APEX is the tutorials available on both web sites. Oracle has created a "2 Day Developer Guide" that walks you through the tools, basic PL/SQL, and everything you need to know to get a quick and dirty app up. There's also some great tutorials.

I don't expect this will make anyone an expert at Oracle, but for SQL Server types like myself looking to teach themselves PL/SQL and maybe create a little sample app, you can at least get yourself some exposure to the world of Oracle without a lot of hassle.

Tuesday, August 3, 2010

Can a LEFT JOIN restrict your resultset?

A colleague came to me the other day with this question: "Isn't it true that a left join shouldn't restrict the number of records returned?"

This was one of those crisis-of-understanding moments: we're all taught that if you query a table and get 100 rows back, adding on a left join may add rows to your resultset, but should never remove any.

His query was something like this:
select od.*, p.ISBN from OrderDetails od left join Products p on od.ProductUid=p.UID where od.OrderDate='3/11/10' and p.Type in ('B','T')
He told me that when he ran this query:
select od.* from OrderDetails od where od.OrderDate='3/11/10'
he got 100 records back. But when we added on the left join, suddenly he only got 87 records. Since when, he wanted to know, could a left join do that? Not all the OrderDetail records had a ProductUid (some were for taxes, miscellaneous charges, etc), which was why he used the left join to begin with. And sure enough, those records were the ones that were missing from the results. Shouldn't the left join have returned those records, with no p.ISBN value?

The problem here is actually our old friend, Mr. NULL. The left join did return the OrderDetail records without a ProductUid. But the where clause filtered them out-- specifically, the phrase "p.Type in ('B','T')". When I told my friend that this was the problem, he protested. "But Product.Type is a required field!" It was defined as NOT NULL in the table.

So, how could p.Type be the problem, and when would it ever be null?

Remember, when a match isn't found in a left join, the fields in the right table (in this case, "p") are all null. So, after evaluating the left join, there still were 100 records. However, now when we come to evaluating "p.Type in ('B','T')", we run into the situation where p.Type is null (because no record was found in p).

Since"null=anything" always returns false (technically "unknown"), and because "in" evaluates to a series of "or"s in the query engine:

p.Type = 'B' or p.Type = 'T'

,we end up filtering out the very records that the left join was supposed to leave in.

There are three quick solutions to this:
select od.*, p.ISBN from OrderDetails od left join Products p on od.ProductUid=p.UID and p.Type in ('B','T') where od.OrderDate='3/11/10'
or
select od.*, p.ISBN from OrderDetails od left join Products p on od.ProductUid=p.UID where od.OrderDate='3/11/10' and (p.Type in ('B','T')or p.Type is null)
or
select od.*, p.ISBN from OrderDetails od left join Products p on od.ProductUid=p.UID where od.OrderDate='3/11/10' and isnull(p.Type,'') in ('B','T')
The first solution is probably preferable-- move the p.Type condition into the join expression. What we're really trying to join to is "products with a type of B or T"-- if none are found, then p.ISBN should be null in our resultset.

It's equivalent to:
select od.*, p.ISBN from OrderDetails od left join (select * from Products where p.Type in ('B','T')) p on od.ProductUid=p.UID where od.OrderDate='3/11/10'
Testing for "p.Type is null" in the where clause should accomplish the same thing, but consider the condition where Product.Type is a nullable field, and there are records where p.Type is null and p.ISBN is not. By testing for "p.Type is null" in the where clause, we'd be including records with no match from the left join, but we'd also be including ISBNs for products with a null Type, which isn't what we want.

This is particularly troublesome when the data you're working with almost always has a match in the right table. In this case, 99% of the OrderDetails records had a ProductUid, so the developer never had to worry about p.Type. When he found a rare data condition, it seemed like the rules of SQL had changed on him.

The bottom line is this: be careful when using where clauses on columns from a right table-- even when those columns are defined as NOT NULL in the database. For that matter, get in the habit of always asking yourself "what if this column were null?" when testing for conditions in a SQL statement.

Monday, August 2, 2010

CTEs and readability

I've been working lately on keeping my queries more readable. Here's a pattern I use a lot when I'm not thinking about readability.

Say I want to know the average number of unique items contained in a set of orders. This is a classic aggregate-of-an-aggregate problem. First you need to find how many unique items are in each order:

select count(distinct prodcode) from orderdetail group by orderid

Then you can take that set and get an average:

select avg(prodcount) from (select count(distinct prodcode) prodcount from orderdetail group by orderid) as x

I'm always tempted to do this because it just involves wrapping my first query in parentheses, adding a column alias, and giving the whole thing a table alias. Here's two problems with this:
  1. I used "x" as a table alias. Syntax requires me to give the CTE a name, but I don't really care what it is. Until I add another layer to the query. I've seen queries that are multiple pages long and full of "as x", "as x1", "as x2" etc.
  2. It's already starting to get a little difficult to tell which from clause applies to which tables, and which columns came from where. Scope is also difficult to infer-- if I use this query as part of something else and want to add a date range, where do I add it?
Consider this syntax instead:

with items_per_order as
(select count(distinct prodcode) as prodcount from orderdetail group by orderid)

select avg(prodcount) from items_per_order ipo

The query benefits from a better alias for the CTE, but also we've simplified the intent: the last line is a very simple query getting the average count from a set. Items_per_order encapsulates the logic of getting the item count, and yet it's right there in the query if I want to delve into it. Scope is also much clearer: orderdetail and its columns can be referenced from within the items_per_order CTE, but not outside of it. It's clear where items_per_order starts and stops.

If I continue down the road of building more logic into the query, I can add more CTEs:
with orders_denormalized as
(select o.orderid, s.name as salespersonname, c.name as customername, sc.code from orderheader o left join salesperson s on o.sid=s.id
left join customer c on o.custid=c.id
inner join statuscodes sc on o.statusid=sc.id),

items_per_order as
(select orderid, count(distinct prodcode) as prodcount from orderdetail group by orderid),

orders_with_itemcount as
(select o.orderid,o.salesersonname,o.customername,o.code,i.prodcount
from orders_denormalized o inner join items_per_order i on o.orderid=i.orderid )

select avg(prodcount) from orders_with_itemcount owi where customername='steve smith'

I have some distinct encapsulations here that help keep track of where things came from. The definition of each CTE is obvious, and the end query is simple and easy to read.

Performance should definitely be considered here; I'd want to see the query plans to see how well this all performs. But as a way to keep ad hoc encapsulations in order (as often happens in SSRS as you continue to hang logic on your query to build complex segment logic), this has made maintenance a whole lot easier for me.

Sunday, August 1, 2010

A quick gripe


My Eee netbook has a tiny little keyboard, and keys like page-up, page-down, and home are reached by pressing two keys: function-left arrow is the combination for "home".

On my ThinkPad, the ctrl and function keys are and always have been oddly reversed from other keyboards.

Consequently, the keystrokes for alt-home (which I use all the time while browsing), on the Eee PC are function-alt-left arrow. On the ThinkPad, doing the same thing with my fingers ends up with them pressing ctrl-alt-left arrow. I'll let you figure out what that does on your own computer. Let's just say it's annoying and I wish my fingers had two sets of presets like some fancy car seats, so I could program the same muscles to do different things on different keyboards.

Sunday Reading: SQL Server MVP Deep Dives


Today's recommendation is "SQL Server MVP Deep Dives," a collection of excellent articles by SQL MVP on a variety of topics.

I just got this this weekend but so far have learned about ten things that I can use right away. Here's what makes the book such a gem: the target audience for each chapter seems to be experienced DBAs who haven't worked in that particular topic before, but who need a good quick start with examples they can actually use. They're each about 10 pages long but have enough details to give you a solid understanding of the concepts and usually some good examples that you can use immediately.

There are some more conceptual chapters like "What does it mean to be a DBA?" and "Louis and Paul's 10 key relational database design ideas" that are good summaries of some things you probably already know (or should know) but make great touchstones. Others like "BI for the relational guy" solve that classic "I wish someone had just told me X before I started" scenario.

As an added bonus, the authors are donating all their royalties to War Child, an organization that helps war-affected children worldwide.

The ebook version is only $35 and is worth every penny.