Monday, 10 June 2013

SQL: Recursive Common Table Expressions (CTE)

This post follows up the post on Common Table Expressions to introduce Recursive CTE's. You shouldn't overly worry about this. The word recursion sends shivers down most developers! And to be honest the use of recursive CTE's will only be in very particular occasions.

A recursive CTE will reference itself. As we should know in recursion we should have a "exit" clause. This is achieved in a recursive CTE with an anchor member. The anchor member doesn't reference the CTE - whereas the recursive member of the CTE will do.

Our CTE is thus made up of two elements - the anchor member and the recursive member. These two parts are combined using a SET operator - normally UNION or UNION ALL.
WITH cte_name ( column_names ... )
AS (
   CTE_query_for_anchor_member
 UNION [ALL]
   CTE_query_for_recursive_member
)
SELECT * FROM cte_name -- Or other statement to use CTE
I will use the Northwind database Employees table.
SELECT 
   EmployeeID, LastName, Title, ReportsTo 
FROM Employees
Giving us the output
EmployeeID  LastName             Title                          ReportsTo
----------- -------------------- ------------------------------ -----------
1           Davolio              Sales Representative           2
2           Fuller               Vice President, Sales          NULL
3           Leverling            Sales Representative           2
4           Peacock              Sales Representative           2
5           Buchanan             Sales Manager                  2
6           Suyama               Sales Representative           5
7           King                 Sales Representative           5
8           Callahan             Inside Sales Coordinator       2
9           Dodsworth            Sales Representative           5
That's the raw data - let's try and put this into a "hierarchy" of the employees and there managers. Each employee's manager is defined by the ReportsTo column, so we can join on this.
SELECT 
   E.EmployeeID, E.LastName, E.Title, E.ReportsTo, M.LastName--, M.Title
FROM Employees M
JOIN Employees E ON (E.ReportsTo = M.EmployeeID)
Now showing the manager's name (through the JOIN)
EmployeeID  LastName             Title                          ReportsTo   LastName
----------- -------------------- ------------------------------ ----------- --------------------
1           Davolio              Sales Representative           2           Fuller
3           Leverling            Sales Representative           2           Fuller
4           Peacock              Sales Representative           2           Fuller
5           Buchanan             Sales Manager                  2           Fuller
6           Suyama               Sales Representative           5           Buchanan
7           King                 Sales Representative           5           Buchanan
8           Callahan             Inside Sales Coordinator       2           Fuller
9           Dodsworth            Sales Representative           5           Buchanan
We need to tweak this a little as the top manager (Fuller - Vice President of Sales) is missing, we need an Outer Join.
SELECT 
   E.EmployeeID, E.LastName, E.Title, E.ReportsTo, M.LastName
FROM Employees M
RIGHT OUTER JOIN Employees E ON (E.ReportsTo = M.EmployeeID)
Now showing Fuller - with Nulls for who he reports to.
EmployeeID  LastName             Title                          ReportsTo   LastName
----------- -------------------- ------------------------------ ----------- --------------------
1           Davolio              Sales Representative           2           Fuller
2           Fuller               Vice President, Sales          NULL        NULL
3           Leverling            Sales Representative           2           Fuller
4           Peacock              Sales Representative           2           Fuller
5           Buchanan             Sales Manager                  2           Fuller
6           Suyama               Sales Representative           5           Buchanan
7           King                 Sales Representative           5           Buchanan
8           Callahan             Inside Sales Coordinator       2           Fuller
9           Dodsworth            Sales Representative           5           Buchanan
There we go. Let's do this with a recursive CTE. The anchor condition will be when the ReportsTo column is NULL - i.e. when we have an employee who doesn't report to anyone else.
SELECT ReportsTo, EmployeeID, LastName, FirstName, Title
FROM Employees 
WHERE ReportsTo IS NULL
Doesn't seem like much - but let's think of this as a separate query. Imagine our CTE is called CTEEmployees and this SQL refers to it
SELECT e.ReportsTo, e.EmployeeID, e.LastName, e.FirstName, e.Title
FROM Employees AS e
    INNER JOIN CTEEmployees AS d
    ON e.ReportsTo = d.EmployeeID 
Putting this together into a CTE and a query to use the CTE we would have
WITH CTEEmployees(ReportsTo, EmployeeID, LastName, FirstName, Title) AS 
(
    SELECT ReportsTo, EmployeeID, LastName, FirstName, Title
    FROM Employees 
    WHERE ReportsTo IS NULL
    UNION ALL
    SELECT e.ReportsTo, e.EmployeeID, e.LastName, e.FirstName, e.Title
    FROM Employees AS e
        INNER JOIN CTEEmployees AS d
        ON e.ReportsTo = d.EmployeeID 
)
SELECT EmployeeID, LastName, FirstName, Title, ReportsTo
FROM CTEEmployees
ORDER BY EmployeeID
Giving the results
EmployeeID  LastName             FirstName  Title                          ReportsTo
----------- -------------------- ---------- ------------------------------ -----------
1           Davolio              Nancy      Sales Representative           2
2           Fuller               Andrew     Vice President, Sales          NULL
3           Leverling            Janet      Sales Representative           2
4           Peacock              Margaret   Sales Representative           2
5           Buchanan             Steven     Sales Manager                  2
6           Suyama               Michael    Sales Representative           5
7           King                 Robert     Sales Representative           5
8           Callahan             Laura      Inside Sales Coordinator       2
9           Dodsworth            Anne       Sales Representative           5
I'm not sure it was worth the hassle to get there when an outer join did this for me. But let's add another column to our output
WITH CTEEmployees(ReportsTo, EmployeeID, LastName, FirstName, Title, EmployeeLevel) AS 
(
    SELECT ReportsTo, EmployeeID, LastName, FirstName, Title, 0
    FROM Employees 
    WHERE ReportsTo IS NULL
    UNION ALL
    SELECT e.ReportsTo, e.EmployeeID, e.LastName, e.FirstName, e.Title, d.EmployeeLevel + 1
    FROM Employees AS e
        INNER JOIN CTEEmployees AS d
        ON e.ReportsTo = d.EmployeeID 
)
SELECT EmployeeID, LastName, FirstName, Title, ReportsTo, EmployeeLevel
FROM CTEEmployees
ORDER BY EmployeeID
Giving us the same output with a column showing the level
EmployeeID  LastName             FirstName  Title                          ReportsTo   EmployeeLevel
----------- -------------------- ---------- ------------------------------ ----------- -------------
1           Davolio              Nancy      Sales Representative           2           1
2           Fuller               Andrew     Vice President, Sales          NULL        0
3           Leverling            Janet      Sales Representative           2           1
4           Peacock              Margaret   Sales Representative           2           1
5           Buchanan             Steven     Sales Manager                  2           1
6           Suyama               Michael    Sales Representative           5           2
7           King                 Robert     Sales Representative           5           2
8           Callahan             Laura      Inside Sales Coordinator       2           1
9           Dodsworth            Anne       Sales Representative           5           2
This extra column would be useful when displaying the output.

The actual use of recursive of CTE is going to be limited to data that is stored in a hierarchical structure - i.e. with self joins.

Thursday, 6 June 2013

SQL: Common Table Expressions (CTE)

This posting is the first around Common Table Expressions (CTE) that were introduced in SQL Server 2005. They can be thought of as temporary views or results that can be used within a query.

My first introduction to this to look at sub-queries and inline views and show how a Common Table Expression can be used instead.

We are going to use the Northwind database for this example. The problem that I want to solve is to calculate is the average number of orders that each customer will make (using only customers who have made orders before - just for simplicity).
SELECT COUNT(*) AS NumberOfOrders FROM Orders
SELECT COUNT(DISTINCT CustomerID) AS NumberOfCustomersWhoHaveMadeOrders 
    FROM Orders
with the results
NumberOfOrders
--------------
830

NumberOfCustomersWhoHaveMadeOrders
----------------------------------
89
The answer is 830/89 - which is 9.3258. The following SQL will do this for you
SELECT 
 COUNT(DISTINCT O.OrderID) / CAST(COUNT(DISTINCT O.CustomerID) AS float)
      AS AverageNumberOfOrdersPerCustomer
 FROM Orders O
with the results
AverageNumberOfOrdersPerCustomer
---------------------------------------
9.32584269662921
One small downside of this - if there are no orders you will get a divide by zero error. In this simple example you won't see this, but taking the problem further we might want to retrieve the average number of orders by month - and we might not have any orders in one month. But the point of this is how to do this with a Common Table Expression.
WITH CTEOrders
AS
(SELECT CustomerID, COUNT(*) AS NumberOfOrdersPerCustomer FROM Orders
GROUP BY CustomerID)
SELECT AVG(CAST(NumberOfOrdersPerCustomer AS float)) 
     AS AverageNumberOfOrdersPerCustomer
FROM CTEOrders
with the results
AverageNumberOfOrdersPerCustomer
--------------------------------
9.32584269662921
Let's look at the syntax of the CTE. We start with the WITH clause followed by the name of the Common Table Expression. Following this we can specify (in brackets) the names of the columns. This is optional if the columns are named within the SELECT statement within the CTE. The SELECT statement is enclosed in brackets.
WITH CTEName (column1, column2, ...)
AS (SELECT statement)
Immediately following the Common Table Expression definition you must use it. After the next statement the result of the CTE is gone.
Which is the best approach? I have always had problems reading Execution Plans, but we can have a go. The execution plan for the first query is this

Execution plan for standard query - click to enlarge
The query is performing a scan of the CustomersOrders Index (which isn't clustered) of 890 rows representing 89% of the overall query. The execution plan for the Common Table Expression is very much identical. It has the same scan of the same index - but this represents more of the query (implying that the rest of the query is very slightly less costly). But overall they are the same in cost.

Execution plan for Common Table Expression - click to enlarge
We can also look at the Client Statistics for each and the CTE comes out slightly on top.

Statistics for standard query
Statistics for Common Table Expression
So very similar in timing and the same results. But lets look at the name - it is called a Common Table Expression. One of the uses is to reuse the results from the CTE. So imagine that as well as the average I want the maximum number of orders per customer and minimum number of orders per customer. To do this with a CTE the query is
WITH CTEOrders
AS
(SELECT CustomerID, COUNT(*) AS NumberOfOrdersPerCustomer FROM Orders
GROUP BY CustomerID)
SELECT MAX(NumberOfOrdersPerCustomer) AS MaxOrders,
       MIN(NumberOfOrdersPerCustomer) AS MinOrders,
       AVG(CAST(NumberOfOrdersPerCustomer AS float))  
     AS AverageNumberOfOrdersPerCustomer
FROM CTEOrders
Giving the results
MaxOrders   MinOrders   AverageNumberOfOrdersPerCustomer
----------- ----------- --------------------------------
31          1           9.32584269662921
Without much additional cost - the extra aggregates will be working with the data returned by the CTE. Doing this with the standard query will require quite a bit more work.