Pages

Monday, May 25, 2009

Tree Queries in SQL Server with Common Table Expressions

My Friend and Fellow MVP in SQL Server, Jacob Sebastian is conducting a very good contest on TSQL. Every week he and his team are posting a very good challenge to write a SQL query to solve a problem.

You can also find his interview @ 60 seconds with Jacob.

This week I got an email from him about TSQL challenge. If you want to find more details about TSQL challenge then visit http://beyondrelational.com/

The challenge is “about identifying all the employees directly or indirectly reporting to a given manager.” This looks so simple by the statement but not at all. You can write a query with Recurssive Common Table Expressions (CTE) to resolve this, but you should not pass or hard code any values inside the CTE. This is what makes it a bit complex. That is you can’t control the output that is coming from CTEs.

Copied table structure and sample data from TSQL Challenge 8

CREATE TABLE Employees (EmpID INT, EmpName VARCHAR(20), ReportsTo INT)
INSERT INTO Employees(EmpID, EmpName, ReportsTo)
SELECT 1, 'Jacob', NULL UNION ALL
SELECT 2, 'Rui', NULL UNION ALL
SELECT 3, 'Jacobson', NULL UNION ALL
SELECT 4, 'Jess', 1 UNION ALL
SELECT 5, 'Steve', 1 UNION ALL
SELECT 6, 'Bob', 1 UNION ALL
SELECT 7, 'Smith', 2 UNION ALL
SELECT 8, 'Bobbey', 2 UNION ALL
SELECT 9, 'Steffi', 3 UNION ALL
SELECT 10, 'Bracha', 3 UNION ALL
SELECT 11, 'John', 5 UNION ALL
SELECT 12, 'Michael', 6 UNION ALL
SELECT 13, 'Paul', 6 UNION ALL
SELECT 14, 'Lana', 7 UNION ALL
SELECT 15, 'Johnson', 7 UNION ALL
SELECT 16, 'Mic', 8 UNION ALL
SELECT 17, 'Stev', 8 UNION ALL
SELECT 18, 'Paulson', 9 UNION ALL
SELECT 19, 'Jessica', 10

Lets see how we can solve this with CTEs.
DECLARE @manager VARCHAR(20)
SELECT @manager = 'Jacob';
WITH EmployeeCTE(empid, empname, ReportsTo, depth, sortcol)
AS(
SELECT empid, empname, ReportsTo, 0, CAST(empid AS VARBINARY(500))
FROM employees
WHERE Empname = @manager
UNION ALL
SELECT E.empid, E.empname, E.ReportsTo, M.depth+1, CAST(sortcol + CAST(E.empid AS BINARY(4)) AS VARBINARY(500))
FROM Employees AS E JOIN EmployeeCTE AS M
ON E.ReportsTo = M.empid
)
SELECT REPLICATE('----', depth) + empname AS empname
FROM EmployeeCTE CTE
ORDER BY sortcol


Output of the above query is:



The logic which I built is based on recursive CTE. The CTE would give me all the employees who are reporting to Jacob directly or indirectly. If you look at the definition of CTE, I am using @manager variable to restrict the result set of CTE.

The real challenge is not to hard code or use any variables inside the CTE which restricts the output.

After thinking a lot, the one idea which came to my mind is to find out the ultimate parent ID for each and every node. To find this I used Row_Number which is SQL Server Analytical function.

DECLARE @manager VARCHAR(20);
SELECT @manager = 'Jacob';
WITH EmployeesCTE
AS (
SELECT EmpName,EmpID, ReportsTo, 1 AS LEVEL,
CAST(row_number() OVER(ORDER BY @@LangID) AS varchar(max)) AS path
FROM dbo.Employees
UNION ALL
SELECT e.EmpName,e.EmpID, e.ReportsTo, EmployeesCTE.LEVEL + 1 AS LEVEL,
path + ',' + CAST(row_number() OVER(ORDER BY @@LangID) AS varchar(max)) AS path FROM dbo.Employees e, EmployeesCTE
WHERE ((e.ReportsTo = EmployeesCTE.EmpID))
)
SELECT REPLICATE(' ', c1.Level) + c1.EmpName as EmpName,
c1.ReportsTo, c1.LEVEL, c1.path
FROM EmployeesCTE c1 ,
(Select c2.EmpName,c2.Path from
EmployeesCTE c2
WHERE LTrim(c2.EmpName) = @manager And c2.Path Not Like '%,%' ) c2
where substring(c1.path+',',1,charindex(',',c1.path+',')-1) = c2.path
ORDER BY c1.path

The way to find out this is constructing the complete path from Node to its ultimate parent node. The code which does this part is
path + ',' + CAST(row_number() OVER(ORDER BY @@LangID) AS varchar(max)) AS path

@@LangID is dummy order by parameter; I had to use some variable to order the results.

Look at the below screen shot for the sample output.



So in this path I am taking the ultimate parent ID by applying CHARINDEX and SUBSTRING functions.


NOTE: The numbers which you are seeing in the Path column are not EmpIDs. These are based on SQL Server Analytical functions.


So now I have the node and its ultimate parent node. Next step is to limit the results based on the input parameter.
To limit the results I am using below query, where I am passing @manager and also placed a condition on the output of CTE where Path shouldn’t contain any special characters. What this means is if this column i.e. path contains any special characters then it has children. The below query would give me the ID of the @manager value.

(Select c2.EmpName,c2.Path from EmployeesCTE c2 WHERE LTrim(c2.EmpName) = @manager And c2.Path Not Like '%,%' ) c2
Using this c2.path column, I am limiting the result set of the outer query.

This solution works in SQL Server 2005. In SQL Server 2008 there is new feature called HierarchyID with this you can easily solve this problem. I haven’t yet explored this feature but very soon I am going to explore and will put something on my blog.

2 comments:

Amit Agrawal said...

I did not understood the second query ......

you could have very well just say
is null instead of putting the variable inside the CTE ...


DECLARE @manager VARCHAR(20)
SELECT @manager = 'Jacob';
WITH EmployeeCTE(empid, empname, ReportsTo, depth, sortcol)
AS(
SELECT empid, empname, ReportsTo, 0, CAST(empid AS VARBINARY(500))
FROM employees
WHERE Empname is null
UNION ALL
SELECT E.empid, E.empname, E.ReportsTo, M.depth+1, CAST(sortcol + CAST(E.empid AS BINARY(4)) AS VARBINARY(500))
FROM Employees AS E JOIN EmployeeCTE AS M
ON E.ReportsTo = M.empid
)
SELECT REPLICATE('----', depth) + empname AS empname
FROM EmployeeCTE CTE
ORDER BY sortcol

Amit Agrawal said...

why you need second quesry to avoid the variable ..you can simplay say is null instead of @manager in first query