Pages

Sunday, December 30, 2007

clustered index in SQL Server

What is a clustered index

First off, we'll go through what a clustered index is. SQL Server has two types of indexes, clustered indexes and non-clustered indexes. Both types are organized in the same way with a b-tree structure. The difference between them lies in what the leaf-level nodes -- the lowest level of the tree -- contains. In a clustered index the leaf-level is the data, while the leaves of a non-clustered index contains bookmarks to the actual data. This mean that for a table that has a clustered index, the data is actually stored in the order of the index. What the bookmarks of the non-clustered index point to depends on if the table also has a clustered index or not. If it does have a clustered index then the leaves of non-clustered indexes will contain the clustering key -- the specific value(s) of the column(s) that make up the clustered index -- for each row. If the table does not have a clustered index it is known as a heap table and the bookmarks in non-clustered indexes are in RID format (File#:Page#:Slot#), i.e. direct pointers to the physical location the row is stored in. Later in this article we will see why this difference is important. To make sure that everyone understands the difference between a clustered index and a non-clustered index I have visualized them in these two images (clustered | non-clustered). The indexes correspond to those of this table:

CREATE TABLE EMPLOYEES
(
empid int NOT NULL CONSTRAINT ix_pkEMPLOYEES PRIMARY KEY NONCLUSTERED
, name varchar(25) NOT NULL
, age tinyint NOT NULL
)


CREATE CLUSTERED INDEX ixcEMPLOYEES ON EMPLOYEES (name)


INSERT INTO EMPLOYEES (empid, name, age) VALUES (1, 'David', 42)
INSERT INTO EMPLOYEES (empid, name, age) VALUES (2, 'Tom', 31)
INSERT INTO EMPLOYEES (empid, name, age) VALUES (3, 'Adam', 27)
INSERT INTO EMPLOYEES (empid, name, age) VALUES (4, 'John', 22)


SELECT * FROM EMPLOYEES WHERE name = 'John'
SELECT * FROM EMPLOYEES WHERE empid = 1

In the real indexes these four rows would fit on the same page, but for this discussion I've just put one row on each page. So, to return results for the first query containing WHERE name = 'John' SQL Server will traverse the clustered index from the root down through the intermediate node levels until it finds the leaf page containing John, and it would have all the data available to return for the query. But to return results for the second query, it will traverse the non-clustered index until it finds the leaf page containing empid 1, then use the clustering key found there for empid 1 (David) for a lookup in the clustered index to find the remaining data (in this case just the column age is missing). You can see this for yourself by viewing the execution plan for the queries in Query Analyzer (press Ctrl-K too see the plan).

No comments: