A common question asked during an interview is “What is the difference between a heap and a table with a clustered index?”. The most commonly given answer is that the data is sorted in the key order of the index. Sometimes this will be elaborated upon, and other things will be mentioned, such as:
- All of the data in the table is moved into the leaf level of the index.
- A heap has a record identifier (RID) – an eight-byte value that identifies the file / page / slot that the record is in.
- This RID is then used in non-clustered indexes to identify the particular record.
- On rare occasions, you might even hear about forwarded record pointers.
On an index (including a clustered index), there is at least a root level and a leaf level to the index – the leaf level contains the data, and the root level has the starting value that is on each of the data pages. Assuming that a heap has no forwarded records, then the clustered index leaf level should be the same size as the heap. Depending on the number of pages required for the data in the index, there may also be intermediate leaf levels to support this index tree – in this case, the root level will point to the intermediate leaf levels, which will in turn continue pointing down the tree until they are pointing to the leaf level. All of these index intermediate leaf level pages and the root page would be additional pages that aren’t in the heap, and the clustered index would have more pages that the corresponding heap.
On a table with only insert activity, a clustered index would frequently have more pages than a heap. Furthermore, if the clustered index key is not in an ascending order, then page splits can occur when a record has to go onto a specific page due to the value in that record – if the page does not have enough free space to hold that row, then some of the records in that page are split off into a new page, and then the record is inserted into the appropriate page. This creates even more pages in the clustered index.
Heaps, on the other hand, insert new records into the last page until that page is full, then a new page is created for additional records (again, this is for tables with only insert activity).
Or do they?
I recently came across another difference that I hadn’t previously known. Depending upon the way that records are being inserted into a heap, pages may not be filled to capacity before new pages are allocated. This can result in a heap being larger than a corresponding table with a clustered index.
Demo time.
Let’s start off by creating a table. We’ll create this table in the tempdb database, so that it won’t affect any other databases, and when the instance is restarted, it will be automagically cleaned up.
1 2 3 4 5 6 7 |
USE tempdb; -- safe database to do this in, and it will clean up when restarted GO IF OBJECT_ID('dbo.HeapTest') IS NOT NULL DROP TABLE dbo.HeapTest; GO -- look ma! No clustered indexes! CREATE TABLE dbo.HeapTest (TextData VARCHAR(1000)); GO |
First off, let’s verify that it works as expected, and then compare this to a clustered index. Let’s add 700 rows to this table, and put 960 characters into this column. Since a page has a row size limit of 8060 bytes, and each row is just under 1000 bytes, we expect to see 8 rows per page. This should make the table with 87 full pages, and one page with 4 rows. I’ll add the 700 rows by using an inline tally / numbers table, and just get the first 700 rows:
1 2 3 4 5 6 7 8 9 |
WITH Tens (N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1), Hundreds(N) AS (SELECT 1 FROM Tens t1, Tens t2), Millions(N) AS (SELECT 1 FROM Hundreds t1, Hundreds t2, Hundreds t3), Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM Millions) INSERT INTO dbo.HeapTest (TextData) SELECT TOP (700) REPLICATE('Wayne''s World Rocks!', 48) FROM Tally; |
Now that the table has been populated, I’ll query a system DMV and return how many pages / rows are in this table:
1 2 3 4 5 6 7 |
SELECT CASE WHEN OBJECTPROPERTY(OBJECT_ID('dbo.HeapTest'), 'TableHasClustIndex') = 1 THEN 'CLUSTERED' ELSE 'HEAP' END AS TableType, used_page_count, row_count, in_row_data_page_count FROM sys.dm_db_partition_stats st WHERE object_id = OBJECT_ID('dbo.HeapTest'); |
This returns:
1 2 3 |
TableType used_page_count row_count in_row_data_page_count --------- --------------- --------- ---------------------- HEAP 89 700 88 |
Let’s next look at how many rows are in each page. To accomplish this, this next query uses two undocumented functions. The first, %%physloc%%, returns the file/page/slot that a row is in, and the second, sys.fn_PhysLocFormatter, puts this into a File:Page:Slot format. The query will get this value for each row, then break this apart and get the page that it is on, and finally get a count for each page by grouping on the page:
1 2 3 4 5 6 7 |
SELECT ca4.Page#, COUNT(*) AS PageCounter FROM dbo.HeapTest CROSS APPLY (SELECT sys.fn_PhysLocFormatter(%%physloc%%)) ca([File:Page:Slot]) CROSS APPLY (SELECT CHARINDEX(':', ca.[File:Page:Slot])) ca2(PageStart) CROSS APPLY (SELECT CHARINDEX(':', ca.[File:Page:Slot], ca2.PageStart + 1)) ca3(SlotStart) CROSS APPLY (SELECT CONVERT(INTEGER, SUBSTRING(ca.[File:Page:Slot], ca2.PageStart+1, ca3.SlotStart - ca2.pageStart - 1))) ca4([Page#]) GROUP BY ca4.Page#; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Page# PageCounter ----------- ----------- 535 8 537 8 552 8 553 8 554 8 604 8 606 8 607 8 864 8 ... 942 8 943 4 |
And when I check the results, there are 87 pages with 8 rows on each page, and one page with 4 rows, for 88 total data pages. This agrees with the system DMV previously queried, and with what was calculated.
We now have our starting point – both the heap and the clustered index should have 88 data pages. Let’s add an identity column to this table, and build a clustered index on this table against that column. Since doing this will replace the RID in the heap, which is 8 bytes, let’s use a BIGINT (which is also 8 bytes) in order to make the row size stay the same.
1 2 |
ALTER TABLE dbo.HeapTest ADD RowID BIGINT IDENTITY(1,1); CREATE CLUSTERED INDEX IX_RowID ON dbo.HeapTest (RowID); |
When we rerun the DMV query, we get these results:
1 2 3 |
TableType used_page_count row_count in_row_data_page_count --------- --------------- --------- ---------------------- CLUSTERED 90 700 88 |
As we can see, the total number of data pages remained the same, and the total number of pages increased by one. This is expected since the index adds a root level index node, and there isn’t so much data that intermediate leaf levels are required.
Now, let’s try a different insert pattern. Let’s add these rows one by one, by replacing the above insert statement with this one:
1 2 3 4 5 6 7 8 9 10 |
WITH Tens (N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1), Hundreds(N) AS (SELECT 1 FROM Tens t1, Tens t2), Millions(N) AS (SELECT 1 FROM Hundreds t1, Hundreds t2, Hundreds t3), Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM Millions) INSERT INTO dbo.HeapTest (TextData) SELECT TOP (1) REPLICATE('Wayne''s World Rocks!', 48) FROM Tally; GO 700 |
Rerun all of the statements, and when we run the DMV query, we get these results:
1 2 3 |
TableType used_page_count row_count in_row_data_page_count --------- --------------- --------- ---------------------- HEAP 108 700 107 |
Here we can see that the table now has an additional 19 data pages being used. So, let’s run the query to see how many rows are on each of the pages. The partial results are:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Page PageCounter ----------- ----------- 535 7 537 7 552 7 553 7 554 7 604 7 606 7 607 7 752 1 753 1 754 1 755 1 756 1 757 1 758 1 759 1 760 7 761 7 ... |
None of these pages are filled to capacity (8 rows), and some have as few as 1 row on the page. If I add the identity column and clustered index, it goes back to 88 data pages / 90 total pages. If the table is initially created with this column and clustered index, the results are the same as adding them afterwards.
Does this happen all the time?
So far, we’ve seen a difference between a single row insert repeated 700 times, and a single 700 row insert. Are there other data insert patterns that we could investigate? How about for every combination where (rows being inserted) * (loops) = 700? The following script gets those combinations, and then runs all of the above code in a loop (cursor) for each combination, finally returning the results for each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
DECLARE @RowsToInsert INTEGER; SET @RowsToInsert = 700; IF OBJECT_ID('tempdb.dbo.#Temp') IS NOT NULL DROP TABLE #Temp; WITH Tens (N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1), Hundreds(N) AS (SELECT 1 FROM Tens t1, Tens t2), Millions(N) AS (SELECT 1 FROM Hundreds t1, Hundreds t2, Hundreds t3), Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM Millions) SELECT TOP (@RowsToInsert) N AS Loops, RowsPerLoop INTO #Temp FROM Tally CROSS APPLY (SELECT (@RowsToInsert/N)) ca(RowsPerLoop) WHERE @RowsToInsert % N = 0; /* declare variables */ DECLARE @Loops INTEGER, @RowsPerLoop INTEGER, @CurrentLoop INTEGER; IF OBJECT_ID('tempdb.dbo.#Temp2') IS NOT NULL DROP TABLE #Temp2; SELECT IDENTITY(INT,1,1) AS RowID, 'CLUSTERED' AS TableType, @Loops AS Loops, @RowsPerLoop AS RowsPerLoop, used_page_count, row_count, st.in_row_data_page_count INTO #Temp2 FROM sys.dm_db_partition_stats st WHERE 1=2; -- Creates data structure, but no rows. DECLARE cLoops CURSOR FAST_FORWARD READ_ONLY FOR SELECT * FROM #Temp ORDER BY Loops; OPEN cLoops; FETCH NEXT FROM cLoops INTO @Loops, @RowsPerLoop; WHILE @@FETCH_STATUS = 0 BEGIN IF OBJECT_ID('dbo.HeapTest') IS NOT NULL DROP TABLE dbo.HeapTest; -- look ma! No clustered indexes! CREATE TABLE dbo.HeapTest ( TextData VARCHAR(1000) -- remove comment to check inserts against a clustered table --,RowID INTEGER IDENTITY PRIMARY KEY CLUSTERED ); SET @CurrentLoop = 0; WHILE @CurrentLoop < @Loops BEGIN SET @CurrentLoop = @CurrentLoop + 1; WITH Tens (N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1), Hundreds(N) AS (SELECT 1 FROM Tens t1, Tens t2), Millions(N) AS (SELECT 1 FROM Hundreds t1, Hundreds t2, Hundreds t3), Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM Millions) INSERT INTO dbo.HeapTest (TextData) SELECT TOP (@RowsPerLoop) REPLICATE('Wayne''s World Rocks!', 48) FROM Tally; END; INSERT INTO #Temp2 (TableType, Loops, RowsPerLoop, used_page_count,row_count, in_row_data_page_count) SELECT CASE WHEN OBJECTPROPERTY(OBJECT_ID('dbo.HeapTest'), 'TableHasClustIndex') = 1 THEN 'CLUSTERED' ELSE 'HEAP' END, @Loops AS Loops, @RowsPerLoop AS RowsPerLoop, used_page_count, row_count, st.in_row_data_page_count FROM sys.dm_db_partition_stats st WHERE object_id = OBJECT_ID('dbo.HeapTest'); -- comment out the following line (starting with "\") to add a clustered index and test that. /* ALTER TABLE dbo.HeapTest ADD RowID INTEGER IDENTITY(1,1); CREATE CLUSTERED INDEX IX_RowID ON dbo.HeapTest (RowID); INSERT INTO #Temp2 (TableType, Loops, RowsPerLoop, used_page_count,row_count, in_row_data_page_count) SELECT 'CLUSTERED', @Loops AS Loops, @RowsPerLoop AS RowsPerLoop, used_page_count, row_count, st.in_row_data_page_count FROM sys.dm_db_partition_stats st WHERE object_id = OBJECT_ID('dbo.HeapTest'); --*/ FETCH NEXT FROM cLoops INTO @Loops, @RowsPerLoop; END CLOSE cLoops; DEALLOCATE cLoops; IF OBJECT_ID('dbo.HeapTest') IS NOT NULL DROP TABLE dbo.HeapTest; SELECT TableType, Loops, RowsPerLoop, used_page_count, row_count, in_row_data_page_count FROM #Temp2; |
Results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
TableType Loops RowsPerLoop used_page_count row_count in_row_data_page_count --------- ----- ----------- --------------- --------- ---------------------- HEAP 1 700 89 700 88 HEAP 2 350 89 700 88 HEAP 4 175 89 700 88 HEAP 5 140 89 700 88 HEAP 7 100 89 700 88 HEAP 10 70 89 700 88 HEAP 14 50 89 700 88 HEAP 20 35 89 700 88 HEAP 25 28 89 700 88 HEAP 28 25 89 700 88 HEAP 35 20 89 700 88 HEAP 50 14 89 700 88 HEAP 70 10 89 700 88 HEAP 100 7 101 700 100 HEAP 140 5 97 700 96 HEAP 175 4 89 700 88 HEAP 350 2 89 700 88 HEAP 700 1 108 700 107 |
In this last script, a variable sets the number of rows to insert at a time. You can test this against whatever values that you desire. The script also allows for creating the clustered index, to verify that it is working correctly (it did so in all of the values that I tested).
Summary
We can see that based upon the number of rows being inserted into the heap, we can have a greater number of pages than would be expected. Over time, this could lead to a heap being considerably larger than a corresponding clustered table, which translates into more space that is needed on disk, in memory, and for backups. This is particularly true when you consider activity that can cause a heap to create forwarded records.
What concerns me the most about this finding is that what is expected to be the most frequent insert pattern in an OLTP system (single record inserts) has the biggest potential for creating the most number of new pages prematurely. This is, in my opinion, just another reason to have a clustered index on your table.
This was tested on SQL Server versions 2005, 2008, 2008R2, 2012, 2014 and 2016 (CTP 2.2).