Wayne Sheffield

My blog about SQL Server

Browsing Posts published by Wayne Sheffield

In several of my last few blog posts, I’ve shared several methods of getting internal information from a database by using the DBCC PAGE command and utilizing the “WITH TABLERESULTS” option to be allowed to automate this process for further processing. This post will also do this, but in this case, we’ll be using it to bust a common myth—data in a clustered index is physically stored on disk in the order of the clustered index.

Busting this myth

To bust this myth, we’ll create a database, put a table with a clustered index into this database, and then we’ll add some rows in random order. Next, we will show that the rows are stored on the pages in logical order, and then we’ll take a deeper look at the page internals to see that the rows are not stored in physical order.

To start off with, let’s create a database and a table, and add a few rows to this table:

What we have is a table with 5 rows. The table’s clustered key is the RowID integer column. The rows are inserted so that the odd rows are inserted first, followed by the even rows.

Determine row’s physical location

At this point, let’s look at where the system reports these rows to be at. To do this, we’ll utilize two undocumented system commands. The first is %%physloc%%, which returns the physical location in a hexadecimal format. The second is fn_PhysLocFormatter, which converts this into a format of FileID:PageID:SlotID. This is just simply added to the select clause, so the query is:

This query produces these results:


From these results, you can see that the rows are all on the same page, and that the slot in the slot array on that page are incrementing for the appropriate RowID value. Remember also that the slot array is zero-based, where the first slot is slot #0. This is the logical order.

How inserts change the slot the row is in

To see this changing around as the rows are inserted, just put the select statement (which I just introduced) after each of the prior insert commands and run the script to create the table and add the rows again. You will initially see RowID=1 put into Slot #0, RowID =3 into Slot #1 and RowID=5 into slot #2. When you then add RowID=2, this needs to be between RowID #s 1 and 3, so #2 is now in slot array #1, #3 moves to slot array #2, and #5 is moved to slot array #3. When you add RowID=4, it gets inserted into slot #3 and RowID#5 is again pushed down, to slot #4. The logical order follows what we are expecting:


Check the physical storage

In order to show that these are not physically stored in this order, we will need to crack this page and look internally at where the data is actually stored. We will accomplish this with this code:

Which produces these results:


The row’s offset is where the row starts at on the page

As we look at these results, pay close attention to the Offset. This is where the row physically starts on the page. You can see how for RowID #2, that this offset is higher than the offset for RowID #3… and even RowID #5. When the row was added, the data was added to the end of the other existing rows on the page, and the offset where this data starts was entered into the slot array, after having the remaining slot array entries pushed down in order to maintain the correct logical order. We can see this happening once again when RowID #4 is inserted.

And, finally, let’s use DBCC PAGE to crack open this page and look at the raw data. For this, we want to use dump style 2:

If you expand out the VALUE column, you can see it go from Row01 to Row03 to Row05, then to Row02 and Row04.

So there you go… the data is physically stored in the next available space on the page; however, the page’s slot array is in the clustered key order. Furthermore, we now know that we can look at the offset and determine the order on the page. And we also have yet another way to use DBCC PAGE in an automated manner to show this.

Previous related posts:

Persisting DBCC Output
Automating DBCC Page
Automating DBCC Page, Part 2

This post is re-published from my original post on SQL Solutions Group.

Today is yet another T-SQL Tuesday – the monthly blogging party started by Adam Machanic (b|t) to get everyone in the SQL Server blogosphere together and to blog about a topic. Today’s party is hosted by none other than Matt Gordon (b|t), and the topic that he has chosen for us to blog about is “Fixing old problems with Shiny New Toys”. Specifically, he says:

What I’d like to see from the blog responses for this T-SQL Tuesday is how you’ve used a “new” Microsoft data platform toy to fix an old problem.

I’ve decided that I’m going to turn this around. Sometimes, those Shiny New Toys cause problems. Lets talk about one such case.

Recently, I was helping a client with an issue where when a job would run, the CPU on all processors would max out. This in turn would cause other performance issues on the server:

After a bit of investigation, I was able to determine that the statement causing this was a MERGE statement. The MERGE statement was a Shiny New Toy introduced in SQL 2008. This MERGE statement had both a WHEN MATCHED and a WHEN NOT MATCHED BY TARGET clause inside it.

What does this mean?

BOL defines the MERGE statement as:

Performs insert, update, or delete operations on a target table based on the results of a join with a source table.

This is performed with just one pass of the target table.

There are three possible operations that can be used within the MERGE statement. These are:

WHEN MATCHED – this allows you to perform an UPDATE or DELETE against the target table using the source table. There is a row existing in both tables for this match. If you were performing just the individual UPDATE or DELETE statement, you would use an INNER JOIN between the source and target table.

WHEN NOT MATCHED BY TARGET – this allows you to perform an INSERT of rows into the target table from rows in the source table that are not in the target table. If you were performing just the INSERT statement, you would drive it with a SELECT statement that performs a LEFT OUTER JOIN between the source and target tables, checking for a (not-null) column in the target table being NULL (signifying that there wasn’t a matching row).

WHEN NOT MATCHED BY SOURCE – this allows you to perform an UPDATE or DELETE against the target table when the target table has rows that the source table does not. If you were performing just the UPDATE or DELETE statement, you would use a RIGHT OUTER JOIN between the source and target tables… or how most people would write it, a LEFT OUTER JOIN between the target and source tables.

Keep in mind that my examples above are what you would do if you were separating the commands. Individually, these different conditions perform just fine with the MERGE statement. However, when the MERGE statement combines multiple match conditions, then think about what SQL has to do. See the chart at this link for when the different types of join conditions are used in the MERGE statement.

In this case, it’s doing both a WHEN MATCHED and a WHEN NOT MATCHED BY TARGET. In order to process the target table in just one pass, it will frequently end up performing worse. It can also end up using more resources in the process. I frequently find it better to just use the individual statements.

By separating this out into the respective statements, each can be optimized to perform better. Which is what I did… an UPDATE statement followed by an INSERT statement. Not only did the job run faster overall, but this change also stopped the CPU from maxing out:

Ah, much better.

There are other issues with the MERGE statement. It has bugs… some of which can cause database corruption.

Here we have a Shiny New Toy (feature), supposed to make life easier, yet it causes problems. Until it can perform better (and the bugs are eliminated), I just don’t use it.

Beware the Shiny New Toys.

In a prior blog post, I demonstrated how using DBCC PAGE can be automated by using the “WITH TABLERESULTS” option. In this post, we will continue with another look at how this can be done.

On a nice wintry day, your city ended up being covered in several feet of snow. During the course of the night, your own house had several power outages. Being concerned about your databases, you shoveled your way into your office, so that you could check on things. (Okay… actually you would just VPN in, but this is my story after all…)

Once you get into your server, you check the jobs and find that your job that runs DBCC CHECKDB has failed. Let’s assume that a power glitch has caused corruption in your database. In order to find out what all is affected, you run DBCC CHECKDB WITH NO_INFOMSGS, ALL_ERRORMSGS. But, this happens to be on your 2TB database, and it will take a while for CHECKDB to finish so that you can find the scope of corruption.

You’d really like to know what tables are affected without having to wait. Luckily(?), this corruption was recorded in msdb.dbo.suspect_pages, and having just recently read Paul Randal’s post here, we know we can use DBCC PAGE to determine this information. And, after having read my prior blog post, you know that we can automate DBCC PAGE, so we can use our new friend “WITH TABLERESULTS” to find out what objects have been corrupted.

The suspect_pages table, documented here, has three particular columns of interest: database_id, file_id and page_id. These correspond nicely to the first three parameters needed for DBCC PAGE. To automate this, we need to know what information we need to return off of the page – and from Paul’s post, we know that this is the field “METADATA: ObjectId”. For this code example, let’s assume that this corruption is on page 11 of the master database (just change “master” to the name of your 2TB database).

And there you go… you now know which object it is that has the corruption. In the same fashion, another interesting field that is returned is the IndexId – the Field value is “Metadata: IndexId”. It would be a similar exercise to grab that from this page also, an exercise that I’ll leave to you.

An automated method for getting the object from all suspect pages would entail encapsulating this logic into a cursor to spin through each row in the suspect_pages table (and I’ll even throw in getting the index_id also):

If you happen to be on SQL 2012 or higher, this can be greatly simplified by using the new (undocumented) DMO function sys.dm_db_database_page_allocations (and it also takes away the need to crack the page using DBCC PAGE).

And there we go – yet another time when you might want to automate using DBCC PAGE. By now you should be able to see other uses for it – as long as you can get the database_id, file_id and page_id, you can automate the usage of it to retrieve the information that you are looking for.

Previous related posts:

Persisting DBCC Output
Automating DBCC Page

This post is re-published from my original post on SQL Solutions Group.

On the second Tuesday of each month, the SQL Server universe all comes together to blog about a topic. Known as T-SQL Tuesday and started by Adam Machanic (Blog | @AdamMachanic), this month’s T-SQL Tuesday topic is hosted by friend and fellow MCM Brent Ozar. The theme that Brent has chosen for this month is “SQL Server Bugs & Enhancement Requests”. Specifically, Brent wants us to visit connect.microsoft.com, find an interesting request, and blog about it. Now Brent, this is a fantastic topic for a T-SQL Tuesday. I’m really looking forward to seeing what others blog about. I plan on upvoting those I like to get the attention of the SQL Server team at Microsoft. For what it’s worth, I think that this should be a yearly event to do with T-SQL Tuesday!

My favorite connect item

The connect item that I want to blog about is one that I submitted a few months back. Now, I’ve submitted a few enhancement requests over time (see this link,) and there are two current ones that I’m really keen on. However, Brent only wants us to talk about one, so let’s explore what my favorite one is. It can be found at this link.

Track object last used data

At almost every place that I’ve worked (and many that I’ve consulted at), I’ve been asked to figure out what objects in the database are no longer being used. I’ll bet that you’ve been asked this also. While we can tell when objects were created, it’s not so easy to determine when they were last used.

Now, there are a couple of DMVs that can help us out some. Specifically, sys.dm_exec_procedure_stats will show us information about the procedures that are in the plan cache. In BOL there is some information about what this DMV does:

The view returns one row for each cached stored procedure plan, and the lifetime of the row is as long as the stored procedure remains cached.

Another DMV that can be helpful is sys.dm_exec_index_usage_stats, which shows us various types of index operations and when they were last performed. However, this data is only available as long as the database is online – when the database is shutdown or detached (or when the SQL Server service is shutdown), this data is lost.

In both of these DMVs, the data is not persisted. Now, we can persist this data by running a job every minute or so to write this information out to a table. However, this only tells us a little bit of the picture: when procedures and indexes are used. If an index is being used, then the table (or indexed view) that the index belongs to has been utilized.

What is missing is all of the other objects: triggers, functions, non-indexed views, user-defined data-types, etc. In order to look for these, you would have to get an extended event session to look at all of the sql commands going on, and then to parse the code to find out what objects are being used. Yuck.

My proposal

What I have proposed is to add two columns to sys.objects: last_used_time_user and last_used_time_system. Two columns are proposed to be consistent with how the sys.dm_exec_index_usage_stats DMV tracks usage by user or system objects. I also suggest adding these columns to sys.indexes. For sys.indexes, this could just be the max of the appropriate columns from the sys.dm_exec_index_usage_stats DMV.

I would imagine that tracking all of this information would cause a performance hit, so this should not be on all the time. I suggest an ALTER DATABASE command (ALTER DATABASE TRACK OBJECT USAGE = ON/OFF) to control whether this information is being collected (now if they can figure out a way to do this where the overhead is minimal, just ignore this command).

Taking this further?

Would it be advantageous to know whether a column is being used in a table? Do you have a column in a table that you believe is obsolete? Have you left it in the table because you’re afraid of breaking the application somewhere? So how about adding these same columns to sys.columns? I’d suggest another ALTER DATABASE command to control whether column usage is being tracked (ALTER DATABASE TRACK COLUMN USAGE = ON/OFF).


SQL Server can make it easier for us to know whether objects (or even columns) are being used. Having this information will make it easier for us to eliminate obsolete stuff hanging around. This stuff takes up space (including backups) and increases time for maintenance.

How SQL Server uses Windows Virtual Accounts and local groups

Managed Service Accounts (MSAs) and Virtual Accounts were introduced in Windows 7 / Windows 2008R2 (link). Installations of SQL Server 2012 and newer will utilize the Virtual Accounts. SQL Server 2008 / 2008R2 installations use a combination of the local groups and Virtual Accounts, and SQL Server 2005 installations use just local groups.

On my Windows 7 laptop, I have SQL Server versions 2005-2014 installed. Looking at Computer Management for the groups, we can see the groups that were created by the various SQL Server installations:


If we drill in to one of these groups, we can see that the SQL Server service account for the database engine is assigned to this group:


Just to prove that this is the SQL Server service account:


SQL Server will use these groups in many places so that permissions are granted to the group, instead of the actual service account. This simplifies things greatly if you change the service account – SQL Server Configuration Manager will just change the member of this group instead of having to hunt down and change everywhere that it knows that permissions are needed for the service account. Using these groups instead of the service account will simplify your life also if you ever change the service account – all those specific permissions that you granted on local resources (paths, registry, etc.) would have to be changed. Using the group, it will still have the same permissions.

When installing on Windows 7 / 2008R2, Virtual Accounts are used instead. The virtual account used is “NT SERVICE\”. For a default SQL Server instance, the name will be “NT SERVICE\MSSQLSERVER”, and for a named SQL Server instance it will be “NT SERVICE\MSSQL$”. Essentially, this Virtual Account takes the place of a group. One of the places where this is used is in Local Security Policies. Here is a screen shot from one of those:


Notice that this policy has both Virtual Accounts and local groups assigned to it, depending on the version of SQL.

Does this work?

I have had people insist that this only works when using a local windows account, and that it doesn’t work with a domain account. So, let’s test it out.

Previously, I had created a lab environment with a network for testing out Availability Groups. I installed an instance of SQL Server 2016 using a domain account for the service account:


The SQL Server 2016 installation specified to set the “Perform Volume Maintenance Tasks” policy. The following screen shot shows that is set to use “NT SERVICE\MSSQLSERVER”:


When the SQL Server error log is examined, we can see that Instant File Initialization (IFI) is enabled. This can only be enabled if the service is running with the “Perform Volume Maintenance Task” security privilege assigned to the service:


Just to prove that IFI is actually working, let’s build a database using the example shown at https://blogs.msdn.microsoft.com/sql_pfe_blog/2009/12/22/how-and-why-to-enable-instant-file-initialization/:

Running these statements produces the following results from the error log:


Here we can see that only the log file was zeroed out – the data file wasn’t. Proving that IFI is actually working, and that the virtual account properly utilizes the “Perform Volume Maintenance Tasks” privilege for the service account for use by SQL Server.


In summary, the Virtual Account is essentially an automatic group for the service. It can be used on the local machine, even if the service account is a domain account. It can simplify things if you ever need to change the service account. You should use it.

Note that if you need to grant the service account permissions outside of the local server (network file share, another server, etc.) then you will still need to use the domain service account.

Way back in 2006, Paul Randal documented DBCC PAGE on his Microsoft blog at http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/06/10/625659.aspx. In his post, you will notice that in order to return the output from DBCC PAGE to the screen, you need to enable trace flag 3604 first. The above blog post shows a few examples of the results and utilizing them for further actions. Unfortunately, this method requires manual intervention to get the necessary data from the page in order to work with it further.

You know, it sure would be nice if the manual intervention could be removed and to completely automate the task that you are looking for. This blog post is going to show you how this can be done, and it will give a few examples of doing so.

In a prior blog post, I introduced an optional setting to several DBCC commands: WITH TABLERESULTS. This returns the output in tabular format, in a manner that allows the output to be consumed. As it turns out, this optional setting also works with DBCC PAGE. You can see this by examining the database information page (page 9) of the master database:

This statement, which can be run without the trace flag, returns the data on the page in 4 columns: ParentObject, Object, Field and VALUE. As it turns out, this particular page of every database has a lot of interesting information on it. Let’s automate grabbing a specific piece of information out of this page:

This wonderful snippet of code returns the last time a successful DBCC CHECKDB was run against the master database. Whoa… how sweet is that? When you discover a server where the CHECKDB job has been failing to the point that the job only has failed entries in the job history, you can now find out just how long it’s been since it was last run successfully.

The “trick” to making this work is to encapsulate the DBCC command as a string, and to call it with the EXECUTE () function. This is used as part of an INSERT INTO / EXECUTE statement, so that the results from DBCC PAGE are inserted into a table (in this case a temporary table is used, although a table variable or permanent table can also be used). There are three simple steps to this process:

  1. Create a table (permanent / temporary) or table variable to hold the output.
  2. Insert into this table the results of the DBCC PAGE statement by using INSERT INTO / EXECUTE.
  3. Select the data that you are looking for from the table.

By utilizing a cursor, you can easily spin through all of the databases on your instance to get when they last had a successful DBCC CHECKDB run against them. The following utilizes sp_MSforeachdb (which itself uses a cursor to spin through all of the databases) to do just this:

And here we have every database on your instance, and the date that DBCC CHECKDB was last run successfully against it – well, at least for the databases that sp_MSforeachdb didn’t skip (see http://www.mssqltips.com/sqlservertip/2201/making-a-more-reliable-and-flexible-spmsforeachdb/ for more information about this).

In summary, by utilizing the “WITH TABLERESULTS” option of DBCC, we can automate processes that use DBCC PAGE, instead of needing manual intervention to work your way through this.

In the next few installments, we’ll look at other uses of using “WITH TABLERESULTS” to automate DBCC output.

Previous related posts:

Persisting DBCC Output

This post is re-published from my original post on SQL Solutions Group.

TSQL Tuesday Backup and RecoveryIt’s that time of the month… the time when all of the T-SQL bloggers have a party and blog about a specific topic. Unfortunately, it’s been a few months since I’ve written a T-SQL Tuesday post. I’m hopeful that I can use this post to get back in the groove and participate in this monthly party on a, well, monthly basis.

Kenneth Fisher is our host this month, and the topic that he has selected is about “Backup and Recovery”. Kenneth states:

Backups are one of the most common things DBAs discuss, and they are at once one of the simplest and most complicated parts of our whole job. So let’s hear it for backup and recovery!

This is so true. Now, I want to get back to regular blogging – but this seems like a topic that is going to be pretty popular. I would venture that almost everyone will be talking about some aspect of the BACKUP or RESTORE command and working with the databases directly, so I would like to approach this from a different perspective.

Availability Groups and default XE Sessions

In a prior T-SQL Tuesday post, we were challenged to “Sharpen Something”, and I blogged about my deficiency in Availability Groups. Well, I’ve been working with them for a bit now. However, I’ve noticed that when things go wrong, there are lots of places that you can look at for finding out where the problem lies. And if the problem lies in the cluster, it becomes even trickier. However, the cluster writes out some .xel files – XE log files – by default to the SQL Server’s log directory. You won’t find this XE session running – it’s one of the hidden default sessions that Jason Brimhall blogged about.

In addition to these XE sessions being hidden, they only keep a small amount of data in just a few files. If you are looking back in time to see what had happened, it’s quite possible that you will find out that these files have rolled over on you, and what you’re looking for just isn’t in there anymore. I found that files only 8 hours old were being deleted because of this. If you’re trying to investigate what happened a few days later, well, you’re just out of luck.

Backups – more than just databases!

So how does this talk of AGs pertain to this T-SQL Tuesday topic? It should be pretty obvious – we need to periodically grab all of the .xel files generated by the cluster, and move them to a different directory, with a different retention policy. Yup… we need to back up these files. Sometimes, we need to be backing up things other than the databases themselves.

I created a PowerShell script that takes a few parameters, then moves the files from the source directory to the destination directory. And then it deletes files from the destination directory that are over x days old.

The next step is to schedule this script to run. I created a job, with a job step that executes a powershell script, and I call it with the desired parameters:

When I ran this job, it completed successfully. However, in looking at the source directory, all the files were still there. What’s up?

After a lot of investigation, it turns out that these particular files are created by the system, and all other users need administrator rights to work with the files. Even if the user has “Full Control” permission on the directory and all files in it. Since running the script from a SQL job with administrator privileges is a pretty difficult thing to do, what I ended up doing was to create a scheduled task at the OS level, where I can set the task to “Run with highest privileges” – aka administrator. Now the files are being copied out to a different location, and they are being retained with a different retention policy.

Just ensure that you have the disk space that you need. You don’t want to be causing further problems…


Most of the DBCC commands return their results as textual output, even if you have SSMS configured to return result sets to a grid. This makes examining the output a manual process that is prone to errors. It sure would be nice if there was a way to return the output to a grid.

If we were to examine Books Online (BOL) for DBCC in 2000 (http://technet.microsoft.com/en-us/library/aa258281%28v=sql.80%29.aspx) and 2005 (http://msdn.microsoft.com/en-us/library/ms188796%28v=sql.90%29.aspx), we would notice a section titled “Using DBCC Result Set Output”, with the phrase: “Many DBCC commands can produce output in tabular form by using the WITH TABLERESULTS option. This information can be loaded into a table for additional use.” This section has been removed from more recent versions, most likely because the results returned are not documented for the individual DBCC commands and are thus subject to change without notice.

Okay, let’s try this out. At the above BOL links, there are two DBCC commands that are documented to use the TABLERESULTS option: OPENTRAN and SHOWCONTIG. Testing all of the other DBCC commands shows that this option can also be used on the CHECKALLOC, CHECKDB, CHECKFILEGROUP and CHECKTABLE commands. I’m going to continue this post with using DBCC CHECKDB. If you check BOL, it does not mention the TABLERESULTS option. Let’s first see the results we get without the TABLERESULTS option by running:

For the consistency check that is performed on the master database, I end up with over 300 lines of output. The key lines in the output are:

And the final two lines:

(Notice that when we run CHECKDB against the master database, the hidden database mssqlsystemresource is also checked.)

If we were to modify the above statement to use the TABLERESULTS option:

we would actually get two result sets – one for the master database, and one for the hidden mssqlsystemresource database. Notice that the “DBCC execution completed…” line is not in either of the result sets – it is still displayed on the Messages tab.

Now, most people usually modify this command to suppress informational messages and to show all error messages (it defaults to “only” the first 200 error messages per object), so the command that is normally used would be:

And, hopefully, this only returns the message “Command completed successfully”. Let’s modify this command to use the TABLERESULTS option:

If there is no corruption in the database, it still returns only the message “Command completed successfully”. However, if there is corruption, you will get a result set back. So, I’m now going to run this against a database (Lab) where I have engineered some corruption. First off, let’s run CHECKDB without the TABLERESULTS option to check the initial output:

Here we can see that we indeed have corruption. Quickly now… is any of this corruption in a non-clustered index?

Running this statement with the TABLERESULTS option, we get a grid of the results:


(This screen shot is only showing some of the columns. Note that the message of the previous output for the minimum repair level is still on the Messages tab.)

As you can see, this already makes examining your errors somewhat easier. For instance, you can easily scroll through this output to see if the corruption is in non-clustered indexes (IndexId > 1), where the corruption could easily be fixed by scripting out the index definition, dropping and then re-creating the non-clustered index. Suppose you had over 100 errors… you can see how much faster this would be.

If you recall, the BOL description says that this data can be loaded into a table for further processing. Furthermore, this post is about persisting DBCC output data, which implies storing it into a table. So, let’s make a table, put these results into the table, and then run a query against it. First off, let’s make a local temporary table to hold the results (you could put this into a permanent table if you so desire):

Since the output from using TABLERESULTS isn’t documented, I’ve had to make some assumptions about the data types for these columns. To actually insert the output, we need to use the INSERT INTO … EXECUTE statement:

Now that we have the data stored in a temporary table, let’s run a query against the table to return some aggregated data.

— Get the objects, # of errors, and repair level

With which I get the following results:


So, there we go. Easy-peasy. By utilizing the TABLERESULTS option, the output of the DBCC CHECKDB command has been persisted into a table, and we are now able to run our own queries against that data. In the event that there is corruption in multiple indexes in a table, this query could easily be extended to get the number of errors in each index.

This post is re-published from my original post on SQL Solutions Group.


October… the month of Halloween. Spooky events. SQL Server has all kinds of Halloween-ish themed activities like Split Brains, Zombies, Ghosts, Phantoms, and of course Halloween itself. In this post, we’re going to mill about with Phantoms a bit.

Phantoms in SQL Server are actually called “Phantom Reads”. This ectoplasmic phenomenon manifests itself when an identical query being run multiple times, in a single connection, returns a different result set for each time it is run. A Phantom Read is one of the transaction isolation level concurrency events. The read uncommitted, read committed and repeatable read transaction isolation levels may exhibit Phantom Reads; the serializable and snapshot transaction isolation levels are not susceptible to this phenomenon.

Let’s set up a demo to see this in action. First, let’s create our sample data:

Now, let’s test phantom reads in the repeatable read transaction isolation level. Open the following 2 queries in separate query windows. Run Code Script 1, and within 10 seconds run Code Script 2.


The repeatable read transaction isolation level guarantees that there will not be any updates or deletes, so the second query is blocked until the transaction in Code Script 1 has finished. You can see that the second result set is identical to the first one. Once the transaction completes, the update in Code Script 2 can run, and the third result set from Code Script 1 shows that the result set is now changed. However, the repeatable read transaction isolation level still allows inserts, which means that you can still have a phantom read. Let’s see this by changing Code Script 2 to:

Run both code scripts again (Code Script 1 first, and Code Script 2 within 10 seconds). This time, you will see that Code Script 2 completes immediately without being blocked, and when Code Script 1 finishes, it has spawned additional data in its result set. A phantom read in action.

If you want to hide from phantom reads completely, then you’ll need to use either the serializable or snapshot transaction isolation levels. Both of these have the same concurrency effects: No dirty reads, non-repeatable reads, or phantom reads. The difference is in how they are implemented: the serializable transaction isolation level will block all other transactions affecting this data, while the snapshot isolation level utilizes row versions to create connection-specific versions of the table for the transaction – all of these row versions will cause increased activity in the tempdb database. Let’s take a look at how the snapshot isolation level will eradicate the phantom reads. First off, we need to modify the database to accept this isolation level.

To see this in action, we will once again have two queries. Run the code from Code Script 3 in one query window, and within 10 seconds run the code from Code Script 4 in another query window.

Notice that Code Snippet 2 was executed immediately, however the row that it inserted was not returned in the result set from Code Snippet 1.

And now you know how to dodge phantoms.

This post is re-published from my original post on SQL Solutions Group.

Page Splits 101

In SQL Server, a page split occurs on index pages when a row is required to be on a certain page (because of the index key), and there isn’t enough room on the page for it. This can be from either an insert operation, or an update operation. When this occurs, that page is split into two pages, with roughly half of the rows of that original page on each of the pages. The row is then put into the proper page. It is possible that a page split causes higher level leaf nodes to undergo page splits also. Furthermore, all of the page allocations and the data movement is logged in the transaction log.

Paul Randal has defined two different types of page splits. Paul calls the first type of page split a “good” page split, where the storage engine has to add a new page on the right hand side of the index’s leaf level. If you think of a table with an identity column, where the last page for the index for the identity column is too full to hold a new row, then adding a new row will allocate a new page so that the row can be inserted. Paul calls the second type of page split a “nasty” page split, which is when a row expands and the page doesn’t have enough space to hold the changed data, or if a new row needs to go on the page and there isn’t room for it.

In my book, the “good” page split isn’t really a page split, it’s just a new page allocation. However, this is deemed a page split in SQL Server, therefore this is the type of page split that we want to have happening.

Identity Columns

A recent SQLSkills newsletter has a discussion about a table that uses an integer identity column, and running out of values. Under normal usage (and the default unless otherwise specified) an identity column starts with the value of 1, and increments by 1. If you have enough rows where you exhaust the positive values, you need to do something so that your application will continue to work. Obviously, the best thing to do is to change this integer column into a bigint column. However, with over 2 billion rows in this table, a long maintenance window is needed to perform this conversion. What if you need to do something now, before this maintenance window? There are another 2+ billion negative values available for use in the integer data type, so we will use those.

In order to do this, the identity column needs to be changed. It can either be changed to start at the most negative value and increment by one, or start at -1 and be decremented by one. In other words, it can either be set to (-2147483648 , 1), or be set to (-1, -1).

Page Splits on Identity Columns

In considering which of these methods is preferred, we need to consider whether page splits impact these methods – especially nasty page splits. Furthermore, how will index maintenance affect each choice? So let’s think this through.

When there are negative values in this column, and the index is rebuilt, there will be a page with both negative and positive values in it. If the identity column is set to (-1, -1), there won’t be a gap (excluding the 0) in the values, and newly added rows will get a new page allocated – a good page split. If the identity column is set to (-2147483648 , 1), then there will be a full page with the records for the most recently used identity value, and with the values starting with 1 – a rather large gap.

When a new row is added, it will need to be added into the gap on this page, which will need to be split so that the new row can fit in before the value of 1. As more rows are added, the page with the value of 1 will again be used, and then need to be split again. As long as there are rows to be added, this cycle will continue. Therefore, considering the page splits, this choice seems to be a bad choice.

Well, that is my reasoning for how things will work. However, it’s best to test out your theory – especially if you are disagreeing with Paul. So, I will create a table with an identity column (set to 1, 1), and insert some rows. I will then change the identity column to the values to be tested, and add some more rows. I’ll throw in an index rebuild during this mix so that there will be a page with both the positive and negative values, and then insert more rows and see how this affects the page splits. With this in mind, the code to set up this test environment is as follows:

Reset the Identity Column

Next, the identity column needs to be reset to use the new seed / increment value – and this is where I ran into a problem with the test. While I can use DBCC CHECKIDENT to change the seed value, there is no way to change the increment value. Therefore, I can’t test my preferred method by changing the identity column to (-1, -1). However, I can simulate it by creating a table with the initial identity value at (-1, -1) and then using SET IDENTITY_INSERT to put in the positive values. Furthermore, I’ll need a second table to test the positive increment, so I’ll just create it with the most negative seed value and insert the positive values into it. The new code to setup the environment is:

The next part of this test is to create a page that has both positive and negative values in it. Since this test is to test out the second option, I’ll do this just for the second table. This code will insert a few rows, and then rebuild the index on this table. Finally it will show how the data on the page looks.


In this code, I use the undocumented virtual system column %%physloc%% to get the physical file/page/slot that a row is on, and then this binary value is cracked to return the actual file/page/slot. I then get the number of rows on this page, and the starting/ending values, and report by page. It can be seen that a page was created that will need to be split when more rows are inserted.

Checking for a page split

Continuing on, let’s add more rows to each table, and see how many page splits occurred. This will be performed for each table one at a time. This code and the results it produces is:


Now this is unexpected – there is only one page split on the LotsaSplits2 table, where I was expecting many more. Let’s look at what data is on these pages for these tables:


The second result set (for the identity column set at (-2147483648 , 1) ) shows that the page split occurred at the value of 1. If we can get the page to split where it starts with 1, then inserted rows won’t go onto this page, thus meaning that additional page splits won’t occur. It seems like I really got lucky here. Since the page can hold 7 rows, I repeated this test several times with between 1 and 6 filler rows, and it always split at the value of 1. The page split algorithm seems to be smart enough to realize that this gap exists, and that there will be more page splits occurring unless there is a page that starts with that value of 1. In talking with Paul about this observed behavior, he replied to me:

You’ll get at most two nasty page splits. The Access Methods is smart enough on the second split to split at the +ve/-ve boundary and no more nasty page splits occur. Yes, it seems somewhat counter-intuitive until the observation about the smart page split.

So there we go – the page split algorithm is smart enough to look at the data and to try to prevent future page splits if possible. Very nice!

Did you notice earlier that LotsaSplits1 also showed a page split? My assumption is that since there is still a gap to hold a value of zero, that the page was used and was then split. I tested this by changing the setup code to use “N-1” instead, so that it will start with the value of zero. However, this still incurs a page split. Looking closely at Paul’s definition of a good page split, notice that he states that a good split is when adding a new page on the right hand side of the index’s leaf level. It so happens that we are adding to the left hand side. I still think that all that should be occurring is a new page allocation, yet these tests prove that internally SQL Server is performing a page split.