SQL Server 2012 Performance Test: Gap Detection

2012-04-18 Wayne Sheffield 2 comments

Next up on the SQL Server 2012 Performance testing circuit: Gap Detection. You know, where you find missing values from a table of sequential numbers.

SQL Server 2012 introduces the LEAD function, which I’ve previously blogged about here. In that blog post, I covered how to do Gap Detection using the LEAD function. Now, it’s time to put it to the performance test.

The Test

For this test, I want a large table, with at least a million rows. Obviously, gaps are needed. We need to start the table off with a gap. There needs to be gaps throughout the table. Some as small as one row, and others need to encompass many rows. Each method to test needs to identify the starting point and ending point of a gap.

The table was created with the following code:

IF OBJECT_ID('dbo.GapTest','U') IS NOT NULL DROP TABLE dbo.GapTest;
CREATE TABLE dbo.GapTest (N BIGINT CONSTRAINT PK__GapTest PRIMARY KEY CLUSTERED);

-- start with a number to make a large gap at the start of the table
INSERT INTO dbo.GapTest (N) VALUES (11000000);  

-- insert rows to make 2 million
WITH
TENS      (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
                  SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),
THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),
MILLIONS  (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),
TenMillion(N) AS (SELECT 1 FROM MILLIONS CROSS JOIN TENS),
TALLY     (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM TenMillion)
INSERT INTO dbo.GapTest (N)
SELECT TOP (1999999) (82011000000000) + N
  FROM TALLY;

-- delete some records to make gaps
DELETE dbo.GapTest
 WHERE (N BETWEEN 82011000402000 AND 82011000500000
        AND N % 2000 = 0)  -- delete one row every 2000 rows
    OR (N BETWEEN 82011000600001 AND 82011000700000); -- delete this larger gap of 100,000 rows

IF OBJECT_ID('dbo.GapTest','U') IS NOT NULL DROP TABLE dbo.GapTest;

CREATE TABLE dbo.GapTest (N BIGINT CONSTRAINT PK__GapTest PRIMARY KEY CLUSTERED);

-- start with a number to make a large gap at the start of the table

INSERT INTO dbo.GapTest (N) VALUES (11000000);

-- insert rows to make 2 million

WITH

TENS (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),

MILLIONS (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),

TenMillion(N) AS (SELECT 1 FROM MILLIONS CROSS JOIN TENS),

TALLY (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM TenMillion)

INSERT INTO dbo.GapTest (N)

SELECT TOP (1999999) (82011000000000) + N

FROM TALLY;

-- delete some records to make gaps

DELETE dbo.GapTest

WHERE (N BETWEEN 82011000402000 AND 82011000500000

AND N % 2000 = 0) -- delete one row every 2000 rows

OR (N BETWEEN 82011000600001 AND 82011000700000); -- delete this larger gap of 100,000 rows

This code starts off with a starting gap of eleven million. Next, more records are added to this to bring the table up to two million rows. Finally, many records are deleted, giving many gaps of one row, and an additional large gap of 100,000 rows.

The Contenders

The first method to test is the LEAD function. The code for doing this is very simple, and hopefully, extremely efficient and fast:

WITH cte AS
(
SELECT curr = N,
       nxt  = LEAD(N, 1) OVER (ORDER BY N)
  FROM dbo.GapTest
)
SELECT [Start Of Gap] = curr+1,
       [End Of Gap] = nxt-1
  FROM cte
 WHERE nxt-curr > 1;
GO

WITH cte AS

(

SELECT curr = N,

nxt = LEAD(N, 1) OVER (ORDER BY N)

FROM dbo.GapTest

)

SELECT [Start Of Gap] = curr+1,

[End Of Gap] = nxt-1

FROM cte

WHERE nxt-curr > 1;

The second contender is the method that Itzik Ben-Gan published in SQL Server MVP Deep Dives:

SELECT StartRange = N+1,
       EndRange = (SELECT min(lt2.N)
                     FROM dbo.GapTest lt2
                    WHERE lt2.N > lt1.N) - 1
  FROM dbo.GapTest lt1
 WHERE NOT EXISTS
       (SELECT N
          FROM dbo.GapTest lt3
         WHERE N = lt1.N+1)
   AND N < (SELECT max(N) FROM dbo.GapTest lt4);

SELECT StartRange = N+1,

EndRange = (SELECT min(lt2.N)

FROM dbo.GapTest lt2

WHERE lt2.N > lt1.N) - 1

FROM dbo.GapTest lt1

WHERE NOT EXISTS

(SELECT N

FROM dbo.GapTest lt3

WHERE N = lt1.N+1)

AND N < (SELECT max(N) FROM dbo.GapTest lt4);

The third contender is the method that I learned from Jeff Moden on SQLServerCentral.com:

SELECT GapStart = ISNULL((SELECT MAX(lo.N+1)
                      FROM dbo.GapTest lo
                     WHERE lo.N < hi.N),1),
        GapEnd = hi.N - 1
   FROM dbo.GapTest hi
  WHERE hi.N NOT IN (SELECT N + 1 FROM dbo.GapTest)
    AND hi.N > (SELECT MIN(N) FROM dbo.GapTest);

SELECT GapStart = ISNULL((SELECT MAX(lo.N+1)

FROM dbo.GapTest lo

WHERE lo.N < hi.N),1),

GapEnd = hi.N - 1

FROM dbo.GapTest hi

WHERE hi.N NOT IN (SELECT N + 1 FROM dbo.GapTest)

AND hi.N > (SELECT MIN(N) FROM dbo.GapTest);

Looking at the code from Jeff and Itzik, you can see that they are very similar, though they do have their differences. Notably, they each have three sub-queries against the table, so it looks like the Reads will be higher than what the LEAD function will have. So, let the battle of the performance gurus vs. SQL Server 2012 begin!

Running the Test

Each of the contenders code is put into a stored procedure, and modified to dump the results to a temporary table instead of returning the output to the screen. The following code is used to generate IO and Time statistics, and to run each method ten times:

SET STATISTICS IO,TIME ON;
GO
EXECUTE dbo.GapTest_LEAD;
EXECUTE dbo.GapTest_Itzik;
EXECUTE dbo.GapTest_Jeff;
GO 10
SET STATISTICS IO,TIME OFF;
GO

SET STATISTICS IO,TIME ON;

EXECUTE dbo.GapTest_LEAD;

EXECUTE dbo.GapTest_Itzik;

EXECUTE dbo.GapTest_Jeff;

GO 10

SET STATISTICS IO,TIME OFF;

The Results

The statistics returned were made into a table and copied into Excel where a line chart was made showing how well each performed. (Units are milli-seconds.)

Well, right off I can see that LEAD does not perform as fast and efficient as I had hoped for. The methods from Itzik and Jeff are extremely close – a virtual tie. So, I took the average of the ten runs and charted that as a bar graph:

Here you can see that they are indeed extremely close – a difference of only 16 milli-seconds. The code from either one of these performance gurus runs 3.5 times faster than what the LEAD function from SQL Server 2012 gives us.

The other statistics are interesting: The LEAD function has 1 scan with 4014 logical reads; the gurus both have 55 scans with 8189 logical reads. So, my guess about more reads was dead on. Unfortunately, this did not lend itself to a performance improvement.

The Wrap Up

Well, this is pretty obvious. The code from either Itzik or Jeff will run astonishingly fast. They are so close, that I won’t even declare one of them a winner. Use either one.

The Invitation

If you have some code that you think is a contender for these, please post a reply with it. I’ll add it to the test suite, and update this post to show how it fares if it’s close or better than what these methods from Itzik or Jeff do.

Performance, SQL 2012 none

#1 | Written by Jeff Moden about 14 years ago. Reply

Nice testing, Wayne. This is the second time (that I know of) that you’ve demonstrated that some SQL “Black Arts” code beats the new functionality offered in 2012. Still (and trying to be realistic), the LEAD code is much simpler than either Itzik’s or mine which will help even neophytes to SQL to avoid the RBAR that many would use to solve such a problem. To top it off, LEAD only loses by ~2.5 seconds over nearly 2 million rows making it a viable alternative (although I’ll personally still use the “Black Arts” code 😉 ).

Loading...
- #2 | Written by Wayne Sheffield about 14 years ago. Reply
  
  You’re absolutely correct Jeff. The difference is that the last post was attempting to use a function to truncate a date, which that function wasn’t designed to do. This time around, the LEAD function is designed to get values from the next row(s), so I expected this to be very fast. I’m surprised that the “Black Arts” code beat this by as much as it did – I expected it to be competitive. However, as you say, I did process 2 million rows in under 4 seconds, so it is better than many other solutions out there.
  
  Loading...

Wayne Sheffield

Recent Posts

Top Posts

Archives

Categories

Blogroll

SQL Server 2012 Performance Test: Gap Detection

The Test

The Contenders

Running the Test

The Results

The Wrap Up

The Invitation

Like this:

Related

Leave a Reply to let me know how you liked this postCancel reply

Wayne Sheffield

Recent Posts

Top Posts

Archives

Categories

Tags

Blogroll

SQL Server 2012 Performance Test: Gap Detection

The Test

The Contenders

Running the Test

The Results

The Wrap Up

The Invitation

Share this:

Like this:

Related

Leave a Reply to let me know how you liked this postCancel reply