This post is part of the series discussing the new Analytic functions in SQL Server “Denali”, CTP3.
Analytic Functions in SQL Server “Denali”
All of the new Analytic functions require the use of the OVER clause. To see what changes to the OVER clause that “Denali” brings, please view my article The OVER Clause enhancements in SQL Server code named “Denali”, CTP3 on SQLServerCentral.com.
Introducing the PERCENTILE_CONT and PERCENTILE_DISC functions
Rounding out the Analytic Functions series are the PERCENTILE_CONT and PERCENTILE_DISC functions. The PERCENTILE_CONT function calculates a percentile based on a continuous distribution of the column’s value. The PERCENTILE_DISC function returns the column’s value for the smallest CUME_DIST value that is greater than or equal to the specified percentile value. For the PERCENTILE_CONT function, the value returned may or may not exist in the column being analyzed; for the PERCENTILE_DISC function the value returned will belong to one of the rows in the partition. Thus, they may or may not return the same value. The syntax for these functions is:
1 2 3 |
PERCENTILE_CONT ( numeric_literal ) WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] ) OVER ( [ <partition_by_clause> ] ) |
The first thing that you immediately notice (well, at least the first thing that I immediately noticed) is the WITHIN GROUP clause. This clause specifies a list of numeric values to sort and compute the percentile over. Note that only one order_by_expression is allowed. When using the PERCENTILE_CONT function, it must evaluate to one of the exact or approximate numeric data types.
Within the OVER clause, the PARTITION BY is optional, and the ORDER BY and ROW|RANGE clauses are not supported.
The numeric_literal is the percentile to compute, and it must be within the range of 0.0 to 1.0.
For both functions, NULL values are ignored.
Okay, let’s make an employee table, with employees assigned to a department and what each employee’s salary is. We’ll return each employee, along with the median salary in the department and the median salary in the entire table. The medians will be calculated by using both functions, and we’ll pass in a percentile of 0.5 (50%). For good measure, I’m going to include the AVG and CUME_DIST values also.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
DECLARE @test TABLE ( EmplId INT PRIMARY KEY CLUSTERED, DeptId INT, Salary NUMERIC(8,2)); INSERT INTO @test VALUES (1, 1, 10000), (2, 1, 11000), (3, 1, 12000), (4, 2, 25000), (5, 2, 35000), (6, 2, 100000), (7, 2, 100000); SELECT EmplId, DeptId, Salary, PctC1 = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary ASC) OVER (PARTITION BY DeptId), PctD1 = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Salary ASC) OVER (PARTITION BY DeptId), CD1 = CUME_DIST() OVER (PARTITION BY DeptId ORDER BY Salary), AVG1 = AVG(Salary) OVER (PARTITION BY DeptId), PctC2 = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Salary ASC) OVER (), PctD2 = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Salary ASC) OVER (), CD2 = CUME_DIST() OVER (ORDER BY Salary ASC), AVG2 = AVG(Salary) OVER (ORDER BY Salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM @test ORDER BY DeptId, EmplId; |
And this returns this result set:
1 2 3 4 5 6 7 8 9 |
EmplId DeptId Salary PctC1 PctD1 CD1 AVG1 PctC2 PctD2 CD2 AVG2 ------ ------ --------- ----- -------- ---- ----- ----- -------- ---- ----- 1 1 10000.00 11000 11000.00 0.33 11000 25000 25000.00 0.14 41857 2 1 11000.00 11000 11000.00 0.66 11000 25000 25000.00 0.28 41857 3 1 12000.00 11000 11000.00 1 11000 25000 25000.00 0.42 41857 4 2 25000.00 67500 35000.00 0.25 65000 25000 25000.00 0.57 41857 5 2 35000.00 67500 35000.00 0.5 65000 25000 25000.00 0.71 41857 6 2 100000.00 67500 35000.00 1 65000 25000 25000.00 1 41857 7 2 100000.00 67500 35000.00 1 65000 25000 25000.00 1 41857 |
The PctC# columns utilize the PERCENTILE_CONT function, and the PctD# columns utilize the PERCENTILE_DISC function. The columns ending with “1” are partitioned by the Department, and the columns ending in “2” are across the entire table.
Notice that for Dept 1, that the PctC1 and PctD1 columns both return the same value, and that this value is one of the Salary values for this department. However, when we take a look at Dept 2, we see that the PctC1 column has a value that is the median of the four salaries, and it is a value that is not one of the salaries. The PctD1 column returns the value from the first row where the CUME_DIST is >= 0.5, and in this case it is the second row. But, where is that value in the PctC1 column coming from? After a bit of searching on the internet, I found this article that explains how this function is implement in Oracle. In quoting from it:
“The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).”
Okay, so now I can see that for Department 2, that it is getting the Salary values for the two middle rows (35,000 and 100,000), adding them together and then dividing by 2, resulting in 67,500.
I don’t know if I’ll ever use these functions, but at least now I understand what they do. If you have a real-world use for these, please post a comment explaining it.
This concludes this series on the new Analytic Functions presented by SQL Server “Denali”.
References:
PERCENTILE_CONT
PERCENTILE_DISC
PERCENTILE_CONT (Oracle)
Analytic Function Series:
Overview
LAG/LEAD
FIRST_VALUE/LAST_VALUE
CUME_DIST/PERCENT_RANK