What were you thinking?
Have you ever been working on some code from another person, and you end up scratching your head wondering “What were you thinking”? Maybe even thinking this with some “#$%^&*” thrown in? I recently ran into a hidden relationship that made me do exactly this.
Hidden relationships
I was recently working with a client to reverse-engineer a report from a third-party’s software so that it could be exported to an Excel spreadsheet. This report has parent and child data in it, and I was able to find all the data – and it was all in one table. It took me a while to figure out the relationship between the parent row and the child rows. I finally figured it out … the third-party has a column with obfuscated data that has the ID numbers of the related child records. This post will show you why this is not a good idea, and what a better solution is.
For rows of a certain type, a column would be filled in with something like “0300030000000000040000000000C6440F000000000000000000000000000000000000000000000000000000000000000000”. The first four characters represent the number of “rows” in this relationship. The rest of this string is 8 sets of 12 characters:
030000000000
040000000000
C6440F000000
000000000000
000000000000
000000000000
000000000000
000000000000
The “rows” with all zero’s are not used. From the report generated by the application, I could see that for this parent row it has child rows with ID #’s 3, 4 & 1000646 (determined by looking at the data in the table).
Cracking the hidden relationship
The first two rows have the appearance of being for ID #’s 3 and 4. The third row looks like a hexadecimal representation of a number. Entering this into calculator.exe and converting hex to decimal did not give the proper ID #, so I needed to figure out how this would work.
For the next attempt at cracking this value, I simply reversed the string. Nope, that didn’t work.
Failing that, I decided to reverse this string two bytes at a time. That is, the reversed string becomes “0000000F44C6”. When entered into calculator.exe and converted from hex to decimal, this did generate the correct value. Tests of several other records shows that this process seems to work for all the parent rows.
The next step is to generate a query to do all of this for me. Using a series of CROSS APPLY operators, I was able to extract the 8 sets of characters using the SUBSTRING function. Next, each of those are byte-reversed, converted to varbinary, then to an integer. Joining this number back to the ID number produces the necessary results. This looks something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
SELECT <columns> FROM <main table> -- 8 substrings to get the 8 12 character sets CROSS APPLY (VALUES (SUBSTRING(<column>, 5, 12)), (SUBSTRING(<column>), 17, 12)), (SUBSTRING(<column>), 29, 12)), (SUBSTRING(<column>), 41, 12)), (SUBSTRING(<column>), 53, 12)), (SUBSTRING(<column>), 65, 12)), (SUBSTRING(<column>), 77, 12)), (SUBSTRING(<column>), 89, 12))) ca1(ID) -- byte-reverse the string CROSS APPLY (VALUES (SUBSTRING(ca1.ID, 11, 2) + SUBSTRING(ca1.ID, 9, 2) + SUBSTRING(ca1.ID, 7, 2) + SUBSTRING(ca1.ID, 5, 2) + SUBSTRING(ca1.ID, 3, 2) + SUBSTRING(ca1.ID, 1, 2))) ca2(rID) -- convert to varbinary CROSS APPLY (VALUES (CONVERT(VARBINARY, '0x' + ca2.rID, 1))) ca3(vbID) -- then convert integer CROSS APPLY (VALUES (CONVERT(INTEGER, ca3.vbID))) ca4(iD) -- finally, join back to the table to get the child records JOIN <main table> b2 ON b2.ID = ca4.iID |
Why this isn’t such a good method
Earlier, I promised to talk about why this isn’t a good idea. In fact, this is a horrible way to do this, for several reasons:
Firstly, use a foreign-key relationship to enforce the parent-child relationship. That is impossible to do in this method.
Secondly, because there is not a foreign key relationship, a child record could be deleted without updating the parent record to reflect it.
Thirdly, this method requires the application maintains the referential integrity of the parent-child relationship, which involves updating the parent row with what the child rows are for it.
Finally, optimization of the query is not possible. As implemented, the application needs to retrieve the parent rows. Then it needs to crack this obfuscated record ID, and get the child records. The application retrieves the data in a row-by-agonizing-row method from the database.
A better, normalized method
A proper implementation is to create a table with a many-to-many relationship (called M-M table for this post). This M-M table would have just two columns; one holds the ID number for the parent table, and the other holding the ID number for each child record.
Next, we take a look at referential integrity. This M-M table should have a PRIMARY KEY constraint on these two columns, which will both prevent duplicate child records for each parent record, and give an index into this table to optimize the query. A foreign key relationship from the primary table to this table (on the parent column), then another foreign key from this table (on the child column) back to the primary table, enforces the referential integrity.
Finally, a simple INNER JOIN between the main table and the M-M table on the parent row ID number retrieves the data. Then the query performs another INNER JOIN from the M-M table back to the main table on the child record ID numbers. This would look like:
1 2 3 4 |
SELECT <columns> FROM <main table> t1 JOIN <M-M table> m1 ON t1.ID = m1.ParentID JOIN <main table> t2 ON t2.ID = m1.ChildID; |
Which would you rather see and work with???
Why wasn’t this method used? I can think of a couple of reasons:
- There is a misguided belief among many developers that foreign key relationships in SQL Server are slow and cause performance issues. This just isn’t true.
- Processes prevented the developer from creating tables, and this was an alternative to creating the table.
The wrap up
Just because you can do something doesn’t mean that you should do it. I’ve already gone over the reasons for why this isn’t a good idea. Just keep it simple.