SQL Server – Data, Design, and SQL Server

September 16, 2009January 9, 2019

Transaction count during DML statements

Recently I was troubleshooting a blocking problem where a number of sessions were blocked, waiting to acquire a page latch. While blocking was occurring, there were several rows in the output from sysprocesses that looked like this (only relevant columns are shown):

spid status blocked open_tran waitresource cmd lastwaittype

—— ———- ——- ——— ————- ——- ————-

1001 suspended 1294 2 8:6:792624 INSERT PAGELATCH_UP

In this case, session 1001, executing an INSERT statement, is waiting to acquire a latch on page 792624, which happens to be a PFS page (792624 / 8088 = 98, a whole number of PFS intervals). While this may provide a clue as to the cause of blocking, this is not the main topic of this post.

Note that the value in the open_tran column is 2. The open_tran column is described in documentation as the “number of open transactions for the process.” The intuitive conclusion from this is that session 1001 has two explicit transactions open, one nested in the other. However, this system uses stored procedures exclusively, and a review of all stored procedures that insert rows did not find any code that used explicit nested transactions.

After some research, I found that explicit nested transactions are not the only reason why the transaction count can be greater than 1 during execution of a DML statement. Consider the following code fragment:

CREATE TABLE T1

(

Col1 int

);

INSERT INTO T1

SELECT @@TRANCOUNT;

SELECT Col1 FROM T1;

UPDATE T1 SET

Col1 = @@TRANCOUNT

WHERE Col1 = 2;

SELECT Col1 FROM T1;

DELETE

FROM T1

WHERE Col1 = @@TRANCOUNT;

SELECT Col1 FROM T1;

Here’re the results, with comments added after the output from each statement:

(1 row(s) affected)

^^ INSERT statement ^^

Col1

———–

(1 row(s) affected)

^^ first SELECT statement ^^

(1 row(s) affected)

^^ UPDATE statement ^^

Col1

———–

(1 row(s) affected)

^^ second SELECT statement ^^

(1 row(s) affected)

^^ DELETE statement ^^

Col1

———–

(0 row(s) affected)

^^ third SELECT statement ^^

This shows that during execution of a DML statement that is not within any explicit transaction, there are actually two open transactions reported. The results are the same if instead of @@TRANCOUNT we use the open_tran column from sysprocesses, or the open_transaction_count column from the sys.dm_exec_requests DMV. Effectively, in addition to the one transaction always associated with any DML statement, there is another nested transaction opened internally by SQL Server, lasting for the duration of statement’s execution. This behavior occurs on all recent versions of SQL Server, starting with SQL Server 2000 (I did not test on older versions).

To be clear, the second transaction is open only while a DML statement is executing. The @@TRANCOUNT function (as well as sysprocesses and sys.dm_exec_requests) behaves as expected if used in a non-DML statement in procedural T-SQL code, which is the typical use case.

So as a practical matter, if you see the number of reported open transactions that is greater than expected, consider the context where that number was obtained, before concluding that it must be due to explicit nested transactions being used.

August 31, 2009January 9, 2019

Disjoint subtyping in SQL

Disjoint subtyping is a scenario that is often encountered in data modeling. In one frequently used modeling approach, an entity of a certain type is represented by a database table, and each subtype of this entity is represented by another table. Subtyping is disjoint if an instance of a type corresponds to at most one instance of a subtype. For example, we may have a table named Animal, and three other tables named Extinct, Living, and Mythical (perhaps some would argue that these are not really disjoint, but let’s ignore this for now). In this example, these three tables represent entities that are disjoint subtypes of the Animal type.

When implementing disjoint subtyping in SQL, it is necessary to enforce the rule that for each row in the type table, there is one related row in at most one subtype table. It is possible to implement this rule declaratively by creative use of foreign key and check constraints, as described by David Portas in his blog.

The other day it occurred to me that one could simplify this implementation a little bit. Rather than add single-value subtype columns to each of the subtype tables, each such column with a check constraint and a default constraint, one could use SQL Server computed columns. This way, there is no need to declare the check and the default constraints, and the column could be ignored for all intents and purposes, while still enforcing the disjointness data integrity rule.

With this modification, David’s schema becomes as follows (change highlighted):

CREATE TABLE Products

(SKU INT NOT NULL PRIMARY KEY,

ProductType CHAR(1) NOT NULL

CHECK (ProductType IN (‘B’,‘C’,‘D’ /* Book, CD or DVD */)),

Title VARCHAR(50) NOT NULL,

UNIQUE (SKU,ProductType));

CREATE TABLE Books

(SKU INT NOT NULL PRIMARY KEY,

ProductType AS ISNULL(CAST(‘B’ AS CHAR(1)), ”) PERSISTED,

Pages SMALLINT NOT NULL,

FOREIGN KEY (SKU,ProductType) REFERENCES Products (SKU,ProductType));

(I’m omitting the CDs and DVDs tables for brevity.)

Note that the computed column still needs to be persisted – SQL Server will refuse to create the foreign key otherwise – so this approach can only be used with SQL Server 2005 and later. I also had to explicitly cast the column to match the data type of the referenced column. The ISNULL() function makes the column not nullable in the table’s metadata. Since the column actually cannot have NULLs, this might avoid some confusion.

Update 2009-10-19: Here is a Connect suggestion to add that support in T-SQL.

August 30, 2009January 9, 2019

Surrogate keys in distributed databases

In this post, the term “distributed database” refers to a set of SQL Server databases, each managed by a SQL server running on a separate computer. All databases have identical schemas, and data that originates in one database is replicated to all other databases, or nodes. A common example would be a system with a central database server and a number of remote machines used by field personnel, each with a local database replica. Data replication in such system can be implemented using a number of techniques, for example, merge replication, peer-to-peer replication, or Sync Services for ADO.NET.

A common problem arising during design of such distributed databases is surrogate key generation. Keys must be unique across the entire distributed database, rather than unique just within a particular database node. A common approach used in practice is to use GUID columns as keys. While straightforward and simple from a developer’s perspective, this however has a big disadvantage – the random nature of GUIDs quickly leads to extensive index fragmentation in the database. Additionally, the size of GUID keys is four times larger than the size of integer keys, leading to corresponding increase in index size.

An alternative approach is to use compound two-column keys: one column identifies the database node, while the other column identifies a row within a table on that node. The combination of two columns creates a key that is unique across the distributed database. This works reasonably well, however using compound keys may be somewhat awkward in practice: for example, a table with multiple foreign keys that reference such compound keys will have twice the number of foreign key columns. Storage issues aside, the likelihood of developer confusion and errors would be higher if this approach is used.

In this post, I will present a method of generating single column keys that avoids these problems. The method is based on combining multiple values of smaller numeric data types into a single value of a larger numeric data type.

We have two values to be combined: the database node identifier (DBNodeID), and the row identifier for a particular table on a particular database node (RowID). Let’s assume that both are integer values, which would not be unreasonable in practice. Each integer uses four bytes of storage, so to combine the two values without loss of information, we need eight bytes. We will use bigint as the data type for the combined value, which does require eight bytes of storage. To combine two integer values into one bigint value we will use a technique called bit shifting.

Here’s an example. Let’s say we need to pack DBNodeID 2 and RowID 14 into a single bigint value. In bit representation, these two values appear as follows:

DBNodeID (2): 00000000 00000000 00000000 00000010

RowID (14): 00000000 00000000 00000000 00001110

Using bit shifting, we shift the bits of the first integer to the left, into the two high words of the bigint, and use the bits of the second integer for the two low words of the bigint. Here’s the result:

DistributedRowID: 00000000 00000000 00000000 00000010 00000000 00000000 00000000 00001110

In decimal, this corresponds to 8589934606 – a single bigint value that can be used as a key value for a row. This method will generate values that are derived from both DBNodeID and RowID values, and are guaranteed to be unique across the distributed database. In a sense, this method is similar to the two-column compound key approach mentioned earlier, with the advantage that only one key column is needed.

So how can we implement this bit shifting operation in SQL Server? Left-shifting a value by N bits is equivalent to multiplying that value by 2^N. This means that in order to pack the DBNodeID integer into the two high words of a bigint, we need to multiply it by 2^32 (there are 32 bits to represent an integer). Once the DBNodeID integer is packed in the two high words of a bigint, we add the second integer (RowID) to the result to obtain the key value. Here’s a T-SQL example (assuming SQL Server 2008 from here on):

DECLARE @DBNodeID int = 2;

DECLARE @RowID int = 14;

SELECT @DBNodeID * POWER(CAST(2 AS bigint), 32) + @RowID AS DistributedRowID

There are multiple ways to implement this approach in a particular database design. One is to have a table, replicated to each database node, to be used as a key generator for all tables that have distributed surrogate keys. Switching to T-SQL again:

CREATE TABLE dbo.KeyGenerator

(

DBNodeID int NOT NULL,

TableName sysname NOT NULL,

RowID int NOT NULL,

DistributedRowID AS ISNULL((DBNodeID * POWER(CAST(2 AS bigint), 32) + RowID), 0),

CONSTRAINT pkKeyGenerator PRIMARY KEY (DBNodeID, TableName)

);

We can populate this table with data for a hypothetical three-node two-table distributed database:

INSERT INTO dbo.KeyGenerator (DBNodeID, TableName, RowID)

VALUES

(1, ‘Table1’, 1),

(2, ‘Table1’, 1),

(3, ‘Table1’, 1),

(1, ‘Table2’, 1),

(2, ‘Table2’, 1),

(3, ‘Table2’, 1);

SELECT * FROM dbo.KeyGenerator produces the following (note the computed key values in the last column):

DBNodeID TableName RowID DistributedRowID

———– ———- ———– ——————–

1 Table1 1 4294967297

1 Table2 1 4294967297

2 Table1 1 8589934593

2 Table2 1 8589934593

3 Table1 1 12884901889

3 Table2 1 12884901889

If an application needs to insert a row into Table2 on database node 1, it can run the following UPDATE query to obtain the key value for the new row, and increment the corresponding RowID value in a single statement:

DECLARE @NewDistributedKey bigint;

UPDATE dbo.KeyGenerator SET

RowID += 1,

@NewDistributedKey = DistributedRowID

WHERE DBNodeID = 1 AND TableName = ‘Table2’;

SELECT @NewDistributedKey;

The selected value is 4294967297. Executing SELECT * FROM dbo.KeyGenerator one more time produces this result:

DBNodeID TableName RowID DistributedRowID

———– ———- ———– ——————–

1 Table1 1 4294967297

1 Table2 2 4294967298

2 Table1 1 8589934593

2 Table2 1 8589934593

3 Table1 1 12884901889

3 Table2 1 12884901889

Note that the RowID and DistributedRowID values for the second row have been incremented by 1, so the next time the UPDATE query is executed, it will obtain 4294967298 as the next key value for the Table2 table.

In summary, using the approach presented in this post, you can implement a distributed database system that uses single-column numeric surrogate keys, instead of widely used but problematic GUIDs, or more cumbersome compound keys.