Wednesday, May 4, 2011

Quirks of Distributed Transactions (MSDTC) and SQL Server

By Alejandro Villarreal

Distributed transactions can be of great help when dealing with complex operations that must be atomic across servers, but then again for their very nature –distributed– they can be hard to debug when something fails. A good example of this is an error I encountered recently. Here’s a bit of background:

We have a Web Application hosted in Server01 sending messages to a WCF Service through an MSMQ endpoint (we’re using transactional queues to leverage their reliability), and this Service saves the content of the messages it receives in a SQL Server database hosted in a remote server (Server02). The queue in Server01 plus the database in Server02 make this a distributed transaction, and that’s why we need MSDTC in the first place.

The whole setup worked perfectly… until suddenly it just didn’t anymore. The WCF Service started throwing the following error:

System.Data.SqlClient.SqlException: Warning: Fatal error 8510 occurred at May 21 2010 9:50AM. Note the error and time, and contact your system administrator.

A severe error occurred on the current command. The results, if any, should be discarded.

Fatal error with almost no description? Out of nowhere? We didn’t deploy a new version (nor redeployed the same one, for that matter); we didn’t change anything in the configuration files; we didn’t install OS updates. Then why could it possibly start to fail out of nothing? Both servers had gigs of free disk space, and plenty of RAM, so that couldn’t be the problem. I checked the SQL Server logs, and found this:

1

The full message is “Enlist operation failed: 0x8004d01c(XACT_E_CONNECTION_DOWN). SQL Server could not register with Microsoft Distributed Transaction Coordinator MS DTC) as a resource manager for this transaction. The transaction may have been stopped by the client or the resource manager.”

Accompanied by a second message: “Error: 8510, Severity: 20, State: 3.”

Ok, that gave us some clue about what was happening, but didn’t shed any insight on why. After quite some time looking for the cause of this issue, I finally reached this article, which doesn’t clearly state why did their solution solve the problem but references this other article, which again, doesn’t quite pose a solution to the problem, but is very close. The key information I obtained from that article is

“As MSTC [sic] is the core Transaction Service, its restarting will cause other Transaction depended [sic] services stop working normally, such as SQL server”.

The core idea is pretty clear. And combining this with a comment from the first article I mentioned (“Restarting services on the sql server that could not initiate distributed transactions resolved the problem”), I imagined that the MSDTC and the SQL Server windows services must be started in that order for everything to work.

Back to our setup, I recalled that I did restart the MSDTC service at some point, so I obviously tried restarting SQL Server, and…

Another error! It seemed the user we had been using all along to connect to the database didn’t have permissions anymore. Oh, but wait! There isn’t only 1 database server, there are 2, configured for mirroring, and a third one that acts as witness to allow for automatic failover. Restarting SQL Server in the first database server triggered the failover, so when it came back up, it wasn’t the primary anymore, but the mirror. And since you cannot connect to a mirror, the error makes sense. Whew! I started a manual failover back to this server, and surely enough, distributed transactions started working again!

So, even though it is not explicitly documented anywhere (that I could find), I think it is safe to say the following, which generalizes the title of this post a bit: always make sure that any server hosting resources that make use of MSDTC, sets up its services and resources in the correct order, and be careful with restarting the MSDTC service.

I hope this saves somebody the time I spent debugging this issue.

1 comment:

  1. Get Blazing Fast Los Angeles Dedicated Server (Asia Optimized) For Affordable Price Here - THESTACK.NET

    ReplyDelete