This past week, I was part of three state agency firedrill around a failed web service.
The web service failed at 12:16 PM on Tuesday 8/25/2009. I know this because I got one of the red colored emails from State of Colorado Judicial. Our side is the client, the server is Judicial.
I tried, with little success to get the interest of my manager, his manager, the network admin, anyone who would talk to me. At the state today, there is little downtime. Everyone's resources are stretching a little further every day, when you think that you just can't stretch anymore. All of the people that I tried to talk to had other priorities. At the end of the day, I didn't have an answer about what was happening, and the Judicial IT guy was not happy.
He complained up the line, and at the very very end of the day, we had top management from three agencies asking for answers.
The next day, after some venting had happened, we began to investigate. I defended, quite strongly I might add, that the code was not to blame, that the last production move we made had been successful and was not related to the present problem.
Judicial began to send us shots of their side, the firewall logs, that indicated that they were not the problem.
Just as we were about to do some real time monitoring of the web service transaction, just as I had been talked into adding even more error handling into the application, just at that moment, the system started to work again. It was 10:47 AM, Wed, 8/26/2009.
Now, I would have been looking to place the blame, except for one fact. During the time of the outage, there were about 75 records that were sent. Of that number, 4 records got through. The problem was intermittent. It was neither fully broke or fully functioning. This fact got in my way of placing responsibility on the party who caused the problem.
In the process of writing the textbook for Transport Layer Security this weekend, I came across an article that may have some relevance to this problem. The article was at Microsoft, http://msdn.microsoft.com/en-us/library/aa480583.aspx. This article talks about "One liability ... is that firewall boundaries may not allow Kerberos authentication traffic between the calling application and the Kerberos Key Distribution Center (KDC) or between Web service and the Kerberos KDC." This describes a possible scenario that could have caused our outage this week. I still don't know whose firewall is not configured correctly, but I am quite sure that is it not my code.
Subscribe to:
Post Comments (Atom)
That was a very informative post. It is curious to note that Kerberos has a lot of authentication errors that i have run into in my time in IT security.
ReplyDelete