What Happened?

Users of TruckingOffice where unable to enter dispatches in the system for new routes on the morning of July 5th. Users received a variety of errors. Our system staff started getting alerts from the application that something was wrong as well.

Why Did It Happen?

TruckingOffice uses PC*Miler to do all of mileage calculations in the system. They had scheduled a maintenance window from 5am-7am CST on July 5th to go from version 24 to version 25. Near the end of the window we started getting alerts from our error monitoring system that something was timing out when trying to talk to PC*Miler. The errors persisted even after the end of the time window. We contacted PC*Miler to determine if we had bad code or if there was a problem on their end. It turned out they had some difficulty with the upgrade and where forced to downgrade the system back to version 24. They will mostly likely schedule another window to do the upgrade again once they have determined the source of the problem. We thought that we had isolated PC*Miler enough so that the system would continue to work if they were down, but this outage showed that we missed several use cases that still depended on PC*Miler. This is the first time in 16 months that PC*Miler has had any downtime, which is the reason this issue was not discovered sooner.

What Are You Doing About It?

We are going to do several things:

  • We are now going to use the blog to announce upcoming maintenance events. We will also use it to announce any issues related to downtime. This blog is hosted at a separate facility from the application itself, so even if the application goes down this blog should still be operational
  • We are working on some new features to allow you to override the mileage as calculated by PC*Miler. This will allow the system to continue to operate when we are unable to communicate withe PC*Miler
  • We are reviewing the code to find any other points where we depend on external systems to make sure that the system stays up even if the external system goes down.