HELIX FEED

Explaining the network-wide technology outage

Written by DSST Public Schools | 05/31/18

As you all are aware, DSST’s IT systems suffered a catastrophic event which caused a network-wide outage that affected email, network applications, and phones. After working throughout the holiday weekend, all services were restored late Tuesday morning. Below is an update from Director of Technology Shaun Bryant on what caused the outage, how the IT team fixed the issues, and steps being taken to prevent similar events in the future.

In the realm of IT data systems, “disaster recovery” is something you plan for but maybe never see in your whole career. It’s like planning for a 500-year flood; you make sure you have the tools you need to recover but hope you never really need them.    

Unfortunately, what we have experienced is a 1,000-year flood. For our most critical applications, we run the gold standard in private cloud offerings called VMware, both to run the applications we all count on as well as to provide the storage those applications depend on to work. This is designed to be a Highly Available system -- a “never goes down platform” that can survive the failure of any two critical components and keep running. To its credit, it has done this for us for many years without any issue that impacted our students or staff.

That all changed 10 days ago when we had the first issue. For everyone outside of my team, what you saw was email go down and then back up. On our side, we saw two critical parts of the system go away a few minutes apart for seemingly no reason. At this point, this seemed like a bad outage, but given the repair was straightforward and fairly routine once repaired, we moved back to our everyday work. What we did not know at the time was the water was starting to rise due to a bug in a critical component that did not react the way it was supposed to work after the first failure. Then came this last Thursday. A week after the first event, the system that is never supposed to be able to go offline stopped dead in its tracks. That bug had been slowly and quietly impacting the system and filling up the resources we needed for our systems to work.

At this point, none of the metrics or tools we use to understand what is going on with the system was making much sense, so we engaged the help of VMware experts to take over troubleshooting. First, they determined that the start of the issue was a bug in the firmware that allows VMware to talk to the storage, and that that bug had created a bunch of extra data that was filling all of our resources and not allowing anything else to happen. After what turned into 100 hours straight working with teams in Broomfield, India and Ireland, the firmware issue was fixed.

With that being solved, we were left with a choice: Wait eight-plus days for everything to come back on its own (the amount of time it would take for the system to work through the mess the bug had created) or move cautiously and carefully transfer everything off the platform to an alternative location. This option also would still take time (it is a lot of data) but far less than the eight days the system needed to fix itself. This is where we are living today. We have moved nearly all of our systems out of the unhealthy location to an alternative one in an effort to bring them back online.

During the next few weeks, we will start completely from scratch with the current cluster. Much like a car that needs repairs after a serious accident, we will never trust that it is “quite right” after this, so it must be fully rebuilt. We will also build a new cluster to split where we house our critical applications. We are also taking steps to move our phone system to a completely independent set of servers so that, in no case other than a complete loss of the data center we use, no event will take down all means of communication at the same time.

Please understand, we feel the impact this had on everyone who counts on us every day to keep the systems flowing and are doing everything we can to prevent any future issues from ever preventing you from doing the important work you do. In a very real way, my team takes a tremendous amount of pride in providing you, the users we are responsible for, a network they can count on every day. To that end, I am personally very sorry for the impact this had on all of our students and staff as well as the members of my team. As engineers, we are all the sum of the problems we have solved and the issues we have had to fix. With each new problem solved, we grow and can make things a little better or fix things a little faster. You have my commitment that we will improve based on this experience and make things better than they were before.

Shaun Bryant