We appear to have suffered a power cut at our data center.
We are currently investigating and working to restore service.
Our apologies for any inconvenience caused.
UPDATE: 21:00 BST (GMT +1)
All services are now up and running, customers may experience a little bit of latency in accessing their sites.
We apologise for any inconvenience caused and we continue to investigate the cause of this outage.
UPDATE: 10:12 BST (GMT +1) - 24th Oct.
Because the power failure caused our machines to shut down unexpectedly, our disk array which contains uploaded files restarted in 'read only' mode as a precaution. This means we have to run full disk integrity checks before we can go back to 'write' mode. Only 50% of sites have been affected by this, and we will repair them 10% at a time.
In any event, all content on these disks is backed up both on and off site, so in event of any errors in the disk, we can fully restore the files.
We will let you know when this is complete.
UPDATE: 14:00 BST (GMT +1) - 24th Oct.
All services have been restored to normal operation. We continue to investigate the causes of this outage and will provide a full report as soon as it is available.
Incident Report
We host our servers at a secure data centre managed by Telstra, a reputable colocation provider. As part of their colocation service, they provide dual uninterrupted power supplies. Each server on site thus has two independent power feeds to ensure that in the event of either power supply failing, the servers will continue running without interruption.
At 19:10 (18:10 GMT) on 23rd October, Telstra engineers started planned maintenance work on one of their UPS (uninterruptable power supplies). During this work, the servers were to be moved temporarily onto mains power. This should have happened transparently but the additional load on the mains tripped a circuit breaker. At approximately 19:47 (18:47 GMT) the breaker was reset and power was restored; the servers were then brought back online one by one. The servers have been subsequently restored to redundant UPS and will continue to be fully protected in the future.
A full report on the outage and the remedial actions from Telstra is attached below.
In the hours that followed the file server noticed certain problems with the filesystems. As each problem was detected, to ensure data integrity, the affected file system was remounted in 'read-only' mode. This ultimately resulted in 50% of the file systems going into a read-only state, meaning for half of our sites file uploads and other 'write' activities (such as form submissions) failed. This could not be rectified until the following morning.
In order to do a thorough filesystem consistency check, each affected filesystem needed to be taken offline. We therefore took each individual affected filesystem offline one-at-a-time to perform these checks in order to cause minimum disruption. This meant that affected customers may have experienced 'Internal Server Errors' for a 20 minute period while their file system was checked.
We have taken multiple measures to ensure that our services remain available in the event of power failures. The failure of Telstra to guarantee a permanent supply was an unforeseen and extraordinary event. Again, we apologise for any inconvenience you may have experienced.
==========
Telstra Outage Description
At 18:45:00 on 23rd October, a maintenance window was used to make enhancements to the current UPS equipment on the 3rd floor of Telstra’s London Hosting Center. This work resulted in the total loss of power on the 3rd floor co-location facility.
A thorough Method Statement and Risk assessment had been carried out to make sure all switching was correct and the load would be smoothly transferred through the static switch to raw mains for the duration of the maintenance. Telstra are unable to access this particular breaker and so could not predict that it would fail under a load well below its capacity. The Thermal and Ultrasonic survey of the electrical infrastructure carried out recently showed normal operation.
The activity being undertaken in the maintenance window was to temporarily move the load from one leg to another. This would allow Telstra to undertake works to enhance the current UPS equipment. The load in the 3rd floor co-location facility is currently under 1250 amps. This load is fed by two feeders rated at 1250 amps each. The schematic below shows the basic configuration of the power layout.
In order to conduct the maintenance on the UPS the load needed to be swapped on one leg thus allowing the Static Bypass Switch to throw and in turn allowing Telstra to move the load back but with the UPS isolated ready for safe working.
When the load was moved a Breaker rated at 1250amps failed. This is the root cause of the issue, as already mentioned to total load was under 1250 amps and should have been easily supported by this breaker.
The power outage started at 18:45:10 and power was restored at 18:47:05. As a direct result all services within the 3rd floor co-location facility powered down.
As soon as the breaker tripped, we immediately isolated the UPS’s and static switch, reset the breaker and put level 3 into wrap round bypass supply. This keeps the UPS and static switch out of the circuit and supplies the load on raw mains.
As the load is well below the capacity of the breaker, it was assumed the breaker had tripped early. At the time Telstra had no way of safely checking this breaker without switching the power off. Telstra engineers took the decision to remain on raw mains until further investigations and preparations could be made.
On Friday 24th October, Telstra arranged for a Metropolitan Electrical Tegg service team to attend site with all the necessary H&S equipment for them to remove the cover and check the breaker whilst it is live. Telstra will be better placed to make a decision on the best way forward once these findings have been collected and assessed.
Telstra Remedial Action
The most appropriate actions will be determined as soon as Telstra are given the findings of the specialist report.
Clearly, Telstra needs to migrate back on to UPS supply as soon as possible. Based on the outcome of the analysis of the breaker Telstra will be looking to undertake an emergency planned outage at 00:00 Sunday morning.
This document provides a high level report of the events, highlights areas of concern and lays forth remedial steps, which are being put in place to prevent the same type of episode arising in the future. Telstra would like to offer its sincere apologies for the unforeseen disruption caused to our customers. All teams involved are working together to prevent this situation arising again.
===========
NB. Admin: Tidy up of post title and inclusion of Incident report from later post for consistency 16:00 GMT 18 Mar 2009