SiteMaker Server Status

To content | To menu | To search

Thursday, July 2 2009

RESOLVED: Service Disruption 09:11 BST (GMT +1) 02 Jul 2009

DETAIL: We are experiencing a distributed denial of service (DDoS) attack on the services, which is currently overloading our firewalls.

RESPONSE: Our technical team are working hard to get this attack under control and return full service to all customers. We hope to have this completely resolved shortly.

UPDATE: We've managed to block what appears to have been three different types of DDoS attack. Combating the three attacks has taken us longer than expected but we appear to have got them under control as of 11:35 but continued to experience minor issues until 12:32 BST (GMT +1).

Thursday, June 25 2009

RESOLVED: Service Disruption 06:10 BST (GMT +1) 25 Jun 2009

DETAIL: We are experiencing a large DDoS on one of our hosted sites, which is currently overloading our firewalls.

RESPONSE: Our technical team are working to resolve this issue as quickly as possible and should have it resolved shortly.

UPDATE: We are being flooded by an attack on one of our sites with 75% of all incoming traffic from over 2,300 IPs, which is causing all traffic to be congested and page load times to be affected. We are still currently investigating how to dynamically block this content.

Monday, May 11 2009

COMPLETED: Scheduled Downtime 09:30 - 10:30 GMT 18 May 2009

REASON: Database upgrade and updates

PLANNED DURATION: Up to 60 minutes

We will be upgrading the database software to the latest version, and performing a number of data updates in preparation for the release of the next version of SiteMaker.

Thank you for your patience and apologies for any inconvenience caused.

Thursday, April 9 2009

COMPLETED: Essential Maintenance 11:20 BST (GMT +1) 09 Apr 2009

REASON: Our investigations into the recent service disruptions indicated that by running consistency checks on the affected data, the file system was stabilised. To be certain that the data is secure we will be running consistency checks on the remaining data in a controlled manner.

PLANNED DURATION: 40 minutes per site

NOTES: Those communities not already affected by the recent disruptions will go Read-Only for about 40 minutes. This will only affect file uploads, new site building, forum posts and saving of data. All sites will remain on view to visitors and owners. There will be six periods of 40 minutes during which 10% of sites will be affected.

UPDATE: This has now been completed. The consistency checks have been successful.

Tuesday, April 7 2009

RESOLVED: Service Disruption 16:40 BST (GMT +1) 07 Apr 2009

DETAIL: On April 7 a section of our system went read-only. This currently affects only 10% of our customers and has an impact on file uploads, new site building, form submissions and forum posts. The other 90% of our customers are completely unaffected by this event.

RESPONSE: Our technical team are working to resolve this issue as quickly as possible and should have it resolved shortly.

UPDATE: 01:45 BST - All affected systems have now been fully restored. We will continue to monitor the situation and post more details later in the morning.

Monday, April 6 2009

RESOLVED: Service Disruption 14:34 BST (GMT +1) 04 Apr 2009

DETAIL: On April 4 a section of our system went read-only. This currently affects 20% of our customers and has an impact on file uploads, new site building, form submissions and forum posts. The other 80% of our customers are completely unaffected.

RESPONSE: Our technical team are working to resolve this issue as quickly as possible.

UPDATE: We are working on repairing the file system for the affected customers. Current estimate for completion is 2 - 4 hours.

UPDATE: 13:40 BST - We will be taking the service off-line for 10 minutes at approximately 13:55 to reconfigure the file system as part of our recovery actions.

UPDATE: 14:04 BST - The file system reconfiguration was successful and enabled us to restore full service to half of the affected customers. We expect to restore the service fully to the remaining 10 % of customers (the remaining half affected by this issue) within the next couple of hours.


RESOLVED: 15:02 BST

DETAIL: Full service has now been restored to all customers.

RECOVERY: We failed over to our backup file server and performed consistency checks to ensure that the data was not corrupted.

FOLLOW UP: We will run background consistency checks on all our data over the next few days to ensure that no further problems occur. Visitors may possibly experience slower page loads while these checks take place. We will also conduct an investigation into the cause of this event and take steps to mitigate a repeat of this incident.

Tuesday, March 10 2009

RESOLVED: Service Disruption 09:12 GMT 10 Mar 2009

DETAIL: This morning we suffered a distributed denial of service attack on one of our sites. This affected the speed at which others sites could be accessed which in some cases resulted in time-outs for some users. This was caught early, however time was needed to establish how the attack was being conducted in order to block it. The service restriction lasted 49 minutes.

RECOVERY: Some users may have experienced login issues following the resolution, although clearing the cache would have easily resolved this problem. Other than this no other sites have been impacted by this event.

FOLLOW UP: We are now reviewing what steps we can take to minimise the risks of this occurring again.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Monday, January 26 2009

RESOLVED: Service Outage 07:24 GMT 26 Jan 2009

The database was subject to a short lived dead-lock. These do happen sporadically, and normally get resolved automatically. However, under certain conditions the situation can snowball and cause a temporary 'hang', which is resolved by stopping the webservers and allowing the database some time to recover. We continue to investigate the causes of deadlocks to reduce the probability of them occurring, however, they are not totally avoidable. We apologise for any inconvenience and thank you for choosing SiteMaker.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Tuesday, December 23 2008

RESOLVED: Service Outage 12:38 GMT 23 Dec 2008

At 12.38 GMT we experienced an unscheduled outage, this was caused by a number of rogue processes taking up resources on the web-servers. This resulted in slow access times and dropouts for all websites.

All services were resumed by 12:48 GMT. We apologize for any inconvenience caused.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Tuesday, December 2 2008

COMPLETED: Scheduled Downtime 22:30 - 22:35 GMT 6th Dec 2008

REASON: Software Upgrade to Network Equipment

PLANNED DURATION: 5 minutes




Between the hours of 22:30 to 06:00 on the 6th December, 2008 our data centre provider will be performing a software upgrade to parts of the network infrastructure. The upgrade should take no more than 5 minutes, but the exact time of the upgrade cannot be specified in advance.




Thank you for your patience.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Tuesday, November 25 2008

RESOLVED: Service Outage 12:52 GMT 25 Nov 2008

DETAIL: 12:52 - 13:05 and 15:32 - 17:12 GMT

We have suffered two incidents today, one starting at 12:52 GMT lasting for 13 minutes resulting in very slow response times, and another one starting at 15:32 GMT lasting for 1hr 40 minutes, which included some periods of full loss of service.

Service was restored to normal at 17:12 GMT and all sites appear to be up and running with normal response times.

RECOVERY:

We are still investigating the causes, but it appears that the machine resources on each of the servers in the webserver layer were exhausted one by one. The lack of free resources meant that our system administrators were locked out or severely impaired in trying to diagnose the problem remotely. We did manage to restore service to half of our webservers after the first incident and dispatched a member of the system administrator team to our dedicated data center to diagnose the problems on site. The remaining webservers were rebooted and brought back up again shortly after 14:15 GMT, but as the engineer left the datacentre the service started deteriorating again and the engineer returned. In order to hurry the return to normal service, the affected webservers were power cycled, but this course of action unfortunately resulted in connections being left open on our database. This meant that once the web servers were back up again they were unable to connect as the maximum number of database connections were reached. In order to clear the database connections, it was necessary to restart the database, but after the database was restarted it seems that the query optimiser started returning inefficient query plans resulting in very slow response times. Our database administrator was brought in and after several attempts finally cleared the inefficient query plans from the cache and normal service returned.

FOLLOW UP:

We suspect the initial causes of the incident may have been a user uploaded file (or files) that resulted in a denial of service condition by causing our image conversion software to consume excessive resources while processing. The file may have been uploaded multiple times and this repeated action exacerbated the problem. Our image procesing software has processed millions of files in the past without such issues; this an extraordinary occurrence and we are taking immediate steps to identify the cause. We have already released a patch which we hope should prevent this happening again in the future.

Thank you for your patience,

Walter Rothon
Product Manager
SiteMaker Software Ltd

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Wednesday, October 29 2008

RESOLVED: Service Outage 16:42 GMT 29 Oct 2008

Service Outage: 29th October 16:42 GMT

We have had a very unexpected failure with the Disc Array. We are currently trying to bring up the Standby system to restore services to all customers.

You will be notified immediately of any changes to this condition and the expected ETA when we have one.

We appreciate this is not great timing after our last outage but this is completely unrelated and is being working on by our own technical team who are doing everything in their power to bring up the standby service as quickly as possible.

Thanks

================

UPDATE: 17:37 GMT

All services have been restored to normal operation. We will obviously investigate the cause of the incident and once we have this and get an explanation from our Data Centre over the cause of last weeks downtime we will inform all customers. We will also notify you of steps to avoid these situations occurring again.

If you encounter any difficulties with viewing your site we recommend clearing your cache (temporary Internet files) and reloading your browser. This will fix problems with pages loading and file uploads.

Thanks again for your patience.


Incident Report

At 16:42 on the 29th October we were alerted to the fact our main file server appeared to be offline. This meant that customers' uploaded files were unavailable. Once it was confirmed that the file server was not accessible, we immediately instigated our recovery plan and brought the standby file server online. We then attached the network storage to it, thus restoring access to customer uploaded files. Services were brought back up by 17:32.

On further investigation, we noticed that the server detected that it had entered an inconsistent state and as a result, halted itself as a safety measure. This an extremely rare occurrence and is the first time we have encountered this behaviour since SiteMaker began. However, this is the reason we have redundant hardware enabling us to quickly fail over to our standby system and recover services rapidly.

As a precaution, we have updated the operating systems of the file servers and, although we have no reason to suspect any damage, we are running a full diagnostics on the hardware.

NB. Admin: Tidy up of post title and inclusion of Incident report from later post for consistency 16:00 GMT 18 Mar 2009


Thursday, October 23 2008

RESOLVED: Service Outage 19:10 BST (GMT +1) 23 Oct 2008

We appear to have suffered a power cut at our data center.

We are currently investigating and working to restore service.

Our apologies for any inconvenience caused.

UPDATE: 21:00 BST (GMT +1)

All services are now up and running, customers may experience a little bit of latency in accessing their sites.

We apologise for any inconvenience caused and we continue to investigate the cause of this outage.

UPDATE: 10:12 BST (GMT +1) - 24th Oct.

Because the power failure caused our machines to shut down unexpectedly, our disk array which contains uploaded files restarted in 'read only' mode as a precaution. This means we have to run full disk integrity checks before we can go back to 'write' mode. Only 50% of sites have been affected by this, and we will repair them 10% at a time.

In any event, all content on these disks is backed up both on and off site, so in event of any errors in the disk, we can fully restore the files.

We will let you know when this is complete.

UPDATE: 14:00 BST (GMT +1) - 24th Oct.

All services have been restored to normal operation. We continue to investigate the causes of this outage and will provide a full report as soon as it is available.


Incident Report

We host our servers at a secure data centre managed by Telstra, a reputable colocation provider. As part of their colocation service, they provide dual uninterrupted power supplies. Each server on site thus has two independent power feeds to ensure that in the event of either power supply failing, the servers will continue running without interruption.

At 19:10 (18:10 GMT) on 23rd October, Telstra engineers started planned maintenance work on one of their UPS (uninterruptable power supplies). During this work, the servers were to be moved temporarily onto mains power. This should have happened transparently but the additional load on the mains tripped a circuit breaker. At approximately 19:47 (18:47 GMT) the breaker was reset and power was restored; the servers were then brought back online one by one. The servers have been subsequently restored to redundant UPS and will continue to be fully protected in the future.

A full report on the outage and the remedial actions from Telstra is attached below.

In the hours that followed the file server noticed certain problems with the filesystems. As each problem was detected, to ensure data integrity, the affected file system was remounted in 'read-only' mode. This ultimately resulted in 50% of the file systems going into a read-only state, meaning for half of our sites file uploads and other 'write' activities (such as form submissions) failed. This could not be rectified until the following morning.

In order to do a thorough filesystem consistency check, each affected filesystem needed to be taken offline. We therefore took each individual affected filesystem offline one-at-a-time to perform these checks in order to cause minimum disruption. This meant that affected customers may have experienced 'Internal Server Errors' for a 20 minute period while their file system was checked.

We have taken multiple measures to ensure that our services remain available in the event of power failures. The failure of Telstra to guarantee a permanent supply was an unforeseen and extraordinary event. Again, we apologise for any inconvenience you may have experienced.

==========

 Telstra Outage Description

At 18:45:00 on 23rd October, a maintenance window was used to make enhancements to the current UPS equipment on the 3rd floor of Telstra’s London Hosting Center. This work resulted in the total loss of power on the 3rd floor co-location facility.

A thorough Method Statement and Risk assessment had been carried out to make sure all switching was correct and the load would be smoothly transferred through the static switch to raw mains for the duration of the maintenance. Telstra are unable to access this particular breaker and so could not predict that it would fail under a load well below its capacity. The Thermal and Ultrasonic survey of the electrical infrastructure carried out recently showed normal operation.

The activity being undertaken in the maintenance window was to temporarily move the load from one leg to another. This would allow Telstra to undertake works to enhance the current UPS equipment. The load in the 3rd floor co-location facility is currently under 1250 amps. This load is fed by two feeders rated at 1250 amps each. The schematic below shows the basic configuration of the power layout.

In order to conduct the maintenance on the UPS the load needed to be swapped on one leg thus allowing the Static Bypass Switch to throw and in turn allowing Telstra to move the load back but with the UPS isolated ready for safe working.

When the load was moved a Breaker rated at 1250amps failed. This is the root cause of the issue, as already mentioned to total load was under 1250 amps and should have been easily supported by this breaker.

The power outage started at 18:45:10 and power was restored at 18:47:05. As a direct result all services within the 3rd floor co-location facility powered down.

As soon as the breaker tripped, we immediately isolated the UPS’s and static switch, reset the breaker and put level 3 into wrap round bypass supply. This keeps the UPS and static switch out of the circuit and supplies the load on raw mains.

As the load is well below the capacity of the breaker, it was assumed the breaker had tripped early. At the time Telstra had no way of safely checking this breaker without switching the power off. Telstra engineers took the decision to remain on raw mains until further investigations and preparations could be made.

On Friday 24th October, Telstra arranged for a Metropolitan Electrical Tegg service team to attend site with all the necessary H&S equipment for them to remove the cover and check the breaker whilst it is live. Telstra will be better placed to make a decision on the best way forward once these findings have been collected and assessed.

Telstra Remedial Action

The most appropriate actions will be determined as soon as Telstra are given the findings of the specialist report.

Clearly, Telstra needs to migrate back on to UPS supply as soon as possible. Based on the outcome of the analysis of the breaker Telstra will be looking to undertake an emergency planned outage at 00:00 Sunday morning.

This document provides a high level report of the events, highlights areas of concern and lays forth remedial steps, which are being put in place to prevent the same type of episode arising in the future. Telstra would like to offer its sincere apologies for the unforeseen disruption caused to our customers. All teams involved are working together to prevent this situation arising again.

===========

NB. Admin: Tidy up of post title and inclusion of Incident report from later post for consistency 16:00 GMT 18 Mar 2009

Tuesday, October 7 2008

COMPLETED: Scheduled Downtime 15:00 - 15:15 BST (GMT +1) 13 Oct 2008

A number of database changes need to be run to enable new features in SiteMaker. Unfortunately this means a service outage for no more than 15 minutes on Thursday morning, 9 October 2008 10:30 BST (GMT+1).

During this time, visitors to SiteMaker sites will be presented with a holding page informing them that the site should be back soon.

We are sorry for any inconvenience this may cause.

Regards,

Hiren Joshi

Systems Manager

Update:

This has now been rescheduled to 15:00 BST (GMT +1) 13 Oct 2008 , we thank you for your understanding.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Monday, August 18 2008

RESOLVED: Service Outage 19:12 (GMT +1) 18 Aug 2008

At 19:12 BST (GMT +1) we experienced an unscheduled outage, this was caused by a number of rogue processes taking up resources on the web-servers. This resulted in slow access times and dropouts for all websites.

The process was killed and steps have been taken to prevent this from happening again.

All services were resumed by 20:20 BST. We apologize for any inconvenience caused.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Monday, July 21 2008

COMPLETED: Scheduled Downtime 09:30 - 10:00 BST (GMT +1) 28 Jul 2008

REASON: General database maintenance

PLANNED DURATION: 30 minutes

NOTES: During this time, customers will not have access to their sites and will be presented with a holding page with a short message informing them when normal service will resume.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Thursday, June 5 2008

COMPLETED: Scheduled Downtime 08:00 - 09:00 BST (GMT +1) 12 Jun 2008

REASON: Upgrade of our gateway firewall

PLANNED DURATION: 1 hour

NOTES: Once the upgrade is complete, normal service will resume. We thank you in advance for your co-operation.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Friday, May 2 2008

RESOLVED: Service Outage 11:40 BST (GMT+1) 02 May 2008

We experienced a service outage between 11:40 and 11:45 BST.

We were forced to apply an emergency database patch which could not be done while our servers were running.

We apologise for any inconvenience caused.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Thursday, May 1 2008

RESOLVED: Service Outage 14:24 BST (GMT+1) 01 May 2008

We experienced a service outage between 14:24 and 14:55 BST.

A database deadlock situation caused a backlog for the requests coming in. We were forced to take our webservers off-line to ease the load on the database long enough to resolve the deadlock.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

Tuesday, April 22 2008

COMPLETED :Scheduled Downtime 03:00 - 04:00 BST (GMT +1) 30 Apr 2008

REASON: Upgrade by our hosting company

DURATION: 10 minutes, up to 1 hour

This work is unavoidable and all services should have resumed by 04:00 if not before.

NB. Admin: Tidy up of post title for consistency 16:00 GMT 18 Mar 2009

- page 1 of 2