Want to link to trackrooms. An enterprise in our area. www.trackrooms.co.uk Please support this effort.
Archive for the ‘Outages’ Category
Our hosts are having to do some scheduled maintenance this coming Sunday. It looks like a minor thing and the interruption should only be for a few minutes. This is the message they sent us:
Dear Valued Customer,
RE: Coreix :: Scheduled Network Maintenance – Sunday 30th November 2008
The Coreix engineering team will be conducting network maintenance beginning at 00:00AM GMT on Sunday 30th November 2008, lasting until 06:00AM GMT of the same morning.
Summary of downtime and customers affected:
Part 1 – None anticipated
Part 2 – Customers who take communal load balancing solutions will be affected by 30 seconds to 1 minute of downtime.
Part 3 – All customers, there will be an increase in latency and some packet loss. There will be a customer edge outage for approximately 2 to 5 minutes per rack whilst we migrate customers from one chassis to another.
Part 4 – None anticipated – Minor route changes are expected.
Part 1 – The operating system upgrade of core1.sta.lon2.coreix.net
Scope of work: The upgrade of the JunOS on core1.sta.lon2.coreix. The hardware will be taken offline cleanly and should have no impact on networking operations.
Outage: We do not anticipate any downtime during this maintenance section due to the Coreix network resiliency.
Rationale: Coreix performs periodic Operating System upgrades on its equipment to improve security and features on its equipment.
Part 2 – Upgrade the operating system on lb1.sta.lon2.coreix.net
Scope of work: The upgrade of the WebOS on lb1.sta.lon2.coreix. The hardware will be taken offline cleanly, upgraded and rebooted.
Outage: All customers who take communal load balancing services may see 3 or 4 packets lost during this maintenance (approximately 30 seconds to 1 minute of downtime), while the load balancing equipment fails over. Customers with dedicated load balancer solutions will not be affected.
Rationale: Coreix performs periodic Operating System upgrades on its equipment to improve security and features on its equipment.
Part 3 – The upgrade of dist7b.sta.lon2.coreix.net
Scope of work: As part of our expansion at the Stratford site (sta.lon2), we will be migrating customers from one distribution 7a and replacing the chassis of distribution 7b. The new chassis will still maintain the redundancy you expect from Coreix (dual supervisor, dual power supplies, multiple copper and optical cards for dual edge switching ports and core uplinks).
Outage: As we swap over to the new chassis customers will be migrated rack by rack. Since all Coreix edge switches have 2 uplinks to the distribution the downtime will be minimal since we will run the two in parallel. I have estimated 2 to 5 minutes per rack, however in certain instances the swap over will be significantly lower.
Rationale: We are increasing the port density of our distribution layer in response to an unanticipated rapid expansion of colocation, enterprise services and dedicated servers.
Part 4 – The addition of core3.sta.lon2.coreix.net
Scope of work: The Coreix network expands its peering arrangements on a daily basis and as we grow our routing requirements increase. As part of our expansion we are installing a third core router at the Stratford site. The main task of the router will be to mix our primary peering traffic (private peers plus public exchanges), and will act as another resilient node on our network.
Outage: This is a node addition therefore we do not expect any downtime during this period. Customers may see minor route changes for direct peers during this period as the traffic migrates from one core to another (however no traffic should be lost).
Rationale: This upgrade is in response to an increase in traffic and peers on our core routers.
Thank you for your understanding as we continue to improve our services.
If you have any further questions or would like any clarification please do not reply to this email but contact the support department via https://support.coreix.net/ or email@example.com
-Coreix Engineering Team
8.45 am: Looks like we are having a problem again. We are working on it.
Thanks to the people who alerted us via the blog.
10.14 am: This seems to be resolved now. It was an issue with our firewall crashing. We are hoping to be able to go through the Coreix firewall in future so as to take ours out of the loop.
Sorry once again for the disruption
15.02: We are making our final move to the Coreix racks now. Service will be interrupted for about 30mins.
15.39h: Looks like we are back but we are working on an error when loggin in. Hopefully fixed in a minute.
16.22h: Just had this from Coreix:
Just wanted to let you know that the first part of the migration went fine, however we had a small issue with power cabling to the firewall due to its size, we need to quickly restart the machine to move some power cables (we didn’t want to keep the machine down to long) to resolve this can you confirm if 5-10 minutes to reboot is acceptable in about 30-60 minutes ?
I gave them the Ok.
16.28h: The login issue is fixed now. We are waiting for one last restart of the server at about 17.00-17.30h, then we’ll finally be settling down with our new hosts.
18.36h: Finally! We are moved away from ServerCity for good. Things should settle down now.
11.55h: We are trying to lay the groundwork for the move tomorrow and are having a bit of a hiccup with the DNS.
Back soon hopefully.
12.19h: Looks like we got caught in a network snag at Coreix somehow. Thy assured me everything would be seamless…
12.49h: The issue is with one of our network cards. It’ll be over soon.
13.04h: Nearly there. Still sorting a login issue.
13.39h: We are basically back, apart from a couple of tweaks. Sorry about that blip.
Our final move to Coreix will happen on Sunday 5th Oct at 3pm. This should take no more than 30 mins. We are migrating our DNS ahead of time so in all other respects the transition to the Coreix rack space should be seamless. I chose 3pm hoping it will be a quiet time in terms of people using ippimail.
19.49h: The guys at Coreix are preparing the move of our server fron the ServerCity racks to their own racks. To do this they are having to reboot the server occasionally. This will cause some minor interruptions to our service.
Once they are done, the actual move will cause an interruption of only about 30 minutes on Sunday afternoon.
We are in the middle of a planned outage due to our hosts (ServerCity.co.uk) relocating their entire facility. Hopefully we will be back in a few hours.
Edit at 17.11h: This is taking much longer than I imagined it would. I have called our hosts for news and I’m waiting for them to get back to me.
19.05h: Looks as if we’re not going to be back fully for another 12-24 hours I’m afraid. This is due to our DNS records (the string of numbers which translate into ‘www.ippimail.com’) not propagating as rapidly as we would have hoped. We laid the groundwork for this two days ago but it was handled in an odd way by the people who hold our DNS records.
I’m expecting the site to appear on our new IP address soon but no mail will be getting through until the DNS records propagate the new IP address.
21.00h: Our hosts have obviously broken something pretty seriously. more news in the morning. We are powerless this end. Sorry it hasn’t gone as planned.
9.18am. 26th Sep. I have emailed our hosts again asking for news as soon as they know what’s going on. I’ll keep updating you.
10.04 am: The site is back up so we are getting somewhere! It’s not possible to view mailboxes just yet and it will probably be some time until email gets processed as normal but we are on the home straight.
12.44pm: It never rains… Our hosts have now had ‘a local network failure’ which must have taken a number of sites off the web, ours included. This is totally out of our hands again. As soon as we get word we’ll be able to get in to our server to get things up and running again.
16.11h: It seems our hosts are still battling with this network issue. I have left messages and emails.
17.09h: Just called again. Had to leave a message with an operator. It’s now ‘out of hours’ staff dealing with things…
18.20h: Still no return call. A new host is called for… I can see this going on all weekend at this rate.
19.11h: The site has reappeared although no one from our hosts has bothered to call me to let me know… Hopefully we can now get back in to get things started again.
Logins are now possible but I wouldn’t use ippimail for anything mission-critical until I give the all-clear here… Some aspects of our service may not be firing on all cylinders yet…
Sat 27th 9.37h: Site gone again. I have logged yet another support call. Thanks for the kind comments one or two are leaving. I’m not posting them as this isn’t the place but they are appreciated.
I’m very sorry for all this disruption. It’s typical that it should happen over a weekend. Being the rather jaded type (I am now at any rate! :-), I figured things probably wouldn’t go entirely to plan which is why I asked for our server to be moved on a Thursday rather than a Friday, so we wouldn’t face the ‘Skeleton weekend staff’ issue. Never did I imagine that a job scheduled for ‘a couple of hours’ would be drawn out to three or more days…
11.02h: Another call to our hosts made just now… I am assured that there really are engineers on duty over the weekend. What they do exactly is another issue entirely…
11.17h: Just out of interest I tried to reach our host’s own website, it’s not available either. Probably a pretty serious issue going on.
15.15h: Still no news… I’m going to give up with the calls to the Help Desk as it doesn’t seem to do any good and I’m hoping the engineers are too busy working on a fix for our problems that they can’t get in touch with anyone. More ‘Good luck’ comments posted. Thanks. You know who you are 🙂 Appreciated.
17.58h: Not a squeak from our hosts and still no site… I can only apologise on their behalf…
Sun 28th 8.03am: As above. It would be a miracle if we see the site back before Monday lunchtime. I have Googled for ‘Servercity outage’ etc but haven’t found anything…
12.52h: Still nothing…
16.34h. Our host’s site is back. Perhaps we will be soon as well!
16.47h: I’ve logged another call at their support desk, to no avail most probably…
17.19h: Finally got a call back! They need to reconfigure our firewall due to their network issue. Hopefully the site will be accessible soon. Please don’t get too excited as I still need my guy to get in there and clean up the inevitable mess and it is Sunday night…
17.51h: One website. I’ll try to get my man to get in there to have a look at the state of things.
18.24h: I can’t reach my tech guy at the moment. Mail may start to flow anyway but it will definitely take time for the backlog to clear. Perhaps just as well it’s Sunday night…
(AR: Don’t know if we are indirectly using Telecity but I doubt it as they got things sorted more quickly than we did 😉
18.30h: My mail is beginning to pour in. A good sign!
18.37h: Although you may be getting mail through now, it’s possible that, because we have been down so long, mail to ippimail may have been bounced by some mailservers. It depends how they have been set up. Some give up trying to deliver mail after 24h, others after a week. If a message you know has been sent doesn’t arrive, you may need to ask the sender to send it again, sorry.
19.57h: Site seems to be down again! It’s possible the server is just working so hard delivering mail that it’s not serving our front page but I doubt it. I have logged another call at the Help Desk…
20.04h: Blimey they called me back again… They are having a look at things once more but ‘I’m just finishing my shift so I have to pass it to a colleague’. I wonder what it’s like to have a steady job…
20.47h: Not a word and no site. Where do I get some Prozac this time on a Sunday night…?
22.01h: Still no site. Can’t imagine why we are coming and going. Just had a message saying ‘Where are the messages I should have had between Thursday and now?’. The answer is that they may have bounced back to the sender or they may still appear once we stay online for any length of time. The chances are they will appear eventually.
Once this is over I need to know if you feel it’s still worth pursuing ippimail…
Mon 29th 8.15h. No progress overnight. I’ll call them a bit later on. I’ve had one comment to say ippimail is well worth pursuing. Thank you for that but it’ll take more than just the two of us 🙂 (Don’t mean that to sound dismissive… need all the encouragement I can get at the moment…)
9.38h: Just spoke to one of the managers, after half an hour of trying. He’s going to call me back in five minutes.
10.34h: Is my watch fast…? Not a squeak. I’m guessing it’s serious. Although their own site is up…
11.51h: Left another message for the manager. Thanks for more messages of encouragement.
12.14h: Just had a message from an ippimail user saying ‘Let’s keep things going! How can we make ippimail better?’ Thanks for that. We all need to have a discussion in our forums as to a way forward once this is over. We all deserve better than this but I’m afraid it all boils down to me not being able to fund ippimail in the way it deserves. The project is too big for me now but it’s not even close to covering it’s costs and I can’t afford to fund it on my own in these credit-crunch days. We need to talk through our options. It may be that there simply isn’t enough people who want ippimail and I need to face that.
12.51h: Another conversation with the helpdesk. “There’s definitely a guy looking at it now. I thought it was done actually… Want to just leave him to get on with it? I’ll let you know as soon as it’s done.” That makes me feel way better… Not.
14.54h: Still not a word from ServerCity. I have emailed the manager making him aware of this blog but perhaps the implications of the negative publicity is passing him by. If you by any chance run a blog which isn’t hosted with ippimail, please link to this page to get it up the Google rankings. This is getting truly ridiculous…
15.50h: Another conversation with the “Help Desk”. The manager who promised at 9.38h this morning that he would call me back “in five minutes” and hasn’t, “is on site trying to deal with the issue.” There seem to be other sites impacted as well, not just ippimail. Which doesn’t make me feel any better at all I have to say…
16.45h: No news. Sorry.
17.48h: Still nothing.
18.47h: Still nothing but thanks for the kind comments people are leaving. Please don’t worry, if we continue with ippimail we will be getting a new host! It will mean more disruption as we move most probably but as I said to a friend just now, “If we don’t move hosts and there’s another outage, I’ll get lynched!”. From ServerCity’s end, this really is a textbook example of how not to handle an outage. I’m just grateful the idea of this blog came to me…
20.21h: Still no word. Sorry.
21.40h: I am utterly distraught. It feels as if years of work is being destroyed… Never in my wildest nightmares did I think things could go this wrong. Thanks for more messages of support though. The search for a new host starts tomorrow but it’s not as easy as it sounds. We don’t want to go from the frying pan into the fire…
Tuesday 30th Sep. 8.12h: Still no site. Energy will go in to finding a new host today rather than chasing ServerCity.co.uk.
9.13h: More messages of encouragement. Thank you. Still no news on our site.
11.14h: So sorry, still no news.
12.00h: I have contacted a hosting company which was recommended to me. They deal by email so we’ll see what they say when they get back in touch.
13.11h: Another call to the SeverCity helpdesk. They aren’t under the mistaken impression that our server is Ok. Will call back in 20 mins. “No, really…”
13.44h: One website again! Mail should start to flow again now. Just hope we stay up this time.
14.01h: I can definitely see ippimail from where I am and my mail is coming through again. If for some reason you can’t see our front page, try refreshing your browser, might help.
14.06h: If you aren’t seeing mail which ought to be there, it may have bounced because of us being unavailable for so long but, as I understand it, mailservers will keep trying to deliver mail but they’ll try less and less frequently before they finally bounce the message. In other words, more mail may arrive in 24h or something.
14.58h: Looks like we are having trouble sending mail? I’m trying to get my tech in to the server but ServerCity look to have configured the Firewall so he can’t reach the server to have a look…
15.02h: ServerCity predictably aren’t answering the phone…
15.06h: I can’t see ippimail either… I have sent emails. They are not answering the phone.
15.40h: Just spoke to one of the ServerCity directors. There’s still a network issue apparently. Unbelievable…
15.46h: Now I’m being told that it’s only our server giving them problems. Previously there were other victims as well. To top it all I get “I’m going to have to give you thirty days notice. If our network isn’t good enough for your server you will have to go elsewhere.” Absolutely astonishing.
15.57h: Please feel free to send a link to this blog to anyone you know in the IT industry. The Register? Slashdot? Who else might like to know? Pass it round…
The Director guy said he is going to move our server once again, to “A partner company’s facility.” I asked how long this would take and I’m not holding my breath for it happening today. They’ll go home at 17.00 sharp I’ll bet…
16.02h: I’m just thinking aloud… The Director guy said ” It must be a particular type of traffic you are receiving which is breaking our network.” We were on the network at their Brick Lane (London) facility for years before this. How can it be our server breaking their network? How on earth can a server break a network? Does anyone know?
16.36: I just missed a call from them but they are going to call back in ten minutes. If they move our server they are bound to give us new IP addresses which will take upwards of 24h to change… Which will mean that no mail will get through during that time.
16.40h: If this page does get seen by anyone who runs a site which has also been affected by the same outage, please let me know. Leave a comment below.
17.03h: Right… Finally got my call, after reminding them… One of the Directors who owns the data centre in which ServerCity.co.uk have our server just spoke to me to say that he will take our server and place it out of reach of the Servercity.co.uk outfit. Simultaneously ServerCity solved the problem they were having with our server.
The upshot is that we are leaving our server where it is until this coming weekend. (I wanted to give you some time to get your email) Then it will be removed from ServerCity and hosted with the main company. “Coreix” they are called. They are an entirely separate company one tier above ServerCity. We should be safe there, at least in the short term.
I have been promised that the move at the weekend should give minimal disruption but be prepared for it please.
17.35h: Still having issues but it’s most likely something we can fix ourselves. There was bound to be a bit of a mess to clear up. Please would you check here before hammering the server 🙂
17.47h: Sending is an issue. We are working on it. Nearly there hopefully. Trying to get my guy in there.
17.57h: We seem to be able to get in there. Won’t be long I’m sure.
18.18h: We are locked out again. I’ll call ServerCity.
18.25h: They just called back. They will take a look at the problem. I’m shaking…
18.37h: Server back up. It’s under a heavy load trying to clear the backlog of mail. This may be causing problems.
18.53h: Just had a chat with one of the Coreix guys who handle ServerCity’s support out of hours. He says Servercity has placed two massive servers above ours in the rack. They are putting out so much heat that it’s causing our server to reboot constantly. It’s already working at its limit trying to clear our backed up mail. This is why we keep coming and going. Each reboot takes a few minutes. If you can possibly avoid using the server at this point, I would be grateful. The mail will have cleared in a few hours if the load stays down.
19.03h: The above seems to be true. Server just came back up. Sending is fixed now but if you can possibly wait until the morning it would help us enormously. Lovely message just left by a user. May just go and have a weep now. Ehrrr, a few pints down the boozer that is of course.. (Don’t worry. I’ll sit here keeping an eye on things) 😉
20.00h: One final thought for tonight… One or two people have commented saying “Hey, where’s the mail I was expecting?” It may already be on our server but it has to be scanned for viruses and spam and then popped into your mailbox. All the while our server is having hundreds of external mail servers saying “Hey buddy! Been somewhere nice? I’ve got a ton of mail waiting for you! Grab this lot!” Then that lot has to be scanned for spam and viruses and delivered to your mailbox having gone through your own filters or forwarded to another address etc, etc. It will all take time to get through the system. Some mail will have bounced for sure though… Sorry.
20.30h: You guys have gone quiet and the ippimail site seems to be responding well so I’m hoping we are through this nightmare. At least until the weekend 🙂 I’ll start a new page on this blog for the weekend shall I…
21.42h: All quiet… Hope everyone is getting their mail. My tech has said that some mail may be stuck on our server because of the change of IP addresses. It will be resolved in the morning hopefully.
Wed 1st Oct 07.37h: The site is struggling for some reason. The front page loaded but I couldn’t log in. It seems to be there but running very slowly. Hopefully something my tech can fix once he comes on line. It may still be working through all our mail but it seems unlikely after all this time.
07.46h: This one looks down to us… Eventually I get an error relating to the database and it seems my guy was working on it very late last night. Perhaps a bit too late… Will be fixed very shortly I’m sure.
9.02h: Some connectivity issue. We can’t get in there. I’ll calling Servercity again… They are not answering the phone yet.
9.08h: Just spoke to one of the guys at Coreix. We aren’t really their problem yet but they seem willing to help. At least they answer the phone…
9.20h: Coreix guy definitely on our case now…
10.11h: It sounds as if the constant reboots have damaged the filesystem on the server. The guys seem confident that they can repair it.
10.57h: The Coreix guy is communicating well thank goodness and is still confident that repairs to the filesystem will bring the server back, it’s just taking time to complete.
13.30h: I just spoke to the Coreix guy and he sent me this message after checking our server again:
I have a gut feeling that the FSCK is going to run the majority of the day as FSCK is forcing the deletion of unused INODES, i.e all the INODES that were previously occupied by mail accounts on the server. I am checking your server every 15 minutes to see if there is any progress, however it does appear to be a pretty lengthy process at this stage.
So, Not great news. Sorry.
16.27h: No news I’m afraid. I feel much happier with these new guys so I trust them to keep us informed as soon as the repairs complete. There’s just no shortcut I’m sorry to say.
19.35h: I’m giving up hope for tonight. The disks involved are big, I’m not surprised the repairs are taking time. I’m sure you are tired of my apologies…
Thursday 2nd October. 08.00h: Site is back up so repairs were at least partly successful it seems. I get an error when logging in, a scary one to my untrained eye. Be assured that we are aware of it but my guy won’t be around for another hour I don’t suppose. I’m sure it’s fixable.
09.03h: People are leaving kind comments, thank you. I’ve decided to publish them even though I did say “This isn’t the place”. I want this blog to be for information during an outage only in the long term, should there be one, but your comments have kept me going this week and perhaps this is a time when we need to keep each other going… I need to ask permission from the various people first.
9.37h: Just spoke to my tech. He’s on to things. Holding my breath…
I have found that a lot of people who left comments quite logically, left their ippimail address attached so it’s not possible to ask permission from everyone at this point. I’m going to grab a handful and post then as a comment anonymously. If I can work out how to do it… 🙂 If you want your comment taken down for any reason, just let me know.
9.44h: I have posted the comments below. They may all be from my Mum for all I know 😉 but they mean a lot. Many tears have been shed this end over the last few days, of sadness anger and frustration but also of gratitude that there really are people out there who care. Thank you.
10.00h: Just had a email from my tech. Didn’t sound like good news. It’s at least going to take a long time to fix.
10.14h: It sounds as if we can restore things from backups. Still not breathing…
11.00h: Just had an email from my tech. I can’t pretend to understand it fully but it looks as if the backup will recover things as they were a couple of days ago. It’s not clear what the implications of this will be in terms of mail being delivered. We’ll have to see.
11.25h: We seem partly recovered but I’m still seeing an error when logged in. We are working on it.
11.45h: Looks like my inbox is damaged. This is the “worst case scenario”. There are bound to be others damaged as well.
12.08h: The server is rebuilding our mailboxes at the moment. I can see mine now but get an error when sending.
12.11h: Looks like we can actually send but get an error when doing so. My message did arrive. I’m guessing the error will go away when the mailbox rebuild is complete. It would probably be a good idea to leave the server alone for a bit. Avoid logging in if you can.
13.01h: Server gone again. We are doing what we can.
13.35h: ServerCity is experiencing a DoS attack. This happens on a regular basis to hosting companies but it’s why we lost the server for a while. Certainly hope it’s nothing to do with this blog… It’s out of order.
You couldn’t make it up…
15.00h: As far as I know, the rebuild is still ongoing.
15.57h: I’m aware of an error message when sending. The message does get sent but it’s not added to your sent box. Can’t imagine it’s a huge issue but I’m still waiting for the fix.
17.50h: Is anyone else having trouble sending (getting an error when sending) or just me?
18.22h: I think we are back to normal service. Let me know in our forums if not.
We are just recovering from a lengthy outage.
Apparently our hosts had a network cable fail and ippimail fell off the web. It took a long time before we realised that we had a problem, hence this blog!