Uptime, outages, and networking issues

Discussion in 'Open Discussion' started by extorted, Mar 21, 2019.

  1. Hello all reading,

    I am sorry to have to write this post but it has been a long time coming. I generally have good things to say about Winhost and am not on the verge of moving out of their eco-system, but I believe some proper responses from management are due.

    Prologue: I understand that outages happen. I also understand these can occur due to different parties: Winhost, their third party suppliers, upstream providers, clients themselves via bad code / practice.

    A 99.9% uptime, if taken annually, is circa 9 hours of downtime. Is that how I am expected to look at the figure? And are we talking about a January to December range?

    In a single outage between 20 and 21 March 2019, a downtime of 5 hours, possibly more, happened. If we are to consider the uptime as an annual concept, measured in a consecutive January to December period, that leaves a mere 4 hours for the remaning 9 months of 2019 of "allowable downtime".

    In an outage that happened in 12 November 2018, a networking issue on the hypergrid servers led to downtime.

    In December 3 2018, a networking issue led to downtime.

    I do not keep an hourly count of annual downtime, but following yesterday's incident, I am going to start doing that.

    Can we get compensation, in any form, if we, as clients, experience annual downtime that is shown to be greater than 0.1% as specified in the service level agreement?

    Can Winhost start indicating the start and end times of major network disruptions? This could be in the same forum posts labelled network-connectivity (which unfortunately for us, conveniently for Winhost, get removed once each crisis is resolved)

    I hope this post gets some traction and responses, from both clients, and Winhost management.

    Respectfully,

    A client
     
  2. I am surprised nobody replied, yet. Both management, and clients. Considering one of my sites is down again. Reported by client 45 minutes ago.
     
  3. curtis

    curtis Winhost Staff

    If your client is down, please contact our support team and they can look into it. I suspect your client was on W26. We posted about that some issues with that server this morning in the outages forum.

    We understand that you are upset over the recent outage on Wednesday March 20, 2019 and that your business was impacted. As posted in our Outage/Maintenance forum, the networking issue was caused by bad fiber and router hardware failure upstream from us.

    We are an online business and rely on our connectivity for our business livelihood, so the disruption also affected us as well. We are equally frustrated and trying to get more detailed answers from our upstream provider.

    As for start/end times of outages, we do our best to post them for any outages. However, note that the start time and end times are not clear most of the time. For outages, often things don’t turn off or on at a single moment. There is usually a slow build to the start of an outage and its resolution typically rolls out in waves, or things are just slow for a while and then it clears up after time. A lot of times, the sites are slow for many customers and for others, their sites are non-responsive. So it’s hard to state when exactly did the outage began.

    You are correct, old outage posts are removed from the forum after time (not right after resolution). We used to keep all the outages posted but we decided to start removing the older posts – not because we are trying to hide something, but because of the problems the old posts were causing for both customers and for our staff. For example, during outages, some customers would reply to old posts that had nothing to do with a current issue, resulting in multiple places of discussions – which often leads to confusion. We tried closing replies to older posts, then we received customer complaints or customers linked to old posts that had nothing to do within a current incident – which also contributes to confusion. We found that by removing the older posts, we can focus the attention on the current relevant outage post, allowing our staff to focus on getting the problem resolved and communicate with customers in one place and avoid customer confusion.

    You can keep track of cumulative downtime if you want to. To help, there are some monitoring services out there that you can subscribe to. And there are some hosting resource sites that try to provide such information. But note that the information these sites give you is a general sense of uptime – as they check on a particular site or IP. It could be that the server they check had some hardware issues in a particular month and experienced abnormal stats, but the rest of the host’s infrastructure was fine. And also note that some hosts figure out which site/IP are being tested and game the results.

    As for hosting SLAs - they are confusing. Your calculation is correct - a 99.9% uptime is roughly about 9 hours of downtime per year. However, hosting SLAs do contain exclusions for things that are generally outside the host’s reasonable control.

    For example, hosting providers have to do both regular and emergency maintenance on their servers and these often require one or multiple reboots. So many SLAs, don’t start counting time until there is 30 minutes of consecutive downtime. Other things like DDOS attacks, customer faults/abuse, hacking, software bugs, hardware failure, natural disasters, and acts of terrorism are similarly excluded. Most SLAs also exclude issues encountered upstream like Internet backbone outages, fiber cuts and upstream provider issues.

    In the incident on Wednesday, March 20, 2019, the networking issue was related to bad fiber and hardware failure at our upstream provider. Our hosting platform was working fine during the incident. So under typical SLAs, this incident would not apply.
     
  4. Hi Curtis, thanks for the reply; it is well written and answers many of my questions. It is not my wish to count downtime; all I (and I would say, all clients) am really after is uninterrupted service. I would rather be eating pizza than count downtime :)

    I do have one suggestion, however, which may not be as difficult for Winhost to implement and in my opinion, ups the ante for hosting providers; I hope this can be considered. Here goes:

    Be it the case when many servers are affected (as per Wednesday, March 20, 2019 incident), or single servers, as per today, can Winhost send out an email to account holders informing them of issues, with possibly the link to the forum post detailing the latest news on the issue? Winhost would be more professional, and same applies to clients who manage services on behalf of their own clientele. I would prefer getting to know it from you, than from my own clients, that their website is offline.

    I would also have a threshold for this so you do not issue a ton of alerts for something that could potentially be sorted in 15 minutes. Say, 1 hour? This approach has many advantage for both Winhost and Winhost clients. For starts, you would decrease the surge of tickets during downtime. Clients may not be necessarily monitoring the forums, but I am sure 99% of us monitor their email.

    Have a great day and thanks again. Keep up the good work.
     
  5. curtis

    curtis Winhost Staff

    Thanks for the suggestion. Can't make any promises but we'll discuss your suggestion with the managers and support team. Also note that you can "watch" the outages forums.
     
  6. Thank you!
     
  7. Hi Curtis,

    Look, for instance, how professional this looks (see image below). I was informed, and was able to inform website owner in advance! I appreciate this is a planned activity, rather than an ad-hoc interruption, but the concept is similar and can be applied to network / upstream interruptions :)

    upload_2019-3-27_12-31-35.png
     
  8. curtis

    curtis Winhost Staff

    Yes, for planned outages that are going to be long or fall out of our Monthly Maintenance Window, we typically email customers.
     

Share This Page