Thursday, December 7, 2017

Multi Service Incident Update

We are pleased to announce we have added the ability to create incidents which affect multiple services. This is an extremely important and long awaited enhancement which forms part of a large rework of our internal systems in StatusHub.

To see how it now works please see this article in our help centre - Managing Incidents Across Multiple Services

If you have any questions please - contact us

For those of you that are interested or curious the following is a detailed description of how we planned and crafted this big update and how the system now works.

As background due to a legacy issue there was a serious inconsistency for users as maintenance events could be tied to multiple services but this was not the case for incidents. Now both event types work in the same way.

As mentioned this was part of a larger project to rework key parts of the system. For multi service incidents (MSI) we did not want to just add a simple drop down and move on. We also wanted to ensure that by adding this option we would not introduce unexpected behaviour or worse, unpredictable behaviour.

So the first step was to clearly define what MSI means. The answer was unexpected at first but logical. MSI means supporting multiple overlapping incidents for a service.

Without the overlapping scenario we would have to restrict the list of available services to only those which are event-less at this time. In order to create incidents that are affecting multiple services, one would have to clear existing events first.

Another legacy and counter intuitive element which we took the opportunity to remove was the option to allow users to set the status on a service without creating an incident or maintenance event. Without this making MSI work would have been extremely difficult if not impossible.

Why counter intuitive? StatusHub is not a tool to check if something is up or down. If an end user experiences your website as down, and your end users are checking your status page on StatusHub, then the information that your website is just down will bring zero value.
StatusHub is about communication. To be transparent with your end users about first: when something will be resolved, and secondly: what happened. Therefore allowing our users to change status per service without an event is no longer valid.

This view was also shared by some of our customers who asked us to block the ability to set a service status without any event over the last several months.
This choice introduced a key task which was to preserve the historical data as many StatusHub users have used this feature to date. In some cases due to ease of use, it was a simple drop down on our control panel home page instead of creating an incident using the form.

For others it was set via the API. We have chosen to keep the Service Statuses API instead of removing it entirely. However it's logic is now changed to automatically add an incident with a generic name to every status change so those who rely on this API option can continue to use it but can hopefully transition in their own time to creating only events.
The same approach was taken for past data. All 'naked' service statuses have been converted to generic incidents.


With these two key decisions outlined above made we were able to start our project which involved re-working our internal logic and data querying almost from scratch.
Working with events only resulted in much cleaner and less complex code. A very important goal to ensure maintainability and quality.

The system no longer needs to operate in terms of what is happening now and what happened before and just needs to work with what is happening at this very moment. Now, with multiple overlapping incidents, the problem of "What's the status of this service when I'm closing my incident?" starts to be non-trivial.

As an example, take a web application. One team noticed that due to problems with the database, the web app is responding very slowly. So they set the web app status to 'yellow' and they are trying to fix the problem. A pretty simple case.

But as problems like to appear in pairs (or worse), a 3rd party services has gone dark. Unfortunately this service was vital to the same web app and has resulted in a complete outage.
A second team, responsible for the 3rd party integrations sets another incident also affecting this web app and sets the service status to 'red'. So far, everything is simple.
But now let's assume that first team finished their work and the DB is operating fine (and not as a result of lower load due to web app being down, their fix was a problem with underlying DB storage performance).

They want to close their incident. But which status should they use ? 'green' because their problem is solved and in this aspect the app should be working fine? Or 'red' because in the end the app is not working which they can clearly see from their checks?
Or now let's assume that the second team finished first. Is the service 'up' (they have fixed their problem) and the web app should work fine or 'yellow' because maybe the DB team hasn't finished yet?

So for this reason we have decided to not explicitly set service status when closing an incident.  And because we have decided to operate with events only, we can tell StatusHub to not care about individual service statuses but care only about "Is there an event at this time?".

So returning to our example, in the first case, when the DB team finishes first and closes their incident, the web app service will be 'red' because the other event is dictating the status.
Only then when the second incident is closed will the service will be set as 'up' again.

In other words: Service status at any point in time is a result of much simpler logic: "Is there any event then? If so, use the status of the worst event, if not the service is 'up'". The same applies to aggregated historical views on your StatusHub page: "Was there any event on that day ? If so, then use the worst status from the event or events. If not, then the service was 'up'".
Now users who update incidents, don't have to check with other users to ask if the service should be up or not. They can focus on their part only.

One more thing!

Another final change, prior to this update after adding services to a maintenance event there was no way to remove it. If a user made a mistake, they had to recreate the maintenance without that particular services. A very poor user experience. With this MSI release we have addressed this problem too.

Now a user can remove services from maintenance and from incident events at any point.
With incidents it is more complex but we have put a solution in place. When creating incidents, you don't know when it will be resolved and can’t know how many updates will be posted.
In order to remove a service from an incident, one has to do this from the hub history view. The same view that was always used to edit already created incidents.

Services will have to be removed from all incident updates to disappear from this incident entirely. "What's the point of an incident where one of the services will not be updated while others will be?"
This is a hint towards a feature that we want to complete next year which is the ability to skip notifications for incidents updates that are not changing the status of a service which an end user is subscribed to.

Like many elements of software sometimes what looks like a simple change on the surface is hiding many complex work and systems underneath.

Again we want to express our thanks and appreciation to all our existing customers and users who have been so patient in waiting for this update. As you can see we needed time and care to do it right and now these underlying changes give us a stepping stone and flexibility to introduce more great enhancements to StatusHub going forward.

As before if you have any questions or feedback on this please do let us know - contact us.