Monitoring service that has a human component?

David_H · December 5, 2018, 10:01pm

Hey all, was curious if anyone knows of a website monitoring service that has the option to incorporate a human component into the decision and escalation tree? I’m trying to help a customer find a way around false positives bogging down their NOC staff, by having a human determine the difference between a real error, desired (but different) content, or something in between like “Hey it’s 3am and we’ve taken our website offline for maintenance, we’ll be back up by 6am.” Automated systems tend to only know if test A, or steps A through C, are failing, then this is ‘down’ and do my preconfigured thing, but that ends up needlessly taking NOC time if the customer themselves is performing work on their own site, or just changed it and whatever content was being watched, is now gone. So, the goal would be to have the end user be the first point of contact if it looks like more of a customer-side issue. If they can’t be reached to confirm, THEN contact NOC, and unlike email alerts, keep contacting until a human acknowledges receipt of the alert.

Thanks

Jared_Mauch · December 5, 2018, 10:12pm

I know there are outsourced NOC services you can hire. I’m not sure how long it would take, but I wonder if you could do an API call to something like mechanical turk as well?

- Jared

John_Von_Essen · December 5, 2018, 10:39pm

Whats your budget?

The outsourced NOC firms tend to be expensive (I’ve looked at them for a project), and they are also not that fast, so dont expect someone to determine if an alarm is valid within a few minutes, instead, in goes into their queue and waits for a tech to pick it up, so it could be 30-60 mins.

In a perfect scenario, using freelancer/gig-economy people should be able to get this done quickly, but its needs to be sizeable to start and will involve alot of logistics, which means money.

To be honest, the best option may be to hire a developer to custom code really good logic that eliminates a good deal of the false positives so only a handful make it through.

Dovid_Bender · December 5, 2018, 10:58pm

For my 9-5 we have a company that has a 24/7 NOC that watches all of our boxes. They CEO is in the US and the NOC guys are seas. They are generally very responsive. It’s very affordable (about $200.00 per box per month). These guys work real well but they are sort of work in the Box. We need to give them guidelines. For example we do telecom and clients get hit with fraud all the time. There are times where we know it’s 90% fraud and we want them to shut it down, there are times where there is a 50/50 chance that it’s fraud. If the system thinks there is a 90% chance it emails them “Fraud call from X to Y”. They then have a procedure on how to figure out based on the call record who the client is and they shut them down. If it’s the 50/50 chance they get an alert “POSSIBLE Fraudulent call from X to Y”. For such a call they have to go through a series of checks before they shut them down. What I am getting at is they aren’t good at figuring out what is fraud but are very good at following rules and doing exactly what you ask of them. What ever technology we use they learn it (be it OpenSips, Asterisk etc.). We do need to tell them exactly what to monitor and what to do for specific alarms.

If you want an intro let me know.

Heath_Jones · December 6, 2018, 11:05pm

Hi David - Just a bit of insight from my own experience:

Common issues when monitoring (and the associated escalation processes) don’t work and similar issues are seen as you described:

Inconsistent HTTP response codes across services and service layers (nginx vs the backend tomcat), means you can’t use them properly.
Monitoring on arbitrary metrics (90% of something) as opposed to metrics linked to an actual outcome (response times for example).
No runbook in place (engineer to change some setting to switch on/off maintenance mode).
No central view of what engineer is doing what to which systems.

Some fairly simple example of when I’ve seen things work pretty well:
Organisation uses HTTP code monitoring, alerting on 5xx but not 503.
Services configured (and tested!) to return other, specific 5xx errors, but keep 503 as a ‘known and expected maintenance’ mode.
Runbook in place to let other engineers know what’s happening (slack message for example) and then maintenance page on the reverse proxy.
Monitor and report on the common 90% metrics (disk space, memory) but no alerts.
Don’t fill up the disk with logs, only to delete them and let it fill up again…
Remove all non-actionable alerts.

Of course a good solution could be to implement a rolling-upgrade / ha maintenance strategy, but in reality (depending on how ancient the app is) this can be quite hard.

ps. This is a really good read: https://landing.google.com/sre/sre-book/toc/index.html

Cheers
Heath

Mark_Milhollan2 · December 7, 2018, 8:38pm

Isn't this merely a matter of escalation, since either alerts someone
and it is just a matter of who, when and how often? The usual way of
putting a human in the loop is for some events to create tickets to be
triaged as staff has time, or all events get tickets but with some
created in a lower priority queue w/o escalation and others in a high
priority queue w/escalation. As a service though, sorry, no I've not
seen such.

/mark

Karsten_Elfenbein · December 11, 2018, 6:46pm

Hi,

you could let them insert a custom string into the maintenance page.
(I hope they are not writing it on demand) So the monitoring would be
ok on status code 200-399 or custom string found.
You could also use a different escalation chain when "maintenance" is
found on an 503 error. Other than that it sounds like a nice AI
training field.

Karsten