Why Monitoring Services?
Mostly I don't much care if my own services fail, because I assume that if they've failed through my own fault I deserve to suffer.
But at work, for friends, and for commercial setups I want to know when something has failed, and ideally I want to know urgently. (I've got an account at intellisms so I can easily generate/respond to SMS messages. e.g. I've got a setup where I can SMS a random code to a magic email address and the server reboots. Cool.)
The problem with a lot of monitoring systems is that they are external, so they replicate what random internet users would see.
This means that if there are any minor routing oddities, or link failures, the monitoring will trigger. On the one hand this is good - if the service is down, no matter how briefly, then I should know. On the other hand chances are the failure was transitory so by the time I look it will be resolved.
Moving the monitoring closer to the machines it tests is a simple way of solving that, but introduces the problem that you're not testing naturally - and you testing might be saying everything is fine, but your ISP has cut you off from the rest of the internet.
So the solution I considered was using a distributed system. Having multiple service checkers each testing the same services - and only alerting if/when a given percentage of those monitors see a failure.
That seems to solve the problem of having a trigger-happy monitoring system getting cut off, by only alerting in the case of a genuine failure seen from, say, half the nodes.
Monitoring Prototype
As a proof of concept I wrote a simple prototype monitor. This is a system which has two conceptual parts:
- A service monitor
This tests services (ssh, ping, smtp, etc), and records "OK" or "FAIL" locally depending on the result.
If the service tester sees "FAIL" it talks to other nodes and sees what they say.
- A status report
This is invoked as CGI script and allows you to view the current state of all tests.
This is available specifically so that the tester component can query the status remotely.
So in a simple three-node setup what happens is:
- Node 1 sees a failure.
- Node 1 repeats the test and still sees a failure.
- Node 1 does a "wget .." on the 2nd node's status page.
- Node 1 does a "wget .." on the 3rd node's status page.
- If both node 2 & 3 see a failure it alerts.
- Otherwise it records "FAIL".
On the whole this works. But there are delays. Assuming that the test of each known host + service takes 2 minutes to complete then there is a potential delay of up to 6 minutes before an alert is generated:
- Node 1 must run all tests, and see the failure. Then repeat the test to see if it was a truly transient failure. Before recording "FAIL".
- Node 2 must run all tests, and see the failure. Then repeat the test to see if it was a truly transient failure. Before recording "FAIL".
- Node 3 must run all tests, and see the failure. Then repeat the test to see if it was a truly transient failure. Before recording "FAIL".
In short for N nodes you must wait 2xN minutes for all of them to record and display the failure and that is just too long. (You could cut that down to four minutes if you were to alert if only 2/3 nodes see a failure, rather than waiting for them all to see it. Yes I appreciate node 1 + 3 might be in phase, but generally with splay times, etc, you cannot do that.)
So a solution?
There are two simplistic solutions to solve the problem of having to wait for the failure cascade to trigger an alert:
- Only run some some tests on each node. Rather than all of them.
- Allow a more direct "Do you see this failure too?" message to be sent between nodes.
The first step means that you don't need to wait long periods between each run - for example if you're testing "host1.example.com" -> "host99.example.com" you have to wait for 98 host tests to complete before you see a failure on the last host.
Given enough nodes you should be able to divide the tests up amongst them - meaning a complete run of all tests happens in parallel and the total test time is reduced.
The second solution is means you don't need to passively check to see what your peers think of the current situation, instead you can ask them directly: "Test this host/service for me? KTHXBYE".
So that's my new design:
- A standalone service checker.
- A dynamic server:
- Allowing remote testing-nodes to invoke: "test(Host,Service)" and get immediate feedback.
The service checker will test all hosts+services, recording results. But if it detects a service is down it will immediately query all other testing nodes and ask them to test the failing host/service - reporting the result back instantly.
In fact all the service testing logic will be in the dynamic server and the status checker will literally just say "Test this service on this host" if all is well it will return immediately, if it isn't it will trigger the remote query.