default, spiral
posted by [personal profile] skx at 10:55pm on 15/05/2009 under
Mailing List Software

Recently I was setting up some new domains and I wanted to configure some mailing lists.

Historically I've tried all the mailing list managers out there and think that 99.99% of them suck. The canonincal solution is mailman which I detest both as an administrator and as a user.

The other solutions seem to fail in various ways (character sets, options, complexity, the ability to use the same local-part on multiple domains, even having support for multiple domains - that list is just off the top of my head).

Until now I've always chosen ecartis as the best of a bad bunch, but today I thought "how hard can it be?" (You know this is going to end badly!)

I figure there are three options I need to support:

Subscribing To A list

This should mean you mail a magic address, receive some instructions back, and then follow them.

Basically so long as you have to confirm the action you're fine. Otherwise people can abuse your server and sign up strangers to your list.

Unsubscribing from A list

Same as above.

Posting To A List

The lists I run only allow subscribers to post, but there are alternatives;

  • Anybody can post.
  • Only subscribers can post.
  • Only admins can post.

Now most mailservers support the ability to run scripts via "pipes" in some fashion. So /etc/aliases can have:

list-foo-subscribe:  "|/usr/bin/subscribe-list   --list=foo"
list-foo-subscribe:  "|/usr/bin/unsubscribe-list --list=foo"
list-foo-subscribe:  "|/usr/bin/post-to-list     --list=foo"

Surely the mailing list can be pretty simple? Just create a small database to hold subscribers, and when a message comes in run "/usr/lib/sendmail to pass it on to each subscriber - munging reply-to, etc.

I did write a brief hack which works. But I'm unsure if it is useful, or too basic.

default, spiral
posted by [personal profile] skx at 05:29pm on 07/05/2009 under ,
Why Monitoring Services?

Mostly I don't much care if my own services fail, because I assume that if they've failed through my own fault I deserve to suffer.

But at work, for friends, and for commercial setups I want to know when something has failed, and ideally I want to know urgently. (I've got an account at intellisms so I can easily generate/respond to SMS messages. e.g. I've got a setup where I can SMS a random code to a magic email address and the server reboots. Cool.)

The problem with a lot of monitoring systems is that they are external, so they replicate what random internet users would see.

This means that if there are any minor routing oddities, or link failures, the monitoring will trigger. On the one hand this is good - if the service is down, no matter how briefly, then I should know. On the other hand chances are the failure was transitory so by the time I look it will be resolved.

Moving the monitoring closer to the machines it tests is a simple way of solving that, but introduces the problem that you're not testing naturally - and you testing might be saying everything is fine, but your ISP has cut you off from the rest of the internet.

So the solution I considered was using a distributed system. Having multiple service checkers each testing the same services - and only alerting if/when a given percentage of those monitors see a failure.

That seems to solve the problem of having a trigger-happy monitoring system getting cut off, by only alerting in the case of a genuine failure seen from, say, half the nodes.

Monitoring Prototype

As a proof of concept I wrote a simple prototype monitor. This is a system which has two conceptual parts:

A service monitor

This tests services (ssh, ping, smtp, etc), and records "OK" or "FAIL" locally depending on the result.

If the service tester sees "FAIL" it talks to other nodes and sees what they say.

A status report

This is invoked as CGI script and allows you to view the current state of all tests.

This is available specifically so that the tester component can query the status remotely.

So in a simple three-node setup what happens is:

  • Node 1 sees a failure.
  • Node 1 repeats the test and still sees a failure.
    • Node 1 does a "wget .." on the 2nd node's status page.
    • Node 1 does a "wget .." on the 3rd node's status page.
    • If both node 2 & 3 see a failure it alerts.
    • Otherwise it records "FAIL".

On the whole this works. But there are delays. Assuming that the test of each known host + service takes 2 minutes to complete then there is a potential delay of up to 6 minutes before an alert is generated:

  • Node 1 must run all tests, and see the failure. Then repeat the test to see if it was a truly transient failure. Before recording "FAIL".
  • Node 2 must run all tests, and see the failure. Then repeat the test to see if it was a truly transient failure. Before recording "FAIL".
  • Node 3 must run all tests, and see the failure. Then repeat the test to see if it was a truly transient failure. Before recording "FAIL".

In short for N nodes you must wait 2xN minutes for all of them to record and display the failure and that is just too long. (You could cut that down to four minutes if you were to alert if only 2/3 nodes see a failure, rather than waiting for them all to see it. Yes I appreciate node 1 + 3 might be in phase, but generally with splay times, etc, you cannot do that.)

So a solution?

There are two simplistic solutions to solve the problem of having to wait for the failure cascade to trigger an alert:

  • Only run some some tests on each node. Rather than all of them.
  • Allow a more direct "Do you see this failure too?" message to be sent between nodes.

The first step means that you don't need to wait long periods between each run - for example if you're testing "host1.example.com" -> "host99.example.com" you have to wait for 98 host tests to complete before you see a failure on the last host.

Given enough nodes you should be able to divide the tests up amongst them - meaning a complete run of all tests happens in parallel and the total test time is reduced.

The second solution is means you don't need to passively check to see what your peers think of the current situation, instead you can ask them directly: "Test this host/service for me? KTHXBYE".

So that's my new design:

  • A standalone service checker.
  • A dynamic server:
    • Allowing remote testing-nodes to invoke: "test(Host,Service)" and get immediate feedback.

The service checker will test all hosts+services, recording results. But if it detects a service is down it will immediately query all other testing nodes and ask them to test the failing host/service - reporting the result back instantly.

In fact all the service testing logic will be in the dynamic server and the status checker will literally just say "Test this service on this host" if all is well it will return immediately, if it isn't it will trigger the remote query.

default, spiral
posted by [personal profile] skx at 07:06pm on 06/05/2009

Since this is essentially a new journal I'm not sure what to do with it - I am very tempted to import all my content and migrate fully. But at the same time my friends are not all here, so I don't want to do that.

Instead I think for the next while I'll restrict myself to posting only technical things. I can go on a fair amount on that topic without too much risk of running out of things to say.

So, what have I been doing recently?

Fighting Spam

My spam filtering business has been growing a lot recently, to the extent that I'm rejecting on average three emails a second. (That might not sound like a lot, but that is three emails a second every hour of every day for the past month.)

This success has lead to some interesting issues of its own, but I will post about that tomorrow.

It does bring me on to the next bane of my life...

MySQL

Or, as I prefer to call it fucking mysql.

Over the past few months I've been doing some work with another group, a group who have a very popular website built around PHP & MySQL. Unfortunately they seem to usage patterns that are capable of triggering and discovering bugs in MySQL at an alarming rate.

At one point they were paying for the use (+support) of MySQL enterprize - these days I'm not so sure - that was largely a non-productive expense.

Anyway suffice it to say that after MySQL gave up and blamed all the problems on hardware we moved them to a second box recently. Previously this was a MySQL slave, and had been working in a reasonably trouble-free manner.

Once it was promoted to being master immediate system reboot. Sigh.

One hour later the table checks were complete and the application good to go, but I'm almost certain it is only a matter of hours/days until it dies again. At least we've successfully demonstrated the problem is software-specific.

The odds of their master dying, and then their slave dying, as soon as they become "production" are too high to be useful.

So today I've been rescuing, repairing, and configuring replication in the opposite direction. (There is a significant amount of data, so the LVM snapshot + rsync took a few hours, the "replication catchup" will probably take most of the night. le sigh)

Distributed Monitoring

Although I've not done much I've been very interested in the subject for a while. I use a few monitoring solutions at different locations (e.g. at work, and for monitoring my antispam setup) and most of them suck.

Nagios, the canonical monitoring solution, is a nasty piece of software and it seems to fail whenever it should be alerting.

Still I'll post some musings over the next few entries. Unless the topic is dull?

default, spiral
posted by [personal profile] skx at 07:25pm on 04/05/2009 under

I guess this is where I should make my first post, introducing myself tot he world.

Only that feels too weird. I've been posting to livejournal (username: skx) for many years and I expect that initially at least I'll only be read by existing or prior friends from there.

Still that might not be entirely correct, since I have been encouraged by [personal profile] denny to post to the dreamwidth lists about the spam filtering interest I have.

Early days there, but who knows we might have an interesting time in the future.

May

SunMonTueWedThuFriSat
          1
 
2
 
3
 
4 5
 
6 7 8
 
9
 
10
 
11
 
12
 
13
 
14
 
15 16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31