From time to time, this question comes up from customers and partners alike.
What’s the process behind our update procedure for on-premise spam filtering and how do we feed the machine that keeps it detecting spam accurately over time?
I’m quite certain that other managed spam filtering businesses operate in a similar fashion so this article will probably reverberate with other companies in the same line of business.
Here’s the process illustrated:
Vircom receives multiple data streams coming from different types of sources:
- Servers in the field that receive spam
- end-user submissions (via our directQuarantine plugin)
- administrative submissions (coming from system administrators using the modusConsole)
Messages arrive into a very large spam database which feeds into an analytical engine that is able to produce spam updates autonomously. User submissions and administrative submissions tend to have more weight which denotes the importance for us to get feedback from clients in the field.
Once the automated systems have done their job, spam updates and signatures are produced which are pushed out to our client servers in the field. These updates go out as frequently as every five minutes. We already catch 99%+ of the spam at this point.
The spam team then needs to deal with the leftovers — messages that aren’t caught for some reason by the automated systems. This is a two-step process. A first step is to produce spam updates based on our Sieve engine. These scripts are meant as a temporary fix to stop a particular spam wave. The next step is to use the samples in our AI engine corpus and launch a retraining session for the artificial intelligence engine we use. This process can take a whole day. An AI update is then pushed out and the scripts we did earlier are manually removed.
In a nutshell:
- Automated updates from the analytical system happen every 5 minutes.
- Manual scripted updates are done every 15 to 30 minutes.
- AI updates are done 2 to 3 times per week.
Automation alone lets us catch 99%+ of the spam that is out there (the percentage fluctuates depending on the types of spam waves that we get). The human element is what makes a big difference.
The advantage of keeping humans in the loop is that it lets us keep the quality of the data high and it lets us do manual tweaking on a client-by-client basis. For instance, in the case of Graymail, we can push out custom scripts to clients who are less or more tolerant of Graymail then others.
A similar process happens for false-positives. False-positives are received from:
– Administrative submissions from the modusConsole
– Quarantine reports “release & report” actions
– WebQuarantine release and report actions
– directQuarantine release & report actions
Since False-Positives tend to be costlier to end-users then false-negatives, the weight given is very high compared to spam. FPs are processed similarly to spam. If an FP is still caught after going through the automation as a test, we make manual exception scripts and the FPs are then cumulated for AI retraining. The messages eventually wind up in the clean mail training set.
Modus really shines here because of the various mechanisms given to both admin and end-users, we have a closed-feedback system that lets us tweak the filtering on a minute-by-minute basis.
We get tens of thousands of submissions per day by end users for spam and false-positives and we action almost all of them. We do filter out errors (for instance, we’ve seen people submit our quarantine report emails as spam). End users make a lot of mistakes so we have to watch out for those.
But over all, it does help keep the quality of the updates very high.
Modus has millions of business end-users worldwide whose email it protects. Those millions of users receive several hundred million messages daily, and that is not counting the spam and malware messages that are already blocked by modus. Some of the spam gets through, on average less than 0.3% (Thankfully!). Regardless, 0.3% of such a large volume still represents over 1 million bad messages getting through daily. With the excellent mechanisms we have for reporting such messages, including directQuarantine for Outlook, we have made it much easier for our customers to report these messages. So we receive anywhere from 50,000 to 150,000 of those messages daily, depending on volume and spam waves. This is good, it helps us improve things and we appreciate that.
By feeding us with good data, end-users help us keep the accuracy high, hence the importance of using the various feedback mechanisms we provide.