CyberArmy University | Open Source Institute | CyberArmy Intelligence & Security | CyberArmy Services & Projects

[Library Index]

[View category: Networking] [Discuss Article]

Spam Filters and How They Work

Article is yet to be rated
Author:      unknown
Submitted:      28-Nov-2004 08:04:51
Imported From:      zZine (original author: Salendor)


Every year, the amount of unsolicited email received by the average user increases. Spam accounts for around 40% of North America's daily email, an increase of 28% since 2002.
Signature Matching

Anti-Spam software companies maintain large amounts of email accounts on free email providers such as Yahoo or Hotmail. These accounts are monitored very closely, waiting for a spam message to arrive. When the message inevitably does arrive, the vendor creates a signature for the message. This will consist of between 32 and 126 alphanumeric characters that are worked out by the content of the message. The signature is added to a database of other spam signatures and can be accessed by any site with a copy of the database installed. The site will generate a signature for the message in exactly the same way as the company does, and if the generated signature matches anything in the database, the message will be deleted.

Heuristics

Bulk messages tend to have the same characteristics. For example, companies advertising mortgages will often have the phrase ?low interest rate? in the text. In a heuristic system, these phrases have a value, and to determine whether a message is spam or not, the system adds the values up. If the total is over a set amount (set by the site administrator), the message will be treated as spam. Heuristic filtering is one of the fastest and most accurate filtering methods available. It works straight out of the box, has no ?learning periods?, and no need for constant updates from the Net. However, if the rules are poorly written, the filter can have a very high false positive rate.

Bayesian Filtering

A Bayesian filter ?learns? the difference between spam and genuine messages by looking at two large groups of mail and finding common characteristics between them. One collection contains spam collected from a site, and the other contains non-spam messages received by the same site. When a Bayesian filter receives a message, it pulls it apart into separate words, which are then scanned for any of the ?interesting words? that the filter found in the learning period, ?interesting words? being words that might indicate that the message is spam. Bayesian filters carefully weigh up the non-spam and spam characteristics of the message and then make their decision. For example, if a message contains the word 'viagra' or '####' and the message contains many non-spam words then the filter may just ignore it. One of the disadvantages is that the filter requires a long learning period to differentiate between spam and non-spam, although the vendor of the program can do this before the program is received.

Challenge/Response

Only a minority of spam messages are actually sent by real people; mostly they are sent by automated programs. Challenge/Response systems take advantage of this fact by making the system pass a test to see if they are human. When a message is received the system send out a reply to that address explaining that the message's purpose is to cut down on spam sent to the site. The message will also have a challenge in it, usually a link to the site, and the challenge will usually take the form of a distorted image of an alphanumeric code. The user is requested to enter the code. If the code is correct, the message will be allowed through. One of the disadvantages of this system is that automated emails that are sent from legitimate companies would be deleted, as the company would generally not answer the challenge.

Conclusion

Many anti-spam programs are available freely today but no method will solve the problem completely. The best solution is to have several programs running in parallel, overlapping each other to provide extra security. At the same time, it is important not to have too many filters running, because after a certain point, the increase in accuracy would be minimal as compared to the huge increase in server load.

This article was originally published by CyberArmy.net in the CyberArmy Library.

You must be logged in to vote on an article

About Us | Privacy Policy | Mission Statement | Help