For quite sometime naive bayesian classifier based SPAMBayes filtered my emails very accurately with very few false positives.

Recently however I have noticed few trends in spamming which are alarming in nature.

  • Database poisoning: Using otherwise innocuous words (ham words) in a SPAM, thereby effectively poisoning the database in the long run
  • Junk Tags: Hiding spam words by inserting invalid HTML tags in between words. Any HTML parser ignores tags it doesn’t understand, thereby resulting in properly viewable document
  • Invalid Words: Spam word like mortgage etc. are masked by inserting special characters or junk characters in between.

Solutions I could think of:

  • Most of the database poisoning email tend to be classified in Not Sure category. I suggest that you delete them instead of classifying them as spam. However it still requires that we spend some time for it which is what I don’t like.
  • Junk Tags: Add a filter in front of bayesian classifier to eliminate junk tags
  • Invalid Words: No-exact matching algorithms from Lucene etc. should help.

I have recently noticed a significant increase in mortgage spams. It should be easy to tackle them by legal means.

Overall the game is becoming tougher for spam prevention. A combination of existing techniques are required for any spam filters to remain effective.

Looking forward to hear your thoughts.