For quite sometime naive bayesian classifier based SPAMBayes filtered my emails very accurately with very few false positives.在相当一段时间的朴素贝叶斯分类基于spambayes过滤我的电子邮件非常准确,很少假阳性。

Recently however I have noticed few trends in spamming which are alarming in nature.最近,但我注意到数的趋势,垃圾邮件这是令人震惊的性质。

  • Database poisoning: Using otherwise innocuous words (ham words) in a SPAM, thereby effectively poisoning the database in the long run数据库中毒:使用无害的,否则的话(火腿字)在一个垃圾邮件,从而有效地中毒数据库,在长远而言
  • Junk Tags: Hiding spam words by inserting invalid HTML tags in between words.垃圾标签:隐藏垃圾邮件的话,插入无效的HTML标记,在字与字之间。 Any HTML parser ignores tags it doesn’t understand, thereby resulting in properly viewable document任何HTML解析器忽略了标记,它不明白,因而在适当的检视文件
  • Invalid Words: Spam word like mortgage etc. are masked by inserting special characters or junk characters in between.无效的话:垃圾邮件字一样,按揭等都是蒙面插入特殊字符或垃圾的人物之间。

Solutions I could think of:解决方案,我可以认为:

  • Most of the database poisoning email tend to be classified in Not Sure category.大部分的数据库中毒的电子邮件,往往被归类在不能确定类别。 I suggest that you delete them instead of classifying them as spam.我建议你删除他们,而不是归类为垃圾邮件。 However it still requires that we spend some time for it which is what I don’t like.不过,我们仍然需要我们花一些时间,因为这是我不喜欢。
  • Junk Tags: Add a filter in front of bayesian classifier to eliminate junk tags垃圾标签:添加过滤器在前面的贝叶斯分类,以消除垃圾标签
  • Invalid Words: No-exact matching algorithms from Lucene etc. should help.无效的话:不完全匹配算法从Lucene的等,应有助。

I have recently noticed a significant increase in mortgage spams.我最近注意到一个显着增加,在按揭垃圾邮件。 It should be easy to tackle them by legal means.它应该可以很容易地解决这些问题通过法律手段。

Overall the game is becoming tougher for spam prevention.整体游戏正在成为更严厉的垃圾邮件预防。 A combination of existing techniques are required for any spam filters to remain effective.结合现有的技术所需要的任何垃圾邮件过滤器,以维持有效的。

Looking forward to hear your thoughts.期待着听到您的想法。