For quite sometime naive bayesian classifier based SPAMBayes filtered my emails very accurately with very few false positives.在相當一段時間的樸素貝葉斯分類基於spambayes過濾我的電子郵件非常準確,很少假陽性。

Recently however I have noticed few trends in spamming which are alarming in nature.最近,但我注意到數的趨勢,垃圾郵件這是令人震驚的性質。

  • Database poisoning: Using otherwise innocuous words (ham words) in a SPAM, thereby effectively poisoning the database in the long run數據庫中毒:使用無害的,否則的話(火腿字)在一個垃圾郵件,從而有效地中毒數據庫,在長遠而言
  • Junk Tags: Hiding spam words by inserting invalid HTML tags in between words.垃圾標籤:隱藏垃圾郵件的話,插入無效的HTML標記,在字與字之間。 Any HTML parser ignores tags it doesn’t understand, thereby resulting in properly viewable document任何HTML解析器忽略了標記,它不明白,因而在適當的檢視文件
  • Invalid Words: Spam word like mortgage etc. are masked by inserting special characters or junk characters in between.無效的話:垃圾郵件字一樣,按揭等都是蒙面插入特殊字符或垃圾的人物之間。

Solutions I could think of:解決方案,我可以認為:

  • Most of the database poisoning email tend to be classified in Not Sure category.大部分的數據庫中毒的電子郵件,往往被歸類在不能確定類別。 I suggest that you delete them instead of classifying them as spam.我建議你刪除他們,而不是歸類為垃圾郵件。 However it still requires that we spend some time for it which is what I don’t like.不過,我們仍然需要我們花一些時間,因為這是我不喜歡。
  • Junk Tags: Add a filter in front of bayesian classifier to eliminate junk tags垃圾標籤:添加過濾器在前面的貝葉斯分類,以消除垃圾標籤
  • Invalid Words: No-exact matching algorithms from Lucene etc. should help.無效的話:不完全匹配算法從Lucene的等,應有助。

I have recently noticed a significant increase in mortgage spams.我最近注意到一個顯著增加,在按揭垃圾郵件。 It should be easy to tackle them by legal means.它應該可以很容易地解決這些問題通過法律手段。

Overall the game is becoming tougher for spam prevention.整體遊戲正在成為更嚴厲的垃圾郵件預防。 A combination of existing techniques are required for any spam filters to remain effective.結合現有的技術所需要的任何垃圾郵件過濾器,以維持有效的。

Looking forward to hear your thoughts.期待著聽到您的想法。