Originally posted by Delgarde
View Post
Seriously, though, we really need smarter anti-spam. For example, something that's better at taking into account the kinds of links that a post contains.
For one of my own sites, I'm planning to have a system which sends messages direct to the "ham" (jargon for non-spam) corpus (with the option to manually re-file them later) unless they meet a certain standard for suspiciousness, which can include things like "has URLs" and "contains keywords like SEO" but ignores things such as URLs on a whitelist of "sufficiently moderated URL types". (eg. Wikipedia article URLs, IMDB entries, etc.)
To ease adding things to the whitelist, I'll have it tally up the domains that are getting linked to and produce a monthly report of un-whitelisted URLs showing up in non-spam messages.
To deal with the kinds of obfuscation Disqus spammers seem to like, I'll use some Rust code (for high-performance text processing) and the Unicode tables to catch messages which use characters from an abnormally broad range of scripts and then use the TR39 skeleton algorithm before feeding the message into the spam analysis to defeat such obfuscation.
Hell, if abusive language becomes a problem, I might have some fun trolling the trolls with a "glory days of 4chan"-style phrase-substitution filter which replaces profanity with formal antonyms like "fine gentleman" even as it pings me to review the message.
Comment