Why can't on-line news sites filter spam comments?

I have a question of fact and, since I’m posting it on a subsidiary of the Chicago Tribune Newspaper, I thought I’d invite response from the Editor-in-Chief or whoever is delegated the role of on-line Comments Editor or OnLine Technologies Programmer or whoever is in the decision-making capacity most closely related to this matter.

A news site puts up (publishes?) an article for viewing. It’s not controversial, not momentous, salacious, heartwarming, groundbreaking, maybe not even a current events issue; not much of a big deal – something like “Unpublished manuscript from Edgar Allan Poe discovered in Baltimore Attic.”

There’s really nothing much to say. Nevertheless, the first – and often the only – comment posted is someone using the space to say “My step-father’s mistress’ third cousine’s aunt-in-law makes $16billion a week on-line. Follow this link to learn how: www.iamasleazyspammer.biz”

They’re not even contributing to a discussion of the article. I mean, they’re not all that easy to detect when they’re buried within the discussions of a hot topic, but they should be easy enough to weed out of a mundane info-leak. In fact, those mundane info-leaks where they show up as the sole faux comment could really be the harvesting grounds for name-gathering software to find the spam-commenters. Why, oh why aren’t these guys put on a shared blacklist? Why aren’t these posts detected and deleted within minutes (at least within hours) of being posted?

The bottom of this news article includes a perfect example of the kind of garbage I’m talking about: http://www.rr.com/articles/2014/02/20/g/gov-t-looking-into-atf-operations-in-4-cities

–G!

I doubt if there is a hard, fast and factual answer. We know the technology exists but we also know the filtering software is far from perfect. The best solution right now is to use humans, as SDMB does in the guise of moderators, to filter and police. I can only assume that the paper doesn’t think that the labor cost would justify the rewards; papers aren’t exactly flush with cash right now.

I police my own YouTube comment areas, and wipe out the worst spam as soon as I can, which takes time. Some YT posters have decided to disable comments entirely for just this reason, but the paper may think that’s too harsh and user-unfriendly.

It looks like your news article does have such a filter, because there don’t seem to be any comments at all.

Which could be due to a human, not automated filter for all we know.

Having an automated system try to filter things is much tougher than having a human look over things and see the “obvious” spam. I mean, I certainly know that breast cancer awareness websites are not porn, but we constantly hear about automated porn filters that have troubles telling the difference.

Spammers are not stupid. They know exactly what automated detection methods are currently being used, and construct their spam carefully to avoid them.

For example, back in the old Usenet days, anti-spam filters started by stopping lots of multiple posts with identical content. So then the spammers started posting messages that had the first couple lines with their ad, then a block of random characters. So the filters adjusted to find that, but then the spammers changed so that their block of text was just stuff pulled from other messages, or novels, or whatever. Some filters tried to adjust to just ignore everything by the first couple of lines, but that caught legitimate posts (for example, people posting short stories that always started with an identical copyright notice).

Which is not to say that all is lost, it’s just that it’s harder than it looks.

Human moderation is easier in terms of identifying spam, but that generally requires a report system and means that the spam is there until a moderator can take action, which means users will see it for a while, even if it does eventually get caught and deleted.

They do. They successfully filter lots of spam, and not just the obvious stuff.

Spammers know this, and they get feedback whenever they see their content failing to get through. So they change until it does. The spam that you see is, by definition, the stuff that they managed to slip past.

It looks obvious to you when that’s all you see. But in the context of all of the candidate comments that come through, most of which you don’t see, it’s much harder than you’d think to find those few needles in the haystack.

The particular spammer you’re referring to is among the worst of the worst. They employ (or more likely deceive) humans into posting the spam for them. Those humans use a variety of tools to multiply and alter their comments in ways that are specifically designed to make filtering (both automated and manual) difficult. And those ways change frequently.

In my experience, it’s because a lot of them don’t have a way to report a spam post. So any spam that gets through any possible spam filters just sits there. It’s not like moderators want to have to go back and visit the articles over and over.

The ones with the least spam actually close comments after a certain amount of time, so any moderators can move on.

Moderators don’t usually examine comments one article at a time. They see the stream of comments coming in, regardless of what articles those comments were posted on.

“Report as spam” buttons and upvote systems have their own problems. People spam those buttons on comments they disagree with. Spammers upvote their own comments. And it’s harder to filter and moderate button clicks than it is to filter comments.

One useful fact would be that the Chicago Tribune does not own the Chicago Sun-Times.