What possible purpose is there to web spam that is just nonsense content?
I have a web site for my jazz trio with a Contact Us page. It takes the user’s message and emails it to me. I manage to filter out most of the spam (I have not set up CAPTCHA yet) but sometimes something gets through and it’s usually just some random characters. Here’s the latest. The actual message is about 20 times the size of this sample but it just goes on like this. The user gave a Gmail address as their address. I’m assuming it was generated by a bot. The ones I filter out are those that have hyperlinks in them but at least those were intelligible.
I’m guessing that that’s some non-Latin script (possibly Chinese) that got garbled somewhere (most likely, by the bytes being offset by one). Old text encoding using ASCII only needed one byte per letter, but it could only be used for the Latin alphabet (plus a handful of other characters). Most text nowadays is Unicode, which needs two or more bytes per character, and so if those bytes get mismatched, you get garbled nonsense that doesn’t make sense in any language.
If anyone wants the technological details this is a good intro:
I would bet that the OPs website, or at least the contact page and the server-side code that creates and sends the email are all ancient and US English centric.
The info being posted by the spammer is probably well-formed modern Unicode Chinese or Russian. It’s the OP’s website that’s mangling it.
Well, certainly it’s US centric. It’s not ancient but I am not a professional web developer, and this is intended for a local audience, not international. And even if I supported languages in other character sets the content would be useless to me.
The server-side is PHP and I am just using an HTML form to collect whatever they enter into the textbox and copying the content to the body of an email. Offhand I don’t know which of those steps are hostile to other character sets.
It just occurred to me to View Source and this is what I get:
Joshuazew has sent you the following message:
Цены на монтаж водонагревателя Стоимость услуг VENCON оправдана качеством и высоким уровнем обслуживания наших Клиентов. Стандартный монтаж водонагревателя Вид работ Цена услуги Цена при покупке товара у нас Установка бойлера от 5 до 35 литров 1 999 грн 1 399 грн Установка бойлера от 40 до 80 литров 1 999 грн 1 399 грн Установка бойлера от 100 до 160 литров 1 999 грн 1 399 грн Установка бойлера от 200 до 300 литров 1 999 грн 1 399 грн Замена бойлера 2 999 грн 2 099 грн Демонтаж бойлера 999 грн 999 грн Скачать полный прайс материалов и работ Стандартная установка включает следующие работы Согласование даты и времени монтажа с менеджером. Выезд нашего специалиста на
and so on. Google Translate tells me it’s a Russian promo for water heater installation.
Super simple captchas like that will work if you’re the only one using it, and your site isn’t very popular. As soon as it’s worth it for anyone to make a bot for it, though, it becomes completely ineffective.
That is true as far as it goes. However, there is a foolproof technique: if you add a simple checkbox certifying that “I am human”, no bot will ever be able to click it. It would violate the Laws of Robotics, or something.
You could easily identify fellow SDMB members by nitpicking/bitching about non-determination quality of this string and that - prima facie - at least “3” and “4” seems to be reasonable good answers.
There’s a lot more behind that checkbox than you think. It probably measures things like precisely how the mouse moves to the checkbox, and apparently it’s difficult for a bot to emulate a human in that way. And it’s never the only line of defense: It only gives the checkbox if it’s already pretty sure that you’re a human (from lots of prior human interactions from that computer), and if it’s unsure, it gives you one of the “click every box that contains a traffic light” ones.
On several of my websites I use a function that counts the number of non-latin characters and is set to flag as spam anything over about a third as spam. I guess you could post up to 1024 letters yet it won’t even store more than that in the DB - just mark it as spam if it’s bigger.
Reading the question and knowing how to answer it are two different things…I am investigating CAPTCHA currently. It’s a free service but there are multiple ways to implement it and I’m trying to digest the documentation.
First it has to know there is a question, then divine what the question is, and lastly know the answer. IMO, all that is happening here is some crawling of the web searching for the < form > tag, then auto-submitting with their crap inserted. There’s no analysing the content, just the usual scattergun approach. Which is easily defeated, my experience is 100% effective since I’ve put my question in place.