Now I am reading about the Deep Web, how it is accessed, and if the Surface Web has places that have some characteristics of the Deep Web. The author of the blog characterizes some sites as borderline surface web, that is, Google can index them with difficulty and anonimity is prevalent. He names as examples reddit and 4chan. Are there any sites that are officially called borderline websites? Can we compare some places of the surface web to the deep web? I know reddit is a well-known website. Is it considered deep?
What blog are you reading?
Deep, surface, and borderline means whatever the blog author wants them to mean. They’re buzzwords, not internet engineering terms.
reddit and 4chan are more analogous to gateway drugs. They are where you would gain the right knowledge, meet the right people, and get into the wrong crowd, all of which could end in an invitation to try some site you’d never find on your own.
Or you could just use a Deep Web or Dark Web search engine . . .
You already use parts of the Deep Web, every day. When you check your bank account balance, or read your e-mail, or edit your Google Docs, you’re viewing information over the Web which is not available to the general public. And yet, your bank makes no secret of the fact that they have accounts, and has information prominently available on their public web page about how to get one of your own.
“Deep” does not in any way mean that there’s anything shady going on. There will be some shady stuff in the Deep Web, of course, but then, there will be some in the public Web, too. It might even be safer on the public web, because not requiring login credentials might make it easier to maintain anonymity.
Yes, according to a recent survey much of the content on the Deep Web is quite innocuous.
Maybe we’re thinking of the Dark Web … bitcoins, hackers, those Guy Fawkes people … all those nasty places on the web.
Deep Web is stuff that’s just buried in layer after layer of directories … think of a specific latitude and longitude on Mars and we’d have to be digging through NASA’s web site to find a photo of that specific place … it’s there … but it’s deep
in the site … many government agencies post their scientific data they’ve collected, it’s just difficult to find in some cases.
This would never have happened with Gopher
Ok, I conflated deep web and dark web, as usual. I was specifically meaning the tor network, and if some surface sites are somewhat similar to that. The specific article in a blog I read that is below:
Over 14,000 words and not a single paragraph break.
I got as far as “1EarthUnited” and dismissed the entire thing as rubbish.
Right, I confused the Deep Web and the Dark Web in my post above. Here’s a definition from PC Advisor:
The survey I linked covered the Dark Web.
Cmon man, no tables, frames or inline image support? Mosaic kicked Gopher’s ass fair and square.
I don’t know if this is a helpful thought, but the SDMB is a perfect example of surface and deep content side by side. This post is now public information and searchable by the likes of Google. But if I sent you a private message, that is part of the deep web, only accessible with your username and password. And if I use the Tor browser to read the SDMB… well, I haven’t changed the SDMB at all, but I am using dark web technology to encrypt data and mask IPs.
And I read it whole!
Characterizing Reddit and even 4Chan as “borderline” speaks of a misunderstanding of the nature of those forums. If the primary characteristic of borderline is that Google can’t index them, then the SDMB would have been borderline before Google was allowed to index us. It’s really to block Google and other legitimate webcrawlers as that just requires a couple of lines in the robots.txt file.
[Bracketing] inserted by me to clarify.
Agree with all you’ve said. Which makes me think of a question …
Ref the snippet above, compliance with robots.txt is 100% voluntary. An interesting question is whether there are any publicly available search engines that advertise they don’t abide by robots.txt?
Sure, any given webmaster could try to IP-block such an unfriendly search engine. But that’s a futile game of whack-a-mole versus any good-sized crawler infrastructure.
Eh, you can serve HTML over Gopher just as easily as anything else. Images, too.
Wow … Mosaic … seeing that word makes me feel very very old … do you Alta Vista ???
No. That would be not only blatantly antisocial, but monumentally stupid from the perspective of the robot’s operator.
A lot of what robots.txt does these days is protect website backends from robots and, therefore, robots from themselves, in the form of notifying robots about dynamically-generated content which can be effectively infinite, generated programmatically from whatever internal database the website draws from; unless the robot’s owner wants to be on the wrong end of a combinatorial explosion, it programs the robot to respect robots.txt and avoid some infinite tarpits that machines really cannot navigate.
Well, here you get into the difference between what semi-legitimate but assholish people do and what spammers with hordes of zombies do. Sure, a good-sized search engine company might own a lot of different computers sitting behind a lot of different IP addresses, but since it will have leased those computers legally from one or two other companies, or will own them themselves, all of those IP addresses will be in a few specific netblocks, owned by the relevant companies, as recorded in the information associated with the Autonomous Systems which advertise those netblocks as their own. In short, all of those IP addresses will be coming from “the same place”, in a networking sense, and it will be easy to block all of them with a few commands.
The spammers don’t do that. They own zombies, created through foul magicks involving unpatched Windows XP machines sitting behind cable modems, and therefore their IP addresses could come from anywhere on Earth. Blocking them is more of a game of whack-a-mole, but programming your server software to rate-limit any specific IP address which tries to go too fast, or tries to grab the wrong things, is a lot easier.