I’ve heard offhand jokes, 98% porn, 2% crap…anyone know what percent of sites actually are porn? You can’t google anything with the word porn in it to do a search, because you get porn, or anti-porn crusaders. :dubious: However, it seems no matter what I enter, I get at least a few porn hits. The reason I ask is I’m trying to find a good site about Japan(hopefully with a messsage board not composed entirely of Asian freaks) and take a WAG what comes up when you search anything with “Japan” in it. Not that i have anything against porn mind you, just a thought.
Well, just as much as you could possibly want… what’s your point?
labmonkey, you do know, don’t you, that if you go to Preferences in your browser’s menu, you can set your browser to ignore porn sites?
What kind of sites about Japan are you looking for?
If these aren’t enough, be specific, and I’ll see what I can link you to.
Googling on “and” in the text of web pages yields 3,520,000,000 hits.
Googling on “porn” in the text of web pages yields 94,700,000 hits.
Googling on “and” and “porn” in the text of web pages yields 94,700,000 hits.
So every web page that contains the word porn also contains the word and.
The pages that contain the word porn make up 2.7 percent of all the pages that contain the word and.
I wouldn’t know, but I’ve been told that the vast majority of porn is on pay sites and/or sites that require an explicit age agreement click-through. Google won’t index those sites, so the raw count statistics are off. On the other hand, there is an enormous amount of non-porn material hosted the same way. I don’t think this question can be answered by spiders.
Sorry for the schizophrenic post, looking at it now its difficult to say whether this is a factual question, or a very mild rant born out of frustration.gluteus maximus I’m looking specifically for a message board for international couples like the one my wife frequents, you know where people discuss the good, the bad and the ugly.(BTW I can’t use her’s, I dont read JP). What I’ve found so far has been quite lacking or maybe I’m just ruined for all other boards by SDMB. Seriously.
Sure, Google doesn’t probe deep content, but nothing else does either. The OP’s question related directly to google, so the data I provided related to the percentage of porn in readily available content.
If you try to go beyond that, you get into trouble defining exactly how private a database can be, and how tenuously it can be linked into the web and still count as a part of the internet. I figure it’s better to try to answer the question, than figure out why the question can’t be answered.
Instead of gauging percentage with web pages, its more accurate to measure total bytes of information that are pornography vs all other data. Pictures, movies, audio and erotica text existing on web servers, ftp’s, shared data on peer to peer programs, usenet, etc. Damn, there must be a lot of porn out there.
I’d say porn was the single biggest chunk of the internet pie in total bytes back in the ancient 90’s, but I’d wager that pirated (non porn) music and movies are perhaps bigger now.
I always wonder why the porn industry isn’t as vocal as the RIAA and the MPAA about combating piracy. Its gotta be hurting them far more than the big guys.
Yeah, well, you have a good point there. I’ve looked at some of the English forums based over here, and there’s a lot of Kurdt Kobainishness. Whine, whine, whine.
Don’t give up yet, though. There is one association for non-Japanese women married to Japanese men,
AFWJ, which might be able to point you toward something for husbands.
If I see anything else, I’ll post it here.
You can also set Google to ignore porn sites.
Depends if you mean actual domain names, or terrabytes of content, or discrete pages, or what. Personally, I’d think adult sites would win on all these counts, but I have never seen any research on it. I’d like to know.
I think it’s still one of the few industries that is making even vaguely legitimate money directly from the internet (as opposed to the scams like “we’ll list your business in 20,000 search engines”, viagra, etc).
I don’t think is true - I run a porn site, and Google indexes many of my pages. Got a cite?
Yah, pirating does hurt. Most porn companies pursue it moderately agressively - we know that we’re not going to get any support from the media as we’re all so obviously evil, unlike the RIAA, which is pure as new-driven snow - but they use their own resources.
DMCA notices (Digital Millenium Copyright Act notice, a way of telling someone they look like they are breaching trademark or copyright, and listing the details of the breach) are easy to make and send to offenders and the hosting company (I send a few a week).
Most hosting companies know they are liable for what their clients put on their sites, and don’t waste any time issuing a cease and desist notice to the offender. Indeed, most offenders know it’s not worth the hassle either - they’ll just go steal someone elses’ content, and wait for them to notice.
Headcoat, good point on the pirated movies and music - I did not think of them. Seems to be a massive “business” with those P2P people. It’s interesting that no money changes hands - kinda like open source code, music and videos are released for the “greater good”, and no one pays to access the content. I guess the only people that are making money out of this are ISP’s and hosting companies that charge per Gb of traffic transferred.
abby
That’s interesting, because as of today, Google is currently “Searching 3,307,998,701 web pages.”
So 106.4% of the indexed web pages have the word ‘and’ on them?
That’s impossible. No word can appear on more than 100% of the pages.
I’ll bet that the number in the notice is a few months out of date.
I respectfully disagree. Your estimates are useful as long as they are only part of the answer, but the statistics are wrong and it is very useful to understand why.
I don’t have a cite, but I have some experience. I don’t run any porn sites, but I run or have developed quite a few content sites that are either restricted access (subscriber or authorized user only) or simply block spiders. In most cases, dynamic sites block spiders to prevent unnecessary load on the server, but in many cases they’re protecting their content too. I know one person in the porn industry who used to work with me, and he’s instituted the same kinds of access controls on his sites. A site can block spiders in a number of ways including (1) requiring a login, (2) requiring a license/age agreement click-through, (3) using server-level redirects to block spiders. I know of porn sites and non-porn sites which use all three.
I have no way of estimating how much content is blocked, either porn or non-porn. I’m not surprised to hear you don’t block indexing on your site, but I know of many others which do. I know of some enormous databases (e.g. genetic info) which are fully browsable by any person without a login but which block all known spiders. There is simply no way to estimate the ratio of Googled content to non-Googled content.
Why no, the statistics aren’t wrong, they give a perfectly good estimate of what percentage of google-searchable pages relate to porn, and page counts are one of several perfectly good metrics for gauging the relative proportion of the internet devoted to different types of content. You’ll get no argument from me that page counts present only a partial description of content, but that’s not the same as page counts being wrong. If you prefer a different metric, use it, and we can argue about whether online bank records, part codes, or student grade lists constitue a part of the internet. I rather doubt that the OP was in search of a pedantic discussion of the relative weights that should be applied to text, images, mp3’s and the like in an assessment of what proportion of the internet they take up, but if that’s all you’ll accept, go for it.
Easy there Sqink. It wasn’t a personal attack. The OP mentions Google, but his question regards the content of the Internet. If you want to discuss the Internet, Google page counts are, in fact, wrong. In addition, it would require massive unjustifiable assumptions to say that they are even proportional to the “right” answer. The Google counts are a useful data point, but it’s important that they be taken in context and the OP is better served by an understanding of why those counts are wrong than by simply citing counts.
Have you got a cite to back up that opininion?
Sorry, all I have is common sense and a basic understanding of the infrastructure. The sky is blue, water is wet, and there are huge sections of the Internet Google doesn’t index.
I just did my own search for “and,” which came back with 3,600,000,000 results, or 80 million more than the one done two days ago.
This is just a vague memory, but I read somewhere that Google uses several servers to process searches and their databases aren’t always synced up - hence, someone searching from one part of the world would get different results than someone from another, depending on which server they get assigned to.
Sure JRR, the internet, and google are both dynamic systems, but I seriously doubt that short term fluctuations in content can be enough to affect the ratio of word frequencies by more than a few tenths of a percent. The raw numbers are in the millions and billions after all, ratios can’t change very fast up there.
Another way of looking at the problem would be grab a sample of packets, and try to sort their contents into catagories like mp3’s, spam, porn, requests for Cecil’s columns and the like. If you think of the internet as an information superhighway, that’d probably give you the “most correct” answer.
I did some digging for studies of that type, but came up dry on anything newer than the mid-nineties. Anyone know of a recent packet content study?
Interesting little thing.
Searching on google for “the” you get 5.4 billion responses
Searching on google with strict porn filtering you get 27 million responses.