On searches and Internet traffic

In Cecil’s recent column on Internet porn, I think he might be looking at the wrong statistic. He compares number of searches for various topics, but not amount of traffic for them. It seems quite likely, though, that once a person has found a good source for internet porn, that person will just keep that source bookmarked, and not have to search for it again. On the other hand, many other common uses for the Internet, such as maps, are typically handled through search engines. Whenever I want a map, I’ll do a new search to get it, so travel information might be overrepresented relative to pornography, if one just uses searches as the measure.

Furthermore, there is a distinction between visits or hits and bandwidth usage.

Any topic that emphasizes images or video is likely to use a fair amount of bandwidth, relative to text-based sites. Of course, music downloads will be substantial as well.

Hal Varian and others have attempted to measure How much information is created each year. Table 1.13 claims that in 2002, email was substantially bigger than the world wide web, p2p, and instant messaging combined.

A pdf file at the site suggests that, within the “surface web”, images and movies comprised a little over a quarter of file types[sup]1[/sup] (unknown file types clocked in at 20%. Hm.).

28% of the sites in their sample were porn.

Spam was 55% of all email. They report that 20% of their measured spam was porn (May 2003).

PDF file:
http://www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_internet.pdf

[sup]1[/sup]The bandwidth figures would be a lot different, since image files are substantially larger than text files.

I’m well aware of the distinction between search requests and bandwidth. I focused on the former since trying to compile statistics on the latter seemed hopeless in light of P2P file sharing, etc. Perhaps I’m being overly pessimistic, but at this point I’m not persuaded that anyone has sampling techniques powerful enough to tell us with any confidence how much of the 150 (or whatever) petabytes/mo is porn or anything else.

Porn is, for obvious reasons, largely made up of hi resolution images and video. Video is by quite a long shot the largest average per file heavyweight on the internet. Just looking at video, its hard to imagine that the amount of streamed video that is non-porn even comes close to the amount that would be porn. IFilm posts all of their stats, showing which videos are top 10 in requests, and porn/porn like videos tend to be at the top. UsenetBinaries dot com is a website that collects video files directly from the usenet and displays them by subject alt.tag etc. and just from a quick glance, porn (adult) obviously takes up the lions share of the videos hosted, apparently by an order of magnitude.

Just looking at video, it is my estimation that Porn accounts for probably in excess of 50% or more of traffic, and if video constitutes a significant amount of internet traffic (actual bits from computer to computer) then we could be on to something.

Next of course would be looking at the illegal filesharing that goes on. Huge amounts of pirated movies, software, and music are constantly going from peer to peer, some files as large as several Gigabytes, which adds up quickly when you have thousands of downloads.

If you want to throw porn and “shady downloads” into the same mix, you could definitely make a case for a much higher percentage of internet traffic is “illicit”.

Now you can start to talk about emails. if 55 percent of email is unsolicited spam, 20 percent of that is Porn and the other 30 percent is likely “illicit”, adding that makes a pretty solid case for the vast majority of internet traffic being “illicit”, “illegal” and probably largely carried out by folks with less than pure intentions, or no intention at all in the case of us poor mass spam recipents.

As for legit, non porn related media, filesharing and internet based applications for today’s business such as internet banking, normal email, peer to peer etc. even though they take up a relatively smaller percentage, they are paying for the infrastructure that supports it all…all the illicit crapola that only funds crime, drug dealers, and …dare I say…terrorists??

Well, yes, there is an issue of definitions isn’t it – If on a particular online session I download 3 articles off the New York Times and one off Slate, send two draft contracts to lawyers, a .pdf of my manuscript for “Who Left Behind my DaVinci Code” to my publisher, buy a rare record, read the Arcata Eye and receive and reply to 11 flame-mails from people annoyed at my last article, and at the same time download 10 high-res pictures of Hiromi Oshima wearing only her shoes and a hair ribbon, the count of bytes going thru my DSL will likely show that my communications with the Playboy site outbanded the previously mentioned 20 mundane transactions. Yet 2/3 of the time I spent online was, subjectively, doing non-porn-related stuff.
I dunno if SPAM is a valid indicator for the net/web in general, since the marketing strategy for SPAM is precisely one of extreme margins – the spammer can make profit off of ridiculously low percentages of follow-through all the way to sending money. So if 30% of the SPAM is offers of iffy or illicit transactions, it may be that it represents not even 3% of the transaction volume generated by that enterprise model. Probably is greater, though, as there are a lot of fools out there who will go for something that comes via the 'net that they’d run from if it were offered in person. (BTW, MY spam is overwhelmingly composed of offers for bootleg watches, prescription medications, and home loans, with a strong showing by unauthorized “OEM” software back in the field.)

I’d be interested to find out exactly what flaws people pointed out in Martin Rimm’s study? Anyone have a cite?

Adam

No cite, but off the top of my head it seems to me that extrapolating the amount of porn on the whole net from the amount in usenet binaries groups is like extrapolating how much porn is in the magazine industry from the amount in a sex shop. I’m surprised that he was able to find 16.5% of binaries that weren’t porn

Quoth Cecil Adams:

Oh, I agree that it’d be difficult, but I’m not convinced that it’s as hopeless as all that. I seem to recall, for instance, that you and Ed have in the past polled the membership of the SDMB for information for previous columns (Is the south side always the baddest part of town? comes to mind). Dopers in general seem to be fairly frank about such things, so the sampling problems inherent in surveys should be relatively minor, and it’d surely be a better measure than the one used in the column.

Cecil, I hate to say this, but you blew it. :frowning: The original question was “How much of all Internet traffic is pornography?” Traffic = bandwidth. I have no clue why you even bothered with search engines. Search engines mostly index web pages; not binaries. Almost all Internet traffic (in terms of GBs) is binaries (porn videos, other videos, CD and DVD images, MP3s, etc.)

However, I also can’t say with any confidence how much bandwidth is porn. Lots of folks are downloading pirate copies of Microsoft Office, non-porn DVDs, computer games, etc. My suspicion is that the daily bandwidth used by downloads of pirate blockbuster Hollywood films is significantly greater than XXX porn videos.

Well over 90% of all Internet traffic is copyright violations. How much public domain videos and software do you see available for download on the Internet?

If you insist that traffic = bandwidth, then you are right to say that the rest of my column is unintelligible.

My earlier reference to table 1.13 was misleading: it refers to the stock of file content (what’s on the hard drive), not the flow of information delivery (what’s being uploaded or downloaded).

Wired magazine had a brief discussion of 2004 internet traffic.

[ul]
[li]According to TeleGeography, a telecommunications research firm, international demand for bandwidth grew 42 percent in 2004, with the largest upswing in usage coming from Asian nations.[/li][li]P2P is the largest source of the increase in traffic.[/li][li]“Today, CacheLogic estimates that P2P applications consume between 60 percent and 80 percent of capacity on consumer ISP networks. The fastest growth in P2P usage is coming in Asian nations with high broadband penetration rates, Parker said.”[/li][li]Within P2P, movies are replacing audio as the predominant bandwidth eater.[/li][li]The average P2P file size exceeds 100MB. After the release of one movie, a single 600MB file consisted of 30% of all traffic.[/li][li]VOIP took up 5-10% of capacity.[/li][/ul] Since P2P is so huge, perhaps we could study that. Those who run bittorrent search engines, for example, may be able to provide a breakdown of .torrent file downloads by category.

Here’s a link with footnotes to papers pointing out flaws in the Rimm study. To summarize, it was not peer reviewed, it was methodologically flawed, its ethics were challenged, and in addition to Times’ apologetic retraction, Carnegie Mellon University distanced itself from the study as well. (It didn’t help that Rimm had also written a book on how to be a successful Internet pornographer as well.)

The only reason we heard abot the study was the it was a political hot ticket – Senator Grassley used it as a prop to flog his Communications Decency Act, which aimed to censor the Internet, in Congress. Like the Swift Boat veterans’ lies, it was an effective lie even though it was debunked – it did help secure passage of the CDA.

Did you have a source for that 90% figure? I would tend to doubt that 90% of all internet traffic is entirely concerned with violating some copyright. In fact, I would guess that a significant percentage of internet CONTENT (as opposed to traffic) is actually created specifically for the medium, and that number is likely growing.

If you move from internet to network traffic, then there are a lot more mundane things that consume far more bandwidth. Scientists routinely transfer multi-terabyte datasets across the continent, banks send financial information flying every which way, movie studios are sending movies between production studios, telecommunication companies are bundling their voice traffic over the wires.

The actual consumer portion of the internet might actually be very small indeed. Does anyone have some figures?

Yeah, there’d be a lot of that.

This could very well be. I wonder how much information linked to my identity–in one form or another–gets sent around, compared to my IP address. In any case, it’s surely almost all plain data. Then again, I rarely deal with binaries as a consumer.

Yes and no. We do send datasets that large around sometimes, but when we do, it’s generally by UPS, not the Internet. It’d take an awful lot of bandwidth to send that much data, and it generally ends up being both quicker and cheaper to put it on disks or tapes and physically move them.

It depends on how time-sensitive the data is. A couple of collegues recently came back from APAC where a guy from a Singapore data centre gave a speech. He said it was routine for them to allocate up to 20 gigabit links to some applications. Thats a few terabytes of data per hour at peak utilisation.

Please note the same comment Cecil made: I consider traffic = bandwidth. There isn’t much made for the Internet content that is GB size files like DVD and CD images, etc. The stats may well be different if you define traffic differently.

I also just noted another flaw in Cecil’s analysis. Most of what is on porn sites is behind password protection. Google can’t access that. Thus search engines will make it seem like there is a lot less porn than there really is.