How can Wikipedia save all that data?

Northern_Piper · April 28, 2021, 3:35am

Do they have server farms all over the world? How do they do it?

DPRK · April 28, 2021, 3:42am

Yes they do have a bunch of servers all over the world:

Cluster locations:
https://wikitech.wikimedia.org/wiki/Clusters

Darren_Garrison · April 28, 2021, 4:13am

As of 20 April 2021, the size of the current version of all articles compressed is about 19.52 GB

(From here.)

That isn’t including all the photos and video and audio clips. I don’t how much they add, but it is probably pretty safe to say that with even all of that it is a relatively trivial amount of data compared to the truly large sites (Google, Facebook, Netflix, etc.) I’d be suprised if it didn’t all fit on a single modern hard drive.

Darren_Garrison · April 28, 2021, 4:21am

Apparently if you want the entire history of every edit, it does add up to “terabytes” uncompressed, but it doesn’t specify how many:

(The off-line version of text/tables only Wikipedia that I downloaded multiple years ago was only 10 GB compressed, 20 GB expanded and indexed. I read it with this.)

DPRK · April 28, 2021, 4:30am

It is definitely not “big data”. And an off-line version, text only, is really nothing; you could save it to your smartphone. Even if you want all the languages, all the articles, plus user/talk pages, plus the entire history of all of it, including all the images, videos, and other media, you could keep a copy at home, though you might consider budgeting 100 TB to be on the safe side.

Darren_Garrison · April 28, 2021, 4:40am

I wanted one of these when they were around:

BigT · April 28, 2021, 5:12am

I actually have a Kindle 2 that Randal Monroe went on about being a permanent Wikipedia device, with free data. Unfortunately, the idea that every single site needed to be encrypted and the fact that encryption protocols change over time ruined that part of it. But, last I checked, it still does get free data over cellular.

psychonaut · April 28, 2021, 8:37am

Oh, it definitely is “big data”. Hundreds, maybe even thousands, of scientific articles have been written that explicitly use this term to describe data from Wikipedia.

Schnitte · April 28, 2021, 1:03pm

You can surely draw big data from Wikipedia, as, for instance, this article did (making predictions of the success of an upcoming film based on user activity on Wikipedia). But that doesn’t mean Wikipedia’s own content is big data itself.

Kimera757 · April 28, 2021, 1:16pm

I assume there’s a similar answer for Archive.org.

It’s based in San Francisco but simply says it has many servers. The old set were replaced in 2010, which causes me to wonder just how much maintenance a massive server farm requires.

Same thing for Google, Facebook, Amazon data servers, etc.

bump · April 28, 2021, 2:45pm

Exactly.

DPRK · April 28, 2021, 5:40pm

Slightly large data? There is a lot of it, in human terms. You may need to employ non-trivial algorithms to extract information from it. But the data set fits into your pocket.

DPRK · April 28, 2021, 5:54pm

Wikipedia, though, is not big enough to require maintaining a server farm. They pay to rent facilities, or receive such services as a donation. The ones maintaining the massive server farms are Equinix, EvoSwitch, CyrusOne, and UnitedLayer.

The Internet Archive is bigger data (~100 PB) and I believe hosted it themselves at least at one point. They somehow manage as a non-profit with modest revenue.

Darren_Garrison · April 28, 2021, 5:55pm

The concept of “big data” must necessarily change over time, just like the concept of a “supercomputer”. Back when Wikipedia started, a dataset of 2 or 3 or 5 TB definitely counted as big data. Now it is a couple hundred bucks worth of storage that you can (probably) buy at your local Wal-Mart.

suranyi · April 28, 2021, 6:47pm

Reminds me of when I started working on electronic road maps for car navigation systems, back in the mid 90s. We had to split California into two CD-ROMs (remember them?) because it was too big to fit on one. Still, I was amazed that all of Northern Cal — every single street — could fit on one disk.

Nowadays, of course, Google and Apple hold pretty much every street in the world, and far more besides, in their map databases. And it’s not considered truly big data compared to what else they have.

Darren_Garrison · April 28, 2021, 9:55pm

I used to have Street Atlas USA on one CD. Whole country. Roads only, no gas stations, etc.

LSLGuy · April 28, 2021, 10:50pm

Here’s some xkcd wisdom about the size of wiki: Updating a Printed Wikipedia

As I see it, the real point of the thought experiment in my cited article is to help us bridge the monumental gap between what constitutes “a lot of data” in some physical human readable human touchable format versus “a lot of data” in modern standard computer-accessible formats.

Storing a printed complete wikipedia would require a large warehouse. Storing a computerized complete wikipedia requires a large pocket. That’s hard to wrap our minds around.

Northern_Piper · April 28, 2021, 10:53pm

Fascinating. Thank you all. I had no idea how compact the storage system has become.

I will go back to running my programs on my punch cards now.

DMC · April 29, 2021, 1:15am

And hundreds, maybe even thousands, of scientific articles (and marketers, more importantly) can’t even agree on what is meant by “big data”.

DMC · April 29, 2021, 1:40am

Back to the OP.

As @DPRK and @Darren_Garrison noted, the volume of data isn’t all that big. The primary reason they need hundreds of servers is because of the number of people looking at that data. A few years ago they were already serving over 600 million page views per day. I think the primary backend db is still MariaDB, which is just a fork of MySQL, and it’s going to have trouble keeping up with tens of thousands of simultaneous query hits on a single server.

Topic		Replies	Views
Running a local copy of Wikipedia Factual Questions	9	307	March 16, 2025
How big is Wikipedia? Factual Questions	6	2327	February 22, 2008
If the entire internet where to be printed out in a physical book, how many pages would it be, how much ink would it take, and how physically large would the book have to be? Factual Questions	26	913	July 22, 2022
If all the articles on Wikipedia were printed off, how much space would it take up? Factual Questions	4	5358	October 6, 2010
I'd like to save the contents of a large website. Recommendations? Factual Questions	4	937	January 27, 2005

How can Wikipedia save all that data?

Related topics