As an example, I’ll use http://www.loc.gov
1.) Is there any command or program that will tell me the amount of data stored on that domain?
and
2.) Is there a way,assuming I had the storage, to download the entire website, as is?
As an example, I’ll use http://www.loc.gov
1.) Is there any command or program that will tell me the amount of data stored on that domain?
and
2.) Is there a way,assuming I had the storage, to download the entire website, as is?
The answer to both of your questions could well be wget, but to do the first task you’ll likely need to call wget from a program of your own creation. In Linux or MacOS X this means writing a shell script that can parse wget’s output. In Windows, I have no idea how simple, one-off scripts get made except by installing Cygwin to give you an environment similar to what you’d have on Linux.
It’s possible to do it with other software. cURL comes to mind, and many languages have a native API to deal with the Web. wget, however, is pretty close to being the standard tool for these things.
There is also an open source program called HTTrack, which has a GUI interface. There also appear to be a lot of choices on Google, both free and nonfree.
Note that programs such as wget and curl will only fetch pages which are reachable, directly or indirectly, by following links from the site’s main page. If some sub-section of the site can only be reached by entering the URL directly, wget won’t find it. Also, if some portions of the site require you to enter search terms into a text box, wget won’t do that for you.
Furthermore, some sites which are (partially) dynamically generated can get infinitely large, as they are based on database contents which can be displayed in a myriad of ways. For such sites, the question of “how large is it” is not really meaningful. Trying to download all possible generated pages from such a site will not only potentially overload the largest harddrive, but it can also easily overload the server, and may be treated by the site’s admin as a denial-of-service attack.