Why Is The White House Hiding All Search References To Iraq on Its Web Site?

Mods: This thread is not intended as a GD. However, feel free to move it if the Doper responses are more attuned to debate than answering the question.

Background:

Source: http://www.searchengineworld.com/robots/robots_tutorial.htm

There is nothing nefarious in the use of a robot.txt file on a web site. On the contrary, a robots.txt file is used to assist search engines so that they do not have to collect information that is irrelevant to users.

However, in viewing the White House robots. txt file one notes that all references to Iraq and only references to Iraq are unavailable to searches by search engines.

Why is this?

I don’t know how the robot file works, but a search on the White House site just now for “Iraq” turned up 1,923 results.

No, no. The robots.txt file prevents searches by external search engines. An internal search engine can be manipulated by the site owner to only locate what the owner wants you to see.

I don’t think it’s disallowing all references to Iraq; it looks more like a list of specific files and/or subdirectories.

Google found 19,800

A Google search confined to the whitehouse.gov domain found about 19,900 hits for “Iraq”

http://www.google.com/search?as_q=iraq&num=10&hl=en&ie=UTF-8&oe=UTF-8&btnG=Google+Search&as_epq=&as_oq=&as_eq=&lr=&as_ft=i&as_filetype=&as_qdr=all&as_occt=any&as_dt=i&as_sitesearch=whitehouse.gov&safe=off

Just a WAG, but www.whitehouse.gov is a popular site for people looking for information on iraq. It might make sense to disallow searches for searches on iraq on those pages which don’t contain the term iraq. Even Mr. Bush’s hamsters have their breaking point.

e.g the page http://www.whitehouse.gov/firstlady/recipes can safely be assumed to be excluded from an iraq based search.

This became a topic of public discussion in October of last year.

robots.txt disallows crawling by file name, not by file content. The robots.txt change excludes many file paths that obviously don’t exist:

Disallow: /infocus/everglades/iraq

Disallow: /infocus/rx-medicare/iraq

Disallow: /infocus/teacherquality/iraq

What the person who ordered this intended to do is anyone’s guess. My bet is managerial stupidity.

The question is why would the White House site let you search with the internal search engine but not the external one?

Because the external ones cache pages.

Whoever is managing the whitehouse.gov web page could be doing a better job. For instance:

http://www.whitehouse.gov/index2.html

is an old page from June 2003. If I were running their web page I would clean this stuff up fairly regularly.

The external ones also obey robots.txt as a matter of convention. The file doesn’t “force” any search engine to do anything.

I did a bit of looking at the robot.txt files of various government organizations.
The CIA, FBI, Senate, DOE, Air Force, NASA, Secret Service, Supreme court, Federal Election Commission, Federal Reserve, Homeland Security, and FirstGove sites have no robot.txt files at all.
The House, FDA, NSA, DOJ, USDA, Army, Joint Chiefs, FDIC, and OSHA sites have small restriction files, from a few lines to ~25 in length.
The only other site that approaches the whitehouse in the size of its robot file is the EPA.