Why can we view source on a web page?

This has always struck me as a bit strange, that anyone can view source of a web page and gain info about how a web site is created and such. I can understand there being a way for developers to do this, but it seems that there would be a way to turn this feature off once a site goes live. For all I know there is, btw. But what are the historical reasons for this. Has it always been this way on the web?

You can’t see as much as you think you can. That is, the source you are looking at is (usually) the generated output of a program that is run on the server side, and that program you can’t see. The output is in the form of special text that your browser interprets (typically HTML, CSS, Javascript) - this is what “view source” shows you. But it doesn’t show you anything about the code that looked up all the posts for a certain thread, for example.

It’s available for you to see because it has to be: the browser receives exactly that data and uses it to paint the screen with what you finally see.

Yes, it’s always been this way. And it always will be, and must be so.

In order for your browser to display a webpage to you, it has to download the source from whatever webserver hosts that page. It then figures out how to display (or render) that source to you in the browser window. Viewing source is just leaving out the rendering step.

I like to think it also helped spread HTML as a language. I can definitely attribute a good chunk of my HTML knowledge - maybe 80% of it - to being able to copy/tweak what others have done. It’s the best way to learn.

HTML and CSS and Javascript are free to develop and execute, and so they are also free to view and learn.

The only reason it seems strange is because HTML is (kinda) human-readable in a way that most served data isn’t.

You could imagine an alternate-history web composed of linked Flash media files, it still uses HTTP protocol, it still serves data to your computer the same way, you can still look at the data, but in this alternative world the data is just binary gibberish to any human eyes.

Some of it is historical. The first web browser (NCSA Mosaic) was also the first web page editor. It was an all-in-one kind of thing for both users and developers.

Netscape Navigator was made by a lot of the same folks who made Mosaic (although they intentionally re-wrote it from scratch so that they wouldn’t share any actual code with Mosaic). Firefox evolved out of Navigator.

Folks who made other browsers (Microsoft, Google, etc) included a source viewer by default just because everyone else did and you didn’t want to be the browser that was missing a feature.

You also have to remember that web pages are conceptually just text files with some fancy formatting tags. HTML stands for Hypertext Markup Language. Web pages have become much more complicated these days, but originally they were little more than formatted text and images.

ETA: I’m old enough to remember viewing one of the first web pages. It was a single page that listed every other web page on the internet. I also remember the day that they announced that HTTP data transfers had finally overtaken FTP transfers as the number one type of data transfer on the internet. I feel old now.

However, I have noticed that there are some screen elements that can protected against copying (à la right-click Copy), such as some images. You can still capture images with a screen shot but somehow the code requests that the browser disallow a copy (or maybe they are displayed with Flash or something, taking the browser out of the equation). It is technically possible to create a convention where the page code could include a “do not reveal source” flag and that mainstream browsers would honor it (like Google honors flags not to cache sites), but there’s no point because it would be trivial to defeat such a mechanism.

It’s usually done with scripting. Basically you run a javascript function to intercept the copy and paste commands which prevents them from doing anything useful. It is easily circumvented by using a script blocker with your browser.

To delve a bit into the history part, the early WWW was developed mostly from work in academia, instead of commercial R&D. There was no profit motive to obscure the source code, like there is for Flash or Microsoft Word. Basically, computer code usually starts out human-readable unless there is a specific reason to make it not – in the case of programming languages, they have to be compiled into computer bytecode to be run by CPUs anyway, and so it isn’t necessary to distribute human-readable source code with the compiled programs. With markup and scripting languages like HTML and JavaScript, however, they were designed in such a way that the source code itself is transmitted, not a compiled binary, and then it’s up to the browser to interpret the source code into machine code after download.

It certainly did not HAVE to develop this way, and we are lucky that it did. Before the WWW, the world was a filthy mess of BBS portals, Gopher sites, and random fingerings of old white people with beards. It was the dark ages of computer networking, and people who wanted information would either have to call up random computer networks and use their special BBS software, or else hope to be connected to the nascent academic Internet or something like CompuServe or AOL.

But a renaissance would soon arrive: a knight called Tim appeared and sought to unify the unwashed masses into a glorious kingdom of interconnected info-peasants. So he looked at things like early sci-fi and Apple HyperCard, previous attempts to mark up documents with extra information like images and cross-references so they can be easily browsed in a non-linear format. He proposed something called HTTP and HTML, in which documents could be easily written by anybody, cross-linked and distributed across the growing Internet. He didn’t give a flying fuck about piracy and obscuring source code. He wanted people to write information and make it available for everyone else. Fast forward a few years, the Netscape company decided this was a pretty cool idea and made their Navigator software available to everyone. Then an advertising company called GeoCities offered free hosting to everybody, so now kids everywhere made their own HTML pages using this ancient, open technology and then it all just kind of blew up and became tradition. CompuServe, AOL, and to some lesser degree Java and Flash all tried to make their own networks and formats that could not be so easily written and read by untrained humans, but in the end the widespread public adoption of open HTML and HTTP won out just from sheer momentum.

But anyway, as ed points out, the HTML is just the last step of the process – displaying the information. A lot of the hard work happens serverside, between databases and hypertext preprocessors (computer programs that format information into HTML for transmittance), like PHP or ASP.

It’s fairly common that “open standard” file formats are human-readable with very limited processing. There’s no real reason to make them not so, and it’s easier for the developers and the users if they are. It’s only proprietary formats where obscurity is really applied. HTML was designed to be open by people who don’t benefit at all from it being non-human-readable, and hence it is.

Program executables are the main exception, since they are usually compiled from higher-level programming languages and it is hard to decompile them back. Even then, though, you can disassemble them easily, it’s just that most people don’t have a taste for reading long programs in assembly code.

I remember in the late 90’s, a bunch of my friends and I played a really fun text-based game called “Earth 2025” in which teaming up was encouraged and common. There were websites that catered to running “clan” sites, but some of the smaller (or larger holdout) clans still ran their own websites. We quickly learned how to view the source and find the (single, given only to clan members) password embedded in the HTML and “hack into” their sites to glean useful information for wars.

This, obviously, has taken a backseat to modern security protocols on even the simplest of websites that require password protection.

As a small aside, I also remember that, in the late 90’s and early 2000’s, it was somewhat popular to disable right clicking to stop people from viewing one’s source, as well as stop people from stealing images from a site. So there’s that. But it was fairly easy to get around that type of “security”.

It’s also worth noting that source code is still under copyright, regardless of whether it’s publicly visible or not (just like a person’s poem is still his copyright, even if he posts copies on all the telephone polls in town). Just because you can see all the clever JavaScript that someone did to make their application doesn’t mean that any of it is useful to you, in any way.

Is it that clear-cut? Algorithms and program logic aren’t copyrightable, although actual source code written with them can be. If you read someone else’s JavaScript and figure out how they did it and then re-write it yourself, I think they’d have to take you to court to prove some sort of copyright violation, and even then it’s not a surefire bet that you’d lose. That means the ability to read JS source is very helpful in that you don’t have to reverse-engineer their code, maybe just deal with some obscured variables and functions.

They might be able to patent their specific code, but a lot of the JavaScript on the web is so trivial (when user clicks this button, make this happen) that it gets reused similarly everywhere, whether by copy-and-paste or someone else just wrote something very similar.

FWIW, an old thread on this topic.

Some companies even today expect at least some people to view their webpage source code - view the source of a Flickr webpage - underneath a big ASCII “Flickr” logo, there’s posted the lines:
“You’re reading. We’re hiring.
Jobs | Flickr

Clever-- Advertising targeted at the computer-savvy user.

I can’t remember the specifics, but other companies in the past (maybe it was Google?) made multi-leveled puzzles that wouldn’t offer you a recruitment email until you solve the whole chain. Think there was one that started with a highway billboard and then turned out to be a secret jobs page, and maybe another one that started with source code somewhere that ended up being a three-letter-agency hiring for a codebreaker or some such.

Anyone remember?

It is trivial to view the source of a website. All you need to do is open a socket to the website host and make a request of the form:

GET http://hostname/page-I-want.html

[you’ll have to add a bit more, but that’s the essential part]

The website will then give you the page in plain text. It is trivial for a computer programmer to create a program to get the source of a website. There would be no point for the browser to hide it. Any hacker could just bypass the browser to get the source. A novice wouldn’t really be able to do anything by reading the source. So the browser might as well make it easy for all of us to view the source.

In case anyone’s interested, the NSA is currently hiring.

How would a site be able to use a password embedded in the HTML? I am not a web developer but I have dabbled and I can’t figure out how that would work. Did it use JavaScript to validate the password on the client side before sending a page back to the server?