URL parsing, cookies

This is a General Question even though I am using the SDMB as an example in particular. Since the SDMB got it up again I have noticed the links to the individual threads contain (between the newthread.php? and the forumid parts) a string of meaningless characters. If I paste the URL omitting these characters I still get the correct page and thread. What information is contained in those characters and why is it needed?

In other sites I have noticed characters added to the end of the URL and when I send the URL to someone else I always remove them as I do not know if that string is specific for me and I may be giving information away… you cannot be too careful these days…

And while I have your attention, how do the cookies get sent? Suppose I “dial” www.straightdope.com . My computer sees a cookie from that site and sends it, but how? is it appended to the URL, sent separately or what? Is this related to question #1?

Cookies are sent as an http “Cookie” header, which contains the cookie name-value pairs separated by semi-colons. Conversely, the site setting them sets them via “Set-Cookie” headers in the response (multiples are allowed - 1 header per set cookie). A number of fields apply to Set-Cookie headers, the important ones being the name-value pair, the path and the expiration date. A cookie with no expiration is to be interpreted as a “session” cookie which is maintained only during the browser session, and backdating the cookie is the accepted means of deleting it. Some info here, which also points to the relevent RFC’s:

http://www.netscape.com/newsref/std/cookie_spec.html

This is the original Netscape spec, which still pretty much holds, theough some enhanced syntax has been added since.

Judging by the question, the answer I gave might be a bit terse, so I suppose I should amplify on “header” a bit.

The HTML you know and love in your browser is “content” which is sent via a particular protocol, such as (usually) http. The protocol defines the form of messages and responses between the communicating machines, both of which can have “headers” attached.

Sticking with http:

Without going into the gory details, your browser sends a message which says something like “get <url> http/1.0”. The lines following this in the message buffer it sends may contain “headers” which take the form of a keyword followed by a colon and information.

The site responding will say something like “HTTP/1.0 200 OK” in the first line of its response message, followed by IT’S headers, which are terminated by an empty line, after which comes the content, like “<HTML> …”.

There are some headers like Content-Type and Content-Length which will certainly be present, along with a number of others. Content-Length should also be the last header, and allow the receiver to know how many bytes of content to expect.

The cookies are in the headers, as I previously noted.

So the message sent by your browser looks something like:


get www.straightdope.com/ http/1.0
User-Agent: ****
Host: ****
...
Cookie: ****
Content-Length: 0

Your message doesn’t have any content. The straight dope server responds:


HTTP/1.0 200 OK
...
Set-Cookie: ****
Content-Type: text/html
Content-Lenth: ****

<HTML>
....

A lot of people confuse the distinction between HTTP and HTML. HTTP is the transfer protocol, which is used to transfer content. HTML is a specific type of content, namely markup language understood by the browser. HTTP can be used to transfer other types of content, and will be, for instance to transmit the graphics you are viewing.

Oh, add the meaningless characters you are referring to are the things like “s=<string of hex digits>”, I suppose. I don’t know, but I would strongly suspect it’s a session identifier.

If so, it is related to cookies in that session identifiers are often transmitted as cookies. The session identifier is what identifies your particular session with the server, as opposed to say, mine, since our requests are interleaving with each other.

A lot of application servers allow configuration which controls the sending of the session id via URL’s or via cookies. The cookies look cleaner, but mean you lose the session if the user chooses to refuse cookies. I’m not familiar with how PHP operates, but that argument looks like a session ID to me.

If this is the case, I suspect that nothing terribly crucial to our interaction is being maintained as session data for this BB application, so you can get away with zapping it without noticing anything.

I find those characters annoying because when I want to send the web page to a friend, I have to remove them. And if they are replacing the cookies, they seem to not work so well. Since the SDMB is using them, when I am going to post a reply it often asks me to log in and use my password which is a nuisance and which it did not do before.

It is in fact a session ID. I would have posted sooner, but I thought you had answered the question very well yourself in your first post. Since you are wondering here, I’ll just back you up and say yes, it is your session ID.

[ATMB Moment]

To remove the session ID from the URL’s for this board, click the User CP button on the top of any of the message board pages and then go to Edit Options. Click Yes for both of the following:

Automatically login when you return to the site? (uses cookies)
Browse board with cookies?

See the Sticky thread in ATMB:
PLEASE READ - for those who are being logged out unexpectedly

[/ATMB Moment]

As noted by DrMatrix, they have an option “Browse Board with Cookies”, which does indeed seem to turn off sending the session id through url’s, and which is defaulted to “no” (I also note that setting it still results in any empty “s=” argument on the URL’s).

Note that that is distinct from maintaining your user id in cookies, which is the gist of the “automatic login” option. If you turn on cookie warnings in your browser, you will discover at least 4 cookies being maintained for this site:

bbuserid
bbpassword (it’s encrypted)
bblastvisit
sessionhash - which contains the id that was in the “s” argument.

The sessionhash, very reasonably, is being maintained as a “session” cookie which will go away when you exit your browser. The others have 2 year expiration dates.

(Some app/web servers, notably iPlanet Web Server, have an option for maintaining session id in expiring cookies instead of session cookies, set to expire in, say, half an hour. Trouble is, a client whose clock is far enough ahead of the server’s can’t get a session - remember what I said about how you DELETE a cookie? Some customers want it done this way, though, so there’s an explicit timeout on the cookie, rather than having it hang around as long as the browser’s up. In particular, if they are using MS Active Desktop, which causes the cookie to be maintained by the desktop, so you take down IE, bring it back up, and still have the session.)

DrMatrix, thanks for the info. I have changed my settings and hope I will not be logged out but the main reason I was asking si that I have seen other sites which do the same thing and I have to remove the session part when I send the links to my friends. I wish they would not do that or provide a way of removing the session part like here.

Excuse me - 1 year expiration dates.

Sorry to beat this into the ground, but if anybody read my long message, a slight correction - the http “get” will be something like “get <path> http/1.0”, and my example should read simply “get / http/1.0”. The host part of the URL is stripped off when you connect to the http server at that site and make the request.