Can "Information" be destroyed?

Or, in other words, a scrambled book has a higher entropy than a great novel, since there are a great many different microstates which would be described by the macro description “a scrambled book”, but far fewer which would be described by the macro description “a great novel”.

You didn’t really address the primary point.

How much information is in the book? On one planet it’s a “bunch” and on another planet it’s just one name.

How do you quantify “information” given that it is relative? Is it possible?

You have to understand that this whole concept is embodied in QM. Any time you try to extrapolate from the micro world to the macro things get confusing.

When a book burns the information still exists, and could be theoretically reconstructed, but, of course, in reality there is no way to do that.

To illustrate the meaning of the term ‘information’ in a physical/information theoretic context, I think it’s useful to appeal to the notion of compressibility. Take two strings of n symbols each, one consisting of the letter ‘a’ repeated n times, the other of a random jumble of n letters. The first string carries very little information, while the second carries quite a lot of it – the maximum possible for a string of this length, in fact. The reason for that is essentially that the first string is very predictable, while with the second, you ‘learn something new’ with each letter you read – using all the letters that have come before, you can’t predict what letter you will read next.

In other words, the first string has a very short, compressed description (for instance, something like ‘*n**a’), while the second is incompressible – in order to retain all the information, the full string (or something equivalent to it) must be written out. Compressibility thus measures the predictability – the redundancy – in a string of symbols – in a sense, if it follows a certain law, it can be written down concisely, while if it is lawless, if it is random, such a description is not possible.

This already entails one departure from the everyday notion of meaningful information, as to us, a perfectly random string of symbols seems to carry no meaning at all. Sometimes, people talk about ‘potential information’, or information that can be ‘won’, in this case: given a certain code, you can ‘give meaning’ to a random string, and under that code, a random, i.e. completely non-redundant, string can carry far more meaning than a highly non-random one (at least on average). (Incidentally, the quote for which there still seems to be a citation needed in the wiki article is, I believe, from Claude Shannon, the ‘founding father’ of information theory, who said IIRC ‘Information is any difference that makes a difference’, meaning that whenever you have two things that you can tell apart somehow – say, 0 and 1 --, you can use this difference to store (at least) one bit of information.)

This defines the information-theoretic notion of entropy: a highly random string of symbols has a high entropy, a non-random one’s entropy is low. Entropy, in this sense, is the potential to carry meaningful information.

In physics, now, entropy is tied, roughly, to the question of how many microstates there are to a given macrostate (as These are my own pants and Chronos, I believe, have already pointed out). In the given example, one might think of the short description – ‘*n**a’ or ‘a random string of n symbols’ – as the macrostate, and the actual string of symbols as the microstate. In the first case, there is only one microstate that fits the macrostate description, while in the second, very many strings fit.

In the physical sense, one might compare this to the amount of molecule arrangements that amount to, say, a car versus a pile of stuff – there are obviously very many states that macroscopically lead to the same car – you can exchange a molecule here with a molecule over there without anybody noticing in quite a lot of ways --, but it’s also clear that there are yet enormously more states that amount to ‘a pile of stuff’ (think cars of a different make or model, or neatly stacked raw materials used to make a car, or random arrangements thereof, etc.).

Now we get to what Indistinguishable said:

This is pretty much exactly right. The time evolution of a physical system is effected by applying some transformation to the state of the system at a given moment, such that, if we have a state describable by some string x at a given point in time, the state at a later point in time will be described by a certain function f(x) of that state. If now there is a function f[sup]-1[/sup] we can apply to that evolved state to again get out the state x, i.e. if f[sup]-1/sup = x, no information has been lost.

If that was too abstract, consider again the string consisting of n times the symbol a, written out, for low n: aaaaaaaaaaaaaaaaaaaa. I can now compress it, i.e. 20a – this amounts to applying a transformation. It’s clear that I can decompress it again to get the original string back. I can also apply a different transformation, for instance creating the string zzzzzzzzzzzzzzzzzzzz. Again, it’s easy to reverse that transformation to get the original string back.

Now take a random string: qosndfkwmeha. I can apply a transformation to it, and get out the string wpdmfgle,rjs. Now, this transformation – which was just ‘replace every symbol with the one to the right of it on the keyboard’ – can be inverted (‘replace every symbol with the one to the left of it on the keyboard’), to yield again the original string. Using another suitable transformation, I might also have converted that string into ‘I like beans’. However, I might also have used a transformation that, say, takes the first 13 letters of the alphabet to the symbol 1, and the second 13 letters to the symbol 0, yielding 000011101111 (I think…). From this string, with the knowledge of the transformation, the original can’t be restored – there are multiple possibilities of ‘going backwards’, one for instance would be zzzzaaazaaaa. In this case, information has been lost.

Now, in physics, the microstate of a system is ultimately given by its quantum mechanical description, i.e. something like a wave function, or a state vector (or, more accurately, an equivalence class thereof) in a Hilbert space. A nice feature of quantum mechanics then is that time evolution is unitary, i.e. the transformation of a state at some time into the state at a later time is effected by some quantity U, for which there always exists a quantity U* such that UU = 1*, where 1 is the identity transformation, i.e. the transformation that ‘does nothing’. So that if the state of the system is described by ψ, at a later time it will be described by φ = Uψ, and there exists U* such that Uφ = UUψ = 1ψ = ψ, i.e. the original state can always be recovered.

In this sense, then, quantum mechanics says that information can never be destroyed.

The complication that black holes bring to this picture is roughly the following: By the already mentioned ‘no-hair theorem’, a black hole is described by very few quantities: its mass, its charge(s), and its angular momentum. Thus, if we throw a (quantum mechanical) system into a black hole, it seems that all the information stored in the system does get destroyed, as there is no way to reconstruct, only from knowing those quantities, the quantum system that was thrown in (as there was no way to reconstruct the random string from the 0s and 1s).

However, by a very clever argument, it can be shown that nevertheless, black holes have quite a high entropy – the highest possible for a system occupying a certain volume of space, in fact. Somewhat surprisingly, this entropy turns out to be proportional to the surface area of the black hole ('s event horizon), rather than the volume as one might have expected. This lead to the holographic conjecture: that all the information about systems thrown into the black hole is in some way encoded on its surface, as the information about 3D images is encoded in 2D holograms, and thus, in principle not lost.

There are still some arguments about this, but by now, it is mostly accepted that even black holes are incapable of destroying information.

Half Man Half Wit, thanks.

The holographic principle is not a subject I’ve read much about. I can see why infomration about infalling matter can still exist around the event horizon, however how can the ‘hairiness’ of the matter that is in the Schwarzchild radius at the ‘moment’ of gravitational collapse be recorded at the event horizon? As this is a fairly obvious objection to the principle I’m sure Susskkind must’ve covered it, I’m guessing there must be some kind of appeal to string theory?

Well, yes, but that is not my point. I intended one particular scrambled book, scrambled in a particular (but arbitrary) way. This could carry meaningful information - it might actually be War and Peace (or some other story) in Martian - but most likely it does not mean anything to anyone, and never will. Nevertheless, we know that those letters have (at least) the informational capacity needed to encode War and Peace.

My point is that information in the information theoretic sense does not have to carry meaningful information (and usually does not). What it does do, however, is provide a measure of the maximum amount of meaningful information that a system might be able to encode. Also, although it may not be possible to destroy* information theoretic* information, it certainly is possible to destroy meaningful information (or, if you prefer, to render information meaningless).

And, yeah, it is rather unfortunate (though very understandable) that Shannon chose use the word “information” as the name for the property that he defined and showed how to quantify. It does not mean the same thing as “information” does in common parlance (although it does mean something closely related), and the subtle but significant differences between the two concepts has led to all sorts of confusions.

Bravo Half Man Half Wit. That was worth my $8.00 for the year right there.
Going back to folk’s objections to the term “information” having a non-everyday meaning …

In high-school level physics we have the terms work, momentum, energy, and power. Almost no layman uses them correctly.

Engineers speak of stress & strain. Again very few laymen use them correctly.

Science is like that.

Or more precisely, laymen are like that. Woolly-headed vague shorthand is how most people think & talk most of the time. Why? Because that works well enough for most people’s purposes most of the time as they try to get through their 21st Century day using a (Neanderthal+1) version brain. Or are we version 2.0 chimps? I can never remember.

Expecting the two realms to line up seamlessly is silly. Demanding a technical coined term for each non-everyday notion would render all of science even less approachable than it is.

The simple answer to that is that in a system undergoing gravitational collapse, the Bekenstein bound does not actually hold, exactly for the reason you mention – before being destroyed on the singularity, the system’s surface area becomes arbitrarily small. Bekenstein’s original derivation is limited to weakly self-gravitating systems, which a collapsing star emphatically is not.

However, one can appeal to a more general entropy bound, Bousso’s so-called covariant entropy bound, for which holography can be saved; but the details of that are a little beyond a forum post (plus, I’d have to look 'em up…). If you’re interested, the original paper by Bousso can be found here, and I think Bekenstein’s Scientific American article is of general interest, and also mentions Bousso’s bound briefly.

Heh, thanks. :slight_smile:

Bravo to Half Man Half Wit. Seriously good.

Scrambling a book and talking about the information only gets you half of the problem. Implicit in the book’s coding is a decoder, and there is information in that too. The question of compression comes up again.

Worrying about the nature of the being of the other end of the channel doesn’t actually help work out the information content of a book.

I can encode my entire CD collection into a set of 10 bit numbers. The trick is that the decoder requires my CD collection to decode the number. I simply index the CDs. So I can encode Karajan conducting Beethoven’s 9th into nothing more than this: 1010011011. Which is a good trick. Also silly.

The point is that estimating the information content of a channel is only part of the question. As has been observed the trick is to see whether the stream is compressible. This is hard. A stream is compressible if you have a better than random chance of predicting the value of any token in the stream from all the other values in the stream. This directly leads to the oft quoted point that a perfectly compressed channel is indistinguishable from noise.

However, the total information content also requires that you include the information in the compressor. Which is usually done by measuring the size of a program needed to implement the compressor. When I compressed my CD collection into 10 bit tokens, I needed the entire CD collection worth of data to express the codec.

For instance, 20a may well encode aaaaaaaaaaaaaaaaaaaa nicely. But unless you include something that can interpret the string 20a, as part of the totality, you don’t have a full description. Obviously for trivial streams this makes things go rather the wrong way, but as soon as we have some reasonable amounts of data we get traction. A simple Huffman encoding of War and Peace will claw back vastly more data space than is needed to express the codec.

Eventually you can simply send a program than when run emits the encoded data. The duality between manifest data and program becomes complete. Clearly if the encoded data stream is already the equivalent of noise, the program you send may well just be a trivial wrapper around a data block of the original data. Random number generators are fun like this. If you know the algorithm for the generator you can send a tiny amount of data for what was a massive amount of source data. But if you don’t, the data is essentially incompressible. Yet the information content is still actually almost zero. This leads to the rather interesting result that the maximum and minimum information densities in a stream are both the hardest, as both are incompressible.

So a scrambled book could have a whole range of information contents depending upon what we try to do with it. If it was a purely stochastic scrambling we have the strange situation where the ability to compress the stream will drop. A typical compressor looks for series of common sub-streams. In the simplest form this will get you a table of all the words and phrases that appear more than once. Then the compressed book is nothing more than a stream of indexes into the table. The scrambled book will retain the letters and thus the letter frequency, and so a Huffman encoding will still gain traction, but not so much as the better codec did on the unscrambled book. So the information content is different. But how and why is not possible to tell by itself.

Imagine that instead of a stochastic process to scramble the letters we used an algorithm that is fed another data stream, and the contents of that data stream is used to control the scrambling. Further, the scrambling is done such that if we know the algorithm and have a copy of War and Peace we can reconstruct the initial data stream. What is the information content of the stream now? Is it more or less than the stochastically scrambled version or more or less than the non-scrambled version of War and Peace? Does use of War and Peace as the descrambling key make for a greater or lesser information content in the channel? The obvious relationship to cryptography is pretty clear. Indeed the relative metrics of information and deviation from pure stochastic noise of a channel are part of how codes get broken.

To be fair to the layman, all of those terms had their lay definition long before their scientific definition. Saying that laymen use them “incorrectly” isn’t really accurate.

Does anyone else ever start reading a thread and then wish they hadn’t?

I might lie down and try and pretend this thread isn’t scrambling my brain.

Granted. What I really meant was this …

Folks complained “But these darn scientists are using the term ‘information’ in a non-lay sense. And I’m confused by that. Mommy make them stop! Waah!!!”

My comment was to point out that this happens a lot, and not just with high-falutin’ quantum physics. So while the folks of that opinion are strictly correct, the Waah!!! attitude is unhelpful.

In any conversation about complex things you can safely assume most terms have specific meanings above & beyond the day-to-day meanings. Refusing to accept that is deciding *a priori *to fail to understand the conversation.

There’s nothing wrong with someone saying, “Wow, I never knew ‘information’ had such a specialized meaning”. But if they go on to say “…And it shouldn’t either.”, then I object to their POV.

Thanks, all, for the responses so far. Special thanks to Half Man Half Wit. Exceptional!
You’ve given me a lot to think about…

This same argument, even more simply, also implies that information can never be created; one can determine the entire state of the universe at time T2 from a description of the entire state of the universe at earlier time T1 (simply apply the deterministic transformation U as needed).

But, it is in this realm of quantum mechanics where many feel the clearest, most undeniable need to consider physical systems as evolving nondeterministically forward in time arises (via the “true randomness” of measurement collapse or some such).

What does physics actually say about this (or, since I imagine this is tangled up with unsettled interpretational issues, what would the physicists in this thread say about this?)? Is there a temporal-asymmetry in quantum mechanics with respect to information-destruction vs. information-creation or not, and, more importantly, how should I think about it?

And this intuition is really the whole reason we use entropy as a measure of information. Consider an experiment where we draw a random letter from an alphabet many times. Intuitively, the less able we are to predict the next letter, the more information it conveys. We want to be able to measure that notion of information, and there are three reasonable properties our measure should have:
[ol]
[li]It should be additive. If you draw two letters, the amount of information they contain should be the amount of information the first letter contains plus the amount of information the second letter contains.[/li][li]It should be nonnegative. Drawing another letter shouldn’t leave us knowing less than we started off with.[/li][li]It should be maximized when all of the letters are equally likely. If some letters are more common than others, we can make better predictions than we can in the uniform case.[/li][/ol]
As it turns out, information theoretic entropy is the only measure that satisfies all three of these properties, and that’s why we use it.

Indeed, information is conserved in this case. This might seem at odds with what we observe – certainly, there appears to be more and more information as time goes by; I’m adding to the total right now (as has been said, information needn’t be meaningful…) --, but it’s not any different than patterns of great (apparent) complexity emerging from very simple descriptions, programs or what have you.

Think about a computer producing, and subsequently executing, all possible programs, i.e. it systematically produces all possible bit-strings, then interprets them as programs, and executes them in a dove-tailing fashion, step by step. This is something one could implement in a rather simple manner, i.e. write a comparatively short ‘master program’ containing relatively little information.

Now if our universe is computable, sooner or later, this master program will produce, and then execute, a program that is equivalent to our universe. This program will in general be longer and more complex than the master program (unless our universe is quite a bit simpler than we think it is). So it would appear as if information had been created – and certainly, to the inhabitants of the universe (that’s us), it might well appear as if information continues to be created. Whatever the master program produces can again be viewed as simply a binary string, and that may well be incompressible, so that only the knowledge of the evolution of the universe to a certain point does not determine the future of that universe – which to a certain extent seems indeed to be the situation we find ourselves in, wrt the unpredictability of quantum mechanics.

But on the fundamental level, of course, nothing like that persists. All that is needed to reproduce this behaviour is the finite, and quite small, information to specify the master program – information, in total, is conserved. It is only our being constrained to one specific computational history, one specific program created and run by the master program, that makes it appear otherwise.

This is illustrated in the fact that knowing only the master program is perfectly useless, if you want to have a ‘theory of everything’ (and do anything with it). Rather, one would also have to specify the precise computational history that contains our universe – and in general, that specification would be as extensive on average as the information our universe contains.

Think about Borges’ Library of Babel – in it, every possible book of a certain length, i.e. every possible permutation of a certain amount of letters, is stored. The complete description of the library is quite short, much shorter than any given book it contains, similarly to the master program. The trick is using the library, i.e. finding a book – imagine a catalogue, in which the books are indexed in lexicographic order. Since for every book it contains, there is a book identical to it up to the nth letter, the average catalogue entry for every book is of equal length as the book itself! So the catalogue won’t help you – you need as much information as is contained in a book to find a book.

In a sense, in order to extract the information you need, you must ‘pay’ first an equal amount of information in order to satisfy ‘conservation of information’ – analogous to how a system with zero total momentum can split up into two systems, each carrying an equal but opposite amount of momentum.

This is what I think (and I must emphasize that I speak only for myself here) essentially happens at wave function collapse – the system ‘splits up’, and the total amount of information remains the same, even though in each branch, it doesn’t appear to. If out of multiple possible options, just a single one is realised, information would have been created; but if each possibility occurs, the total information content of the ensemble stays the same (because knowledge of the system at point T1 at least always implies knowledge of all possibilities for the systems state at point T2, so if they all happen, nothing new is learned). (And yes, this is essentially the many-worlds interpretation, though I’m not sure if I subscribe to the view that all those many worlds are ontologically ‘real’.)

In this way, the total information content of the universe (or multiverse, if you will) may be very low, being essentially the information content of the ‘master program’ – taking the reasoning above to the extreme, its information content may even be zero, as in this sense, the set of all bitstrings has zero information, being the complement of the empty set.

This has gotten rather rambling, but I don’t have time right now to go over it again and tighten it, so apologies for any lack of clarity…

I don’t understand much of this thread, but that won’t prevent me from asking the following about such a definition:

If at T2 there is a high energy photon passing through a point in space-time, and having a certain wavelength, etc., how is it possible to determine if said photon had just been created (and, if so, through what process) or had been present already before T1?

My apologies for dumbing down this discussion.

Yeah, the combination of those three conditions is basically the uniqueness argument Shannon explicitly gave for the definition of entropy in the appendix of his 1948 paper, but I always thought invoking all three was rather besides-the-point for justifying the definition of entropy, as condition 1 in itself already essentially says it all; talking about entropy is just another way of talking about probability, the way you would do it if for whatever reason you preferred to use the language of addition instead of multiplication for conditioning upon new information.

Illustrating what I mean, suppose one has some fixed probability distribution, and one samples from it independently a large number of times N; then one predicts each possible result X[sub]i[/sub] to occur about p(X[sub]i[/sub])N many times, so that the overall sequence of results is one with an a priori probability equal to the product of all the p(X[sub]i[/sub])[sup]p(X[sub]i[/sub])N[/sup]; that is, the Nth power of the product of all the p(X[sub]i[/sub])[sup]p(X[sub]i[/sub])[/sup]. Amortizing this, learning the information from each individual new sample multiplies the a priori probability of what has been learnt so far by the product of all the p(X[sub]i[/sub])[sup]p(X[sub]i[/sub])[/sup].

All of which is standard probability expressed in a standard probability way. But, as we see, measuring probabilities is measuring something which multiplies in a uniform way with each new sampling, rather than adding in a uniform way. If, for whatever reason, one preferred to think of this as adding instead of multiplying, then the thing to do would be to, well, rename multiplication into addition; i.e., look at it logarithmically. Which gives us log(the product of all the p(X[sub]i[/sub])[sup]p(X[sub]i[/sub])[/sup]), which is the Shannon entropy.

Under this view, a single physical configuration’s future may consist of multiple separate branches; one can deterministically evolve the initial physical configuration into the set of future branches, but there is no distinguished selection of any particular such branch.

Should we also consider a single physical configuration’s past to consist of multiple separate branches in the same fashion? Why or why not?

But don’t predictability and compressability rely on the interpreter? It seems the sequence of a’s can carry much information if the interpreter has a position dependent meaning. Meaning that “a” in position 2 is as different from an “a” in position 1 as having the alternative in which position 2 is a “b”.

I have seen this explanation before and I didn’t understand how it could be said in an absolute sense that the sequence of the same characters had less information than the sequence of different characters: with the proper interpretation or mapping - the first could have the most information and the second could have the least - at least it seems that way.