In *general*, how do encryption algorithms prevent 0's mid-string?

Here’s an idea, have the acceptable range of the encryption be 0 to 254 then add 1 to each encrypted byte.

Because long sequences of 0s can be by definition compressed.

That doesn’t mean the compressed result will have no zeros in it.

It is a common need, e.g. in file-long encodings like Jpeg. Those boasting of systems which avoid any problem may not have worked with many real-world high-performance encodings.

A simple solution – and IIRC something along these lines is used for Jpeg, though it’s solving a slightly different problem :–

Replace FF or 00 with FF FF or FF 01 respectively. (Codes FF 02 through FF FE are freed up for use as markers.) 1/128 of the time, 16 bits instead of 8 bits are used to represent a byte, for an average loss of 1/16 bit per byte.

With this approach, your file will grow by 0.78% on average (if 00 and FF each occur with 1/256 probability). If this is too much waste, other, more complicated, approaches are possible. (Note that the “waste” does build in a crisp possibly useful way to insert markers.)

Exactly. It’s no longer a string, it’s an octet sequence, with a specific length.

Agreed. The algorithm can be ‘unbiased’ within any range of possible values

In the same sort of way that a six sided die is no more or less ‘random’ than a 12 sided die.

That’s exactly what base64 is for. Unfortunately it takes more space, but it’s pretty trivial and fast to convert.

Well, nobody’s “forced” into a lot of things. It becomes a matter of relative convenience. For some specific problem environments, something like OP’s proposal may often be best. I’ve done something like this at least a dozen times (though I think I’ve also written more code than most Dopers).

Yes, but the simplicity of modular-256 arithmetic over mod-255 will often be too good to pass up.

No, AdamF is.

Another advantage of base64 is that, since data is passed as strings, you may not know what kind of processing might be applied to “text strings”. If you stick to base64, every byte will contain a well-defined printable ASCII character, with the least likelihood of getting mangled.

With base64, your text gets 33% bigger (it takes 4 bytes to carry 3 bytes of data.) That’s the biggest disadvantage.

There’s no way to predict how many zeros you’ll get, so with a “zero-replacement” protocol, you have no idea how long your messages might be. If your protocol is efficient enough that the max buffer size can still handle your max cyphertext plus your zero-hiding trick, then sure, go for that if you prefer. But base64 will be recognized and understood by others: less to explain.

PGP uses this or something like this for encrypting emails to ensure that the encrypted email is nice ASCII text with nothing weird in that that would get corrupted by some program on the way.

What’s the objection to just using Base64? Is there some need to minimize the length or something?

Base64 is a fine solution. But sometimes reducing the average length will have merit, e.g. compression schemes, or large tables.

Pffft, these kids today with their fancy base64. In my day we used UUEncode and were glad to have it.

Reminder: the issue isn’t a string of 0s, it’s having a zero in a string.

I also don’t see why the length variable should be constrained to a byte.

Note: the OP’s question really has nothing to do with encryption. It’s just about dealing with nulls in strings. How they got there doesn’t really matter.

You can get really creative if you want to. Suppose the string has 53 null bytes. First say: “I’m sending you 54 strings.” Then send the string. At the other end, keep reading until you’ve got 54 strings, concatenating them together including the null byte. (And remembering that 2+ nulls in a row results in empty strings.)

And if you are using something like “getstring” on the other end, my God have mercy on your code.

Ah - I get it! (I think - at least, I can now do what I want, and understand how string to string works).

By taking the 4 byte integers that XTEA gives me, and dividing them out into 5 bytes (each one holding a number between 0 and 64), I now have a couple of extra bits. I can map the values 0-64 to any characters I want, thus I can make it a string with no zeroes.

And I can make it encoding-independent by doing it with math, instead of depending on what bits are where. Take the mod 64, then divide by 64, then take the mode 64 again…
I get a 5 digit base-64 number.

Thanks for all the input, everyone! Anyone I didn’t understand - I’m sorry.

I think I became a little bit better of a code jockey today!

The other big advantage to using base64 is that it’s a standard technique, so there are standard tools for it. You probably could write your own implementation of it… but you don’t need to, because you can probably find a library for your language of choice that’s already implemented it.

Olde-tyme C programmers were routinely derided for their reluctance to use C Standard Library functions, preferring to re-invent all their own wheels every time.

This discussion clarifies exactly why.

How do C programmers hunt elephants?

How Programmers Hunt Elephants

You are correct, my mistake.

Cpomments like this confuse me. How many programmer man-minutes do you think are needed to build base64 routines from scratch? Or the simple escape mechanism I described?

Oh, it might only take a few minutes to implement a straightforward base64 routine that works with some nice clean input. Dealing with the edge cases, error conditions, input that does not conform to expected parameters - that takes the time. And if you don’t take the time when you write it originally, you will take huge amounts of time trying to figure out the unexpected behaviour of your wider system.

This is why using a standard library routine that has all the edge cases/error conditions/every possible input covered does actually save time and make the code more reliable.

And, as always, I quote one of my Computing lecturers on implementation estimation - “take your original guess, double it, and shift to the next order of magnitude”

Experience has shown me that he was pretty accurate.