In general, how do encryption algorithms prevent 0's mid-string?

septimus · February 27, 2015, 12:18pm

I’ve written more code than you might guess. I once bid on writing a complete operating system for a Silicon Valley company. I estimated the project would take me ten weeks. It turned out to take … ten weeks.

I’m not sure what “edge cases” you’re worried about in something as trivial as base64. Input length not a multiple of three introduces a slight difficulty, already handled in the base64 standard using a 65th character.

If I were to estimate the man-power required to implement base64, I’d be tempted to use man-seconds rather than man-minutes as the basis. Admittledly I write code faster than most.

yabob · February 27, 2015, 4:50pm

BTW, one reason you might prefer not to use a library base64 implementation is that, IMO, some silly choices were made for the pad character and the two non-alphanumeric characters in the standard implementation. “+”, “=” and “/” need escaped in too many contexts. For instance, if you are transmitting the base64 string in a URL parameter, the “+” needs escaped to avoid being interpreted as a space, and the other two are supposed to be escaped whether you can actually get away with them in a particular query string implementation or not.

This is one reason that there a whole gaggle of variant implementations out there, changing them and using or not using the trailing pad character. Of course, it becomes a toss up as to whether you just implement it yourself with alternate characters because it’s simple, or wrap the library version, changing some of them to something else. Of course in a URL, you probably have a URL encoding routine also, which you can apply to the base64 output, which is probably better if that’s all you’re worried about.

("_" which is an id character in many languages would have been a better choice. “.” already is used in numeric strings and will pass anywhere floating point numbers pass unescaped. Perhaps “-” for the other, although it looks too much like underscore when inspecting the string manually. The “+” would have been OK, except for the conflict with URL encoding, a primary place base64 was going to be used.)

Emerald_Hawk · February 27, 2015, 6:16pm

I know this is GQ but here’s my opinion anyway: please, no one write their own implementation of a base64 encoder. Someone else may have to maintain your code some day. Or even just write new code that reads in the messages you produce. If you need to escape +, =, and /, use another standard library that does URL encoding and decoding.

These are not just theoretical concerns. I once had to debug code that would incorrectly truncate strings on semicolons and only semicolons. Every other special character worked fine.

I’ve also had to write converters to encode and decode the military’s VMF format, in which different messages use different numbers of bits (anywhere from 23 - 26) to represent latitude and longitude.

I know it’s tempting to be clever and come up with the most perfectly efficient solution to your problem, but program maintenance is orders of magnitude more expensive than hardware these days. Please use a standard library that implements a standard algorithm.

bup · February 27, 2015, 6:50pm

I actually chose ‘&’ as one of the two non-alphanumerics, even though some of this has to travel through xml :smack:.

But I fixed that in 9 seconds.

As I said, I looked for an OS-independent string to string c (or at least objective c and c# ) solution and couldn’t find one. Took a well-tested encryption, and wrapped enough around it to turn strings into ints and back into strings.

And I learned a lot about my original academic question.

Stealth_Potato · February 28, 2015, 9:04pm

Null-terminated strings are a C thing, not a C++ thing. std::string is a length-counted string. You can, of course, use the C standard library functions in C++, but only degenerates and masochists do this.

leahcim · February 28, 2015, 9:52pm

But even in C++, a literal string written into the source code is a const char * with a \0 at the end, and sometimes that matters, even when you’re using std::string for everything else.

In Java string literals are of type java.lang.String, just like everything else, which avoids all of these problems.

Stealth_Potato · March 1, 2015, 2:53am

That’s true, and it’s one of C++'s (many) failings. As it is, you can’t even initialize a string from a literal containing a \0 without giving its length explicitly:



    std::string s1 {"hello\0world"};
    std::string s2 {"hello\0world", 11};
    std::cout << s1.size() << " " << s2.size << "
";  // 5 11

String literals in source should be expressions of type std::string const& (or probably just std::string with no qualifiers). There’s no excuse for it to be otherwise, and the only explanation is that Bjarne Stroustrup made the initial boneheaded decision to imitate (and even worse, seek compatibility with!) the most brain-dead language of his era, and that and countless other mistakes have now ossified into a 1500-page international standard.

…Despite all that, I actually do like C++; I’m not bitter or anything, I swear!

Sage_Rat · March 1, 2015, 10:56am

Encoding into and out of base-64 is as straightforward as it gets in C. You’re much better to use this than any other method of trying to prevent 0 bytes.

Ultimately, if you need your output to be a legal ASCII string, then you’ll need to convert it to base-64 either way. If it doesn’t need to be a legal ASCII string, then a 0 byte isn’t any more strange than a value greater than 127. But the only reason to make something legal ASCII is if you expect your string to be relayed around through a variety of transport mechanisms, like e-mail or dumped into a webpage. Those might end up chomping characters or helpfully converting them to something else, due to encoding errors, unless you stick within the safe ASCII range.

But if you’re just having two software programs talk to one another, a zero byte is fine. C, C++, and C# can all deal with a zero byte in an array. You just avoid using string functions on the input until you have decoded the data.

Sage_Rat · March 1, 2015, 11:01am

Often true, but sometimes something is easy enough to code up that it’s more effort to find a library that has the right license and integrate it into your build system than to take the 5 minutes it takes to write a base-64 encoder.

But, if you’re using Java or PHP or something else which already supports this in the standard library, you should use the standard library.

psychonaut · March 1, 2015, 11:24am

I think you might have gotten this backwards. Standardized base 64 using ‘+’ as an encoding character (RFC 1421) predates standardized URLs using ‘+’ as a query string word separator (RFC 1630). Despite its shortcomings, the RFC 1630 specification actually recommends using base 64 to encode binary data.

yabob · March 1, 2015, 3:39pm

OK, but then I’ve always thought that the “+” to encode spaces in URLs was a silly choice, too, on similar grounds, and especially in light of that recommendation. Either way, it’s an unfortunate conflict.

Note, BTW, that Java SE didn’t have an implementation of Base64 until quite recently (1.8). An awful lot of java coders wound up rolling their own, and including it in various jars for their own software. Me, too. Some of them, who were using it in an enclosed fashion which never surfaced the encoded string to a third party probably succumbed to the temptation to twiddle with the special characters used, or getting rid of the pad character and determined the number of bytes from the length of the encoded string, allowing a “short” block at the end. Java has had a URLEncoder since 1.0, though, so it’s easy enough to run the base64 encoded string through that.

So did a lot of people outside the java world. See wiki’s listing of variants:

BTW, that has RFC 1421"deprecated", replaced by several others for various contexts.

bup · March 17, 2015, 4:10pm

You know, you could actually put the resulting UINTS into base 62 instead of base 64, because a 6-digit base-62 number is still bigger than the maximum unsigned 32 bit integer.

Then one could avoid the 2 extra non-alphanumerics altogether.

But I guess a lot of people stick with base 64 because one can do it by bit-shifting, rather than having to mod and divide by the base.

Topic		Replies	Views
Trying to understand creating my own file encryption program Factual Questions	25	3049	November 3, 2008
Compression Algorithims--I'm clearly missing something Factual Questions	36	4780	August 7, 2012
Encryption and C++ Factual Questions	23	2946	December 7, 2001
New Original Pure Math Question Factual Questions	9	1312	August 2, 2000
Help fight my ignorance on encryption Factual Questions	21	2452	August 18, 2007

In *general*, how do encryption algorithms prevent 0's mid-string?

Related topics

In general, how do encryption algorithms prevent 0's mid-string?