I’ve written more code than you might guess. I once bid on writing a complete operating system for a Silicon Valley company. I estimated the project would take me ten weeks. It turned out to take … ten weeks.
I’m not sure what “edge cases” you’re worried about in something as trivial as base64. Input length not a multiple of three introduces a slight difficulty, already handled in the base64 standard using a 65th character.
If I were to estimate the man-power required to implement base64, I’d be tempted to use man-seconds rather than man-minutes as the basis. Admittledly I write code faster than most.
BTW, one reason you might prefer not to use a library base64 implementation is that, IMO, some silly choices were made for the pad character and the two non-alphanumeric characters in the standard implementation. “+”, “=” and “/” need escaped in too many contexts. For instance, if you are transmitting the base64 string in a URL parameter, the “+” needs escaped to avoid being interpreted as a space, and the other two are supposed to be escaped whether you can actually get away with them in a particular query string implementation or not.
This is one reason that there a whole gaggle of variant implementations out there, changing them and using or not using the trailing pad character. Of course, it becomes a toss up as to whether you just implement it yourself with alternate characters because it’s simple, or wrap the library version, changing some of them to something else. Of course in a URL, you probably have a URL encoding routine also, which you can apply to the base64 output, which is probably better if that’s all you’re worried about.
("_" which is an id character in many languages would have been a better choice. “.” already is used in numeric strings and will pass anywhere floating point numbers pass unescaped. Perhaps “-” for the other, although it looks too much like underscore when inspecting the string manually. The “+” would have been OK, except for the conflict with URL encoding, a primary place base64 was going to be used.)
I know this is GQ but here’s my opinion anyway: please, no one write their own implementation of a base64 encoder. Someone else may have to maintain your code some day. Or even just write new code that reads in the messages you produce. If you need to escape +, =, and /, use another standard library that does URL encoding and decoding.
These are not just theoretical concerns. I once had to debug code that would incorrectly truncate strings on semicolons and only semicolons. Every other special character worked fine.
I’ve also had to write converters to encode and decode the military’s VMF format, in which different messages use different numbers of bits (anywhere from 23 - 26) to represent latitude and longitude.
I know it’s tempting to be clever and come up with the most perfectly efficient solution to your problem, but program maintenance is orders of magnitude more expensive than hardware these days. Please use a standard library that implements a standard algorithm.
I actually chose ‘&’ as one of the two non-alphanumerics, even though some of this has to travel through xml :smack:.
But I fixed that in 9 seconds.
As I said, I looked for an OS-independent string to string c (or at least objective c and c# ) solution and couldn’t find one. Took a well-tested encryption, and wrapped enough around it to turn strings into ints and back into strings.
And I learned a lot about my original academic question.
Null-terminated strings are a C thing, not a C++ thing. std::string is a length-counted string. You can, of course, use the C standard library functions in C++, but only degenerates and masochists do this.
But even in C++, a literal string written into the source code is a const char * with a \0 at the end, and sometimes that matters, even when you’re using std::string for everything else.
In Java string literals are of type java.lang.String, just like everything else, which avoids all of these problems.
That’s true, and it’s one of C++'s (many) failings. As it is, you can’t even initialize a string from a literal containing a \0 without giving its length explicitly:
String literals in source should be expressions of type std::string const& (or probably just std::string with no qualifiers). There’s no excuse for it to be otherwise, and the only explanation is that Bjarne Stroustrup made the initial boneheaded decision to imitate (and even worse, seek compatibility with!) the most brain-dead language of his era, and that and countless other mistakes have now ossified into a 1500-page international standard.
…Despite all that, I actually do like C++; I’m not bitter or anything, I swear!
Encoding into and out of base-64 is as straightforward as it gets in C. You’re much better to use this than any other method of trying to prevent 0 bytes.
Ultimately, if you need your output to be a legal ASCII string, then you’ll need to convert it to base-64 either way. If it doesn’t need to be a legal ASCII string, then a 0 byte isn’t any more strange than a value greater than 127. But the only reason to make something legal ASCII is if you expect your string to be relayed around through a variety of transport mechanisms, like e-mail or dumped into a webpage. Those might end up chomping characters or helpfully converting them to something else, due to encoding errors, unless you stick within the safe ASCII range.
But if you’re just having two software programs talk to one another, a zero byte is fine. C, C++, and C# can all deal with a zero byte in an array. You just avoid using string functions on the input until you have decoded the data.
Often true, but sometimes something is easy enough to code up that it’s more effort to find a library that has the right license and integrate it into your build system than to take the 5 minutes it takes to write a base-64 encoder.
But, if you’re using Java or PHP or something else which already supports this in the standard library, you should use the standard library.
I think you might have gotten this backwards. Standardized base 64 using ‘+’ as an encoding character (RFC 1421) predates standardized URLs using ‘+’ as a query string word separator (RFC 1630). Despite its shortcomings, the RFC 1630 specification actually recommends using base 64 to encode binary data.
OK, but then I’ve always thought that the “+” to encode spaces in URLs was a silly choice, too, on similar grounds, and especially in light of that recommendation. Either way, it’s an unfortunate conflict.
Note, BTW, that Java SE didn’t have an implementation of Base64 until quite recently (1.8). An awful lot of java coders wound up rolling their own, and including it in various jars for their own software. Me, too. Some of them, who were using it in an enclosed fashion which never surfaced the encoded string to a third party probably succumbed to the temptation to twiddle with the special characters used, or getting rid of the pad character and determined the number of bytes from the length of the encoded string, allowing a “short” block at the end. Java has had a URLEncoder since 1.0, though, so it’s easy enough to run the base64 encoded string through that.
So did a lot of people outside the java world. See wiki’s listing of variants:
BTW, that has RFC 1421"deprecated", replaced by several others for various contexts.
You know, you could actually put the resulting UINTS into base 62 instead of base 64, because a 6-digit base-62 number is still bigger than the maximum unsigned 32 bit integer.
Then one could avoid the 2 extra non-alphanumerics altogether.
But I guess a lot of people stick with base 64 because one can do it by bit-shifting, rather than having to mod and divide by the base.