Computer question: What is a binary vs ASCII file?

Musicat · December 23, 2013, 2:24am

If we’re talking about ancient DOS or PIP (remember PIP?) copying commands, the status of a file could be important. Copying can give you LOTS of data that isn’t SIGNIFICANT to an application, but may be stored in a file. That is typically the case for a text file (we’re talking old-school, here) where data after the EOF marker (zero hex, Control/Z, or whatever) may be stored to a disk, then restored from a disk, but not intended to be used by the text-editing or displaying application.

Early OSes did not store the true file size, but a multiple of the blocks. Text apps usually read the data sequentially, and needed to know exactly, to the byte, where to stop, and the EOF marker determined that.

Why do you think the DOS COPY command had /a and /b parameters?

Let me illustrate…

Here’s a sample text string:

“Now is the time for all good men to come to.^Z”

That’s 45 characters, including spaces and EOF marker. Let’s assume the minimum data block on our theoretical hard drive is 64 characters (ignoring headers, checksums, etc.). This means the copy function will copy a block of data starting at “Now is the time…”, continue until the “…come to.^Z”, and continue on in memory for 19 more characters.

What are those 19 characters? Not significant to the text editor. Probably junk in memory. The actual 64 character memory block might look like this, using an ASCII display:

“Now is the time for all good men to come to.^Z4kd @$%sle 76kek 10”

And that is what will be copied to the disk.

Now let’s read from the disk. Here’s what will be copied into working RAM:

“Now is the time for all good men to come to.^Z4kd @$%sle 76kek10”

Note the junk data that isn’t part of the original text file. Copying this doesn’t mean the OS is broken, it’s just the way things work.

The text editor will use the ^Z EOF to determine where the significant data ends. As far as the copy program is concerned, it doesn’t end there, which is why we make the distinction between binary and ASCII copying.

Derleth · December 23, 2013, 2:51am

Musicat: OK, we apparently weren’t talking about quite the same thing, so I’ll summarize to make sure we’re on the same page.

These days, ASCII vs Binary is about end-of-line characters. That’s it. Modern OSes always know where a file ends and won’t give you extra data past the end of a file. By ‘modern’ I mean ‘MS-DOS 2.0 or newer’, since even most versions of MS-DOS were smart enough to know how big files were.

In specific, the difference is about converting between lines ending with just a line feed character and lines ending with some other character or pair of characters. Programs written in many programming languages assume that lines end with just a line feed; on the OSes where this is incorrect, opening a file in text mode as opposed to binary mode tells the library implementation to convert between line-feed-only-style and the actual local style. Obviously, subjecting a file which isn’t composed of lines of text to this conversion will likely damage it greatly, meaning you need a ‘binary’ mode to open picture or audio files, for example; ‘binary’ just means “don’t do the end-of-line conversions”. The OS isn’t involved in this most of the time. Most OSes are pleased to treat every file as a binary file, and not subject file contents to any sort of interpretation. And, of course, on the OSes where line-feed-only is the dominant faith, there is no difference between ASCII and binary. Wikipedia has exhaustively more information on line ending conventions and conversions as usual.

CP/M was a fairly nice OS and better than most of the alternatives on home computers at the time. OK, not entirely germane but it needs to be said. It’s primitive because the hardware was primitive; it didn’t have directories because hard drives were rare and you can manage floppies without them; it was replaced by MS-DOS because Kildall wasn’t as good of a businessman as Bill Gates.

^Z

Derleth · December 23, 2013, 3:00am

It should be noted that ‘ASCII’ as we’re using it here even applies to non-ASCII text files. For example, a Unicode text file encoded in UTF-8 will still need line-ending conversion done on certain operating systems; UTF-8 is based on ASCII, but it contains many tens of thousands of characters which aren’t in ASCII.

RaftPeople · December 23, 2013, 3:00am

1 - The reason for ASCII vs binary for DOS is a different reason than for FTP

2 - The reason for the modes for FTP has to do with translating between char sets and record structures between Unix and mainframe, etc.

3 - The reason for the modes for DOS was to be able to combine/append text files by stripping the EOF and then adding one back once the operation was complete. Binary was the default and in that case it just copied the characters in the file based on the actual file size.
Having extra chars in a data block is just a fact of life that OS’s have to deal with and they all deal with it for all IO. The user doesn’t need to tell the OS to get it right.

TimeWinder · December 23, 2013, 5:05am

This is essentially correct, sort of an oddball special case of the line ending characters for files that had embedded EOF marks. For anybody’s who’s curious about this, here’s the documentationfor the copy command. But it’s not likely to be of much interest unless you’re playing “pre-1984 OS trivial”, and even then typical use was not to have files extend beyond the EOF mark (or even have an EOF mark). It still has nothing to do with the FTP use of the distinction, the modern use of the distinction, or even most pre-modern files

I’m having trouble finding DOS 1.0 file structure documentation, but certainly by DOS 3.0 in 1984, the file catalogcertainly knew the number of bytes in a file, not just the number of blocks, and did not require an embedded EOF mark.

robert_columbia · December 24, 2013, 2:02am

The “32” in “32-bit” refers to the so-called “word size” of the system. At the physical processor/CPU level, the word size represents the size of the processor’s registers and thus the largest amount of data that the processor can treat as a single unit during the execution of a single machine language operation. E.g. the fundamental “ADD” operation in the CPU can only add numbers smaller than or equal to the word size - anything larger and the computer must perform multiple operations and juggle intermediate results around in memory or else the result gets truncated or wraps around. Another ramification of this is that bigger word sizes allow one to build computers with more memory as generally computers track memory locations (e.g. in RAM, or on a disk) as a single word - a 32 bit system can track 2^32 memory locations, while a 64 bit system can track 2^64 of them, which is hundreds of times bigger. Word size is one of the fundamental reasons behind the IP address shortage on the Internet that has received media exposure in the past few years.

Learjeff · December 24, 2013, 5:44pm

This is the best answer so far. One is to handle difference between text files on different OSs; the other is to handle differences between different types of files on some OSs (I believe, only for obsolete ones that don’t deep accurate file size information for binary files).

robert_columbia answered this above. However, there are other cases. I worked on some machines that had 12-bit words, and others that had 36-bit words. The bottom line with these is that while ASCII text files could be copied to other machines, whereas binary ones could not, in general – at least, not without some special-purpose format on the byte-oriented machines, that special purpose being to carry 12- or 36-bit binary files.

Likewise, IIRC, the 36-bit machines had special formats for carrying byte-oriented binary data from other machines. I don’t know whether that happened back in the 12-bit days.

In addition to the end-of-line and other transformations for ASCII files, there are other possible issues. For example, ASCII was originaly a 7-bit code; the top bit was either a parity bit or was ignored (and was usually stored as zero). An ASCII transfer protocol was allowed to not preserve this top bit. This is probably no longer true, in order to support wider ASCII, Unicode, etc.

ASCII files were stored oddly on the 36-bit machines, packing 5 7-bit ASCII characters per 36-bit word (and making byte-addressing of those characters a pain in the behind!)

I don’t miss the bad old days of strange word lengths one bit.

RaftPeople · December 24, 2013, 5:44pm

This might be understating it a tiny bit

Learjeff · December 24, 2013, 6:24pm

LOL – yeah, understatement of the day! (Do I mean Robert’s understatement or yours? )