Converting ASCII .txt to something I can work with on a Mac.

I have to work with some transcripts prepared by a court reporter. They are .txt, and he tells me they are somehow “ASCII” related. All I know is that they are a NIGHTMARE to try and work with, change, convert… he does everything in all caps, horrible font, weird line breaks. I would KILL to find a realtively simple way to just suck the words (which are on numbered lines that I do need as well) out and pour them into a reasonable document that I can clean up, starting with changing from all caps to normal case!

I searched Versiontracker, the only things that came up were about eleventy years out of date.

Help!

um, well, the standard way to easily convert strange text formats is to first convert them to ASCII text files. Basically an ASCII text file is a generic format that should contain nothing but the text itself.

You mention weird fonts - if it’s truly ASCII, there shouldn’t be any font info in file. Is it just a plain courier font? Is there more than one font? If there is, it’s probably not really an ASCII file.

Weird line breaks and all caps can be fixed easily with a good text editor. I like UltraEdit on the PC, unfortunately I’m not sure what’s out there for macs.

My favorite text editor on the Mac is BBEdit or it’s free little brother TextWrangler. Either one will do the job for you. Both offer all kinds of text formatting options as well as GREP for search and replace. On a Mac, you could probably do everything you need to do from the command line with standard Unix tools.

Another neat and inexpensive little text editor for the Macintosh is Tex-Edit Plus : $15.00

I second Athena’s implication that something is weird here, at least with regards to the fonts and the all-caps. The built-in application TextEdit should have no problem whatsoever with a plain ASCII .txt file. You shouldn’t see any caps that weren’t there in the original, and there shouldn’t be any font at all except for the default TextEdit font (which you can adjust yourself in the Format menu).

Line breaks are another matter. Different OSes use different invisible characters to indicate line breaks. I’ve never had a problem even here using TextEdit, so I assumed that it took care of that in the background. But, if it doesn’t, download the free TextWrangler. It let’s you switch between line-endings for the different OSes. Just save a new copy of the file and click the Options… button in the Save dialogue box. There’ll be a drop-down menu to select Mac, Unix, or DOS line breaks.

I think the main difference between Mac and PC when dealing with ASCII text is that, IIRC, *nix-based systems represented carriage returns with the CR/LF pair (carriage return + line feed, ASCII values 013 and 010) while other platforms were content with just a single CR character. Or maybe that’s the other way round, and possibly with one it was a single LF instead of a single CR; it’s been ages since I’ve had to deliberately deal with this stuff and I’m having trouble knocking a proper recollection loose from my brain’s dusty annals.

But as has been said, plain ASCII text files contain no information other than the text itself – font information or markups or anything else of the sort, just straight up plain text. If there are fonts and stuff involved here, you’re dealing with something else, possibly a rich-text (RTF) document or something.

Do you know what app he’s using to create the documents? That’ll help sort things out.

Maybe run it through Word on a PC first. There’s an open file option to strip just the text out of a file. Don’t know if that’s an option on the Mac version.

I’d also be interested in the application creating the file - something a court reporter uses, perhaps? It might not really be an ASCII file, or it can be an ASCII file with some non-printing but legitimate ASCII characters stuck in. For instance, 007 was put in to ring the bell of a Teletype - not too useful any more. Some of these might be causing odd behavior in whatever app you’re using to look at the file.

I know UNIX commands to see what is really in a file, but I don’t know what a Mac would have.

Macs are running BSD unix these days so the odds are high that Macs have the command you are thinking of.

Saying that ASCII files do not have formatting in them is not strictly correct. ASCII is just a way of encoding characters into byte values. A file can have formatting directives that are themselves just ASCII characters. See nroff for example.

That said, a good text editor will just allow you to select the text and change the case in mass. Look for one that supports “sentence case”; that will capitalize the first letter in the sentence.

TextWrangler does. I just checked. It will capitalize sentences, words or lines.

Unix-based systems use CR. PC systems use CR/LF. Macs used LF, but now that Macs are Unix-based, they now use CR. The built-in Mac OS X text (i.e., ASCII) editor is TextEdit, which understands all three conventions.

The fonts tell me the files aren’t really ASCII at all. ASCII contains no font information so the file should only be one font, with no bolding or underlining or italics or size changes.

Aside from that, though, some files are just badly formatted. Emacs contains code to reformat files but learning Emacs just for this task seems a bit much. I’m sure other, lesser, editors have similar functionality as well, but I don’t know any of them well enough to comment.

I find this needlessly insulting. Everyone knows Emacs is a whole damn operating system, not an “editor.” Sheesh.

If it’s just ASCII text, it should open in TextEdit, the system text editor on OS X, without a problem. I also have TextMate, which I paid for. Send one of the files to me, and I’ll see whether it opens oddly there.

I feel your pain, but alas! despite what others have said here, I doubt anyone could help you much without at least part of the file to look at. I think, from your description, that you do not have enough of a background in computers to fix the problem yourself, and the court reported probably doesn’t either. In this situation, it’s rather like trying to both diagnose and fix your car by looking at a picture of the car and listening to a recording of it running!

In other words, not easy.

Here’s my educated guess:

The court reporter recorded the original notes in a computer system with its own file format. He then output these files naming them “.txt”. They may or may not contain plain text without formatting; it is hard to know without looking at the original file.

You then opened these files on a Macintosh using an editor that you do not mention. This editor may expect a particular file format, or may not. Again, without knowing which editor you used, I can not say.

This editor may be interpreting characters in the file as “formatting” characters, and thus display the file weirdly, with capitalization, fonts, and other info that is really not there.

So, you may have files that are perfectly fine, but you aren’t looking at them the right way.

Let me clarify.

Microsoft Word is a well-known document editor for both PC and Mac. You can format with fonts, put in line breaks and tables, and so forth. This formatting is stored in a Word file (.doc) as special characters that do not show up on your screen and do not print, yet they are there in the file itself.

If I were to edit a .doc file in another editor that knew nothing of Word, I would see strange behavior. The editor might treat the special characters as formatting, or might assume that they are text. I don’t think this second editor would give you the same result that Word would.

Conversely, with a bit of thinking I could make a file out of characters plus other “stuff” that, if I named it “.doc”, would display strangely in Word.

Remember that to a computer, everything in a file is, in the end, 0s and 1s. By convention (and this is a simplification that assumes only countries that use a Roman alphabet), each set of 8 0s and 1s is called a byte. An editor program assumes that each byte represents a character, in a “code”. A byte can represent 256 codes, so you have a limit of 256 characters.

Almost all programs across all operating systems use the 8-bit ASCII coding scheme. All of the codes have a printable character associated with them, although some are also used for old-style control codes. An editor program should be able to read these codes out of the file and display them. The issue, though, is whether or not the program has the right fonts to do this.

Sigh. Even if a program understands all 256 codes, it has to be able to map them to something you understand. It has to map the ASCII code for lowercase “a” to something in a font file that will put lowercase “a” on the screen and also on a printer.

All the “.txt” ending on the file says is “interpret this file as a set of ASCII characters”. You put in all 0s into a .txt file and the editor will open it and display weird characters, because the ASCII code of 8 0s maps to a character that usually does not display in any font.

This only scrapes the surface of what is going on. A good SW engineer or tech writer could look at the file and figure out what’s wrong, I am sure. He or she might even be able to fix it automatically. I’m a tech writer and I’ve done stuff like this before.

Alas, I have no way of contacting you, or you me.

It’s octal dump
old -ta <file> should show if there are any problems, especially if a small chunk was looked at. However, I fear our OP will be freaked out by the output of this command, probably having no experience in reading even small dumps.

I opened a console prompt and tried the ‘old’ command and didn’t get anything. However, googling revealed a hexdump command that seems to be quite flexible.

Hexdump is for those uppity youngins what got them new-fangled “Inter-Net” connections and some zippity-do-dah bitmap graphics device, not a 110 baud teletype display connected via RS-232 to a cantankerous PDP-8.

Real hackers use ‘od’.

Which is a hardlink to the same executable, but you save five whole keystrokes!

Unix uses LF (AKA newline or NL). Old Macs use CR. PCs use CR/LF. Aren’t standards wonderful?