RTF file conversion to plain text question

This one has me stumped at the moment. Maybe someone here knows the answer.

I have a large document in RTF format. I want it in plain text, but I want the text formatting preserved in some way: e.g. italic text becomes /italic text/ and bold text becomes bold text.

Convertor programs I have tried don’t allow me to preserve the formatting information at all, it is simply stripped out entirely.

Find and replace in Word allows me to search for the formatted text and replace the formatting, but I need to translate the formatting into new characters.

Anyone know how to do this?

If you open an RTF file directly into a plain text editor such as Notepad, you will be able to see the control codes. I just tried it with a file containing

This is a test RTF file – containing items in **bold **and *italic *- also bold italic – that is all.

Opened in plain text, it looks like this:


{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\froman\fprq2\fcharset0 Times New Roman;}{\f1\fswiss\fcharset0 Arial;}}
{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs20 This is a test RTF file \endash  containing items in \b bold\b0  and \i italic\i0  - also \b\i bold italic\b0\i0  \endash  that is all.\f1\par
}

-That looks like a fairly simple set of search and replace operations could achieve the effect you desire.
Although it may depend on the program used to create the RTF file - the same content, saved as RTF from Microsoft [del]Bloat[/del] Word looks like this (in plain text):


{\rtf1\ansi\ansicpg1252\uc1 \deff0\deflang1033\deflangfe1033{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f29\froman\fcharset238\fprq2 Times New Roman CE;}{\f30\froman\fcharset204\fprq2 Times New Roman Cyr;}
{\f32\froman\fcharset161\fprq2 Times New Roman Greek;}{\f33\froman\fcharset162\fprq2 Times New Roman Tur;}{\f34\froman\fcharset177\fprq2 Times New Roman (Hebrew);}{\f35\froman\fcharset178\fprq2 Times New Roman (Arabic);}
{\f36\froman\fcharset186\fprq2 Times New Roman Baltic;}}{\colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255;
\red0\green0\blue128;\red0\green128\blue128;\red0\green128\blue0;\red128\green0\blue128;\red128\green0\blue0;\red128\green128\blue0;\red128\green128\blue128;\red192\green192\blue192;}{\stylesheet{
\ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 \fs24\lang2057\langfe1033\cgrid\langnp2057\langfenp1033 \snext0 Normal;}{\*\cs10 \additive Default Paragraph Font;}}{\info
{	itle This is a test RTF file \'96 containing items in bold and italic - also bold italic \'96 that is all}{\author Mike}{\operator Mike}{\creatim\yr2008\mo7\dy14\hr15\min13}{\revtim\yr2008\mo7\dy14\hr15\min13}{\version1}{\edmins0}{
ofpages1}
{
ofwords0}{
ofchars0}{
ofcharsws0}{\vern8243}}\paperw5710\paperh3056\margl567\margr567\margt567\margb567 \widowctrl\ftnbj\aenddoc
oxlattoyen\expshrtn
oultrlspc\dntblnsbdb
ospaceforul\formshade\horzdoc\dgmargin\dghspace120\dgvspace180\dghorigin1701
\dgvorigin1984\dghshow2\dgvshow2\jexpand\viewkind4\viewscale100\pgbrdrhead\pgbrdrfoot\splytwnine\ftnlytwnine\htmautsp
olnhtadjtbl\useltbaln\alntblind\lytcalctblwd\lyttblrtgr\lnbrkrule \fet0\sectd 
\lndscpsxn\psz136\linex0\headery709\footery709\colsx708\endnhere\sectlinegrid360\sectdefaultcl {\*\pnseclvl1\pnucrm\pnstart1\pnindent720\pnhang{\pntxta .}}{\*\pnseclvl2\pnucltr\pnstart1\pnindent720\pnhang{\pntxta .}}{\*\pnseclvl3
\pndec\pnstart1\pnindent720\pnhang{\pntxta .}}{\*\pnseclvl4\pnlcltr\pnstart1\pnindent720\pnhang{\pntxta )}}{\*\pnseclvl5\pndec\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl6\pnlcltr\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}
{\*\pnseclvl7\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl8\pnlcltr\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\*\pnseclvl9\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}\pard\plain 
\ql \li0\ri0\widctlpar\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 \fs24\lang2057\langfe1033\cgrid\langnp2057\langfenp1033 {\fs20\lang1033\langfe1033\langnp1033 This is a test RTF file \endash  containing items in }{
\b\fs20\lang1033\langfe1033\langnp1033 bold}{\fs20\lang1033\langfe1033\langnp1033  and }{\i\fs20\lang1033\langfe1033\langnp1033 italic}{\fs20\lang1033\langfe1033\langnp1033  - also }{\b\i\fs20\lang1033\langfe1033\langnp1033 bold italic}{
\fs20\lang1033\langfe1033\langnp1033  \endash  that is all.}{
\par }}

I think what you want is an RTF to “Tagged Text” converter. There are lots of RTF to HTML converters out there, and HTML is tagged, so these might be a step in the right direction. It would help if you described what you want to do with the tagged text.

To me, “plain text” and “preserve the formatting” are mutually exclusive. A pure ASCII text file contains no characters for bold or italic, just the chars in the ASCII character set.

And to make matters worse, the conversion may or may not insert or preserve CRLF (Carriage Return and/or Line Feed chars). Which may or may not matter in the long run.

I think you need to specify what your eventual end use is: HTML or whatever.

An idea similar to beowulff’s, I was dropping in to suggest you do a two-time translation, but I was thinking LaTeX for the first. HTML is probably better though, being more familiar to most everyone.

Do you want only certain tags (e.g., bold and italic, but not others) preserved? If so, it’s likely you’ll have to write a custom translator…which isn’t too hard (just incredibly tedious) if you’re familiar with regular expressions. A custom translator is a necessity if you’re making up your own tags…

Thanks Mangetout, doing a search and replace like that would mean using Regular expressions, then I’d have two problems (as the saying goes)

I will look into tagged text convertors first. Thanks for the tip beowulff

The purpose of all this is to read eBooks on my PDA. There are ways I can read RTF and HTML using something like Plucker, but I frankly hate that program. I’d rather use Palmfiction, having gone through all the pain to get it working how I like it.

Palmfiction reads RTF but doesn’t display the formatting. I want something that will put the formatting into the text in ASCII, /italic/, bold and underline would be enough really.

Thanks for all the help so far.

A quick google led me to this blog entry about writing your own RTF to HTML program (code included there). Not that I’m suggesting you do that, but rather posting it because it looks like there are some good informational links there.

I am not familiar with PDA programming or capabilities. From your description, it seems like it reads simple text files. The problem is if you want to use ANY formatting, even just italics or bolding, you can’t use a true, simple ASCII text file without some proprietary imbedded codes.

My suggestion would be to investigate what kinds of files your PDA will read and try to work with that framework before trying to do something fancy or write new routines.