Computer question. Converting rtf files to text

GSV_Consolation_of_Dreams · February 6, 2010, 3:44pm

I hope someone here can help me out with this one. I’ve searched high and low for a way to do this myself, but so far Google is not turning up anything.

I need to convert a .rtf file to standard plain text, but I need to preserve some of the formatting, i.e. words in the rtf that look like this should look like this or /like this/.

Any ideas?

RealityChuck · February 6, 2010, 4:50pm

Assuming you’re using Word for the RTF, you can use the advanced search function to find italicized text and then manually make the change. I have to do this often when submitting stories. It only takes a few minutes.

To find the feature, open the “Find” function. There should be a button that says “More.” Click on it and you’ll see a “Format” button. Click on that and select the format.

CookingWithGas · February 6, 2010, 4:57pm

What applications do you have access to for editing rtf files? If you have Word, you can do this with find & replace. For example, go to Find & Replace. Click on Format and select Font, Italic. Then go back and check the Use Wildcards box. Then for Find use

(*)b

where b is a single blank space.

For replace use

\1_

It will replace a space with an underscore for all italicized text.

This may not be precisely what you want but seems to fit as much as you’ve described. With a little noodling you can create more sophisticated replacements.

GSV_Consolation_of_Dreams · February 6, 2010, 7:28pm

Thanks RealityChuck and CookingWithGas.

The first suggestion will help me find all the italic text, but I don’t want to have to replace it all manually, there’s just too much of it.

The second idea works OK, but it puts the underscores after EVERY_ italic_ word_ like_ this_
I’d like to get it so the underscores or slashes are only at the beginning and end of a block of italic text.

panamajack · February 6, 2010, 7:31pm

If you need to do this for multiple documents, you might be able to get by with a rtf->html converter, and then convert from html to something like Textile or Markdown.
That might end up introducing formatting stuff you don’t want, though, depending on the html conversion. But it could be a step in the right direction if it’s too tedious to do a file at a time.

Pedro · February 6, 2010, 9:20pm

It’s a simple programming job if you know regular expressions. I don’t know anything about rtf but going by its wikipedia entry this simple sed command would do what you want for bold formatting:


$ sed -r 's/\{\\b ([^}]+)\}/_\1_/g' ./path_to_your_file

Then you’d convert the transformed rft file to plain text of course.

si_blakely · February 6, 2010, 11:53pm

It depends on your source rtf - some are pretty simple, but others can be pretty complex.

I would probably use some perl - there is a RTF::Parse module. You use this to create the parse tree, delete the bold/italic tags and replace them with the appropriate markers, then strip out the other tags to leave just text. Using sed/awk/perl with regular expressions would work, but finding all the stuff to strip out would take some time, whereas RTF::Parse has done all the ground work already.

Si

Reply · February 7, 2010, 11:28am

I haven’t tested this extensively, but this might work if your RTFs are relatively simple.

Modifying their suggestions a bit:
[ol]
[li]In the Find box, don’t type anything but do the Format -> Italics thing as RealityChuck described.[/li][li]In the Replace box, type:[/li]


/^&/

Optionally, you can also change the Replace box formatting to “No italics”, but that doesn’t really matter once you save the file as .TXT anyway. The crucial thing here is the ^&, which tells Word to replace whatever you Find with itself – only this time you’re adding slashes on either side.
[li]Click Replace All.[/li][li]Repeat the procedure, modifying the Find box with the proper formatting and the Replace box with the appropriate substitutions. Instead of the slashes, you might use asterisks for bold and underscores for underlining.[/li][/ol]

You can also use the following macro to do all three at once:



Sub Txtformatter()
'
' Txtformatter Macro
'
'
    Selection.Find.ClearFormatting
    Selection.Find.Font.Italic = True
    With Selection.Find
        .Text = ""
        .Replacement.Text = "/^&/"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.ClearFormatting
    Selection.Find.Font.Italic = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Italic = False
    With Selection.Find
        .Text = ""
        .Replacement.Text = "/^&/"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    
    Selection.Find.ClearFormatting
    Selection.Find.Font.Bold = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Bold = False
    With Selection.Find
        .Text = ""
        .Replacement.Text = "*^&*"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    
    Selection.Find.ClearFormatting
    Selection.Find.Font.Underline = wdUnderlineSingle
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Underline = wdUnderlineNone
    With Selection.Find
        .Text = ""
        .Replacement.Text = "_^&_"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll

End Sub

Reply · February 7, 2010, 11:38am

And if you have a good text editor, you can also just find and replace the codes directly in the RTF file; RTF is a markup language like HTML and bolding is done through things like:


\b This is bold text.\b0 This is not bold text.

Or, sometimes:


{\b This is bold text.} This is not bold text.

(It depends on the word processor that made the RTF file.)

You can replace those codes with simple /s in the source. You might even be able to do it in Notepad if all else fails.

BigT · February 8, 2010, 5:18am

Maybe you could use your word processor to convert to HTML, and then use Firefox to save as text. Firefox preserves formatting by default. (italics = /italics/ bold = bold)

GSV_Consolation_of_Dreams · February 8, 2010, 11:58am

Thanks for the suggestions everyone. Much appreciated.

ZenBeam · February 8, 2010, 1:53pm

I was able to do this using a combination of Wordpad and Notepad. It sounds like GSV Consolation of Dreams may be OK, but one important thing he might run into is MS Word 7 makes a very complicated RTF file, but Wordpad can read this, and save it as a much simpler file

For example, I typed <ctl-i>italic<ctl-i> then added “ized” later to get italicized. This one word became

{\rtlch\fcs1 \af31507
\ltrch\fcs0 \i\insrsid1711389 Italic}{\rtlch\fcs1 \af31507 \ltrch\fcs0 \i\insrsid4725405 ized}

(with a simillar mess for a word like bolded) and the whole one-line file was 31 kB. I read it into Wordpad and saved it (Wordpad became unresponsive for a minute or two while it was doing this.) to get a 1kB file where that one word was just

\i Italicized\i0

At that point, it became manageable to do what the OP wanted.

CookingWithGas · February 8, 2010, 2:58pm

Saving a Word file as HTML will result in a bloated mess if you actually want to edit the results. Haven’t done this with other editors.

BigT · February 9, 2010, 4:12am

Yes, but saving the HTML as text in Firefox should remove the mess. But I haven’t tried it with actual Microsoft Word. If anyone has both programs, you could try it and see how well it works.

I found a bunch of stuff online that I converted to text for my eReader (a Game Boy of all things), and it worked fine.

ZenBeam · February 9, 2010, 1:30pm

That worked for me, starting with the file from above made with Word 7. When Firefox was converting it to text, it seemed to freeze up just like Wordpad did, but eventually finished. Took a minute or two.

Topic		Replies	Views
RTF file conversion to plain text question Factual Questions	8	2600	July 14, 2008
Simple RTF to HTML converter Factual Questions	1	860	August 28, 2004
ASCII-fying RTF from the command line Factual Questions	3	1062	January 25, 2011
I have a .rtf document and I need it to be .doc Factual Questions	11	904	November 9, 2006
Microsoft Word search and replace formatting _italics_ question Factual Questions	2	764	March 18, 2016

Computer question. Converting rtf files to text

Related topics