I hope someone here can help me out with this one. I’ve searched high and low for a way to do this myself, but so far Google is not turning up anything.
I need to convert a .rtf file to standard plain text, but I need to preserve some of the formatting, i.e. words in the rtf that look like this should look like this or /like this/.
Assuming you’re using Word for the RTF, you can use the advanced search function to find italicized text and then manually make the change. I have to do this often when submitting stories. It only takes a few minutes.
To find the feature, open the “Find” function. There should be a button that says “More.” Click on it and you’ll see a “Format” button. Click on that and select the format.
What applications do you have access to for editing rtf files? If you have Word, you can do this with find & replace. For example, go to Find & Replace. Click on Format and select Font, Italic. Then go back and check the Use Wildcards box. Then for Find use
(*)b
where b is a single blank space.
For replace use
\1_
It will replace a space with an underscore for all italicized text.
This may not be precisely what you want but seems to fit as much as you’ve described. With a little noodling you can create more sophisticated replacements.
The first suggestion will help me find all the italic text, but I don’t want to have to replace it all manually, there’s just too much of it.
The second idea works OK, but it puts the underscores after EVERY_ italic_ word_ like_ this_
I’d like to get it so the underscores or slashes are only at the beginning and end of a block of italic text.
If you need to do this for multiple documents, you might be able to get by with a rtf->html converter, and then convert from html to something like Textile or Markdown.
That might end up introducing formatting stuff you don’t want, though, depending on the html conversion. But it could be a step in the right direction if it’s too tedious to do a file at a time.
It’s a simple programming job if you know regular expressions. I don’t know anything about rtf but going by its wikipedia entry this simple sed command would do what you want for bold formatting:
$ sed -r 's/\{\\b ([^}]+)\}/_\1_/g' ./path_to_your_file
Then you’d convert the transformed rft file to plain text of course.
It depends on your source rtf - some are pretty simple, but others can be pretty complex.
I would probably use some perl - there is a RTF::Parse module. You use this to create the parse tree, delete the bold/italic tags and replace them with the appropriate markers, then strip out the other tags to leave just text. Using sed/awk/perl with regular expressions would work, but finding all the stuff to strip out would take some time, whereas RTF::Parse has done all the ground work already.
I haven’t tested this extensively, but this might work if your RTFs are relatively simple.
Modifying their suggestions a bit:
[ol]
[li]In the Find box, don’t type anything but do the Format -> Italics thing as RealityChuck described.[/li][li]In the Replace box, type:[/li]
/^&/
Optionally, you can also change the Replace box formatting to “No italics”, but that doesn’t really matter once you save the file as .TXT anyway. The crucial thing here is the ^&, which tells Word to replace whatever you Find with itself – only this time you’re adding slashes on either side.
[li]Click Replace All.[/li][li]Repeat the procedure, modifying the Find box with the proper formatting and the Replace box with the appropriate substitutions. Instead of the slashes, you might use asterisks for bold and underscores for underlining.[/li][/ol]
You can also use the following macro to do all three at once:
And if you have a good text editor, you can also just find and replace the codes directly in the RTF file; RTF is a markup language like HTML and bolding is done through things like:
\b This is bold text.\b0 This is not bold text.
Or, sometimes:
{\b This is bold text.} This is not bold text.
(It depends on the word processor that made the RTF file.)
You can replace those codes with simple /s in the source. You might even be able to do it in Notepad if all else fails.
Maybe you could use your word processor to convert to HTML, and then use Firefox to save as text. Firefox preserves formatting by default. (italics = /italics/ bold = bold)
I was able to do this using a combination of Wordpad and Notepad. It sounds like GSV Consolation of Dreams may be OK, but one important thing he might run into is MS Word 7 makes a very complicated RTF file, but Wordpad can read this, and save it as a much simpler file
For example, I typed <ctl-i>italic<ctl-i> then added “ized” later to get italicized. This one word became
(with a simillar mess for a word like bolded) and the whole one-line file was 31 kB. I read it into Wordpad and saved it (Wordpad became unresponsive for a minute or two while it was doing this.) to get a 1kB file where that one word was just
\i Italicized\i0
At that point, it became manageable to do what the OP wanted.
Yes, but saving the HTML as text in Firefox should remove the mess. But I haven’t tried it with actual Microsoft Word. If anyone has both programs, you could try it and see how well it works.
I found a bunch of stuff online that I converted to text for my eReader (a Game Boy of all things), and it worked fine.
That worked for me, starting with the file from above made with Word 7. When Firefox was converting it to text, it seemed to freeze up just like Wordpad did, but eventually finished. Took a minute or two.