Can you get clean HTML from MS Word?

The HTML generated by MS Word is full of extraneous guff. Does anyone know how to export HTML from an MS Word doc in a cleaner format? Or a tool that can sweep all the dross out of the HTML to leave you with something slightly more raw? Thanks.

This might help:

Thanks. Looks perfect.

Did you know about that already or did you search for it on MS website? If the latter, sorry for making work for you.

Already knew about it. Had the link in my bookmarks.

Hardcore geeks may insist on Notepad for editing HTML, but I always use WordPad. It offers several word processor functionalities, but still gives you clean HTML and lets you save as a text file with .html extension.

I simply never let Word get anywhere near my HTML. My advice is to stick with WordPad: it gives you honest code, nothing more than what you choose to input yourself.

Hmmm… have just installed it and can see no discernible difference between “Save As…” HTML, and “Export compact” HTML. Any other ideas?

I myself use Allaire/Macromedia Homesite. The problem here is I have to convert a huge Word doc that someone else produced into decent HTML. It’s driving me up the wall.

In your Start > Programs menu, you should see Microsoft HTML Filter. Did you open that?

OneChance, I was glad to see your link, as I have a need to convert DOC files to HTML as simply as possible, too. So I downloaded the “Compact HTML” program from MS, installed it, and followed instructions to Export a DOC file.

The result had 44 lines of header pre-pended to the body! Much less than the 109 lines the “Save As HTML” function makes, but hardly compact by my definition.

My solution has been to use an old copy of Word 97 to make DOC->HTML files. Its “save as HTML” function doesn’t add much HTML junk to the raw text, and I can strip that out pretty easily. jjimm, you might try that method if available.

I used the button in MS Word. Have tried it from the Start menu, and altered the options to make them as severe as poss; it’s still pretty horrendous, but better than nothing! Thanks.

One nice thing about Homesite is the extended replace feature. I do this all this time to strip Word HTML files. You can replace <FONT=ARIAL with nothing, and so on to get rid of all the stupid, non-CSS and depracated stuff Word throws in there.

Yeah, I’m on my 23rd universal find-and-replace in Homesite… Only another 50 or so to go.

I may have been dreaming, but I remember seeing once that Homesite had a “Clean up shitty MS HTML” codesweeper function. But it’s disappeared. If it ever existed. :frowning:

Well, you could always spend $400 and buy Dreamweaver. Apparently there is a built-in “Clean UP Word HTML” feature built into it.

There is a clean up MS Word HTML codesweeper “tidy” command in Homesite. It’s buried deep in the Settings dialog, but it’s there.

Thanks for everyone’s assistance.

I deal with Word generated HTML files all the time, they are a pain.

I just use universal replace functions to get rid of the extra crap. The trick is to specify the searches so as to maximise extraneous tags that will be removed in one go.

Of course, having regular expressions and wildcards in your editot helps a whole lot.

Hey - all the smilies have gone multiracial. :cool:

To all those who contributed to this thread, I urge you to try the ‘Clean up MS HTML’ function in the ‘HTML code tidy’ section of the ‘Codesweeper’ function in Homesite. It does an amazing job, and also deletes redundant tab endings, etc.