HTML Minimizer / Cleaner Software?

I have two different problems with HTML generating software:

First Problem

I scan many documents, often into HTML. I use or have used the following OCR software:

• The HP OCR software that came with my HP C3180 All-In-One (poor)
• Nuance/ScanSoft OmniPage Pro 16 (buggy for me)
• Abbyy FineReader 9 (trial only until I decide if it’s worth $400)
• Microsoft Office 2007 Office Document Scanning

But they all seem to have the same general problems when creating HTML (and other formats, such as PDF, Word, etc): They try too hard!

• If the OCR thinks (mistakenly) that a few letters or words are in a different typeface/font – even though they’re not – it’ll go ahead and choose a different font for them in the output!

• If the OCR thinks (mistakenly) that a few letters or words are in a different size – even though they’re not – it’ll go ahead and choose a different size for them in the output!

• If the OCR thinks (mistakenly) that a few letters or words are in a different style (italic / bold / superscript, etc.) – even though they’re not – it’ll go ahead and choose the wrong style for them in the output!

So you’ll get things like this:

You get the idea.

Second Problem

The two HTML editors I own (DreamWeaver 8, Namo Web Editor 6) often generate over-complicated, over-cryptic, and/or highly redundant code. DreamWeaver has a built-in HTML cleaner/simplifier (Clean Up XHTML / Clean Up Word HTML), but to put it politely, it’s very poor at it’s job. For example, much of what I enter will be in no more than 2-3 fonts, no more than 2-3 sizes and styles, but if you look at the source, it’ll have 20 or more CSS varieties and so forth, many of them redundant (e.g., <i>this </i><i>is </i><i>dumb.</i>. This gets particularly bad if your input was OCR.
Question

Does anyone know of any effective software for Windows XP that can simplify the HTML output? I’d like something that would do things such as the following:

[ul]
[li]Change all fonts within a selection to a single specified font[/li][li]Change all font sizes within a selection to a single specified size[/li][li]Change all styles (bold, italic, etc) in a selection to a single specified style/no style[/li][li]Change all colors in a selection to the same color[/li][/ul]

Afterwards, eliminate as much HTML/CSS/XHTML redundancy as possible.

You’d think something like this would be simple and reasonably straightforward, but oh, no! DreamWeaver in particular ignores 80% of my attempts to do this because it’s too damned smart and thinks to itself “Surely the user didn’t want to do that! I’ll just do whatever the hell I please.” Namo’s not much better, and Word isn’t either. They’re all too damned “smart” to give me the total control I want!
Any help, please?

(Sorry for the verbosity, but I hope that by posting all this detail I won’t have to do much clarification).

One last thing: Please don’t suggest that I do all my own coding by hand, okay? I’d bow down to your magnificent hand-coding skill, but my back’s out, and I just can’t do it.
Thanks!

Sadly, the best one I’ve ever seen is Dreamweaver’s. I have to say, though, that I haven’t really seen DW add cryptic tags unprompted, unless you’re pasting into the WYSIWYG editor from an MS application.

HomeSite (bundled with DW) also has a half-decent one, albeit with its own quirks. But I’ve never seen anything that can properly get rid of the thousands and thousands of redundant tags and attributes that go into anything MS has touched with its mucky hands.

I generally find it simpler to paste content into a text editor and edit it by hand. Even though this is a horribly time-consuming job (note I’m not telling you to do this, but after 12 years making and running websites, that’s the workaround I still find myself doing).

Am subscribing to this thread in case someone else has a magic bullet.

First, I’d like to thank you for your response, jjimm

Well, “cryptic” wasn’t perhaps the best adjective for me to have used. But massively redundant and massively over-complex certainly applies. This is most often seen with OCR’d text (though not only then). What I’ll see are things such as, an initial CSS block for say:

[ol]
[li]Courier - plain - 12 pt., followed by,[/li][li]a different one for Courier - plain - 11 pt., followed by,[/li][li]a different one for Courier - plain - 12 pt., followed by, etc., etc.[/li][/ol]
Then I’ll change the word or even the single letter mistakenly OCRd into 11 pt to 12 pt, then I’ll see:
[ol]
[li]Courier - plain - 12 pt., followed by,[/li][li]a different one for Courier - plain - 12 pt., followed by,[/li][li]a different one for Courier - plain - 12 pt., followed by, etc., etc.[/li][/ol]
And it’ll still look the same most of the time even after DW’s “clean up”. And this won’t have ever touched any Microsoft product at any point at all. It’ll be like this from the OCR software.

Well, I’ve had to deal with multiple 300-or-so-page of scanned and OCRd texts, and I’d far rather slit my throat than fix up any of that by hand. Especially after DW gets it (even after clean up), and there’ll be 2-3 times as much text in the form of HTML/XHTML/CSS coding than there was text in the source material! (Well, it seemed that way, anyway).

I’d settle for a magic hammer. Or an ever-so-slightly charmed one even.

I can’t suggest any tools exactly like you want, I’m afraid (although this doesn’t mean they don’t exist). The way you describe it, however, the existing formatting is next to useless anyway. Given this, I think personally I’d probably just give up on getting html from the OCR software at all; get plain text output, paste it into Dreamweaver (or whatever), and go from there. If you’re going to be correcting all the formatting anyway, no formatting at all is a better starting point than complete garbage.

If you want a slightly better starting point, there are some plain text to html converters around too, some better than others (all the online ones I can find seem to be rubbish). If you’re comfortable mucking around with command line stuff, I’ve found the imaginatively-named txt2html to produce reasonable output - it’s certainly something you can work with, if nothing else. It will at least separate out your paragraphs for you, and has some tricks for spotting headers, bullet lists and suchlike that (depending on what your plain text output looks like) might help a bit.

You’ll need to have Perl installed on your machine to run it, which you can get free for Windows from ActivePerl. This could all be a bit fiddly to get working, though, and you’ll still end up having to add a certain amount of formatting yourself, so it depends how much you’re prepared to tinker with things like this as to whether it’s worthwhile.

If you’re able to give a sample of the sort of output you’re dealing with, I might be able to make a better suggestion. I’ll quite understand if you’re not willing to paste your documents up for all to see, though. :slight_smile:

Never used Dreamweaver, so I’m afraid I have no suggestions re: question 2. Sorry.

I agree with Dead Badger. It seems like it’d be much much easier for you to just get the text from OCR with no formatting, paste it into Dreamweaver and then do your formatting from there.

It sounds like you’re going to have to do something by hand, so you might as well do the formatting of clean text by hand (which is easily guided by Dreamweaver’s WYSIWYG interface) instead the unformatting of dirty HTML by hand.

I spend a lot of time arguing with developers over this. They want to automate everything, then bitch about the amount of time it would take to write the application to do it - but when I point out that me doing it by hand would take about 1/4 the time, they get pissy. C’est la vie.

Have you tried HTML Tidy? It should clean up some of the code problems but not all. Consider it a start.

Then go back to Dreamweaver. Use the Find/Replace and use regular expressions to remove all instances of generated classes, ids, fonts, etc. I use this approach quite often when I receive an Excel spreadsheet generated web page. After using the standard clean up HTML/XHTML tool, I then use regular expressions to do the nit pick stuff. In no time I have a Excel generated table with all the Microsoft bloat trash removed.

Thanks to all the respondents thus far. I shall do more research given the suggestions you’ve provided.

If any future readers wish to respond, please do!

In DreamWeaver working in WYSIWYG mode, can’t you just select all of the text and apply a style/font to it?

Then, as said above, run it through Tidy. If you’re on a Mac, give Balthisar Tidy a try. :wink:

Nothing would make me happier. You’d think that would be very, very simple and straightforward, but at least for me, very, often DW just plain ignores my attempts!

My guess is that what I try to do conflicts with some existing CSS style or something, but in any case it just completely disregards my efforts.