Excel concatenation and PDF tables - help!

Two urgent questions:

First – I’m trying to get all possible permutations of a word that has been subject to OCRing and may have different spellings. For example, the word might be “Smith” but the “i” may also be an “l” or a “!” and the “h” might also be “b” or “d”.

How can I get Excel to list all of the possible combinations?
Second – is there any way of neatly exporting tables from PDFs? I know PDFs are often used to protect content from copying, but I was wondering whether there was an easy way to get tables to appear neatly.

To answer the first question, you’ll might want to loot at something similar to a soundex routine, but in reverse. It’s a bit complicated but certainly doable, as long as you are able to spec out which letters are misidentified and group them appropriately.

Here is a soundex routine. Keep in mind, it’s only similar to what you want, so it’ll have to be modified quite a bit.

As for the second, I’ll see if I can find a PDF with tables to try it out. I do know that some are editable and some aren’t, so it’s possible that that might affect what gets copied to the clipboard.

Excellent challenge. My VB is weak, but I’ll noodle it awhile and let you know if I come up with anything. Most likely a programer-type will beat me to it, though. While the soundex routine is a good idea, I was thinking more along the lines of a simple substitution table. For each letter parsed, you would look up the possible replacements from your table (“I” could be “i”, “!”, or “|”, say) and permute accordingly. One question: are you attempting to account for character run-together in the OCR process, or are you assuming word length is preserved? For example, “Smith” could erroneously scan as “Srnith, adding a letter.” Wouldn’t make a huge difference in your coding, I would think, but the table would be larger.

Do you mean without purchasing Acrobat? You are aware, aren’t you, that that’s basically their marketing strategy. Reader’s free to gain market share, but if you want to manipulate the content, you gotta pay up. Having made my disclaimer, the answer is, “it depends.” I know a few tricks that work in some circumstances and not in others, having to do with the nature of the table. However, I’m a little uncomfortable sharing publicly, because although I see no copyright violation, I don’t want to be accused of posting dirty tricks. Email me if you want to discuss.

Dirty tricks, Shmirty tricks!

Here’s what you do:

If you only have the free Acrobat Reader, use the “Text Select” or “Column Select” tools to highlight the text on the page.

Just copy and paste into your favorite application.

From my experience, it doesn’t do a good job of copying table structures. It tends to lump all rows into one cell entry. Just play around with the Text/Column select tools and do the best you can.

You may have to use Excel’s “Text to columns” feature to separate columns out.

Thanks for the tips everyone.

I knew that Acrobat was designed partly to make copying difficult, but the team I’m working with are trying to copy tables from faxes that have been scanned as PDFs. The quality is unsurprisingly poor. I should have access to Acrobat Writer but not for a week or two, so I was looking for an alternative – I wasn’t that hopeful though.

The Excel problem is a bit more interesting. The faxes are being scanned as PDFs, then read by OCR software. Word length is pretty much guaranteed (in the majority of cases) and I have a stats table of the most common letter substitution errors. It’s just a matter of trying to get Excel to list all of the variations based on that stats table. I tried messing around with CONCATENATE but that isn’t working for multiple possibilities. Soundex sounds interesting but might be beyond my abilities in the short term.

Thanks again.


If there is no secrecy required, would you care to post a few more specific details about exactly what you’re trying to accomplish?

Given a permutation lookup table and a text file with one word per line, it would be pretty easy to take that file and output one that has possible permutations for each word. I just can’t tell if that’s what you’re wanting.

Is Excel a requirement, or would any app that accomplishes this do the trick for you?

Any app will do, but Excel is readily available and I don’t have anything beyond dial-up access to download other software.

Basically, it’s a library/archiving project. A number of paper documents have been scanned into PDF format. In order to index these electronically they’ve also been OCR’d and re-saved into PDF format. Due to poor quality in some of the source documents I’d like to find a way to automate the process of identifying common letter errors for any given search term or phrase.

I’m still confused. Is it paragraphs of text, and if so, how are you wanting to handle possible permutations?

I’m guessing you don’t want the output to look like “Mr. Smith Sm!th Smlth Smitb Sm!tb Smltb went to Washington Wash!ngton Washlngton Wasbington Wasb!ngton Wasblngton”, so I’m having a hard time understanding what you’re trying to accompllish.

The PDFs are being indexed using Microsoft indexing service via a web-based application called Ringtail. Ringtail allows users to enter search terms (using Boolean logic), passes the search to indexing service, and returns results by scanning the full text of the OCR’d PDF files. I’m trying to find a way to generate the Boolean search terms to include the most common mispellings of the search terms.

Okay, I think I’ve got it now.

So if the user enters the word Smith, you want it to find all documents that have Smith, or Smlth, or Sm!th (using your earlier example) in them, correct?

I looked at Ringtail’s FullText app, which I’m guessing is the package you are using, and I don’t see a way to pass search terms to it. If you can, then an app that performs simple letter replacement would do the trick. The problem I see is if you have a word with a large number of letters that might have misscanned, then the number of combinations could become very unwieldy. A boolean string with 1000+ "OR"s in it will be extremely slow and might not even be allowed in the search string, depending on how the program was written.

Where the words stored in a SQL database, it would be pretty easy to create a soundex-like routine, but I don’t know of an easy way to do it with someone else’s app that you have no control over.