How did this copy and paste operation misfire?

I received an email with an attached letter in the form of a pdf file. I needed to respond to the email address in the letter, not to the email sender, so I copied the email address out of the letter (which I was kind of surprised I could even do), and pasted it into the addressee line in a new email I was sending.

But I didn’t look closely enough at what I pasted. There were changes – the letter “i” had been added between two other letters in the address, and the periods had been changed to commas. My email program actually caught the period thing, because it wouldn’t let me send to an invalid email address, but of course it didn’t catch the extra “i”, so the email bounced back to me as undeliverable within a few minutes.

Just to make things a little stranger, to see if the error would repeat, I copied the address again from the pdf file and pasted it into Notepad. The results surprised me – while the period to comma change was consistent, the “i” showed up in the copy between a different pair of letters.

So what’s up?

Are you using an email client, or is it web-based email? I know Adobe sometimes doesn’t play with MS, but I don’t think I’ve ever seen what you’re describing.

It doesn’t sound like an e-mail issue to me, but maybe some weird formatting/hidden characters in the PDF.

Ya got me. I received and sent from a yahoo personal email account but like tdn, since the second iteration of the problem didn’t involve the email program at all. In both iterations, the pdf had already been downloaded and saved to my hard drive, and in the second iteration the copy was between it and Notepad, which of course has been on my hard drive since the build.

I think this is likely, but the real mystery to me is how the “i” ended up in a different location the second time. The other mystery is, how did it transcribe periods to commas? It’s easier for me to believe the commas were hidden characters that showed up in the paste, but in that case the periods would still be there too.

PDF fonts sometimes use special characters, and perhaps different characters. For instance, if I copy from a PDF and paste into WordPerfect, ligatures like fl and fi in the PDF usually show up as question marks in the pasted text. (I guess that just means that its a character that WordPerfect does not recognize.) Other common oddities include extra spaces, sometimes even within words, and with some PDF documents I have found that when you paste from them you just get garbage. The particular form that the garbage takes may vary depending on what program you are copying into.

Your moving i is puzzling, though. Are you sure? Can you reproduce the effect? If so, I am going to guess that in one case some non-alphanumeric character that does not display in the PDF is, in one case, showing up as an i in your target application, and another is just not displaying in that target. In the other target application, on the other hand, the first non-alphanueric is not displaying and the second displaying as an i. That wold be a weird coincidence, though.

I didn’t completely follow the order of events and my experience is more with the Mac’s cut and paste; however, the transposition of characters is not surprising depending on the font (I usually see spaces [dis]appearing when going from PDF to “text”) and the moving of the i can be explained quite simply. The i is not in the text but rather in a column or margin. When you drag and highlight, you may have not started in the exact same spot both times and as such how it “caught” the other column may be different. I think the belief that PDFs are the same for everyone is quickly becoming false.

Reading so far, I’d wonder if the pdf from which you copied had been run through Acrobat’s OCR engine. If so, it’s possible that various characters had been misread. The periods-to-commas error is typical. Acrobat’s OCR engine isn’t that great even when the scanned document is really sharp. Random flecks on the page and distortions of characters can definitely cause OCR errors.

This I don’t get. Worth exploring further. When you paste into Notepad, you’re pasting nothing but text. No formatting stuff, nothing weird. I would have thought that was true of Outlook (at least in the address fields) too. Is Outlook your email program?

If not, it’s possible that you’re pasting something other than plain text ASCII characters into your email program, while pasting only plain text into Notepad.

I really doubtful the original document was scanned and run through OCR. It has a logo, for one thing, and that would either be omitted by the OCR program or it would have to be pasted back in somehow. It looks like it was simply typed in something like Word and saved or exported as a pdf.

And re Notepad, yes, I know. I found long ago that moving text between word processing documents can have weird results, so I got in the habit of pasting into Notepad, then copying and pasting from there into the second word processor to get rid of all the hidden command characters.

I have tried again several times to paste the string into Notepad, and get identical results each time.

The selection of text invokes OCR, I believe; i.e. what you have may be a scanned document, basically a picture, and the select tool you use tries to parse alphanumeric characters from your selection. This often yields the kind of confusions you describe. (I run into these problems when trying to copy text – or worse yet, formulas – from articles that have been published some time ago, and thus are present online just as scanned copies, not as files where text is marked and formatted properly as text.)

This is definitely an OCR issue. The version of Acrobat that generated the document would have OCR’d the document when it was scanned. No, this wouldn’t remove any logos or change the appearance of the text. The OCR data is hidden and provided only to allow you to copy/paste text, use text search, or interact with screen readers for the visually impaired.

You can easily tell if the .pdf was created with OCR data. With the pdf open, use select all (Ctrl+A). If all you see is a big blue box encompassing everything then there is no OCR. If individual lines are highlighted then there is in fact OCR data present in the pdf.

This is all assuming the pdf was created from an image during some step of the process. The fact that the errors exist shows this is the case. Even if Word was involved it doesn’t mean it didn’t get printed to an image file etc.

Well, each line is individually highlighted, though the highlights extend exactly through the first and last character in each line. There are some lined as the bottom of the page (in a stationery-like office address) where the blue highlight is broken up several times on the same line – it looks like over blank spaces where there were tabs.

What would be evidence that OCR was not involved?

That’s an interesting thought. I do not know the answer one way or another because I don’t work with Acrobat often enough to know how it handles expanses of whitespace. Sure, I would think if OCR was involved and if the highlights were broken up they must be broken up by something.

Might be interesting to paste those blank lines into a txt document and see if there are in fact tabs there. Which wouldn’t prove anything either way, but small dots appearing sure would. Got me stumped.

I just re-read the original post about the ‘i’ changing positions. Now that I guess changes things. The one time I had characters start rearranging themselves on me was when I was writing a program with Unicode support. Not to bore you with details but the state of the Unicode shift state was left errrr “messy” and so constant data would have non-constant output.

If you reeeeeeeeealy wanted to you could paste the email directly into a hex editor and see if any nasties are right in the clipboard copy.