So there’s this organisation that release information as PDF files, for all comers, on the internet. The text in these PDF’s looks normal when viewed in Adobe’s viewer or Apple’s Preview. If, however, you copy and paste any of the text, you get a space between each character l i k e t h i s . I assume they’re doing this on purpose somehow, possibly to deter people like me from copying and pasting. It’s not a big deal to me, just annoying, but I’d like to know how they pull this off, and what I might do to defeat it.
I’m using a Mac, I’ve got TextSoap and Text Wrangler (in addition to the default stuff like TextEdit). Any ideas?
I was going to leave names out of it, but I’m talking specifically about Train Alteration Advice (TAA) PDF’s from the Australian Rail Track Corporation (ARTC) See here. Pick a document, any document, copy and paste. It’s riveting stuff, unless you’re not interested in train times in country NSW, Australia.
I didn’t see it in any of the documents. But I think I know what you are talking about.
I see it in newspapers and the occasional PDF document; when they try to have even margins on both sides, and don’t want to hyphenate anything. What happens is one line doesn’t have enough words to fill out the room between the margins, and the next word is too long to fit. Instead of hyphenating and only using part of the big word, they’ll “stretch out” the rest of the words to fit, so there are extra spaces.
I don’t think I’ve explained it well, I’ll try and find a decent illustration of this phenomenon online and link to it.
EDIT: Here is a wiki article that shows what I’m talking about. You can see how the text on the left has variable spaced words to even out the margins. It is fairly well done, though, in that there aren’t really big spaces anywhere. I’m still looking for a more egregious example.
chicken wire?, with respect to what DrCube says, maybe the problem isn’t in the source data but the program you are pasting into. (I’m using a PC and tried pasting into Notepad or Word 2003. No problems.) Perhaps that program thinks those are separate lines, not a single paragraph, and it is forcing justification the way DrCube illustrates. Perhaps turning off justification or changing the page width (or margins) will fix it?
On preview: DrCube, justification routines put extra spaces between both letters and words, as needed.
From either Apple’s built in Preview app, or from Firefox, I get:
N o t e # P i c t o n F r a m e C t o B u x t o n l i n e i s c o n t r o l l e d a n d o p e r a t e d b y
N S WR T M , a n d i s p l a c e d o n t h i s T A A f o r i n f o r m a t i o n o n l y
From Adobe reader, I get:
Note# Picton Frame C to Buxton line is controlled and operated by
NSWRTM, and is placed on this TAA for information only
The rest of that page copies normally. Maybe ARTC are changing what they do to their PDF’s, and that one line is a throwback or something? Everything they put out used to do the spaced out letters thing, and I was only copying that one line yesterday, so I assumed it was still universal. My mistake.
If it’s spaced out, or not, it’s like that on the clipboard (which I can see in Jumpcut), so I guess it’s the source and not the destination which causes the problem.
There’s something going on there, but at least they seem to be scaling back on teh crazy spaces, and Adobe can un-space it, if needed.
If you look closely at the PDF, you’ll see that there are indeed spaces between the letters in the section you indicated. Compare that with the next line, which says:
You’ll see that the letters there are much closer together. The bolding obscures the spacing in your problem section, but there are indeed extra spaces in the original document.
ETA: Huh, I should have done this earlier, but when I paste your problem section here, I get no spaces:
Yes, that’s possible, but they still manage to have larger spaces between words. It makes sense when you look at it in the original document, but not when you copy and paste it some place else. There’s some disconnect there.
Note# Picton Frame C to Buxton line is controlled and operated by
NSWRTM, and is placed on this TAA for information only
both here in the reply box and in a text editor. What do you get if you paste it in the reply box?
If I select for example just the “a” or just the “m” in frame, there’s a bit of whitespace between the letters that isn’t selected either time. I’m never able to select that bit of whitespace on its own, though.
And that’s extra space characters between the visible text characters, not just wider gaps. It’s definitely in the source text; if you make everything big in the browser, you can see the selection colour jump twice between each letter. (I’m using Firefox 3 on a Mac and when you use the Command-+ key combo to expand the viewed content, it makes everything bigger, not just HTML text.)