PDF file format

Just out of curiosity I would like to know in simple terms how Adobe Acrobat PDF stores the information. Is is purely graphical? or a mixture of graphic and text? Is there any way to capture text from a PDF file to the clipboard?

The PDF format is partially based on the PostScript printing language. It’s much more compressed, and has a lot of features that PS doesn’t need, such as hyperlinks.

Anyhow, both formats store text as text (assuming the application that created the PDF writes it that way) and refers to the graphical info in the font to draw it.

Acrobat Reader has a text selection tool which allows you to swipe over the text and copy it. However, it is possible for the creator of a PDF to secure it to prevent copying from it. I don’t have Reader installed (I use the full version of Acrobat), but I think it has an item on the Edit menu that copies the whole file to the clipboard.

Saltire, if I look at the source code of a PDF file I cannot find the text I can see displayed. I guess most of the files I have found have disabled the copy text to clipboard feature. I further guess it is relatively simple to write software which will extract the text. I have noticed Google does this and you can get the text version from them. They must use this text for their searches.

I’m far from an expert in the format, I’m afraid. However, since the format has built-in compression, it probably doesn’t show up when you open it in a text editor. But I’m not sure.

When I said it was stored as text, all I meant was that it isn’t just pictures of letters. In other words, it can be selected and copied in most cases.

When you have reader open, is the text selection tool grayed out? If so, you do indeed have a secured document. If it’s appearing normally, you should be able to select the text with the tool. If the text is in columns, hold down your control (command on the Mac) key to limit the selection’s width.

My version of Acrobat Reader (4.0 Japanese version) doesn’t have a “select text” tool in the menu, but there is one on the toolbar - right next to the “zoom” tool. It seems to work on most PDF files I’ve tried. If you click and hold that icon, you can select between “select text” and “select graphics” tools.

I don’t think you can extract text straight from the source code. Just because it’s stored as text, it doens’t mean the whole text is sitting together in the source code somewhere. There cold be formatting codes between every word, or even every character, and they may not be in order.

just an fyi. the latest version of Omnipage (11) claims a lot of new support for pdf. eg being able to read pdf and convert it to any other format, such as Word or HTML, and to create pdf files from, eg scans.

I was looking for a long time for a way to get the text out of pdf so i got it. But it fails to read in the pdfs that i wanted to convert, saying they have a password or it cant decode the compression format. (neither is the case). (Kind of a disappointment and afaic the OCR is still not nearly up to par either).

Anyway, I’m still looking for tools to get the text out of pdf (Its actually financial account statements which i need to ultimately get into quicken format. I can do it but even with some programming its a big manual effort).

There was a plugin, called Gemini to convert to text, but IIRC it was kinda pricey.

I resent that adobe provides little to no support for getting your data out of pdfs. I think they have/had a service where you could email a pdf and they would email back a txt. how ridiculous…

Uh, hello? PDFs are largely intended to be read-only. I imagine that’s why the Reader is free but the authoring software is quite expensive. That’s why manuals are often printed as PDFs; that’s why my company uses PDFs for sending documents to clients. It makes it harder (of course, not impossible) for all and sundry to copy or alter the information in the document.

at work we receive forms from Great Britain in .pdf format. You can copy the text and paste it into a word document for reuse. In copying the text (which is tabular), you lose all the tabs, but this is a small price to pay if you don’t mind cutting and pasting.

Another reason for not being able to select text may the source/creation method. When we want to post a 30 year old document on one of our intranet sites, the best method is just to scan it in. This results in the pages being treated as graphics with no embedded text.

http://www.planetpdf.com/mainpage.asp?MenuID=156&WebPageID=323

" No 4. Extracting text from PDFs

Hi Aandi,

I have a question about pdf files. Is there any way to select text off of a pdf file and copy it to a Word document? All I have is Acrobat Reader, and it won’t let me select anything.

Thanks, Susan.
In many cases, you can copy the text from a PDF file and paste it. But there are a number of cases where you might not be able to. It isn’t always easy to tell what is going on, so first I’ll describe what happens when it works, then cover why it might not work.

Before you can copy text at all, you need to select it. The trick is, that Acrobat contains a number of different “tools”. The default tool is the “hand” tool, which is just used for scrolling. You need to use the text selection tool.

In Acrobat Reader 3, this is the letters “abc” surrounded by a dotted line, while in Reader 4, this is the letter “T” with a dotted box next to it.

TIP: in Reader 4, some of the tools fold out. Just click the mouse and hold it on the text selection tool, and you will see more, related, tools, appear. This is important - without it you’ll miss some of them. If a tool has a tiny triangle in the bottom right of its icon, then it will fold out. This was new for Reader 4. You can also use Edit > Select all to save having to select text individually.
Acrobat Reader 4.0
You drag the mouse to select the text, and then copy it (Edit > Copy, from the menus, or Ctrl+C in Windows, or Commmand+C on the Macintosh). What could go wrong? Actually, quite a few things. So, watch out for these problems.

It’s different in a browser.

In a browser you can still use the text select tool, but the usual ways of copying don’t work. That’s because the browser sees the copy instruction, but doesn’t bother to mention it to Acrobat! Luckily, Adobe thought of this and added a special COPY button to the Reader toolbar. It appears only when viewed in a browser. The button shows two tiny pages, side by side. Select all is not available when viewing a PDF in a browser.

It isn’t text at all.

You can see it - there on the page - text. What else could it be? Actually, it could be a picture. The text could be a series of shapes which look like letters, or a scanned page, so the text is actually a bitmap. In these cases, trying to select text will seem not to find it. Unfortunately, there’s not much you can do.

It copies, but it’s complete junk.

Some ways of making a PDF file will give you hopelessly jumbled fonts. Think of it this way: most fonts have all of the letters a, b, c and so forth. You can put them in a grid; for most fonts, the letters will always appear in the same place. But some fonts have the letters all over the place. Acrobat has no way to know that this has been done, so it just copies the letters that you’d get for a normal font. This often happens when creating a PDF document in Windows, using TrueType fonts, and Acrobat Distiller. Before Distiller sees the fonts, they are already jumbled up. Sometimes just a few characters may be junk - these might be in a different font, or use a special character not available to other programs. Again, there isn’t much you can do.

I can’t even select text.

Sometimes, the text select tool is greyed out and can’t be used. This happens with “secure” PDF files. The creator of any PDF can protect it - choosing whether or not to allow copying (and printing). If a document is protected, you would have to contact the copyright holder and ask for an unprotected copy to use. They might agree, or might want a fee. Many people forget that almost everything on the web is copyright, whether or not it is secure, and whether or not it has a copyright notice.

The text is in columns.

Acrobat doesn’t understand about columns, so trying to copy text from columns can be painful - it just reads right across the page. But there is an easy work-around. Just hold the Ctrl key (Windows) or Alt/Option key (Macintosh) when selecting the text, and you will find you can drag around any rectangular area. In Acrobat 4, there’s even a special tool, but remember to fold it out from the regular text selection tool.

But it’s dozens of pages!

Acrobat Reader can copy only one page at a time. Acrobat Exchange (the commercial program) on Windows ONLY can copy the whole file, so long as you have enough memory (and patience). But if you need to do this, perhaps you should reconsider; it’s almost always better to go back to the original file, if you can."

etc etc

If you have access to a linux box, you could try the pdf2[whatever] routines that come standard with many distros. I seem to recall a pdf2txt converter, though my machine currently only has a pdf2ps (and then ps2ascii) converter.

uh, hello, as in i’m so dumb i dont see the obvious?

I suppose that for some applications, the inability to do anything with information other than view or print it could be seen as an advantage, as in your examples. Good for your company. But what USERS want is a free flow of information, the ability to massage data, feed it into other programs, etc. There is an army of pdf users that want/need this, just look at this thread.

Besides, being able to convert to another format does not necessarily affect the inviolability of the original pdf.

Have you seen Adobe actually tout the difficulty or inability to do other than view/print a document as a ‘feature’? I think what Adobe ‘intended’ was just greed, to make it difficult to have any other options than pdf. If not greed, than just indifference to what would be a very useful function to the user. I also happen to find their pricing rather usurious, they divide what should basically be one program into many many separate pieces and charge an arm and a leg for each one. (IMHO).

I just looked on their site and their opening pitch for acrobat is “What good is a document you can’t open?”.

The question to me is, what good is a document you cant process?

Anyway the need for it is, uh, hello, obvious. Omnipage offers it now, albeit flawed, and I see a lot of plugins on their site. I just think Adobe should be the one providing it. The 3rd party solutions I’ve seen are too expensive and/or unreliable.

No that was “um, hello?” as in “Slightly inappropriate choice of phrase after a long day and a stinking tube ride”.

Sorry 'bout that, areider.

I’ve always thought that the non-editing features of Acrobat were a selling point. I agree that there’s a great demand for users to be able to edit them (myself included).

One neat thing about google is it provides the text of PDF files.

I was wondering if there may be a switch in a PDF source: <copy enable=OFF> which you could switch to ON. But if it were that simple I am sure it would be well known.

My initial curiosity was whether the text was stored as text and that has been clearly answered in the affirmative.