Downloading PDFs (Adobe Reader)

I am in the process of downloading and saving some PDF documents. The documents have a contents page which is downloaded first and then when you click on the various headings it will open the relevent pages (which are a seperate PDF contained on the website). These are fairly large documents and it is a laborious task going through all of the contents headings so I can open and then save each individual file.

My question is, is there a way to save all PDFs related to the contents page in one go?

If you have a download manager such as Flashget, you can simply highlight and copy the links, then paste them into Flashget (Flashget can automatically parse links from highlighted text).

But since the links are in PDF format, you won’t be able to highlight them. You’ll have to convert the PDF file to Word first. Maybe this method is more time consuming.

Another solution would be to go directly to the folder where the PDF files are stored. If the link for a PDF file is www.example.com/PDF/document.pdf, you could try typing www.example.com/PDF/ in your browser. If you are lucky, this will show you all files contained in the PDF folder.

Just tried your second solution, unfortunately the page was restricted.

I might give solution one a go, however the links on the contents page don’t all go to seperate PDFs, the main header will generally be a PDF containing all sub headers, however (probably due file size) the PDFs sometimes span only 1/2 a main header or might contain several main headers.

How do I convert PDF to Word?

Also the links seem to be only partial links. If I open the contents within IE then the links work fine, if I open it in Reader by itself then the links don’t work, will this be a problem?

BTW I’m using ME, Reader 6.0.1 and IE6

Acrobat Reader is just that, a reader. You need the full version of Acrobat, or a similar tool, in order to save a PDF file into a different format. However, if a PDF file has some document security features in place, you would be hard-pressed to save it to any other format at all.

Check out this Google search of potentail PDF converters.

Thanks, I’ll have to see if I can find a free one.

Is it possible to directly save a linked PDF rather than opening, then saving? Is it also possible to see where a link goes to without clicking on it?

Can you not just right-click on the link and choose ‘save as…’ or whatever your browser suggests?

And when you do right-click you should get a properties or ‘copy link address’ option to see the URL of the file without having to open it.

No, I can’t Ponster, because the links are not contained on a webpage but rather in a PDF document. When I right click I just get some PDF options, none of which are Save, Save As, Look at Link Address etc.

I use the full version of Acrobat and the Distiller to do this. But before I bought it, I looked around on the net to find a free pdf distiller. I found pdf995 that is free. You have to download several files to combine pdf files.

I didn’t add a link as I have never used this product. My company won’t let us d/l software like that. They spent the $250 for the Acrobat/distiller instead. I’m just saying this is what I found.

You may find other or better programs by searching for a free pdf distiller.

Hope that helps.

You can easily convert PDF to HTML pages with Google. You can try puting the URL of the contents page into Google. If it turns up in the search results, you can click View as HTML. Then you’ll be able to right-click the links and save them.

There are several problems with that:

  1. Google won’t find the contents page if the site requires registration
  2. Some times Google will not have the View as HTML option on PDF files

Well, I’m not having much luck. I tried searching with Google but Google didn’t offer the HTML view for this particular PDF. I’ve downloaded two free PDF distillers, one created an HTML page that was over 1Mb and it didn’t keep the links live, it was text only. The other one required a password for the PDF before it would convert. I obviously don’t have the required password.

These are controlled government documents that I’m trying to save and I’d guess that the authers have tried their best to protect the contents and prevent them from being edited, which in turn is making it difficult to do anything else with them as well.

I actually have them all in hard copy anyway so I may just have to give up on the idea of having a ready reference offline electronic version.

Can you post the URL? I have a PDF converter so I could try to extract the links.

Ok, thanks.

There are four seperate contents pages, the four links across the top of each contents page just link to each other. As I said, a lot of the links will go to the same page, I’m not sure if you can find an easy way to avoid duplication.




The last one is actually just an index that links through to the same pages linked from the first three. I’ve included it in case it’s useful for you, but don’t worry about it too much.

My PDF converter won’t keep the links. Sorry, I can’t help :frowning:

1913
Snork!!

No encryption is listed, so passwords shouldn’t be a problem.

pdftohtml shows a list of links while converting. Note the addresses are relative to the page you downloaded the first one from. Basically, replace …/…/ with http://www.airservicesaustralia.com/pilotcentre/aip2/
you can get pdftohtml here:
http://pdftohtml.sourceforge.net/

They also have a Windows version for the computing impaired.
From enrtoc.pdf:

Thanks Mort Furd. Erm, I have no idea how to use pdftohtml though. I’ve downloaded it, when I run the exe file it flashes up a DOS screen which quickly dissapears. I don’t know how to run it from the command prompt (sad really, I know).

I’d try this:

  1. Copy pdftohtml.exe to a folder called c:\pdf
  2. Download your pdf to c:\pdf
  3. Click on “Start”
  4. Select “Run”
  5. Enter “command” in the box and press enter. If you are using XP or Win2000, you may need to type cmd instead of command.
  6. A DOS box will open.
  7. Type this command in the DOS box: cd c:\pdf
  8. Press enter at the end of the command.
  9. Type pdftohtml enrtoc.pdf >enrlinks.txt
  10. Change “enrtoc.pdf” in the last command to match the pdf you need converted.
  11. You can also change “enrlinks.txt” to some other name, just make sure to leave the .txt ending else Windows will get confused.
  12. Repeat step 9 for all of your pdf files, changing pdf names and .txt names as needed.
  13. When you are all done, you have a set of .txt files that contain a list of all links in all of the pdf files.
  14. Open the .txt files with you favorite text editor (Word, if that’s what you’ve got.)
  15. Do a search and replace
    search: …/…/
    replace: http://www.airservicesaustralia.com/pilotcentre/aip2/
  16. Get and install Flashget.
  17. Select and copy the list of links.
  18. Start Flashget.
  19. Paste the list of links into Flashget.
  20. Follow Flashget’s instructions for downloading the whole pile of stuff.
  21. Repeat the Word/Flashget operation with all of your .txt files.

Hope that helps.

I think you’re getting that because it’s a command-line based program.

Go here and get the GUI so you can use it in windows.

Well, that was one of the more frustrating experiences I’ve had with a computer.

Following Mort Furd’s instructions I extracted the links from each of the PDFs and did a find-and-replace to turn them all into the correct URL. I then downloaded Flashget. Unfortunately I couldn’t get FLashget to pickup all of the links. I experimented by taking away the extra text leaving only the links and it did work, but there were too many links to make this a practical proposition.

I suspect that Flashget would have done what I wanted but my patience with its rather brief help file was waning. I decided to turn all the links into working HTML links by doing another find-and-replace and placing the bare minimum of tags at the start and end of the page. Although I used to have a good working knowledge of basic HTML, I couldn’t remember much, I initially tried using square brackets! Opera kept opening my feeble attempts as pure text displaying all of the HTML source. Obviously my HTML was wrong.

Next step was to open a blank page in a WYSIWYG editor, go to the source code and place all my links in the body. I had relearnt enough now to know that my links were correctly tagged.

Unfortunately, when I saved the file, Netscape Composer nicely placed a few extra bits in my links which caused them to link to the local computer. Specifically I found it had added %3F at the start and end of each of my links. Now, I don’t know what this little group of characters really does, but I suspected it was the cause of my links being screwed when I opened the page in a browser.

So, I copied all of the source over to Word again, intending to find-and-replace all those %3Fs. Although Word displayed it as a web page, I found the menu item that lets you view the source. I did that and it opened up a little HTML editor which had a find-replace function.

With the changes made, I saved, I opened, and it worked! Now I could open it in IE and use Flashget to download all links on the page.

I thought about everything I’d done and tried to eliminate as many steps as possible so I could quickly do the same with the rest. At this stage I still had a problem with my own HTML coding and so I needed to use Composer or something similar.

On my attempt with the next PDF it all worked smoothly, in fact I used Mozilla’s Composer rather than Netscape’s and it didn’t add those %3Fs so it was all done in a couple of steps.

“This is getting better” I thought.

When I tried the last one though, I ran into problems. Initially I shunned Composer and used Word to get me a blank web page in which to insert my links. However, Word insisted on removing everything between the quote marks in my links when I saved. I went back to Composer. It decided it preffered to include %3F in my links after all. I went through and removed them with Word and resaved, but when I opened the page, the links still pointed at the local computer. I spent a lot of time comparing one of my successfull pages with this one and couldn’t find anything. I then tried pasting my links from the good page into my latest failure and it worked. So there was something wrong with the actual links. I compared the links character by character and they were identical to the good ones. At around the peak of my frustration I noticed the quote marks were different. The good links had basic straight up and down " and the bad ones were slanted.

Back in Word I check the font, it’s the same as the good example. I try retyping some of the quotes but they’re still the slanted ones. Grasping at straws I try different fonts, no change. I EVENTUALLY note that there is a little box in the Auto Correct window. It resides under the words Replace as you type. It’s ticked. Beside it is this harmless looking phrase, “Straight quotes” with “smart quotes”.

I can now get my links working consistently. I’ve even found what the minimum in HTML tags are to get everything working, so I don’t need to shag around with a WYSIWYG editor again.

All my PDFs are downloaded and working correctly.

Although it took longer doing this than if I’d just gone through the Contents pages and downloaded each page manually, the document gets updated every few months. Next time I’ll be able to extract the links, do a find-and-replace, add <HTML><BODY> at the top and </BODY></HTML> at the bottom of the page, and it’ll probably take about 5mins for all of them.

Thanks to you guys for your help which did, despite my lack of understanding of my computer and its programmes, get me the result I was looking for.