How can I make PDFs searchable on my website?

I want to upload PDF files to my website. This will be a continuous process with new PDFs going up nearly every day.

I know I can make a PDF itself searchable but how do I make it so someone can search all PDFs using a search window in the browser (so it will look at all PDFs to see where a given word occurs)?

I do not expect a full on explanation here (unless it happens to be absurdly simple). Just having trouble Googling this out myself or getting help from Adobe help files and such. So pointers in the right direction would be great.

Thanks in advance!

To do it on your end would require maintaining an index of the text in all your PDF files, and using a web application as a front-end to search the index. That’s a lot of work.

A much easier way, if your site is publicly accessible, is to simply abuse Google, which will happily index the PDFs on your site anyway.

You could have a search form like this:



<form action="http://www.google.com/search" method="GET">
Search text: <input type="text" name="q" value="                            site:yoursite.com filetype:pdf" />
<input type="submit" />
</form>


Notice that the form input contains a bunch of spaces followed by Google search parameters to limit the search to your site and files of type PDF. If you want to hide those details from the user, you could use a Javascript hook to append the search parameters to the search string before the form is submitted.

You can store the contents of a PDF in a BLOB type field in a MSSQL database, which then makes it searchable. A product my company has used to do this is FileUp by SoftArtisans. But that is an ASP/ASP.NET product.

If that doesn’t help you out maybe search Google for “pdf sql blob.”

How do I get the contents of each PDF out of the PDF and to the blob?

As friedo said, you can just use the power of Google. Here’s a combo HTML/JavaScript snippet you can use:


<script type="text/javascript">
function SearchPDF()
{
	var site = "www.yoursite.com";
	window.location.href = "http://www.google.com/search?hl=en&q="
		+ document.getElementById("txtPDFSearch").value
		+ " site:" + site + " filetype:pdf";
}
</script>

<input type="text" id="txtPDFSearch" />
<input type="button" id="btnPDFSearch" 
	onclick="SearchPDF();" value="Search for PDFs" />

Just change the ‘www.yoursite.com’ to the name of your own site, and you’re ready to go (as long as Google has indexed your PDFs, that is. It may take several days/weeks between uploading a new PDF and Google including it within its search index).

You may find a static index generation tool that you can run whenever you update your PDFs, but they may be pricey. You could also get google/yahoo/search engine of choice to index your website, and provide a custom search box, but you lose some level of privacy.

For really dynamic stuff, you need to install a search engine to your website. Basically, part of your website has to move from being static - just a bunch of html and other files delivered from a web server application - to being dynamic, with code that runs on the web server, indexes the data, and delivers the results to the clients. Or, some dynamic scripts use the google API to deliver search results.

So you need to look at your web server itself to see if it is capable of dynamic code execution and (maybe) database access. There a number of PHP scripts that can do this sort of thing - it really depends on how complex you want to get.

on preview: what everyone else said.

Si

If you’re running IIS, you can do it using Index Server and an extension that will allow it to index PDFs. Then you’ll have to create some interface for searching the index.

That’s what that software does. But using that software assumes you have your own server (or installation rights on it) and the ability to hook up the code to make it do what you want.

Why don’t you just have them download the freaking PDFs and search them in Acrobat Reader?

Well that’s the point…readers will not know which PDF they want without a search. Otherwise they’d have to rummage through hundreds of PDFs manually looking for what they want (not gonna happen).

And if you’re using Apache, you can use Lucene and Nutch. It’s going to take a bit of setup, but it’s not too bad.

If you are looking at hundreds of PDFs then you really need to look at some sort of content management with search facilities - something active on the server - Zope or Plone or Nuxeo CPS or any one of a number of similar solutions. The advantage would be that you get content control, access management, workflows and publishing approvals as well as indexing and presentation. Just a thought, anyhow.

Is this for an intranet or web access, and what platform are you using to serve the files?

Si

Archiving old newsletters but our readers say they like going back to them as reference. Newsletter is published daily.

Whats the web server and how are you managing them currently?

Si

There is no online archive currently. We have some blogging software that seems it would do the trick but the formatting gets screwed up pretty badly and the effort to clean up each one is more than we’d want to do. So I figured making them a PDF would be a possible solution.

Blogging software is Serendipity.
Also playing with Joomla as a possibility.

There is a PDF Indexing Extension for Joomla that may help. If your newsletters are HTML already, you may want to look at using a stream editor (awk or sed) to do some automated reformatting. If they are some other format (like Word) then your PDF approach may be best.

Si