How can I make PDFs searchable on my website?

Whack-a-Mole · July 22, 2008, 1:58pm

I want to upload PDF files to my website. This will be a continuous process with new PDFs going up nearly every day.

I know I can make a PDF itself searchable but how do I make it so someone can search all PDFs using a search window in the browser (so it will look at all PDFs to see where a given word occurs)?

I do not expect a full on explanation here (unless it happens to be absurdly simple). Just having trouble Googling this out myself or getting help from Adobe help files and such. So pointers in the right direction would be great.

Thanks in advance!

friedo · July 22, 2008, 2:43pm

To do it on your end would require maintaining an index of the text in all your PDF files, and using a web application as a front-end to search the index. That’s a lot of work.

A much easier way, if your site is publicly accessible, is to simply abuse Google, which will happily index the PDFs on your site anyway.

You could have a search form like this:



<form action="http://www.google.com/search" method="GET">
Search text: <input type="text" name="q" value="                            site:yoursite.com filetype:pdf" />
<input type="submit" />
</form>

Notice that the form input contains a bunch of spaces followed by Google search parameters to limit the search to your site and files of type PDF. If you want to hide those details from the user, you could use a Javascript hook to append the search parameters to the search string before the form is submitted.

ZipperJJ · July 22, 2008, 2:45pm

You can store the contents of a PDF in a BLOB type field in a MSSQL database, which then makes it searchable. A product my company has used to do this is FileUp by SoftArtisans. But that is an ASP/ASP.NET product.

If that doesn’t help you out maybe search Google for “pdf sql blob.”

Whack-a-Mole · July 22, 2008, 2:54pm

How do I get the contents of each PDF out of the PDF and to the blob?

glenarma · July 22, 2008, 2:57pm

As friedo said, you can just use the power of Google. Here’s a combo HTML/JavaScript snippet you can use:


<script type="text/javascript">
function SearchPDF()
{
	var site = "www.yoursite.com";
	window.location.href = "http://www.google.com/search?hl=en&q="
		+ document.getElementById("txtPDFSearch").value
		+ " site:" + site + " filetype:pdf";
}
</script>

<input type="text" id="txtPDFSearch" />
<input type="button" id="btnPDFSearch" 
	onclick="SearchPDF();" value="Search for PDFs" />

Just change the ‘www.yoursite.com’ to the name of your own site, and you’re ready to go (as long as Google has indexed your PDFs, that is. It may take several days/weeks between uploading a new PDF and Google including it within its search index).

si_blakely · July 22, 2008, 3:02pm

You may find a static index generation tool that you can run whenever you update your PDFs, but they may be pricey. You could also get google/yahoo/search engine of choice to index your website, and provide a custom search box, but you lose some level of privacy.

For really dynamic stuff, you need to install a search engine to your website. Basically, part of your website has to move from being static - just a bunch of html and other files delivered from a web server application - to being dynamic, with code that runs on the web server, indexes the data, and delivers the results to the clients. Or, some dynamic scripts use the google API to deliver search results.

So you need to look at your web server itself to see if it is capable of dynamic code execution and (maybe) database access. There a number of PHP scripts that can do this sort of thing - it really depends on how complex you want to get.

on preview: what everyone else said.

Si

Turek · July 22, 2008, 4:04pm

If you’re running IIS, you can do it using Index Server and an extension that will allow it to index PDFs. Then you’ll have to create some interface for searching the index.

ZipperJJ · July 22, 2008, 4:34pm

That’s what that software does. But using that software assumes you have your own server (or installation rights on it) and the ability to hook up the code to make it do what you want.

AHunter3 · July 22, 2008, 5:44pm

Why don’t you just have them download the freaking PDFs and search them in Acrobat Reader?

Whack-a-Mole · July 22, 2008, 5:57pm

Well that’s the point…readers will not know which PDF they want without a search. Otherwise they’d have to rummage through hundreds of PDFs manually looking for what they want (not gonna happen).

MrSquishy · July 22, 2008, 6:28pm

And if you’re using Apache, you can use Lucene and Nutch. It’s going to take a bit of setup, but it’s not too bad.

si_blakely · July 22, 2008, 7:42pm

If you are looking at hundreds of PDFs then you really need to look at some sort of content management with search facilities - something active on the server - Zope or Plone or Nuxeo CPS or any one of a number of similar solutions. The advantage would be that you get content control, access management, workflows and publishing approvals as well as indexing and presentation. Just a thought, anyhow.

Is this for an intranet or web access, and what platform are you using to serve the files?

Si

Whack-a-Mole · July 22, 2008, 7:45pm

Archiving old newsletters but our readers say they like going back to them as reference. Newsletter is published daily.

si_blakely · July 22, 2008, 7:47pm

Whats the web server and how are you managing them currently?

Si

Whack-a-Mole · July 22, 2008, 8:21pm

There is no online archive currently. We have some blogging software that seems it would do the trick but the formatting gets screwed up pretty badly and the effort to clean up each one is more than we’d want to do. So I figured making them a PDF would be a possible solution.

Blogging software is Serendipity.
Also playing with Joomla as a possibility.

si_blakely · July 22, 2008, 8:49pm

There is a PDF Indexing Extension for Joomla that may help. If your newsletters are HTML already, you may want to look at using a stream editor (awk or sed) to do some automated reformatting. If they are some other format (like Word) then your PDF approach may be best.

Si

Topic		Replies	Views
want to post some pdf docs on the internet Factual Questions	2	801	May 31, 2015
Adobe Acrobat/PDF question--inserting searchable meta keywords into files? Factual Questions	0	1181	July 30, 2003
Making a PDF with imaged text searchable Factual Questions	18	6217	June 14, 2013
Text boxes in PDF files. Factual Questions	1	575	July 20, 2001
How do you search in a PDF? Factual Questions	2	722	May 10, 2004

How can I make PDFs searchable on my website?

Related topics