Indexing .Doc and .Pdf files

Myka · April 26, 2004, 12:55am

I want to setup a search engine for a local web server (no need to index foreign sites).

I have found a couple that are promising, but none of them index PDF or DOC files, and I really need them to do this. The only one I’ve found that comes close is phpdig.net, and that one doesn’t run under windows.

Does anyone know of one that will work platform independantly?

Or perhaps how to do it myself in PHP? I wouldn’t mind learning something new. (I can already do plaintext files)

I would very much appeciate any info.

Tarential · April 26, 2004, 1:47am

Doing it yourself would be a very worthwhile experience in my opinion. I don’t know much about that myself, but I’d definitally not pass up a chance to learn if the need for a script presented itself in my life as it has yours.

Myka · April 26, 2004, 3:34am

Yeah, but the problem is, I have no idea how to even start getting information out of a PDF or DOC.

Tarential · April 26, 2004, 3:26pm

Doc files would be harder. I’d start at the OpenOffice site/forum to see if they have published what they have found about the Doc format.

As for PDF, it is, and always has been, an open format. Adobe releases the specifications, and you can get them from the site.

Myka · April 26, 2004, 6:58pm

Thanks, found how to do it for PDF, you’re right about Office, but I think there is a way to setup the server itself to interpret the DOC so that the code doens’t have to do it.

Tarential · April 26, 2004, 9:11pm

You are probably quite right. I bet they have some sort of a parsing engine out there that would do it for you, including stripping formatting that you don’t need. I’d google for it.

aixccapt99 · April 26, 2004, 10:21pm

One program that will help (which you should find with Google) is called antiword – it’s a command-line utility for linux/unix systems (don’t know about a Windows version) that extracts the plain text of a .doc file. Perfect for your indexing application.

Myka · April 28, 2004, 6:21am

OMG. That antiword program did the trick! Thanks so much.

Do you know of a similar utility for PDF files? I can’t seem to figure out how to read data FROM a PDF. Writing TO them seems to be easy.

aixccapt99 · April 29, 2004, 1:04am

A little google search turned up:

http://research.compaq.com/SRC/virtualp … otext.html

http://www.geocities.com/SiliconValley/ … yDuty.html

http://www.cs.berkeley.edu/~phelps/Multivalent