Indexing .Doc and .Pdf files


#1

I want to setup a search engine for a local web server (no need to index foreign sites).

I have found a couple that are promising, but none of them index PDF or DOC files, and I really need them to do this. The only one I’ve found that comes close is phpdig.net, and that one doesn’t run under windows.

Does anyone know of one that will work platform independantly?

Or perhaps how to do it myself in PHP? I wouldn’t mind learning something new. (I can already do plaintext files)

I would very much appeciate any info.


#2

Doing it yourself would be a very worthwhile experience in my opinion. I don’t know much about that myself, but I’d definitally not pass up a chance to learn if the need for a script presented itself in my life as it has yours.


#3

Yeah, but the problem is, I have no idea how to even start getting information out of a PDF or DOC.


#4

Doc files would be harder. I’d start at the OpenOffice site/forum to see if they have published what they have found about the Doc format.

As for PDF, it is, and always has been, an open format. Adobe releases the specifications, and you can get them from the site.


#5

Thanks, found how to do it for PDF, you’re right about Office, but I think there is a way to setup the server itself to interpret the DOC so that the code doens’t have to do it.


#6

You are probably quite right. I bet they have some sort of a parsing engine out there that would do it for you, including stripping formatting that you don’t need. I’d google for it.


#7

One program that will help (which you should find with Google) is called antiword – it’s a command-line utility for linux/unix systems (don’t know about a Windows version) that extracts the plain text of a .doc file. Perfect for your indexing application.


#8

OMG. That antiword program did the trick! Thanks so much.

Do you know of a similar utility for PDF files? I can’t seem to figure out how to read data FROM a PDF. Writing TO them seems to be easy.


#9

A little google search turned up:

http://research.compaq.com/SRC/virtualp … otext.html

http://www.geocities.com/SiliconValley/ … yDuty.html

http://www.cs.berkeley.edu/~phelps/Multivalent