Pdf Searching Sucks

From PdfSucks

I understand the first reason (slow), and to a lesser extent the second (unusual interface), but this one doesn't make much sense to me. Isn't PDF a standard in and of itself? If so, how important is it that it should agree with other standards?

It's important if you want to search for text in files.

What standards should it comply with then (given it's got to handle pretty arbitrary markup)?

It could store the text data separate from its presentation markup, rather than intermixing them.

Very tricky to do, given the arbitrary nature of what it's given. Remember, basically it's a "printer" as far as the application is concerned. It doesn't get enough information to separate the text from the presentation. To do a better job, it needs a lot more information from the application about the structure of the text, and that means it could work only with applications that were PDF-aware. To take a simple example, if there's a drop caps at the start of the paragraph, many applications will send the large capital in some quite separate way from the body of the paragraph.

It isn't that tricky. The text is in the file someplace. It could be stored contiguously in a standard encoding (Unicode) and referenced by the presentation description.

No, it couldn't, because the application (Acrobat, normally) that gets the text can't always tell that the text is contiguous.

PDF only works with applications that are PDF aware, so I'm not sure what you mean by that.

I'm talking about creating the PDF in the first place, not using it afterwards. Word (for instance) is not PDF aware, but many PDF documents came from Word documents.

By storing the text in a standard encoding it would allow PDF documents to be easily searched.

Of course. But it's not at all easy to do that. If pigs could fly, it would be easier to get them to market :)

PDF isn't a MarkupLanguage, and would be much less useful if it were. Repeating someone else's point from the top of the page:

PDF is fine when used for its strength: preserving a document's data and presentation across different OperatingSystems. The "and presentation" is critical. If presentation is not important, other formats such as HTML and XML often serve just as well. (Never underestimate the PowerOfPlainText!)

What good is the data if I can't search it?

You can search it. Maybe not with enough power for all purposes, but enough for some. Building a database in PDF is silly, but writing a paper which is intended to be read rather than searched ought to be ok.

I'm not asking for a database, just grep. Searching inside one document is possible with a PDF aware app, but the problem comes when you've accumulated several PDF documents, say from government Web sites. If you want to find every document that mentions some phrase you're SOL as far as I can tell. Other presentation formats (like TeX and PostScript) are searchable. Why not PDF?

TeX is not a presentation format (DVI is, good luck trying to search that), and Postscript suffers the same problem as PDF. Just because your source document contained the word foobar, there's no guarantee that the Postscript file has f, o, o, b, a, r as contiguous characters in the file (in fact, if the F is a drop-cap, to continue the above example, it's almost certain not to).

Using the Acrobat Catalog, built in to AdobeAcrobatReader, you can create an Index of multiple PDF files quickly and easily (step 1: Put files in a single folder, step 2: Tools-->Catalog then click New Index, step 3: Select directory then click Build) and then use Search (see above question about Find vs Search) to perform a boolean search of the indexed files.


EditText of this page (last edited September 14, 2005) or FindPage with title or text search