I'm using Opensuse 10.3 and like to know command line tools to search phrases in large number of pdf files inside a directory. Windows XP, the Explorer search allows this but is too slow. Is there grep tips here?

asked 13 Jul '10, 17:35

iceman's gravatar image

iceman
6159
accept rate: 0%

Please accept an answer so the question/answer can be finished. Or provide more details so we can help.

(20 Apr '11, 14:14) rfelsburg ♦



Because pdf's are compressed data you can't simply grep through them with the usual grep command. You can use strings on the file which prints all the ascii out of the file, but it's not guaranteed to work.

There are a number of open source apps out there that can be used, for instance a script using pdftotext would be easy to implement. pdftotext is a part of xpdf.

For OpenSuse 10.3 the rpm is below.

ftp://ftp.pbone.net/mirror/ftp5.gwdg.de/pub/opensuse/repositories/home:/beyerle:/SLE10/SLE_10/i586/xpdf-tools-3.02-139.1.i586.rpm

Using pdftotext and a simple bash loop

for filename in `find /path/to/pdf_files/*.pdf`; do pdftotext $filename - | grep 'some value'; done

So you can understand what each part of the script does i've broken it down as well.

# get the paths to each file, and iterate through them storing each int he variable $filename
for filename in `find /path/to/pdf_files/*.pdf`; do 
 # run pdftotext on $filename, the '-' tells it to output to stdout, and then pass the data to grep
 pdftotext $filename - | grep 'some value'; 
done
link

answered 13 Jul '10, 18:07

rfelsburg's gravatar image

rfelsburg ♦
6061618
accept rate: 25%

What I do is utilize pdftotext.

One way is:

pdftotext datasheet.pdf -| grep sometext

If you have a large number of PDFs to go through, a simple script could be written to go through each PDF file.

link

answered 13 Jul '10, 18:11

Andy's gravatar image

Andy
2972920
accept rate: 14%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×2

Asked: 13 Jul '10, 17:35

Seen: 2,606 times

Last updated: 20 Apr '11, 14:14

powered by OSQA