Iterating over PDF documents

Peter Maas · Nov 11, 2004

Hi,

I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators
available in *X, Mac and Win so that utilizing "for line in file"
is difficult. I tried universal newline but that spoils the document
apparently. What other options do I have besides writing my own PDF
line iterator?. All options have to be quick hacks as I have already
invested some time in existing code and the deadline is approaching
fast

Thanks for your help.

Mit freundlichen Gruessen,

Peter Maas

Just · Nov 11, 2004

Peter Maas said:
Hi,

I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators
available in *X, Mac and Win so that utilizing "for line in file"
is difficult. I tried universal newline but that spoils the document
apparently. What other options do I have besides writing my own PDF
line iterator?. All options have to be quick hacks as I have already
invested some time in existing code and the deadline is approaching
fast

try this:

for line in open(path, "U"): # universal newline mode
...

Just

Peter Maas · Nov 11, 2004

Just said:
try this:

for line in open(path, "U"): # universal newline mode

Thanks, Just, I tried this but the edited PDF was damaged.

Just · Nov 11, 2004

Peter Maas said:
Thanks, Just, I tried this but the edited PDF was damaged.

That's probably because PDF can also contain arbitrary binary data,
which would indeed break if all occurances of \r\n and \r were replaced
by \n...

Just

Peter Hansen · Nov 12, 2004

Peter said:
I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators

It should also be pretty difficult because PDFs are binary, not
text...

(They might contain a whole lot of stuff that looks like text, but
there are binary sections mixed into many of them, and I believe
the header at least is binary. The sample files I'm looking at
definitely are, in any case. Your solution could not be general.)

-Peter

Tomas · Nov 12, 2004

Peter Maas said:
I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators
available in *X, Mac and Win so that utilizing "for line in file"
is difficult.

If you're just going to extract some text or do searching, you can try the
pdftotext utility and convert the document(s) to plain text.
http://www.snapfiles.com/get/pdftotext.html (Windows).

-Tomas

Follower · Nov 12, 2004

Would this Python/PDF handling library be useful to you:

<http://www.boddie.org.uk/david/Projects/Python/pdftools/>

I don't think it specifically handles re-writing, but it might be a
useful starting point.

--Phil.

Peter Maas · Nov 15, 2004

I solved the problem by writing an iterator that takes care of all
eol types without changing/deleting them. So I could edit the parts
that looked like text while leaving the binary parts alone.

Thanks to you all for your useful input.

Iterating over PDF documents

Peter Maas

Just

Peter Maas

Just

Peter Hansen

Tomas

Follower

Peter Maas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads