Iterating over PDF documents

P

Peter Maas

Hi,

I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators
available in *X, Mac and Win so that utilizing "for line in file"
is difficult. I tried universal newline but that spoils the document
apparently. What other options do I have besides writing my own PDF
line iterator?. All options have to be quick hacks as I have already
invested some time in existing code and the deadline is approaching
fast :)

Thanks for your help.

Mit freundlichen Gruessen,

Peter Maas
 
J

Just

Peter Maas said:
Hi,

I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators
available in *X, Mac and Win so that utilizing "for line in file"
is difficult. I tried universal newline but that spoils the document
apparently. What other options do I have besides writing my own PDF
line iterator?. All options have to be quick hacks as I have already
invested some time in existing code and the deadline is approaching
fast :)

try this:

for line in open(path, "U"): # universal newline mode
...

Just
 
J

Just

Peter Maas said:
Thanks, Just, I tried this but the edited PDF was damaged.

That's probably because PDF can also contain arbitrary binary data,
which would indeed break if all occurances of \r\n and \r were replaced
by \n...

Just
 
P

Peter Hansen

Peter said:
I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators

It should also be pretty difficult because PDFs are binary, not
text...

(They might contain a whole lot of stuff that looks like text, but
there are binary sections mixed into many of them, and I believe
the header at least is binary. The sample files I'm looking at
definitely are, in any case. Your solution could not be general.)

-Peter
 
T

Tomas

Peter Maas said:
I'm trying to edit a PDF document line-wise. This is more difficult
than I thought, because PDF uses a mixture of all line terminators
available in *X, Mac and Win so that utilizing "for line in file"
is difficult.

If you're just going to extract some text or do searching, you can try the
pdftotext utility and convert the document(s) to plain text.
http://www.snapfiles.com/get/pdftotext.html (Windows).

-Tomas
 
P

Peter Maas

I solved the problem by writing an iterator that takes care of all
eol types without changing/deleting them. So I could edit the parts
that looked like text while leaving the binary parts alone.

Thanks to you all for your useful input.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top