Checking that 2 pdf are identical (md5 a solution?)

R

rlevesque

Hi

I am working on a program that generates various pdf files in the /
results folder.

"scenario1.pdf" results from scenario1
"scenario2.pdf" results from scenario2
etc

Once I am happy with scenario1.pdf and scenario2.pdf files, I would
like to save them in the /check folder.

Now after having developed/modified the program to produce
scenario3.pdf, I would like to be able to re-generate
files
/results/scenario1.pdf
/results/scenario2.pdf

and compare them with
/check/scenario1.pdf
/check/scenario2.pdf

I tried using the md5 module to compare these files but md5 reports
differences even though the code has *not* changed at all.

Is there a way to compare 2 pdf files generated at different time but
identical in every other respect and validate by program that the
files are identical (for all practical purposes)?
 
P

Peter Chant

rlevesque said:
Is there a way to compare 2 pdf files generated at different time but
identical in every other respect and validate by program that the
files are identical (for all practical purposes)?

I wonder, do the PDFs have a timestamp within them from when they are
created? That would ruin your MD5 plan.

Pete
 
P

Peter Otten

rlevesque said:
Hi

I am working on a program that generates various pdf files in the /
results folder.

"scenario1.pdf" results from scenario1
"scenario2.pdf" results from scenario2
etc

Once I am happy with scenario1.pdf and scenario2.pdf files, I would
like to save them in the /check folder.

Now after having developed/modified the program to produce
scenario3.pdf, I would like to be able to re-generate
files
/results/scenario1.pdf
/results/scenario2.pdf

and compare them with
/check/scenario1.pdf
/check/scenario2.pdf

I tried using the md5 module to compare these files but md5 reports
differences even though the code has *not* changed at all.

Is there a way to compare 2 pdf files generated at different time but
identical in every other respect and validate by program that the
files are identical (for all practical purposes)?

Here's a naive approach, but it may be good enough for your purpose.
I've printed the same small text into 1.pdf and 2.pdf

(Bad practice warning: this session is slightly doctored; I hope I haven't
introduced an error)
a = open("1.pdf").read()
b = open("2.pdf").read()
diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y]
len(diff) 2
diff [160, 161]
a[150:170] '0100724151412)\n>>\nen'
a[140:170] 'nDate (D:20100724151412)\n>>\nen'
a[130:170]
')\n/CreationDate (D:20100724151412)\n>>\nen'

OK, let's ignore "lines" starting with "/CreationDate " for our custom
comparison function:
.... with open(fa) as a:
.... with open(fb) as b:
.... for la, lb in izip_longest(a, b, fillvalue=""):
.... if la != lb:
.... if not la.startswith("/CreationDate
"): return False
.... if not lb.startswith("/CreationDate
"): return False
.... return True
....True

Peter
 
R

rlevesque

rlevesque said:
I am working on a program that generates various pdf files in the /
results folder.
"scenario1.pdf"  results from scenario1
"scenario2.pdf" results from scenario2
etc
Once I am happy with scenario1.pdf and scenario2.pdf files, I would
like to save them in the /check folder.
Now after having developed/modified the program to produce
scenario3.pdf, I would like to be able to re-generate
files
/results/scenario1.pdf
/results/scenario2.pdf
and compare them with
/check/scenario1.pdf
/check/scenario2.pdf
I tried using the md5 module to compare these files but md5 reports
differences even though the code has *not* changed at all.
Is there a way to compare 2 pdf files generated at different time but
identical in every other respect and validate by program that the
files are identical (for all practical purposes)?

Here's a naive approach, but it may be good enough for your purpose.
I've printed the same small text into 1.pdf and 2.pdf

(Bad practice warning: this session is slightly doctored; I hope I haven't
introduced an error)
a = open("1.pdf").read()
b = open("2.pdf").read()
diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y]
len(diff) 2
diff [160, 161]
a[150:170]

'0100724151412)\n>>\nen'>>> a[140:170]

'nDate (D:20100724151412)\n>>\nen'>>> a[130:170]

')\n/CreationDate (D:20100724151412)\n>>\nen'

OK, let's ignore "lines" starting with "/CreationDate " for our custom
comparison function:

...     with open(fa) as a:
...             with open(fb) as b:
...                     for la, lb in izip_longest(a, b, fillvalue=""):
...                             if la != lb:
...                                     if not la.startswith("/CreationDate
"): return False
...                                     if not lb.startswith("/CreationDate
"): return False
...                     return True
...>>> from itertools import izip_longest
True

Peter

Thanks a lot Peter.

Unfortunately there is an other pair of values that does not match and
it is not obvious to me how to exclude it (as is done with the " /
CreationDate" pair).
To illustrate the problem, I have modified your code as follows:

def equal_pdf(fa, fb):
idx=0
with open(fa) as a:
with open(fb) as b:
for la, lb in izip_longest(a, b, fillvalue=""):
idx+=1
#print idx
if la != lb:
#if not la.startswith(" /CreationDate"):
print "***", idx , la,'\n',lb
#return False
print "Last idx:",idx
return True

from itertools import izip_longest
file1='K/results/Test2.pdf'
file1c='K:/check/Test2.pdf'
print equal_pdf(file1, file1c)

I got the following output:
*** 237 /CreationDate (D:20100724123129+05'00')

/CreationDate (D:20100724122802+05'00')

*** 324 [(,\315'\347\003_\253\325\365\265\006\)J\216\252\215) (,
\315'\347\003_\253\325\365\265\006\)J\216\252\215)]

[(~s\211VIA\3426}\242XuV2\302\002) (~s\211VIA
\3426}\242XuV2\302\002)]

Last idx: 331
True

As you can see, there are 331 pair comparisons and 2 of the
comparisons do not match.
Your code correctly handles the " /CreationDate" pair but the other
one does not have a common element that can be used to handle it. :-(

As additional information in case it matters, the first pair compared
equals '%PDF-1.4\n'
and the pdf document is created using reportLab.

One hope I have is that item 324 which is near to the last item (331)
could be part of the 'trailing code' of the pdf file and might not
reflect actual differences between the 2 files. In other words, maybe
it would be sufficient for me to check all but the last 8 pairs...
 
P

Peter Otten

rlevesque said:
Unfortunately there is an other pair of values that does not match and
it is not obvious to me how to exclude it (as is done with the " /
CreationDate" pair).
and the pdf document is created using reportLab.

I dug into the reportlab source and in

reportlab/rl_config.py

found the line

invariant= 0 #produces
repeatable,identical PDFs with same timestamp info (for regression testing)

I suggest that you edit that file or add

from reportlab import rl_config
rl_config.invariant = True

to your code.

Peter
 
R

rlevesque

I dug into the reportlab source and in

reportlab/rl_config.py

found the line

invariant=                  0                       #produces
repeatable,identical PDFs with same timestamp info (for regression testing)

I suggest that you edit that file or add

from reportlab import rl_config
rl_config.invariant = True

to your code.

Peter

WOW!! You are good!
Your suggested solution works perfectly.

Given your expertise I will not be able to 'repay' you by helping on
Python problems but if you ever need help with SPSS related problems I
will be pleased to provide the assistance you need.
(I am the author of "SPSS Programming and Data Management" published
by SPSS Inc. (an IBM company))

Regards,

Raynald Levesque
www.spsstools.net
 
P

Peter Otten

rlevesque said:
WOW!! You are good!
Your suggested solution works perfectly.

Given your expertise I will not be able to 'repay' you by helping on
Python problems but if you ever need help with SPSS related problems I
will be pleased to provide the assistance you need.
(I am the author of "SPSS Programming and Data Management" published
by SPSS Inc. (an IBM company))

Relax! Assistance on c.l.py is free as in beer ;) If you feel you have to
give something back pick a question you can answer, doesn't matter who's
asking. Given that I can't answer the majority of questions posted here
chances are that I learn something from your response, too.

Peter
 
R

Robin Becker

...........
WOW!! You are good!
Your suggested solution works perfectly.

Given your expertise I will not be able to 'repay' you by helping on
Python problems but if you ever need help with SPSS related problems I
will be pleased to provide the assistance you need.
(I am the author of "SPSS Programming and Data Management" published
by SPSS Inc. (an IBM company))

Regards,
.......
if you have any more reportlab related queries you can also get free advice on
the reportlab mailing list at

http://two.pairlist.net/mailman/listinfo/reportlab-users
 
R

Robin Becker

...........
WOW!! You are good!
Your suggested solution works perfectly.

Given your expertise I will not be able to 'repay' you by helping on
Python problems but if you ever need help with SPSS related problems I
will be pleased to provide the assistance you need.
(I am the author of "SPSS Programming and Data Management" published
by SPSS Inc. (an IBM company))

Regards,
.......
if you have any more reportlab related queries you can also get free advice on
the reportlab mailing list at

http://two.pairlist.net/mailman/listinfo/reportlab-users
 
A

Aahz

Given your expertise I will not be able to 'repay' you by helping on
Python problems but if you ever need help with SPSS related problems I
will be pleased to provide the assistance you need.

Generally speaking, the community philosophy is "pay forward" -- help
someone else who needs it (either here or somewhere else). When everyone
helps other people, it all evens out.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top