Checking that 2 pdf are identical (md5 a solution?)

Discussion in 'Python' started by rlevesque, Jul 24, 2010.

  1. rlevesque

    rlevesque Guest

    Hi

    I am working on a program that generates various pdf files in the /
    results folder.

    "scenario1.pdf" results from scenario1
    "scenario2.pdf" results from scenario2
    etc

    Once I am happy with scenario1.pdf and scenario2.pdf files, I would
    like to save them in the /check folder.

    Now after having developed/modified the program to produce
    scenario3.pdf, I would like to be able to re-generate
    files
    /results/scenario1.pdf
    /results/scenario2.pdf

    and compare them with
    /check/scenario1.pdf
    /check/scenario2.pdf

    I tried using the md5 module to compare these files but md5 reports
    differences even though the code has *not* changed at all.

    Is there a way to compare 2 pdf files generated at different time but
    identical in every other respect and validate by program that the
    files are identical (for all practical purposes)?
    rlevesque, Jul 24, 2010
    #1
    1. Advertising

  2. rlevesque

    Peter Chant Guest

    rlevesque wrote:

    > Is there a way to compare 2 pdf files generated at different time but
    > identical in every other respect and validate by program that the
    > files are identical (for all practical purposes)?


    I wonder, do the PDFs have a timestamp within them from when they are
    created? That would ruin your MD5 plan.

    Pete

    --
    http://www.petezilla.co.uk
    Peter Chant, Jul 24, 2010
    #2
    1. Advertising

  3. rlevesque

    Peter Otten Guest

    rlevesque wrote:

    > Hi
    >
    > I am working on a program that generates various pdf files in the /
    > results folder.
    >
    > "scenario1.pdf" results from scenario1
    > "scenario2.pdf" results from scenario2
    > etc
    >
    > Once I am happy with scenario1.pdf and scenario2.pdf files, I would
    > like to save them in the /check folder.
    >
    > Now after having developed/modified the program to produce
    > scenario3.pdf, I would like to be able to re-generate
    > files
    > /results/scenario1.pdf
    > /results/scenario2.pdf
    >
    > and compare them with
    > /check/scenario1.pdf
    > /check/scenario2.pdf
    >
    > I tried using the md5 module to compare these files but md5 reports
    > differences even though the code has *not* changed at all.
    >
    > Is there a way to compare 2 pdf files generated at different time but
    > identical in every other respect and validate by program that the
    > files are identical (for all practical purposes)?


    Here's a naive approach, but it may be good enough for your purpose.
    I've printed the same small text into 1.pdf and 2.pdf

    (Bad practice warning: this session is slightly doctored; I hope I haven't
    introduced an error)

    >>> a = open("1.pdf").read()
    >>> b = open("2.pdf").read()
    >>> diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y]
    >>> len(diff)

    2
    >>> diff

    [160, 161]
    >>> a[150:170]

    '0100724151412)\n>>\nen'
    >>> a[140:170]

    'nDate (D:20100724151412)\n>>\nen'
    >>> a[130:170]

    ')\n/CreationDate (D:20100724151412)\n>>\nen'

    OK, let's ignore "lines" starting with "/CreationDate " for our custom
    comparison function:

    >>> def equal_pdf(fa, fb):

    .... with open(fa) as a:
    .... with open(fb) as b:
    .... for la, lb in izip_longest(a, b, fillvalue=""):
    .... if la != lb:
    .... if not la.startswith("/CreationDate
    "): return False
    .... if not lb.startswith("/CreationDate
    "): return False
    .... return True
    ....
    >>> from itertools import izip_longest
    >>> equal_pdf("1.pdf", "2.pdf")

    True

    Peter
    Peter Otten, Jul 24, 2010
    #3
  4. rlevesque

    rlevesque Guest

    On Jul 24, 11:50 am, Peter Otten <> wrote:
    > rlevesque wrote:
    > > Hi

    >
    > > I am working on a program that generates various pdf files in the /
    > > results folder.

    >
    > > "scenario1.pdf"  results from scenario1
    > > "scenario2.pdf" results from scenario2
    > > etc

    >
    > > Once I am happy with scenario1.pdf and scenario2.pdf files, I would
    > > like to save them in the /check folder.

    >
    > > Now after having developed/modified the program to produce
    > > scenario3.pdf, I would like to be able to re-generate
    > > files
    > > /results/scenario1.pdf
    > > /results/scenario2.pdf

    >
    > > and compare them with
    > > /check/scenario1.pdf
    > > /check/scenario2.pdf

    >
    > > I tried using the md5 module to compare these files but md5 reports
    > > differences even though the code has *not* changed at all.

    >
    > > Is there a way to compare 2 pdf files generated at different time but
    > > identical in every other respect and validate by program that the
    > > files are identical (for all practical purposes)?

    >
    > Here's a naive approach, but it may be good enough for your purpose.
    > I've printed the same small text into 1.pdf and 2.pdf
    >
    > (Bad practice warning: this session is slightly doctored; I hope I haven't
    > introduced an error)
    >
    > >>> a = open("1.pdf").read()
    > >>> b = open("2.pdf").read()
    > >>> diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y]
    > >>> len(diff)

    > 2
    > >>> diff

    > [160, 161]
    > >>> a[150:170]

    >
    > '0100724151412)\n>>\nen'>>> a[140:170]
    >
    > 'nDate (D:20100724151412)\n>>\nen'>>> a[130:170]
    >
    > ')\n/CreationDate (D:20100724151412)\n>>\nen'
    >
    > OK, let's ignore "lines" starting with "/CreationDate " for our custom
    > comparison function:
    >
    > >>> def equal_pdf(fa, fb):

    >
    > ...     with open(fa) as a:
    > ...             with open(fb) as b:
    > ...                     for la, lb in izip_longest(a, b, fillvalue=""):
    > ...                             if la != lb:
    > ...                                     if not la.startswith("/CreationDate
    > "): return False
    > ...                                     if not lb.startswith("/CreationDate
    > "): return False
    > ...                     return True
    > ...>>> from itertools import izip_longest
    > >>> equal_pdf("1.pdf", "2.pdf")

    >
    > True
    >
    > Peter


    Thanks a lot Peter.

    Unfortunately there is an other pair of values that does not match and
    it is not obvious to me how to exclude it (as is done with the " /
    CreationDate" pair).
    To illustrate the problem, I have modified your code as follows:

    def equal_pdf(fa, fb):
    idx=0
    with open(fa) as a:
    with open(fb) as b:
    for la, lb in izip_longest(a, b, fillvalue=""):
    idx+=1
    #print idx
    if la != lb:
    #if not la.startswith(" /CreationDate"):
    print "***", idx , la,'\n',lb
    #return False
    print "Last idx:",idx
    return True

    from itertools import izip_longest
    file1='K/results/Test2.pdf'
    file1c='K:/check/Test2.pdf'
    print equal_pdf(file1, file1c)

    I got the following output:
    *** 237 /CreationDate (D:20100724123129+05'00')

    /CreationDate (D:20100724122802+05'00')

    *** 324 [(,\315'\347\003_\253\325\365\265\006\)J\216\252\215) (,
    \315'\347\003_\253\325\365\265\006\)J\216\252\215)]

    [(~s\211VIA\3426}\242XuV2\302\002) (~s\211VIA
    \3426}\242XuV2\302\002)]

    Last idx: 331
    True

    As you can see, there are 331 pair comparisons and 2 of the
    comparisons do not match.
    Your code correctly handles the " /CreationDate" pair but the other
    one does not have a common element that can be used to handle it. :-(

    As additional information in case it matters, the first pair compared
    equals '%PDF-1.4\n'
    and the pdf document is created using reportLab.

    One hope I have is that item 324 which is near to the last item (331)
    could be part of the 'trailing code' of the pdf file and might not
    reflect actual differences between the 2 files. In other words, maybe
    it would be sufficient for me to check all but the last 8 pairs...
    rlevesque, Jul 24, 2010
    #4
  5. rlevesque

    Peter Otten Guest

    rlevesque wrote:

    > Unfortunately there is an other pair of values that does not match and
    > it is not obvious to me how to exclude it (as is done with the " /
    > CreationDate" pair).


    > and the pdf document is created using reportLab.


    I dug into the reportlab source and in

    reportlab/rl_config.py

    found the line

    invariant= 0 #produces
    repeatable,identical PDFs with same timestamp info (for regression testing)

    I suggest that you edit that file or add

    from reportlab import rl_config
    rl_config.invariant = True

    to your code.

    Peter
    Peter Otten, Jul 24, 2010
    #5
  6. rlevesque

    rlevesque Guest

    On Jul 24, 1:34 pm, Peter Otten <> wrote:
    > rlevesque wrote:
    > > Unfortunately there is an other pair of values that does not match and
    > > it is not obvious to me how to exclude it (as is done with the " /
    > > CreationDate" pair).
    > > and the pdf document is created using reportLab.

    >
    > I dug into the reportlab source and in
    >
    > reportlab/rl_config.py
    >
    > found the line
    >
    > invariant=                  0                       #produces
    > repeatable,identical PDFs with same timestamp info (for regression testing)
    >
    > I suggest that you edit that file or add
    >
    > from reportlab import rl_config
    > rl_config.invariant = True
    >
    > to your code.
    >
    > Peter


    WOW!! You are good!
    Your suggested solution works perfectly.

    Given your expertise I will not be able to 'repay' you by helping on
    Python problems but if you ever need help with SPSS related problems I
    will be pleased to provide the assistance you need.
    (I am the author of "SPSS Programming and Data Management" published
    by SPSS Inc. (an IBM company))

    Regards,

    Raynald Levesque
    www.spsstools.net
    rlevesque, Jul 24, 2010
    #6
  7. rlevesque

    Peter Otten Guest

    rlevesque wrote:

    > On Jul 24, 1:34 pm, Peter Otten <> wrote:
    >> rlevesque wrote:
    >> > Unfortunately there is an other pair of values that does not match and
    >> > it is not obvious to me how to exclude it (as is done with the " /
    >> > CreationDate" pair).
    >> > and the pdf document is created using reportLab.

    >>
    >> I dug into the reportlab source and in
    >>
    >> reportlab/rl_config.py
    >>
    >> found the line
    >>
    >> invariant= 0 #produces
    >> repeatable,identical PDFs with same timestamp info (for regression
    >> testing)
    >>
    >> I suggest that you edit that file or add
    >>
    >> from reportlab import rl_config
    >> rl_config.invariant = True
    >>
    >> to your code.
    >>
    >> Peter

    >
    > WOW!! You are good!
    > Your suggested solution works perfectly.
    >
    > Given your expertise I will not be able to 'repay' you by helping on
    > Python problems but if you ever need help with SPSS related problems I
    > will be pleased to provide the assistance you need.
    > (I am the author of "SPSS Programming and Data Management" published
    > by SPSS Inc. (an IBM company))


    Relax! Assistance on c.l.py is free as in beer ;) If you feel you have to
    give something back pick a question you can answer, doesn't matter who's
    asking. Given that I can't answer the majority of questions posted here
    chances are that I learn something from your response, too.

    Peter
    Peter Otten, Jul 27, 2010
    #7
  8. rlevesque

    Robin Becker Guest

    ...........
    >> repeatable,identical PDFs with same timestamp info (for regression testing)
    >>
    >> I suggest that you edit that file or add
    >>
    >> from reportlab import rl_config
    >> rl_config.invariant = True
    >>
    >> to your code.
    >>
    >> Peter

    >
    > WOW!! You are good!
    > Your suggested solution works perfectly.
    >
    > Given your expertise I will not be able to 'repay' you by helping on
    > Python problems but if you ever need help with SPSS related problems I
    > will be pleased to provide the assistance you need.
    > (I am the author of "SPSS Programming and Data Management" published
    > by SPSS Inc. (an IBM company))
    >
    > Regards,

    .......
    if you have any more reportlab related queries you can also get free advice on
    the reportlab mailing list at

    http://two.pairlist.net/mailman/listinfo/reportlab-users
    --
    Robin Becker
    Robin Becker, Jul 27, 2010
    #8
  9. rlevesque

    Robin Becker Guest

    ...........
    >> repeatable,identical PDFs with same timestamp info (for regression testing)
    >>
    >> I suggest that you edit that file or add
    >>
    >> from reportlab import rl_config
    >> rl_config.invariant = True
    >>
    >> to your code.
    >>
    >> Peter

    >
    > WOW!! You are good!
    > Your suggested solution works perfectly.
    >
    > Given your expertise I will not be able to 'repay' you by helping on
    > Python problems but if you ever need help with SPSS related problems I
    > will be pleased to provide the assistance you need.
    > (I am the author of "SPSS Programming and Data Management" published
    > by SPSS Inc. (an IBM company))
    >
    > Regards,

    .......
    if you have any more reportlab related queries you can also get free advice on
    the reportlab mailing list at

    http://two.pairlist.net/mailman/listinfo/reportlab-users
    --
    Robin Becker
    Robin Becker, Jul 27, 2010
    #9
  10. rlevesque

    Aahz Guest

    In article <>,
    rlevesque <> wrote:
    >
    >Given your expertise I will not be able to 'repay' you by helping on
    >Python problems but if you ever need help with SPSS related problems I
    >will be pleased to provide the assistance you need.


    Generally speaking, the community philosophy is "pay forward" -- help
    someone else who needs it (either here or somewhere else). When everyone
    helps other people, it all evens out.
    --
    Aahz () <*> http://www.pythoncraft.com/

    "....Normal is what cuts off your sixth finger and your tail..." --Siobhan
    Aahz, Aug 6, 2010
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    9
    Views:
    17,056
    John Salerno
    May 8, 2006
  2. Todd Benson
    Replies:
    1
    Views:
    148
    Rolando Abarca
    Apr 30, 2007
  3. Ricardo Pog
    Replies:
    1
    Views:
    396
    Austin Ziegler
    Mar 26, 2008
  4. Sean Nakasone
    Replies:
    1
    Views:
    338
    Farrel Lifson
    Apr 14, 2008
  5. Peter Woodsky

    create a md5 / md5 passwd with a salt

    Peter Woodsky, Nov 20, 2008, in forum: Ruby
    Replies:
    6
    Views:
    193
    Brian Candler
    Nov 21, 2008
Loading...

Share This Page