io module and pdf question

Discussion in 'Python' started by Guest, Jun 25, 2013.

  1. Guest

    Guest Guest

    Would like to get your opinion on this. Currently to get the metadata out of a pdf file, I loop through the guts of the file. I know it's not the greatest idea to do this, but I'm trying to avoid extra modules, etc.

    Adobe javascript was used to insert the metadata, so the added data looks something like this:

    XYZ:colorList="DarkBlue,Yellow"

    With python 2.7, it successfully loops through the file contents and I'm able to find the line that contains "XYZ:colorList".

    However, when I try to run it with python 3, it errors:

    File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

    I've done some research on this, and it looks like encoding it to latin-1 works. I also found that if I use the io module, it will work on both python 2.7 and 3.3. For example:

    --------------
    import io
    import os

    pdfPath = '~/Desktop/test.pdf'

    colorlistData = ''

    with io.open(os.path.expanduser(pdfPath), 'r', encoding='latin-1') as f:
    for i in f:
    if 'XYZ:colorList' in i:
    colorlistData = i.split('XYZ:colorList')[1]
    break

    print(colorlistData)
    --------------

    As you can tell, I'm clueless in how exactly this works and am hoping someone can give me some insight on:
    1. Is there another way to get metadata out of a pdf without having to install another module?
    2. Is it safe to assume pdf files should always be encoded as latin-1 (when trying to read it this way)? Is there a chance they could be something else?
    3. Is the io module a good way to pursue this?

    Thanks for your help!

    Jay
    Guest, Jun 25, 2013
    #1
    1. Advertising

  2. Guest

    rusi Guest

    On Tuesday, June 25, 2013 9:48:44 AM UTC+5:30, wrote:
    > 1. Is there another way to get metadata out of a pdf without having to
    > install another module?
    > 2. Is it safe to assume pdf files should always be encoded as latin-1 (when
    > trying to read it this way)? Is there a chance they could be something else?


    If your code is binary open in binary mode (mode="rb") rather than choosing a bogus encoding. You then have to make your strings also binary (b-prefix)
    Also I am surprised that it works at all. Most pdfs are compressed I thought??

    > 3. Is the io module a good way to pursue this?


    The docs say:
    > The io module provides the Python interfaces to stream handling. Under Python
    > 2.x, this is proposed as an alternative to the built-in file object, but in
    > Python 3.x it is the default interface to access files and streams.


    So I guess no point using io for python 3??
    rusi, Jun 25, 2013
    #2
    1. Advertising

  3. Guest

    Guest

    Le mardi 25 juin 2013 06:18:44 UTC+2, a écrit :
    > Would like to get your opinion on this. Currently to get the metadata out of a pdf file, I loop through the guts of the file. I know it's not the greatest idea to do this, but I'm trying to avoid extra modules, etc.
    >
    >
    >
    > Adobe javascript was used to insert the metadata, so the added data lookssomething like this:
    >
    >
    >
    > XYZ:colorList="DarkBlue,Yellow"
    >
    >
    >
    > With python 2.7, it successfully loops through the file contents and I'm able to find the line that contains "XYZ:colorList".
    >
    >
    >
    > However, when I try to run it with python 3, it errors:
    >
    >
    >
    > File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/codecs.py", line 300, in decode
    >
    > (result, consumed) = self._buffer_decode(data, self.errors, final)
    >
    > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
    >
    >
    >
    > I've done some research on this, and it looks like encoding it to latin-1works. I also found that if I use the io module, it will work on both python 2.7 and 3.3. For example:
    >
    >
    >
    > --------------
    >
    > import io
    >
    > import os
    >
    >
    >
    > pdfPath = '~/Desktop/test.pdf'
    >
    >
    >
    > colorlistData = ''
    >
    >
    >
    > with io.open(os.path.expanduser(pdfPath), 'r', encoding='latin-1') as f:
    >
    > for i in f:
    >
    > if 'XYZ:colorList' in i:
    >
    > colorlistData = i.split('XYZ:colorList')[1]
    >
    > break
    >
    >
    >
    > print(colorlistData)
    >
    > --------------
    >
    >
    >
    > As you can tell, I'm clueless in how exactly this works and am hoping someone can give me some insight on:
    >
    > 1. Is there another way to get metadata out of a pdf without having to install another module?
    >
    > 2. Is it safe to assume pdf files should always be encoded as latin-1 (when trying to read it this way)? Is there a chance they could be something else?
    >
    > 3. Is the io module a good way to pursue this?
    >
    >
    >
    > Thanks for your help!
    >
    >
    >
    > Jay


    -----------


    Forget latin-1.
    There is nothing wrong in attempting to get such information
    by reading a pdf file in a binary mode. What is important
    is to know and be aware about what you are searching and to
    do the work correctly.

    A complete example with the pdf file, hypermeta.pdf, I produced
    which contains the string "abcé€" as Subject metadata.
    pdf version: 1.4
    producer: LaTeX with hyperref package
    (personal comment: "xdvipdfmx")
    Python 3.2

    >>> with open('hypermeta.pdf', 'rb') as fo:

    .... r = fo.read()
    ....
    >>> p1 = r.find(b'Subject<')
    >>> p1

    4516
    >>> p2 = r.find(b'>', p1)
    >>> p2

    4548
    >>> rr = r[p1:p2+1]
    >>> rr

    b'Subject<feff00610062006300e920ac>'
    >>> rrr = rr[len(b'Subject<'):-1]
    >>> rrr

    b'feff00610062006300e920ac'
    >>> # decoding the information
    >>> rrr = rrr.decode('ascii')
    >>> rrr

    'feff00610062006300e920ac'
    >>> i = 0
    >>> a = []
    >>> while i < len(rrr):

    .... t = rrr[i:i+4]
    .... a.append(t)
    .... i += 4
    ....
    >>> a

    ['feff', '0061', '0062', '0063', '00e9', '20ac']
    >>> b = [(int(e, 16) for e in a]

    File "<eta last command>", line 1
    b = [(int(e, 16) for e in a]
    ^
    SyntaxError: invalid syntax
    >>> # oops, error allowed
    >>> b = [int(e, 16) for e in a]
    >>> b

    [65279, 97, 98, 99, 233, 8364]
    >>> c = [chr(e) for e in b]
    >>> c

    ['\ufeff', 'a', 'b', 'c', 'é', '€']
    >>> # result
    >>> d = ''.join(c)
    >>> d

    '\ufeffabcé€'
    >>> d = d[1:]
    >>>
    >>>
    >>> d

    'abcé€'


    As Christian Gollwitzer pointed out, not all objects in a pdf
    are encoded in that way. Do not expect to get the contain,
    the "text" is that way.
    When built with the Unicode technology, the text of a pdf is
    composed with a *unique* set of abstract ID's, constructed with
    the help of the unicode code points table and with the properties
    of the font (OpenType) used in that pdf, this is equivalent to
    the utf8/16/32 transformers in "plain unicode".

    Luckily for the crowd, in 2103, there are people (devs) who
    are understanding the coding of characters, unicode and how
    to use it.

    jmf
    , Jun 26, 2013
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maric Michaud
    Replies:
    0
    Views:
    7,179
    Maric Michaud
    Jun 24, 2006
  2. vasudevram
    Replies:
    0
    Views:
    530
    vasudevram
    Jul 22, 2006
  3. Ricardo Pog
    Replies:
    1
    Views:
    396
    Austin Ziegler
    Mar 26, 2008
  4. Sean Nakasone
    Replies:
    1
    Views:
    338
    Farrel Lifson
    Apr 14, 2008
  5. Guest

    RE: io module and pdf question

    Guest, Jun 25, 2013, in forum: Python
    Replies:
    1
    Views:
    92
Loading...

Share This Page