Re: More Help with python .find fucntion

Discussion in 'Python' started by Steven D'Aprano, Jan 8, 2011.

  1. On Fri, 07 Jan 2011 22:43:54 -0600, Keith Anthony wrote:

    > My previous question asked how to read a file into a strcuture a line at
    > a time. Figured it out. Now I'm trying to use .find to separate out
    > the PDF objects. (See code) PROBLEM/QUESTION: My call to lines.find
    > does NOT find all instances of endobj. Any help available? Any
    > insights?
    >
    > #!/usr/bin/python
    >
    > inputfile = file('sample.pdf','rb') # This is PDF with which
    > we will work
    > lines = inputfile.readlines() # read file
    > one line at a time


    That's incorrect. readlines() reads the entire file in one go, and splits
    it into individual lines.


    > linestart = [] # Starting address for
    > each line
    > lineend = [] # Ending
    > address for each line
    > linetype = []


    *raises eyebrow*

    How is an empty list a starting or ending address?

    The only thing worse than no comments where you need them is misleading
    comments. A variable called "linestart" implies that it should be a
    position, e.g. linestart = 0. Or possibly a flag.


    > print len(lines) # print number of lines
    >
    > i = 0 # define an iterator, i


    Again, 0 is not an iterator. 0 is a number.


    > addr = 0 # and address pointer
    >
    > while i < len(lines): # Go through each line
    > linestart = linestart + [addr]
    > length = len(lines)
    > lineend = lineend + [addr + (length-1)] addr = addr + length
    > i = i + 1


    Complicated and confusing and not the way to do it in Python. Something
    like this is much simpler:


    linetypes = [] # note plural
    inputfile = open('sample.pdf','rb') # Don't use file, use open.

    for line_number, line in enumerate(inputfile):
    # Process one line at a time. No need for that nonsense with manually
    # tracked line numbers, enumerate() does that for us.
    # No need to initialise linetypes.
    status = 'normal'
    i = line.find(' obj')
    if i >= 0:
    print "Object found at offset %d in line %d" % (i, line_number)
    status = 'object'
    i = line.find('endobj')
    if i >= 0:
    print "endobj found at offset %d in line %d" % (i, line_number)
    if status == 'normal': status = 'endobj'
    else: status = 'object & endobj' # both found on the one line
    linetypes.append(status)
    # What if obj or endobj exist more than once in a line?



    One last thing... if PDF files are a binary format, what makes you think
    that they can be processed line-by-line? They may not have lines, except
    by accident.


    --
    Steven
     
    Steven D'Aprano, Jan 8, 2011
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ra294

    Now() Fucntion and CurrentCulture

    ra294, Nov 25, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    620
    Hans Kesting
    Nov 26, 2004
  2. TOMERDR
    Replies:
    6
    Views:
    375
    Jonathan Mcdougall
    May 22, 2006
  3. Replies:
    7
    Views:
    349
  4. sophie_newbie
    Replies:
    4
    Views:
    235
    Paul Hankin
    Oct 18, 2007
  5. aki

    system fucntion in C

    aki, Oct 28, 2010, in forum: C Programming
    Replies:
    1
    Views:
    309
    Seebs
    Oct 28, 2010
Loading...

Share This Page