Re: how to read the last line of a huge file???

Discussion in 'Python' started by MRAB, Jan 26, 2011.

  1. MRAB

    MRAB Guest

    On 26/01/2011 10:59, Xavier Heruacles wrote:
    > I have do some log processing which is usually huge. The length of each
    > line is variable. How can I get the last line?? Don't tell me to use
    > readlines or something like linecache...
    >

    Seek to somewhere near the end and then read use readlines(). If you
    get fewer than 2 lines then you can't be sure that you have the entire
    last line, so seek a little farther from the end and try again.
    MRAB, Jan 26, 2011
    #1
    1. Advertising

  2. MRAB

    Alan Meyer Guest

    On 01/26/2011 04:22 PM, MRAB wrote:
    > On 26/01/2011 10:59, Xavier Heruacles wrote:
    >> I have do some log processing which is usually huge. The length of each
    >> line is variable. How can I get the last line?? Don't tell me to use
    >> readlines or something like linecache...
    >>

    > Seek to somewhere near the end and then read use readlines(). If you
    > get fewer than 2 lines then you can't be sure that you have the entire
    > last line, so seek a little farther from the end and try again.


    I think this has got to be the most efficient solution.

    You might get the source code for the open source UNIX utility "tail"
    and see how they do it. It seems to work with equal speed no matter how
    large the file is and I suspect it uses MRAB's solution, but because
    it's written in C, it probably examines each character directly rather
    than calling a library routine like readlines.

    Alan
    Alan Meyer, Feb 1, 2011
    #2
    1. Advertising

  3. On Tue, Feb 1, 2011 at 9:12 AM, Alan Meyer <> wrote:
    > On 01/26/2011 04:22 PM, MRAB wrote:
    >>
    >> On 26/01/2011 10:59, Xavier Heruacles wrote:
    >>>
    >>> I have do some log processing which is usually huge. The length of each
    >>> line is variable. How can I get the last line?? Don't tell me to use
    >>> readlines or something like linecache...
    >>>

    >> Seek to somewhere near the end and then read use readlines(). If you
    >> get fewer than 2 lines then you can't be sure that you have the entire
    >> last line, so seek a little farther from the end and try again.

    >
    > I think this has got to be the most efficient solution.
    >
    > You might get the source code for the open source UNIX utility "tail" and
    > see how they do it.  It seems to work with equal speed no matter how large
    > the file is and I suspect it uses MRAB's solution, but because it's written
    > in C, it probably examines each character directly rather than calling a
    > library routine like readlines.
    >


    How about mmapping the file and using rfind?

    def mapper(filename):
    with open(filename) as f:
    mapping = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    endIdx = mapping.rfind('\n')
    startIdx = mapping.rfind('\n', 0, endIdx)
    return mapping[startIdx + 1:endIdx]

    def seeker(filename):
    offset = -10
    with open(filename, 'rb') as f:
    while True:
    f.seek(offset, os.SEEK_END)
    lines = f.readlines()
    if len(lines) >= 2:
    return lines[-1][:-1]
    offset *= 2

    In [1]: import timeit

    In [2]: timeit.timeit('finders.seeker("the-file")', 'import finders')
    Out[2]: 32.216405868530273

    In [3]: timeit.timeit('finders.mapper("the-file")', 'import finders')
    Out[3]: 16.805877208709717

    the-file is a 120M file with ~500k lines. Both functions assume the
    last line has a trailing newline. It's easy to correct if that's not
    the case. I think mmap works similarly on Windows, but I've never
    tried there.

    --
    regards,
    kushal
    Kushal Kumaran, Feb 1, 2011
    #3
  4. MRAB

    Guest

    I've implementing this method of reading a file from the end, i.e

    def seeker(filename):
    offset = -10
    with open(filename) as f:
    while True:
    f.seek(offset, os.SEEK_END)
    lines = f.readlines()
    if len(lines) >= 2:
    return lines[-1]
    offset *= 2

    and consistently run into the following error message when Python 3.2
    (running under Pyscripter 2.4.1) tries to execute the line
    f.seek(offset,2)

    UnsupportedOperation: can't do non-zero end-relative seeks

    But offset is initialized to -10. Does anyone have any thoughts on
    what the error might be caused by?

    Thanks in advance

    Thomas Philips
    , Mar 4, 2011
    #4
  5. MRAB

    MRAB Guest

    On 04/03/2011 21:46, wrote:
    > I've implementing this method of reading a file from the end, i.e
    >
    > def seeker(filename):
    > offset = -10
    > with open(filename) as f:
    > while True:
    > f.seek(offset, os.SEEK_END)
    > lines = f.readlines()
    > if len(lines)>= 2:
    > return lines[-1]
    > offset *= 2
    >
    > and consistently run into the following error message when Python 3.2
    > (running under Pyscripter 2.4.1) tries to execute the line
    > f.seek(offset,2)
    >
    > UnsupportedOperation: can't do non-zero end-relative seeks
    >
    > But offset is initialized to -10. Does anyone have any thoughts on
    > what the error might be caused by?
    >

    I think it's because the file has been opened in text mode, so there's
    the encoding to consider. It may be that it's to stop you from
    accidentally seeking into the middle of a multibyte sequence, but
    there's nothing to stop you doing that when seeking relative to the
    start, for example, so it's possibly a pointless restriction.

    A workaround is not to seek relative to the end. os.path.getsize() will
    tell you the length of the file. You'll still have to watch out for
    DecodeError when you read in case the seek was into the middle of a
    multibyte sequence. A better workaround may be to open in binary mode
    and decode the bytes explicitly; if there's a DecodeError then discard
    the first byte and try again, etc.
    MRAB, Mar 5, 2011
    #5
  6. MRAB

    Ian Kelly Guest

    On Fri, Mar 4, 2011 at 5:26 PM, MRAB <> wrote:
    >> UnsupportedOperation: can't do non-zero end-relative seeks
    >>
    >> But offset is initialized to -10. Does anyone have any thoughts on
    >> what the error might be caused by?
    >>

    > I think it's because the file has been opened in text mode, so there's
    > the encoding to consider. It may be that it's to stop you from
    > accidentally seeking into the middle of a multibyte sequence, but
    > there's nothing to stop you doing that when seeking relative to the
    > start, for example, so it's possibly a pointless restriction.


    I expect that's correct. The doc string from Python 2 included this nugget:

    If the file is opened in text mode, only offsets returned by
    tell() are legal.
    Use of other offsets causes undefined behavior.
    Ian Kelly, Mar 5, 2011
    #6
  7. MRAB

    Guest

    Thanks for the pointer. Yes, it is a text file, but the mystery runs
    deeper: I later found that it works perfectly as written when I run it
    from IDLE or the Python shell, but it fails reliably when I run it
    from PyScripter 2.4.1 (an open source Python IDE)! So I suspect
    there's a PyScripter issue lurking in here. I'm next going to try the
    solution you propose - use only for legal offsets - and then retry it
    under both IDLE and PyScripter. Question: how do I use f.tell() to
    identify if an offset is legal or illegal?

    Thanks in advance


    Thomas Philips
    , Mar 5, 2011
    #7
  8. MRAB

    John Nagle Guest

    On 3/5/2011 10:21 AM, wrote:
    > Question: how do I use f.tell() to
    > identify if an offset is legal or illegal?


    Read backwards in binary mode, byte by byte,
    until you reach a byte which is, in binary, either

    0xxxxxxx
    11xxxxxx

    You are then at the beginning of an ASCII or UTF-8
    character. You can copy the bytes forward from there
    into an array of bytes, then apply the appropriate
    codec. This is also what you do if skipping ahead
    in a UTF-8 file, to get in sync.

    Reading the last line or lines is easier. Read backwards
    in binary until you hit an LF or CR, both of which
    are the same in ASCII and UTF-8. Copy the bytes
    forward from that point into an array of bytes, then
    apply the appropriate codec.

    John Nagle
    John Nagle, Mar 5, 2011
    #8
  9. MRAB

    Terry Reedy Guest

    On 3/5/2011 1:21 PM, wrote:
    > Thanks for the pointer. Yes, it is a text file, but the mystery runs
    > deeper: I later found that it works perfectly as written when I run it
    > from IDLE or the Python shell, but it fails reliably when I run it
    > from PyScripter 2.4.1 (an open source Python IDE)! So I suspect
    > there's a PyScripter issue lurking in here. I'm next going to try the
    > solution you propose - use only for legal offsets - and then retry it
    > under both IDLE and PyScripter. Question: how do I use f.tell() to
    > identify if an offset is legal or illegal?


    I do not believe you can. You have to be at a position and f.tell() will
    report it.

    Note: if a file is utf-8 encoded, and you seek to an arbitrary position
    in binary mode, it is easy to synchronize by discarding the remainder
    (if any)of a multibyte char and finding the start of the next char.

    --
    Terry Jan Reedy
    Terry Reedy, Mar 5, 2011
    #9
  10. MRAB

    Guest

    There is a problem, and it's a Python 3.2 problem. All the solutions
    presented here work perfectly well in Python 2.7.1, and they all fail
    at exactly the same point in Python 3.2 - it's the line that tries to
    seek from the end. e.g.
    f.seek(offset, os.SEEK_END)

    I'll register this as a Python bug. Thank you, everyone, for the help
    and guidance.

    Sincerely


    Thomas Philips
    , Mar 10, 2011
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hugo
    Replies:
    10
    Views:
    1,291
    Matt Humphrey
    Oct 18, 2004
  2. Eric Capps
    Replies:
    11
    Views:
    3,126
    Mark Space
    Jul 11, 2006
  3. kaushikshome
    Replies:
    4
    Views:
    753
    kaushikshome
    Sep 10, 2006
  4. Replies:
    3
    Views:
    477
  5. scad
    Replies:
    23
    Views:
    1,142
    Alf P. Steinbach
    May 17, 2009
Loading...

Share This Page