Seek the one billionth line in a file containing 3 billion lines.

Discussion in 'Python' started by Sullivan WxPyQtKinter, Aug 8, 2007.

  1. I have a huge log file which contains 3,453,299,000 lines with
    different lengths. It is not possible to calculate the absolute
    position of the beginning of the one billionth line. Are there
    efficient way to seek to the beginning of that line in python?

    This program:
    for i in range(1000000000):
    f.readline()
    is absolutely every slow....

    Thank you so much for help.
    Sullivan WxPyQtKinter, Aug 8, 2007
    #1
    1. Advertising

  2. Sullivan WxPyQtKinter

    Paul Rubin Guest

    Sullivan WxPyQtKinter <> writes:
    > This program:
    > for i in range(1000000000):
    > f.readline()
    > is absolutely every slow....


    There are two problems:

    1) range(1000000000) builds a list of a billion elements in memory,
    which is many gigabytes and probably thrashing your machine.
    You want to use xrange instead of range, which builds an iterator
    (i.e. something that uses just a small amount of memory, and
    generates the values on the fly instead of precomputing a list).

    2) f.readline() reads an entire line of input which (depending on
    the nature of the log file) could also be of very large size.
    If you're sure the log file contents are sensible (lines up to
    several megabytes shouldn't cause a problem) then you can do it
    that way, but otherwise you want to read fixed size units.
    Paul Rubin, Aug 8, 2007
    #2
    1. Advertising

  3. Sullivan WxPyQtKinter

    Evan Klitzke Guest

    On 8/7/07, Sullivan WxPyQtKinter <> wrote:
    > I have a huge log file which contains 3,453,299,000 lines with
    > different lengths. It is not possible to calculate the absolute
    > position of the beginning of the one billionth line. Are there
    > efficient way to seek to the beginning of that line in python?
    >
    > This program:
    > for i in range(1000000000):
    > f.readline()
    > is absolutely every slow....
    >
    > Thank you so much for help.


    There is no fast way to do this, unless the lines are of fixed length
    (in which case you can use f.seek() to move to the correct spot). The
    reason is that there is no way to find the position of the billionth
    line without scanning the whole file. You should split your logs into
    smaller files in the future.

    You might be able to do this a very tiny bit faster by using the split
    utility and have it split the log file into smaller chunks (split can
    split by line amounts), but since that still has to scan the file it
    will will be IO bound.

    --
    Evan Klitzke <>
    Evan Klitzke, Aug 8, 2007
    #3
  4. On Aug 8, 2:35 am, Paul Rubin <http://> wrote:
    > Sullivan WxPyQtKinter <> writes:
    > > This program:
    > > for i in range(1000000000):
    > > f.readline()
    > > is absolutely every slow....

    >
    > There are two problems:
    >
    > 1) range(1000000000) builds a list of a billion elements in memory,
    > which is many gigabytes and probably thrashing your machine.
    > You want to use xrange instead of range, which builds an iterator
    > (i.e. something that uses just a small amount of memory, and
    > generates the values on the fly instead of precomputing a list).
    >
    > 2) f.readline() reads an entire line of input which (depending on
    > the nature of the log file) could also be of very large size.
    > If you're sure the log file contents are sensible (lines up to
    > several megabytes shouldn't cause a problem) then you can do it
    > that way, but otherwise you want to read fixed size units.



    Thank you for pointing out these two problem. I wrote this program
    just to say that how inefficient it is to use a seemingly NATIVE way
    to seek a such a big file. No other intention........
    Sullivan WxPyQtKinter, Aug 8, 2007
    #4
  5. Sullivan WxPyQtKinter

    Peter Otten Guest

    Sullivan WxPyQtKinter wrote:

    > I have a huge log file which contains 3,453,299,000 lines with
    > different lengths. It is not possible to calculate the absolute
    > position of the beginning of the one billionth line. Are there
    > efficient way to seek to the beginning of that line in python?
    >
    > This program:
    > for i in range(1000000000):
    > f.readline()
    > is absolutely every slow....
    >
    > Thank you so much for help.


    That will be slow regardless of language. However

    n = 10**9 - 1
    assert n < sys.maxint
    f = open(filename)
    wanted_line = itertools.islice(f, n, None).next()

    should do slightly better than your implementation.

    Peter
    Peter Otten, Aug 8, 2007
    #5
  6. Sullivan WxPyQtKinter

    Jay Loden Guest

    Paul Rubin wrote:
    > Sullivan WxPyQtKinter <> writes:
    >> This program:
    >> for i in range(1000000000):
    >> f.readline()
    >> is absolutely every slow....

    >
    > There are two problems:
    >
    > 1) range(1000000000) builds a list of a billion elements in memory,
    > which is many gigabytes and probably thrashing your machine.
    > You want to use xrange instead of range, which builds an iterator
    > (i.e. something that uses just a small amount of memory, and
    > generates the values on the fly instead of precomputing a list).
    >
    > 2) f.readline() reads an entire line of input which (depending on
    > the nature of the log file) could also be of very large size.
    > If you're sure the log file contents are sensible (lines up to
    > several megabytes shouldn't cause a problem) then you can do it
    > that way, but otherwise you want to read fixed size units.


    If we just want to iterate through the file one line at a time, why not just:

    count = 0
    handle = open('hugelogfile.txt')
    for line in handle.xreadlines():
    count = count + 1
    if count == '1000000000':
    #do something


    My first suggestion would be to split the file into smaller more manageable
    chunks, because any type of manipulation of a multi-billion line log file is
    going to be a nightmare. For example, you could try the UNIX 'split' utility to
    break the file into individual files of say, 100000 lines each. split is likely
    to be faster than anything in Python, since it is written in C with no
    interpreter overhead etc.

    Is there a reason you specifically need to get to line 1 billion, or are you
    just trying to trim the file down? Do you need a value that's on that particular
    line, or is there some other reason? Perhaps if you can provide the use case the
    list can help you solve the problem itself rather than looking for a way to seek
    to the one billionth line in a file.

    -Jay
    Jay Loden, Aug 8, 2007
    #6
  7. Hi!

    Create a "index" (a file with 3,453,299,000 tuples :
    line_number + start_byte) ; this file has fix-length lines.
    slow, OK, but once.

    Then, for every consult/read a specific line:
    - direct acces read on index
    - seek at the fisrt byte of the line desired



    @+

    Michel Claveau
    Méta-MCI \(MVP\), Aug 8, 2007
    #7
  8. Jay Loden a écrit :
    (snip)
    > If we just want to iterate through the file one line at a time, why not just:
    >
    > count = 0
    > handle = open('hugelogfile.txt')
    > for line in handle.xreadlines():
    > count = count + 1
    > if count == '1000000000':
    > #do something


    for count, line in enumerate(handle):
    if count == '1000000000':
    #do something


    NB : files now (well... since 2.3) handle iteration directly
    http://www.python.org/doc/2.3/whatsnew/node17.html
    Bruno Desthuilliers, Aug 8, 2007
    #8
  9. Re: Seek the one billionth line in a file containing 3 billionlines.

    On Wed, 08 Aug 2007 09:54:26 +0200, Méta-MCI \(MVP\) wrote:

    > Create a "index" (a file with 3,453,299,000 tuples :
    > line_number + start_byte) ; this file has fix-length lines.
    > slow, OK, but once.


    Why storing the line number? The first start offset is for the first
    line, the second start offset for the second line and so on.

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Aug 8, 2007
    #9
  10. Sullivan WxPyQtKinter

    Ben Finney Guest

    Sullivan WxPyQtKinter <> writes:

    > On Aug 8, 2:35 am, Paul Rubin <http://> wrote:
    > > Sullivan WxPyQtKinter <> writes:
    > > > This program:
    > > > for i in range(1000000000):
    > > > f.readline()
    > > > is absolutely every slow....

    > >
    > > There are two problems:
    > >
    > > 1) range(1000000000) builds a list of a billion elements in memory

    [...]
    > >
    > > 2) f.readline() reads an entire line of input

    [...]
    >
    > Thank you for pointing out these two problem. I wrote this program
    > just to say that how inefficient it is to use a seemingly NATIVE way
    > to seek a such a big file. No other intention........


    The native way isn't iterating over 'range(hugenum)', it's to use an
    iterator. Python file objects are iterable, only reading eaach line as
    needed and not creating a companion list.

    logfile = open("foo.log", 'r')
    for line in logfile:
    do_stuff(line)

    This at least avoids the 'range' issue.

    To know when we've reached a particular line, use 'enumerate' to
    number each item as it comes out of the iterator.

    logfile = open("foo.log", 'r')
    target_line_num = 10**9
    for (line_num, line) in enumerate(file):
    if line_num < target_line_num:
    continue
    else:
    do_stuff(line)
    break

    As for reading each line: that's unavoidable if you want a specific
    line from a stream of variable-length lines.

    --
    \ "I have never made but one prayer to God, a very short one: 'O |
    `\ Lord, make my enemies ridiculous!' And God granted it." -- |
    _o__) Voltaire |
    Ben Finney
    Ben Finney, Aug 8, 2007
    #10
  11. Sullivan WxPyQtKinter

    Ant Guest

    On Aug 8, 11:10 am, Bruno Desthuilliers <bruno.
    > wrote:
    > Jay Loden a écrit :
    > (snip)
    >
    > > If we just want to iterate through the file one line at a time, why not just:

    >
    > > count = 0
    > > handle = open('hugelogfile.txt')
    > > for line in handle.xreadlines():
    > > count = count + 1
    > > if count == '1000000000':
    > > #do something

    >
    > for count, line in enumerate(handle):
    > if count == '1000000000':
    > #do something


    You'd get better results if the test were:

    if count == 1000000000:

    Or probably even:

    if count == 999999999:

    Since the 1 billionth line will have index 999999999.

    Cheers,

    --
    Ant...

    http://antroy.blogspot.com/
    Ant, Aug 8, 2007
    #11
  12. Peter Otten wrote:

    > n = 10**9 - 1
    > assert n < sys.maxint
    > f = open(filename)
    > wanted_line = itertools.islice(f, n, None).next()
    >
    > should do slightly better than your implementation.


    It will do vastly better, at least in memory usage terms, because
    there is no memory eating range call.

    Regards,


    Björn

    --
    BOFH excuse #31:

    cellular telephone interference
    Bjoern Schliessmann, Aug 8, 2007
    #12
  13. Ant a écrit :
    > On Aug 8, 11:10 am, Bruno Desthuilliers <bruno.
    > > wrote:
    >> Jay Loden a écrit :
    >> (snip)
    >>
    >>> If we just want to iterate through the file one line at a time, why not just:
    >>> count = 0
    >>> handle = open('hugelogfile.txt')
    >>> for line in handle.xreadlines():
    >>> count = count + 1
    >>> if count == '1000000000':
    >>> #do something

    >> for count, line in enumerate(handle):
    >> if count == '1000000000':
    >> #do something

    >
    > You'd get better results if the test were:
    >
    > if count == 1000000000:
    >
    > Or probably even:
    >
    > if count == 999999999:
    >
    > Since the 1 billionth line will have index 999999999.
    >


    Doh :(

    Thanks for the correction.
    Bruno Desthuilliers, Aug 8, 2007
    #13
  14. Sullivan WxPyQtKinter

    Chris Mellon Guest

    On 8/8/07, Ben Finney <> wrote:
    > Sullivan WxPyQtKinter <> writes:
    >
    > > On Aug 8, 2:35 am, Paul Rubin <http://> wrote:
    > > > Sullivan WxPyQtKinter <> writes:
    > > > > This program:
    > > > > for i in range(1000000000):
    > > > > f.readline()
    > > > > is absolutely every slow....
    > > >
    > > > There are two problems:
    > > >
    > > > 1) range(1000000000) builds a list of a billion elements in memory

    > [...]
    > > >
    > > > 2) f.readline() reads an entire line of input

    > [...]
    > >
    > > Thank you for pointing out these two problem. I wrote this program
    > > just to say that how inefficient it is to use a seemingly NATIVE way
    > > to seek a such a big file. No other intention........

    >
    > The native way isn't iterating over 'range(hugenum)', it's to use an
    > iterator. Python file objects are iterable, only reading eaach line as
    > needed and not creating a companion list.
    >
    > logfile = open("foo.log", 'r')
    > for line in logfile:
    > do_stuff(line)
    >
    > This at least avoids the 'range' issue.
    >
    > To know when we've reached a particular line, use 'enumerate' to
    > number each item as it comes out of the iterator.
    >
    > logfile = open("foo.log", 'r')
    > target_line_num = 10**9
    > for (line_num, line) in enumerate(file):
    > if line_num < target_line_num:
    > continue
    > else:
    > do_stuff(line)
    > break
    >
    > As for reading each line: that's unavoidable if you want a specific
    > line from a stream of variable-length lines.
    >


    The minimum bounds for a line is at least one byte (the newline) and
    maybe more, depending on your data. You can seek() forward the minimum
    amount of bytes that (1 billion -1) lines will consume and save
    yourself some wasted IO.
    Chris Mellon, Aug 8, 2007
    #14
  15. Sullivan WxPyQtKinter

    Steve Holden Guest

    Chris Mellon wrote:
    > On 8/8/07, Ben Finney <> wrote:
    >> Sullivan WxPyQtKinter <> writes:
    >>
    >>> On Aug 8, 2:35 am, Paul Rubin <http://> wrote:
    >>>> Sullivan WxPyQtKinter <> writes:
    >>>>> This program:
    >>>>> for i in range(1000000000):
    >>>>> f.readline()
    >>>>> is absolutely every slow....
    >>>> There are two problems:
    >>>>
    >>>> 1) range(1000000000) builds a list of a billion elements in memory

    >> [...]
    >>>> 2) f.readline() reads an entire line of input

    >> [...]
    >>> Thank you for pointing out these two problem. I wrote this program
    >>> just to say that how inefficient it is to use a seemingly NATIVE way
    >>> to seek a such a big file. No other intention........

    >> The native way isn't iterating over 'range(hugenum)', it's to use an
    >> iterator. Python file objects are iterable, only reading eaach line as
    >> needed and not creating a companion list.
    >>
    >> logfile = open("foo.log", 'r')
    >> for line in logfile:
    >> do_stuff(line)
    >>
    >> This at least avoids the 'range' issue.
    >>
    >> To know when we've reached a particular line, use 'enumerate' to
    >> number each item as it comes out of the iterator.
    >>
    >> logfile = open("foo.log", 'r')
    >> target_line_num = 10**9
    >> for (line_num, line) in enumerate(file):
    >> if line_num < target_line_num:
    >> continue
    >> else:
    >> do_stuff(line)
    >> break
    >>
    >> As for reading each line: that's unavoidable if you want a specific
    >> line from a stream of variable-length lines.
    >>

    >
    > The minimum bounds for a line is at least one byte (the newline) and
    > maybe more, depending on your data. You can seek() forward the minimum
    > amount of bytes that (1 billion -1) lines will consume and save
    > yourself some wasted IO.


    Except that you will have to count the number of lines in that first
    billion characters in order to determine when to stop.

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC/Ltd http://www.holdenweb.com
    Skype: holdenweb http://del.icio.us/steve.holden
    --------------- Asciimercial ------------------
    Get on the web: Blog, lens and tag the Internet
    Many services currently offer free registration
    ----------- Thank You for Reading -------------
    Steve Holden, Aug 8, 2007
    #15
  16. Sullivan WxPyQtKinter

    Steve Holden Guest

    Chris Mellon wrote:
    > On 8/8/07, Ben Finney <> wrote:
    >> Sullivan WxPyQtKinter <> writes:
    >>
    >>> On Aug 8, 2:35 am, Paul Rubin <http://> wrote:
    >>>> Sullivan WxPyQtKinter <> writes:
    >>>>> This program:
    >>>>> for i in range(1000000000):
    >>>>> f.readline()
    >>>>> is absolutely every slow....
    >>>> There are two problems:
    >>>>
    >>>> 1) range(1000000000) builds a list of a billion elements in memory

    >> [...]
    >>>> 2) f.readline() reads an entire line of input

    >> [...]
    >>> Thank you for pointing out these two problem. I wrote this program
    >>> just to say that how inefficient it is to use a seemingly NATIVE way
    >>> to seek a such a big file. No other intention........

    >> The native way isn't iterating over 'range(hugenum)', it's to use an
    >> iterator. Python file objects are iterable, only reading eaach line as
    >> needed and not creating a companion list.
    >>
    >> logfile = open("foo.log", 'r')
    >> for line in logfile:
    >> do_stuff(line)
    >>
    >> This at least avoids the 'range' issue.
    >>
    >> To know when we've reached a particular line, use 'enumerate' to
    >> number each item as it comes out of the iterator.
    >>
    >> logfile = open("foo.log", 'r')
    >> target_line_num = 10**9
    >> for (line_num, line) in enumerate(file):
    >> if line_num < target_line_num:
    >> continue
    >> else:
    >> do_stuff(line)
    >> break
    >>
    >> As for reading each line: that's unavoidable if you want a specific
    >> line from a stream of variable-length lines.
    >>

    >
    > The minimum bounds for a line is at least one byte (the newline) and
    > maybe more, depending on your data. You can seek() forward the minimum
    > amount of bytes that (1 billion -1) lines will consume and save
    > yourself some wasted IO.


    Except that you will have to count the number of lines in that first
    billion characters in order to determine when to stop.

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC/Ltd http://www.holdenweb.com
    Skype: holdenweb http://del.icio.us/steve.holden
    --------------- Asciimercial ------------------
    Get on the web: Blog, lens and tag the Internet
    Many services currently offer free registration
    ----------- Thank You for Reading -------------
    Steve Holden, Aug 8, 2007
    #16
  17. Sullivan WxPyQtKinter

    Chris Mellon Guest

    On 8/8/07, Steve Holden <> wrote:
    > Chris Mellon wrote:
    > > On 8/8/07, Ben Finney <> wrote:
    > >> Sullivan WxPyQtKinter <> writes:
    > >>
    > >>> On Aug 8, 2:35 am, Paul Rubin <http://> wrote:
    > >>>> Sullivan WxPyQtKinter <> writes:
    > >>>>> This program:
    > >>>>> for i in range(1000000000):
    > >>>>> f.readline()
    > >>>>> is absolutely every slow....
    > >>>> There are two problems:
    > >>>>
    > >>>> 1) range(1000000000) builds a list of a billion elements in memory
    > >> [...]
    > >>>> 2) f.readline() reads an entire line of input
    > >> [...]
    > >>> Thank you for pointing out these two problem. I wrote this program
    > >>> just to say that how inefficient it is to use a seemingly NATIVE way
    > >>> to seek a such a big file. No other intention........
    > >> The native way isn't iterating over 'range(hugenum)', it's to use an
    > >> iterator. Python file objects are iterable, only reading eaach line as
    > >> needed and not creating a companion list.
    > >>
    > >> logfile = open("foo.log", 'r')
    > >> for line in logfile:
    > >> do_stuff(line)
    > >>
    > >> This at least avoids the 'range' issue.
    > >>
    > >> To know when we've reached a particular line, use 'enumerate' to
    > >> number each item as it comes out of the iterator.
    > >>
    > >> logfile = open("foo.log", 'r')
    > >> target_line_num = 10**9
    > >> for (line_num, line) in enumerate(file):
    > >> if line_num < target_line_num:
    > >> continue
    > >> else:
    > >> do_stuff(line)
    > >> break
    > >>
    > >> As for reading each line: that's unavoidable if you want a specific
    > >> line from a stream of variable-length lines.
    > >>

    > >
    > > The minimum bounds for a line is at least one byte (the newline) and
    > > maybe more, depending on your data. You can seek() forward the minimum
    > > amount of bytes that (1 billion -1) lines will consume and save
    > > yourself some wasted IO.

    >
    > Except that you will have to count the number of lines in that first
    > billion characters in order to determine when to stop.
    >


    True. Perhaps you can tell from the data itself what line you want.
    Chris Mellon, Aug 8, 2007
    #17
  18. Sullivan WxPyQtKinter

    Terry Reedy Guest

    "Marc 'BlackJack' Rintsch" <> wrote in message
    news:-berlin.de...
    | On Wed, 08 Aug 2007 09:54:26 +0200, Méta-MCI \(MVP\) wrote:
    |
    | > Create a "index" (a file with 3,453,299,000 tuples :
    | > line_number + start_byte) ; this file has fix-length lines.
    | > slow, OK, but once.
    |
    | Why storing the line number? The first start offset is for the first
    | line, the second start offset for the second line and so on.

    Somewhat ironically, given that the OP's problem stems from variable line
    lengths, this requires that the offsets by fixed length. On a true 64-bit
    OS (not Win64, apparently) with 64-bit ints that would work great.
    Terry Reedy, Aug 8, 2007
    #18
  19. Sullivan WxPyQtKinter

    John J. Lee Guest

    "Chris Mellon" <> writes:
    [...]
    > The minimum bounds for a line is at least one byte (the newline) and
    > maybe more, depending on your data. You can seek() forward the minimum
    > amount of bytes that (1 billion -1) lines will consume and save
    > yourself some wasted IO.


    But how do you know which line number you're on, then?


    John
    John J. Lee, Aug 12, 2007
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    8
    Views:
    879
  2. fd
    Replies:
    2
    Views:
    1,170
    Mark Day
    Mar 5, 2004
  3. Subra
    Replies:
    25
    Views:
    1,169
    user923005
    Mar 8, 2007
  4. CMOS
    Replies:
    2
    Views:
    473
    Zeppe
    May 17, 2007
  5. Replies:
    3
    Views:
    119
    Andreas Perstinger
    May 14, 2013
Loading...

Share This Page