Seek the one billionth line in a file containing 3 billion lines.

  • Thread starter Sullivan WxPyQtKinter
  • Start date
S

Sullivan WxPyQtKinter

I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.
 
P

Paul Rubin

Sullivan WxPyQtKinter said:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.
 
E

Evan Klitzke

I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.

There is no fast way to do this, unless the lines are of fixed length
(in which case you can use f.seek() to move to the correct spot). The
reason is that there is no way to find the position of the billionth
line without scanning the whole file. You should split your logs into
smaller files in the future.

You might be able to do this a very tiny bit faster by using the split
utility and have it split the log file into smaller chunks (split can
split by line amounts), but since that still has to scan the file it
will will be IO bound.
 
S

Sullivan WxPyQtKinter

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.


Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........
 
P

Peter Otten

Sullivan said:
I have a huge log file which contains 3,453,299,000 lines with
different lengths. It is not possible to calculate the absolute
position of the beginning of the one billionth line. Are there
efficient way to seek to the beginning of that line in python?

This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

Thank you so much for help.

That will be slow regardless of language. However

n = 10**9 - 1
assert n < sys.maxint
f = open(filename)
wanted_line = itertools.islice(f, n, None).next()

should do slightly better than your implementation.

Peter
 
J

Jay Loden

Paul said:
There are two problems:

1) range(1000000000) builds a list of a billion elements in memory,
which is many gigabytes and probably thrashing your machine.
You want to use xrange instead of range, which builds an iterator
(i.e. something that uses just a small amount of memory, and
generates the values on the fly instead of precomputing a list).

2) f.readline() reads an entire line of input which (depending on
the nature of the log file) could also be of very large size.
If you're sure the log file contents are sensible (lines up to
several megabytes shouldn't cause a problem) then you can do it
that way, but otherwise you want to read fixed size units.

If we just want to iterate through the file one line at a time, why not just:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something


My first suggestion would be to split the file into smaller more manageable
chunks, because any type of manipulation of a multi-billion line log file is
going to be a nightmare. For example, you could try the UNIX 'split' utility to
break the file into individual files of say, 100000 lines each. split is likely
to be faster than anything in Python, since it is written in C with no
interpreter overhead etc.

Is there a reason you specifically need to get to line 1 billion, or are you
just trying to trim the file down? Do you need a value that's on that particular
line, or is there some other reason? Perhaps if you can provide the use case the
list can help you solve the problem itself rather than looking for a way to seek
to the one billionth line in a file.

-Jay
 
M

Méta-MCI \(MVP\)

Hi!

Create a "index" (a file with 3,453,299,000 tuples :
line_number + start_byte) ; this file has fix-length lines.
slow, OK, but once.

Then, for every consult/read a specific line:
- direct acces read on index
- seek at the fisrt byte of the line desired



@+

Michel Claveau
 
B

Bruno Desthuilliers

Jay Loden a écrit :
(snip)
If we just want to iterate through the file one line at a time, why not just:

count = 0
handle = open('hugelogfile.txt')
for line in handle.xreadlines():
count = count + 1
if count == '1000000000':
#do something

for count, line in enumerate(handle):
if count == '1000000000':
#do something


NB : files now (well... since 2.3) handle iteration directly
http://www.python.org/doc/2.3/whatsnew/node17.html
 
M

Marc 'BlackJack' Rintsch

Create a "index" (a file with 3,453,299,000 tuples :
line_number + start_byte) ; this file has fix-length lines.
slow, OK, but once.

Why storing the line number? The first start offset is for the first
line, the second start offset for the second line and so on.

Ciao,
Marc 'BlackJack' Rintsch
 
B

Ben Finney

Sullivan WxPyQtKinter said:
Sullivan WxPyQtKinter said:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory [...]

2) f.readline() reads an entire line of input
[...]

Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........

The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.
 
A

Ant

Jay Loden a écrit :
(snip)



for count, line in enumerate(handle):
if count == '1000000000':
#do something

You'd get better results if the test were:

if count == 1000000000:

Or probably even:

if count == 999999999:

Since the 1 billionth line will have index 999999999.

Cheers,
 
B

Bjoern Schliessmann

Peter said:
n = 10**9 - 1
assert n < sys.maxint
f = open(filename)
wanted_line = itertools.islice(f, n, None).next()

should do slightly better than your implementation.

It will do vastly better, at least in memory usage terms, because
there is no memory eating range call.

Regards,


Björn
 
B

Bruno Desthuilliers

Ant a écrit :
You'd get better results if the test were:

if count == 1000000000:

Or probably even:

if count == 999999999:

Since the 1 billionth line will have index 999999999.

Doh :(

Thanks for the correction.
 
C

Chris Mellon

Sullivan WxPyQtKinter said:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....

There are two problems:

1) range(1000000000) builds a list of a billion elements in memory [...]

2) f.readline() reads an entire line of input
[...]

Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........

The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.
 
S

Steve Holden

Chris said:
Sullivan WxPyQtKinter said:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
There are two problems:

1) range(1000000000) builds a list of a billion elements in memory [...]
2) f.readline() reads an entire line of input [...]
Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........
The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.

Except that you will have to count the number of lines in that first
billion characters in order to determine when to stop.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
S

Steve Holden

Chris said:
Sullivan WxPyQtKinter said:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
There are two problems:

1) range(1000000000) builds a list of a billion elements in memory [...]
2) f.readline() reads an entire line of input [...]
Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........
The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.

Except that you will have to count the number of lines in that first
billion characters in order to determine when to stop.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
C

Chris Mellon

Chris said:
This program:
for i in range(1000000000):
f.readline()
is absolutely every slow....
There are two problems:

1) range(1000000000) builds a list of a billion elements in memory
[...]
2) f.readline() reads an entire line of input
[...]
Thank you for pointing out these two problem. I wrote this program
just to say that how inefficient it is to use a seemingly NATIVE way
to seek a such a big file. No other intention........
The native way isn't iterating over 'range(hugenum)', it's to use an
iterator. Python file objects are iterable, only reading eaach line as
needed and not creating a companion list.

logfile = open("foo.log", 'r')
for line in logfile:
do_stuff(line)

This at least avoids the 'range' issue.

To know when we've reached a particular line, use 'enumerate' to
number each item as it comes out of the iterator.

logfile = open("foo.log", 'r')
target_line_num = 10**9
for (line_num, line) in enumerate(file):
if line_num < target_line_num:
continue
else:
do_stuff(line)
break

As for reading each line: that's unavoidable if you want a specific
line from a stream of variable-length lines.

The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.

Except that you will have to count the number of lines in that first
billion characters in order to determine when to stop.

True. Perhaps you can tell from the data itself what line you want.
 
T

Terry Reedy

| On Wed, 08 Aug 2007 09:54:26 +0200, Méta-MCI \(MVP\) wrote:
|
| > Create a "index" (a file with 3,453,299,000 tuples :
| > line_number + start_byte) ; this file has fix-length lines.
| > slow, OK, but once.
|
| Why storing the line number? The first start offset is for the first
| line, the second start offset for the second line and so on.

Somewhat ironically, given that the OP's problem stems from variable line
lengths, this requires that the offsets by fixed length. On a true 64-bit
OS (not Win64, apparently) with 64-bit ints that would work great.
 
J

John J. Lee

Chris Mellon said:
The minimum bounds for a line is at least one byte (the newline) and
maybe more, depending on your data. You can seek() forward the minimum
amount of bytes that (1 billion -1) lines will consume and save
yourself some wasted IO.

But how do you know which line number you're on, then?


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top