what happens when the file begin read is too big for all lines to beread with "readlines()"

R

Ross Reyes

HI -
Sorry for maybe a too simple a question but I googled and also checked my
reference O'Reilly Learning Python
book and I did not find a satisfactory answer.

When I use readlines, what happens if the number of lines is huge? I have
a very big file (4GB) I want to
read in, but I'm sure there must be some limitation to readlines and I'd
like to know how it is handled by python.
I am using it like this:
slines = infile.readlines() # reads all lines into a list of strings called
"slines"

Thanks for anyone who knows the answer to this one.
 
B

bonono

newer python should use "for x in fh:", according to the doc :

fh = open("your file")
for x in fh: print x

which would only read one line at a time.
 
B

Ben Finney

Ross Reyes said:
Sorry for maybe a too simple a question but I googled and also
checked my reference O'Reilly Learning Python book and I did not
find a satisfactory answer.

The Python documentation is online, and it's good to get familiar with
it:

<URL:http://docs.python.org/>

It's even possible to tell Google to search only that site with
"site:docs.python.org" as a search term.
When I use readlines, what happens if the number of lines is huge?
I have a very big file (4GB) I want to read in, but I'm sure there
must be some limitation to readlines and I'd like to know how it is
handled by python.

The documentation on methods of the 'file' type describes the
'readlines' method, and addresses this concern.

<URL:http://docs.python.org/lib/bltin-file-objects.html#l2h-244>
 
R

Ross Reyes

Yes, I have read this part....

readlines( [sizehint])

Read until EOF using readline() and return a list containing the lines thus
read. If the optional sizehint argument is present, instead of reading up to
EOF, whole lines totalling approximately sizehint bytes (possibly after
rounding up to an internal buffer size) are read. Objects implementing a
file-like interface may choose to ignore sizehint if it cannot be
implemented, or cannot be implemented efficiently.

Maybe I'm missing the obvious, but it does not seem to say what happens when
the input for readlines is too big. Or does it?

How does one tell exactly what the limitation is to the size of the
returned list of strings?

----- Original Message -----
From: "Ben Finney" <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Saturday, November 19, 2005 6:48 AM
Subject: Re: what happens when the file begin read is too big for all lines
tobe?read with "readlines()"
 
M

MrJean1

Just try it, it is not that hard ... ;-)

/Jean Brouwers

PS) Here is what happens on Linux:

$ limit vmemory 10000
$ python
... Traceback (most recent call last):
 
X

Xiao Jianfeng

newer python should use "for x in fh:", according to the doc :

fh = open("your file")
for x in fh: print x

which would only read one line at a time.
I have some other questions:

when "fh" will be closed?

And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?
 
S

Steven D'Aprano

I have some other questions:

when "fh" will be closed?

When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp


f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.
And what shoud I do if I want to explicitly close the file immediately
after reading all data I want?

That is the best practice.

f.close()
 
X

Xiao Jianfeng

Steven said:
When all references to the file are no longer in scope:

def handle_file(name):
fp = file(name, "r")
# reference to file now in scope
do_stuff(fp)
return fp


f = handle_file("myfile.txt)
# reference to file is now in scope
f = None
# reference to file is no longer in scope

At this point, Python *may* close the file. CPython currently closes the
file as soon as all references are out of scope. JPython does not -- it
will close the file eventually, but you can't guarantee when.




That is the best practice.

f.close()
Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?


Regards,

xiaojf
 
S

Steve Holden

Xiao said:
Steven D'Aprano wrote:



Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?
And I think reading one line each time is less efficient, am I right?
Not necessarily. Try this:

f = file("filename.txt")
for line in f:
if token in line: # or whatever you need to identify it
break
else:
sys.exit("File does not contain token")
line = f.next()

Then line will be the one you want. Since this will use code written in
C to do the processing you will probably be pleasantly surprised by its
speed. Only if this isn't fast enough should you consider anything more
complicated.

Premature optimizations can waste huge amounts of unnecessary
programming time. Don't do it. First try measuring a solution that works!

regards
Steve
 
S

Steven D'Aprano

Let me introduce my problem I came across last night first.

I need to read a file(which may be small or very big) and to check line
by line
to find a specific token, then the data on the next line will be what I
want.

If I use readlines(), it will be a problem when the file is too big.

If I use "for line in OPENED_FILE:" to read one line each time, how can
I get
the next line when I find the specific token?

Here is one solution using a flag:

done = False
for line in file("myfile", "r"):
if done:
break
done = line == "token\n" # note the newline
# we expect Python to close the file when we exit the loop
if done:
DoSomethingWith(line) # the line *after* the one with the token
else:
print "Token not found!"


Here is another solution, without using a flag:

def get_line(filename, token):
"""Returns the next line following a token, or None if not found.
Leading and trailing whitespace is ignored when looking for
the token.
"""
fp = file(filename, "r")
for line in fp:
if line.strip() == token:
break
else:
# runs only if we didn't break
print "Token not found"
result = None
result = fp.readline() # read the next line only
fp.close()
return result


Here is a third solution that raises an exception instead of printing an
error message:

def get_line(filename, token):
for line in file(filename, "r"):
if line.strip() == token:
break
else:
raise ValueError("Token not found")
return fp.readline()
# we rely on Python to close the file when we are done


And I think reading one line each time is less efficient, am I right?

Less efficient than what? Spending hours or days writing more complex code
that only saves you a few seconds, or even runs slower?

I believe Python will take advantage of your file system's buffering
capabilities. Try it and see, you'll be surprised how fast it runs. If you
try it and it is too slow, then come back and we'll see what can be done
to speed it up. But don't try to speed it up before you know if it is fast
enough.
 
S

Steven D'Aprano

def get_line(filename, token):
"""Returns the next line following a token, or None if not found.
Leading and trailing whitespace is ignored when looking for
the token.
"""
fp = file(filename, "r")
for line in fp:
if line.strip() == token:
break
else:
# runs only if we didn't break
print "Token not found"
result = None
result = fp.readline() # read the next line only
fp.close()
return result

Correction: checking the Library Reference, I find that this is
wrong. The reason is that file objects implement their own read-ahead
buffer, and mixing calls to next() and readline() may not work right.

See http://docs.python.org/lib/bltin-file-objects.html

Replace the fp.readline() with fp.next() and all should be good.
 
X

Xiao Jianfeng

Steve said:
Xiao Jianfeng wrote:


Not necessarily. Try this:

f = file("filename.txt")
for line in f:
if token in line: # or whatever you need to identify it
break
else:
sys.exit("File does not contain token")
line = f.next()

Then line will be the one you want. Since this will use code written in
C to do the processing you will probably be pleasantly surprised by its
speed. Only if this isn't fast enough should you consider anything more
complicated.

Premature optimizations can waste huge amounts of unnecessary
programming time. Don't do it. First try measuring a solution that works!
Oh yes, thanks.
regards
Steve
First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

Regrads,

xiaojf
 
B

bonono

Xiao said:
First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

something like this :

for x in fh:
if not has_token(x): continue
else: process(fh.next())

you can also create an iterator by iter(fh), but I don't think that is
necessary

using the "side effect" to your advantage. I was bite before for the
iterator's side effect but for your particular apps, it becomes an
advantage.
 
X

Xiao Jianfeng

Xiao Jianfeng wrote:

First, I must say thanks to all of you. And I'm really sorry that I
didn't
describe my problem clearly.

There are many tokens in the file, every time I find a token, I have
to get
the data on the next line and do some operation with it. It should be easy
for me to find just one token using the above method, but there are
more than
one.

My method was:

f_in = open('input_file', 'r')
data_all = f_in.readlines()
f_in.close()

for i in range(len(data_all)):
line = data
if token in line:
# do something with data[i + 1]

Since my method needs to read all the file into memeory, I think it
may be not
efficient when processing very big file.

I really appreciate all suggestions! Thanks again.

something like this :

for x in fh:
if not has_token(x): continue
else: process(fh.next())

you can also create an iterator by iter(fh), but I don't think that is
necessary

using the "side effect" to your advantage. I was bite before for the
iterator's side effect but for your particular apps, it becomes an
advantage.

Thanks all of you!

I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.

Regards,

xiaojf
 
B

bonono

Xiao said:
I have compared the two methods,
(1). "for x in fh:"
(2). read all the file into memory firstly.

I have tested the two methods on two files, one is 80M and the second
one is 815M.
The first method gained a speedup of about 40% for the first file, and
a speedup
of about 25% for the second file.

Sorry for my bad English, and I hope I haven't made people confused.

So is the problem solved ?

Putting buffering implementation aside, (1) is the way to go as it runs
through content only once.
 
F

Fredrik Lundh

Ross said:
Maybe I'm missing the obvious, but it does not seem to say what happens when
the input for readlines is too big. Or does it?

readlines handles memory overflow in exactly the same way as any
other operation: by raising a MemoryError exception:

http://www.python.org/doc/current/lib/module-exceptions.html#l2h-296
How does one tell exactly what the limitation is to the size of the
returned list of strings?

you can't. it depends on how much memory you have, what your
files look like (shorter lines means more string objects means more
overhead), and how your operating system handles large processes.
as soon as the operating system says that it cannot allocate more
memory to the Python process, Python will abort the operation and
raise an exception. if the operating system doesn't complain, neither
will Python.

</F>
 
M

Mike Meyer

Ross Reyes said:
Yes, I have read this part....
How does one tell exactly what the limitation is to the size of the
returned list of strings?

There's not really a good platform-indendent way to do that, because
you'll get memory until the OS won't give you any more.

<mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top