what happens when the file begin read is too big for all lines to beread with "readlines()"

Discussion in 'Python' started by Ross Reyes, Nov 19, 2005.

  1. Ross Reyes

    Ross Reyes Guest

    HI -
    Sorry for maybe a too simple a question but I googled and also checked my
    reference O'Reilly Learning Python
    book and I did not find a satisfactory answer.

    When I use readlines, what happens if the number of lines is huge? I have
    a very big file (4GB) I want to
    read in, but I'm sure there must be some limitation to readlines and I'd
    like to know how it is handled by python.
    I am using it like this:
    slines = infile.readlines() # reads all lines into a list of strings called
    "slines"

    Thanks for anyone who knows the answer to this one.
    Ross Reyes, Nov 19, 2005
    #1
    1. Advertising

  2. Ross Reyes

    Guest

    Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

    newer python should use "for x in fh:", according to the doc :

    fh = open("your file")
    for x in fh: print x

    which would only read one line at a time.

    Ross Reyes wrote:
    > HI -
    > Sorry for maybe a too simple a question but I googled and also checked my
    > reference O'Reilly Learning Python
    > book and I did not find a satisfactory answer.
    >
    > When I use readlines, what happens if the number of lines is huge? I have
    > a very big file (4GB) I want to
    > read in, but I'm sure there must be some limitation to readlines and I'd
    > like to know how it is handled by python.
    > I am using it like this:
    > slines = infile.readlines() # reads all lines into a list of strings called
    > "slines"
    >
    > Thanks for anyone who knows the answer to this one.
    , Nov 19, 2005
    #2
    1. Advertising

  3. Ross Reyes

    Ben Finney Guest

    Re: what happens when the file begin read is too big for all lines to be?read with "readlines()"

    Ross Reyes <> wrote:
    > Sorry for maybe a too simple a question but I googled and also
    > checked my reference O'Reilly Learning Python book and I did not
    > find a satisfactory answer.


    The Python documentation is online, and it's good to get familiar with
    it:

    <URL:http://docs.python.org/>

    It's even possible to tell Google to search only that site with
    "site:docs.python.org" as a search term.

    > When I use readlines, what happens if the number of lines is huge?
    > I have a very big file (4GB) I want to read in, but I'm sure there
    > must be some limitation to readlines and I'd like to know how it is
    > handled by python.


    The documentation on methods of the 'file' type describes the
    'readlines' method, and addresses this concern.

    <URL:http://docs.python.org/lib/bltin-file-objects.html#l2h-244>

    --
    \ "If you're not part of the solution, you're part of the |
    `\ precipitate." -- Steven Wright |
    _o__) |
    Ben Finney
    Ben Finney, Nov 19, 2005
    #3
  4. Ross Reyes

    Ross Reyes Guest

    Re: what happens when the file begin read is too big for all linestobe?read with "readlines()"

    Yes, I have read this part....

    readlines( [sizehint])

    Read until EOF using readline() and return a list containing the lines thus
    read. If the optional sizehint argument is present, instead of reading up to
    EOF, whole lines totalling approximately sizehint bytes (possibly after
    rounding up to an internal buffer size) are read. Objects implementing a
    file-like interface may choose to ignore sizehint if it cannot be
    implemented, or cannot be implemented efficiently.

    Maybe I'm missing the obvious, but it does not seem to say what happens when
    the input for readlines is too big. Or does it?

    How does one tell exactly what the limitation is to the size of the
    returned list of strings?

    ----- Original Message -----
    From: "Ben Finney" <>
    Newsgroups: comp.lang.python
    To: <>
    Sent: Saturday, November 19, 2005 6:48 AM
    Subject: Re: what happens when the file begin read is too big for all lines
    tobe?read with "readlines()"


    > Ross Reyes <> wrote:
    >> Sorry for maybe a too simple a question but I googled and also
    >> checked my reference O'Reilly Learning Python book and I did not
    >> find a satisfactory answer.

    >
    > The Python documentation is online, and it's good to get familiar with
    > it:
    >
    > <URL:http://docs.python.org/>
    >
    > It's even possible to tell Google to search only that site with
    > "site:docs.python.org" as a search term.
    >
    >> When I use readlines, what happens if the number of lines is huge?
    >> I have a very big file (4GB) I want to read in, but I'm sure there
    >> must be some limitation to readlines and I'd like to know how it is
    >> handled by python.

    >
    > The documentation on methods of the 'file' type describes the
    > 'readlines' method, and addresses this concern.
    >
    > <URL:http://docs.python.org/lib/bltin-file-objects.html#l2h-244>
    >
    > --
    > \ "If you're not part of the solution, you're part of the |
    > `\ precipitate." -- Steven Wright |
    > _o__) |
    > Ben Finney
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Ross Reyes, Nov 19, 2005
    #4
  5. Ross Reyes

    MrJean1 Guest

    Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

    Just try it, it is not that hard ... ;-)

    /Jean Brouwers

    PS) Here is what happens on Linux:

    $ limit vmemory 10000
    $ python
    ...
    >>> s = file(<bugfile>).readlines()

    Traceback (most recent call last):
    File "<stdin>", line 1 in ?
    MemoryError
    >>>
    MrJean1, Nov 19, 2005
    #5
  6. Re: what happens when the file begin read is too big for all linesto be read with "readlines()"

    wrote:

    >newer python should use "for x in fh:", according to the doc :
    >
    >fh = open("your file")
    >for x in fh: print x
    >
    >which would only read one line at a time.
    >
    >
    >

    I have some other questions:

    when "fh" will be closed?

    And what shoud I do if I want to explicitly close the file immediately
    after reading all data I want?

    >Ross Reyes wrote:
    >
    >
    >>HI -
    >>Sorry for maybe a too simple a question but I googled and also checked my
    >>reference O'Reilly Learning Python
    >>book and I did not find a satisfactory answer.
    >>
    >>When I use readlines, what happens if the number of lines is huge? I have
    >>a very big file (4GB) I want to
    >>read in, but I'm sure there must be some limitation to readlines and I'd
    >>like to know how it is handled by python.
    >>I am using it like this:
    >>slines = infile.readlines() # reads all lines into a list of strings called
    >>"slines"
    >>
    >>Thanks for anyone who knows the answer to this one.
    >>
    >>

    >
    >
    >
    Xiao Jianfeng, Nov 20, 2005
    #6
  7. Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

    On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:

    > I have some other questions:
    >
    > when "fh" will be closed?


    When all references to the file are no longer in scope:

    def handle_file(name):
    fp = file(name, "r")
    # reference to file now in scope
    do_stuff(fp)
    return fp


    f = handle_file("myfile.txt)
    # reference to file is now in scope
    f = None
    # reference to file is no longer in scope

    At this point, Python *may* close the file. CPython currently closes the
    file as soon as all references are out of scope. JPython does not -- it
    will close the file eventually, but you can't guarantee when.

    > And what shoud I do if I want to explicitly close the file immediately
    > after reading all data I want?


    That is the best practice.

    f.close()


    --
    Steven.
    Steven D'Aprano, Nov 20, 2005
    #7
  8. Re: what happens when the file begin read is too big for all linesto be read with "readlines()"

    Steven D'Aprano wrote:

    >On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
    >
    >
    >
    >> I have some other questions:
    >>
    >> when "fh" will be closed?
    >>
    >>

    >
    >When all references to the file are no longer in scope:
    >
    >def handle_file(name):
    > fp = file(name, "r")
    > # reference to file now in scope
    > do_stuff(fp)
    > return fp
    >
    >
    >f = handle_file("myfile.txt)
    ># reference to file is now in scope
    >f = None
    ># reference to file is no longer in scope
    >
    >At this point, Python *may* close the file. CPython currently closes the
    >file as soon as all references are out of scope. JPython does not -- it
    >will close the file eventually, but you can't guarantee when.
    >
    >
    >
    >> And what shoud I do if I want to explicitly close the file immediately
    >>after reading all data I want?
    >>
    >>

    >
    >That is the best practice.
    >
    >f.close()
    >
    >
    >
    >

    Let me introduce my problem I came across last night first.

    I need to read a file(which may be small or very big) and to check line
    by line
    to find a specific token, then the data on the next line will be what I
    want.

    If I use readlines(), it will be a problem when the file is too big.

    If I use "for line in OPENED_FILE:" to read one line each time, how can
    I get
    the next line when I find the specific token?
    And I think reading one line each time is less efficient, am I right?


    Regards,

    xiaojf
    Xiao Jianfeng, Nov 20, 2005
    #8
  9. Ross Reyes

    Steve Holden Guest

    Re: what happens when the file begin read is too big for all linesto be read with "readlines()"

    Xiao Jianfeng wrote:
    > Steven D'Aprano wrote:
    >
    >
    >>On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
    >>
    >>
    >>
    >>
    >>>I have some other questions:
    >>>
    >>>when "fh" will be closed?
    >>>
    >>>

    >>
    >>When all references to the file are no longer in scope:
    >>
    >>def handle_file(name):
    >> fp = file(name, "r")
    >> # reference to file now in scope
    >> do_stuff(fp)
    >> return fp
    >>
    >>
    >>f = handle_file("myfile.txt)
    >># reference to file is now in scope
    >>f = None
    >># reference to file is no longer in scope
    >>
    >>At this point, Python *may* close the file. CPython currently closes the
    >>file as soon as all references are out of scope. JPython does not -- it
    >>will close the file eventually, but you can't guarantee when.
    >>
    >>
    >>
    >>
    >>>And what shoud I do if I want to explicitly close the file immediately
    >>>after reading all data I want?
    >>>
    >>>

    >>
    >>That is the best practice.
    >>
    >>f.close()
    >>
    >>
    >>
    >>

    >
    > Let me introduce my problem I came across last night first.
    >
    > I need to read a file(which may be small or very big) and to check line
    > by line
    > to find a specific token, then the data on the next line will be what I
    > want.
    >
    > If I use readlines(), it will be a problem when the file is too big.
    >
    > If I use "for line in OPENED_FILE:" to read one line each time, how can
    > I get
    > the next line when I find the specific token?
    > And I think reading one line each time is less efficient, am I right?
    >

    Not necessarily. Try this:

    f = file("filename.txt")
    for line in f:
    if token in line: # or whatever you need to identify it
    break
    else:
    sys.exit("File does not contain token")
    line = f.next()

    Then line will be the one you want. Since this will use code written in
    C to do the processing you will probably be pleasantly surprised by its
    speed. Only if this isn't fast enough should you consider anything more
    complicated.

    Premature optimizations can waste huge amounts of unnecessary
    programming time. Don't do it. First try measuring a solution that works!

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/
    Steve Holden, Nov 20, 2005
    #9
  10. Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

    On Sun, 20 Nov 2005 12:28:07 +0800, Xiao Jianfeng wrote:

    > Let me introduce my problem I came across last night first.
    >
    > I need to read a file(which may be small or very big) and to check line
    > by line
    > to find a specific token, then the data on the next line will be what I
    > want.
    >
    > If I use readlines(), it will be a problem when the file is too big.
    >
    > If I use "for line in OPENED_FILE:" to read one line each time, how can
    > I get
    > the next line when I find the specific token?


    Here is one solution using a flag:

    done = False
    for line in file("myfile", "r"):
    if done:
    break
    done = line == "token\n" # note the newline
    # we expect Python to close the file when we exit the loop
    if done:
    DoSomethingWith(line) # the line *after* the one with the token
    else:
    print "Token not found!"


    Here is another solution, without using a flag:

    def get_line(filename, token):
    """Returns the next line following a token, or None if not found.
    Leading and trailing whitespace is ignored when looking for
    the token.
    """
    fp = file(filename, "r")
    for line in fp:
    if line.strip() == token:
    break
    else:
    # runs only if we didn't break
    print "Token not found"
    result = None
    result = fp.readline() # read the next line only
    fp.close()
    return result


    Here is a third solution that raises an exception instead of printing an
    error message:

    def get_line(filename, token):
    for line in file(filename, "r"):
    if line.strip() == token:
    break
    else:
    raise ValueError("Token not found")
    return fp.readline()
    # we rely on Python to close the file when we are done



    > And I think reading one line each time is less efficient, am I right?


    Less efficient than what? Spending hours or days writing more complex code
    that only saves you a few seconds, or even runs slower?

    I believe Python will take advantage of your file system's buffering
    capabilities. Try it and see, you'll be surprised how fast it runs. If you
    try it and it is too slow, then come back and we'll see what can be done
    to speed it up. But don't try to speed it up before you know if it is fast
    enough.


    --
    Steven.
    Steven D'Aprano, Nov 20, 2005
    #10
  11. Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

    On Sun, 20 Nov 2005 16:10:58 +1100, Steven D'Aprano wrote:

    > def get_line(filename, token):
    > """Returns the next line following a token, or None if not found.
    > Leading and trailing whitespace is ignored when looking for
    > the token.
    > """
    > fp = file(filename, "r")
    > for line in fp:
    > if line.strip() == token:
    > break
    > else:
    > # runs only if we didn't break
    > print "Token not found"
    > result = None
    > result = fp.readline() # read the next line only
    > fp.close()
    > return result


    Correction: checking the Library Reference, I find that this is
    wrong. The reason is that file objects implement their own read-ahead
    buffer, and mixing calls to next() and readline() may not work right.

    See http://docs.python.org/lib/bltin-file-objects.html

    Replace the fp.readline() with fp.next() and all should be good.


    --
    Steven.
    Steven D'Aprano, Nov 20, 2005
    #11
  12. Re: what happens when the file begin read is too big for all linesto be read with "readlines()"

    Steve Holden wrote:

    >Xiao Jianfeng wrote:
    >
    >
    >>Steven D'Aprano wrote:
    >>
    >>
    >>
    >>
    >>>On Sun, 20 Nov 2005 11:05:53 +0800, Xiao Jianfeng wrote:
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>>I have some other questions:
    >>>>
    >>>>when "fh" will be closed?
    >>>>
    >>>>
    >>>>
    >>>>
    >>>When all references to the file are no longer in scope:
    >>>
    >>>def handle_file(name):
    >>> fp = file(name, "r")
    >>> # reference to file now in scope
    >>> do_stuff(fp)
    >>> return fp
    >>>
    >>>
    >>>f = handle_file("myfile.txt)
    >>># reference to file is now in scope
    >>>f = None
    >>># reference to file is no longer in scope
    >>>
    >>>At this point, Python *may* close the file. CPython currently closes the
    >>>file as soon as all references are out of scope. JPython does not -- it
    >>>will close the file eventually, but you can't guarantee when.
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>>And what shoud I do if I want to explicitly close the file immediately
    >>>>after reading all data I want?
    >>>>
    >>>>
    >>>>
    >>>>
    >>>That is the best practice.
    >>>
    >>>f.close()
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>

    >> Let me introduce my problem I came across last night first.
    >>
    >> I need to read a file(which may be small or very big) and to check line
    >>by line
    >> to find a specific token, then the data on the next line will be what I
    >>want.
    >>
    >> If I use readlines(), it will be a problem when the file is too big.
    >>
    >> If I use "for line in OPENED_FILE:" to read one line each time, how can
    >>I get
    >> the next line when I find the specific token?
    >> And I think reading one line each time is less efficient, am I right?
    >>
    >>
    >>

    >Not necessarily. Try this:
    >
    > f = file("filename.txt")
    > for line in f:
    > if token in line: # or whatever you need to identify it
    > break
    > else:
    > sys.exit("File does not contain token")
    > line = f.next()
    >
    >Then line will be the one you want. Since this will use code written in
    >C to do the processing you will probably be pleasantly surprised by its
    >speed. Only if this isn't fast enough should you consider anything more
    >complicated.
    >
    >Premature optimizations can waste huge amounts of unnecessary
    >programming time. Don't do it. First try measuring a solution that works!
    >
    >

    Oh yes, thanks.

    >regards
    > Steve
    >
    >

    First, I must say thanks to all of you. And I'm really sorry that I
    didn't
    describe my problem clearly.

    There are many tokens in the file, every time I find a token, I have
    to get
    the data on the next line and do some operation with it. It should be easy
    for me to find just one token using the above method, but there are
    more than
    one.

    My method was:

    f_in = open('input_file', 'r')
    data_all = f_in.readlines()
    f_in.close()

    for i in range(len(data_all)):
    line = data
    if token in line:
    # do something with data[i + 1]

    Since my method needs to read all the file into memeory, I think it
    may be not
    efficient when processing very big file.

    I really appreciate all suggestions! Thanks again.

    Regrads,

    xiaojf
    Xiao Jianfeng, Nov 20, 2005
    #12
  13. Ross Reyes

    Guest

    Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

    Xiao Jianfeng wrote:
    > First, I must say thanks to all of you. And I'm really sorry that I
    > didn't
    > describe my problem clearly.
    >
    > There are many tokens in the file, every time I find a token, I have
    > to get
    > the data on the next line and do some operation with it. It should be easy
    > for me to find just one token using the above method, but there are
    > more than
    > one.
    >
    > My method was:
    >
    > f_in = open('input_file', 'r')
    > data_all = f_in.readlines()
    > f_in.close()
    >
    > for i in range(len(data_all)):
    > line = data
    > if token in line:
    > # do something with data[i + 1]
    >
    > Since my method needs to read all the file into memeory, I think it
    > may be not
    > efficient when processing very big file.
    >
    > I really appreciate all suggestions! Thanks again.
    >

    something like this :

    for x in fh:
    if not has_token(x): continue
    else: process(fh.next())

    you can also create an iterator by iter(fh), but I don't think that is
    necessary

    using the "side effect" to your advantage. I was bite before for the
    iterator's side effect but for your particular apps, it becomes an
    advantage.
    , Nov 20, 2005
    #13
  14. Re: what happens when the file begin read is too big for all linesto be read with "readlines()"

    wrote:

    >Xiao Jianfeng wrote:
    >
    >
    >> First, I must say thanks to all of you. And I'm really sorry that I
    >>didn't
    >> describe my problem clearly.
    >>
    >> There are many tokens in the file, every time I find a token, I have
    >>to get
    >> the data on the next line and do some operation with it. It should be easy
    >> for me to find just one token using the above method, but there are
    >>more than
    >> one.
    >>
    >> My method was:
    >>
    >> f_in = open('input_file', 'r')
    >> data_all = f_in.readlines()
    >> f_in.close()
    >>
    >> for i in range(len(data_all)):
    >> line = data
    >> if token in line:
    >> # do something with data[i + 1]
    >>
    >> Since my method needs to read all the file into memeory, I think it
    >>may be not
    >> efficient when processing very big file.
    >>
    >> I really appreciate all suggestions! Thanks again.
    >>
    >>
    >>

    >something like this :
    >
    >for x in fh:
    > if not has_token(x): continue
    > else: process(fh.next())
    >
    >you can also create an iterator by iter(fh), but I don't think that is
    >necessary
    >
    >using the "side effect" to your advantage. I was bite before for the
    >iterator's side effect but for your particular apps, it becomes an
    >advantage.
    >
    >

    Thanks all of you!

    I have compared the two methods,
    (1). "for x in fh:"
    (2). read all the file into memory firstly.

    I have tested the two methods on two files, one is 80M and the second
    one is 815M.
    The first method gained a speedup of about 40% for the first file, and
    a speedup
    of about 25% for the second file.

    Sorry for my bad English, and I hope I haven't made people confused.

    Regards,

    xiaojf
    Xiao Jianfeng, Nov 20, 2005
    #14
  15. Ross Reyes

    Guest

    Re: what happens when the file begin read is too big for all lines to be read with "readlines()"

    Xiao Jianfeng wrote:
    > I have compared the two methods,
    > (1). "for x in fh:"
    > (2). read all the file into memory firstly.
    >
    > I have tested the two methods on two files, one is 80M and the second
    > one is 815M.
    > The first method gained a speedup of about 40% for the first file, and
    > a speedup
    > of about 25% for the second file.
    >
    > Sorry for my bad English, and I hope I haven't made people confused.


    So is the problem solved ?

    Putting buffering implementation aside, (1) is the way to go as it runs
    through content only once.
    , Nov 20, 2005
    #15
  16. Re: what happens when the file begin read is too big for all linesto be read with "readlines()"

    wrote:

    >Xiao Jianfeng wrote:
    >
    >
    >> I have compared the two methods,
    >> (1). "for x in fh:"
    >> (2). read all the file into memory firstly.
    >>
    >> I have tested the two methods on two files, one is 80M and the second
    >>one is 815M.
    >> The first method gained a speedup of about 40% for the first file, and
    >>a speedup
    >> of about 25% for the second file.
    >>
    >> Sorry for my bad English, and I hope I haven't made people confused.
    >>
    >>

    >
    >So is the problem solved ?
    >
    >

    Yes, thank you.

    >Putting buffering implementation aside, (1) is the way to go as it runs
    >through content only once.
    >
    >
    >

    I think so :)



    Regards,

    xiaojf
    Xiao Jianfeng, Nov 20, 2005
    #16
  17. Re: what happens when the file begin read is too big for alllinestobe?read with "readlines()"

    Ross Reyes wrote:

    > Maybe I'm missing the obvious, but it does not seem to say what happens when
    > the input for readlines is too big. Or does it?


    readlines handles memory overflow in exactly the same way as any
    other operation: by raising a MemoryError exception:

    http://www.python.org/doc/current/lib/module-exceptions.html#l2h-296

    > How does one tell exactly what the limitation is to the size of the
    > returned list of strings?


    you can't. it depends on how much memory you have, what your
    files look like (shorter lines means more string objects means more
    overhead), and how your operating system handles large processes.
    as soon as the operating system says that it cannot allocate more
    memory to the Python process, Python will abort the operation and
    raise an exception. if the operating system doesn't complain, neither
    will Python.

    </F>
    Fredrik Lundh, Nov 20, 2005
    #17
  18. Ross Reyes

    Mike Meyer Guest

    Re: what happens when the file begin read is too big for all linestobe?read with "readlines()"

    "Ross Reyes" <> writes:

    > Yes, I have read this part....
    > How does one tell exactly what the limitation is to the size of the
    > returned list of strings?


    There's not really a good platform-indendent way to do that, because
    you'll get memory until the OS won't give you any more.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
    Mike Meyer, Nov 21, 2005
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Lundh
    Replies:
    0
    Views:
    359
    Fredrik Lundh
    Nov 19, 2005
  2. NM
    Replies:
    6
    Views:
    445
    Default User
    Sep 20, 2006
  3. Shaguf
    Replies:
    0
    Views:
    320
    Shaguf
    Dec 24, 2008
  4. Shaguf
    Replies:
    0
    Views:
    423
    Shaguf
    Dec 26, 2008
  5. Shaguf
    Replies:
    0
    Views:
    208
    Shaguf
    Dec 26, 2008
Loading...

Share This Page