file reading by record separator (not line by line)

Discussion in 'Python' started by Lee Sander, May 31, 2007.

  1. Lee Sander

    Lee Sander Guest

    Dear all,
    I would like to read a really huge file that looks like this:

    > name1....

    line_11
    line_12
    line_13
    ....
    >name2 ...

    line_21
    line_22
    ....
    etc

    where line_ij is just a free form text on that line.

    how can i read file so that every time i do a "read()" i get exactly
    one record
    up to the next ">"

    many thanks
    Lee
    Lee Sander, May 31, 2007
    #1
    1. Advertising

  2. Lee Sander

    Lee Sander Guest

    I wanted to also say that this file is really huge, so I cannot
    just do a read() and then split on ">" to get a record
    thanks
    lee

    On May 31, 1:26 pm, Lee Sander <> wrote:
    > Dear all,
    > I would like toreada really hugefilethat looks like this:
    >
    > > name1....

    >
    > line_11
    > line_12
    > line_13
    > ...>name2 ...
    >
    > line_21
    > line_22
    > ...
    > etc
    >
    > where line_ij is just a free form text on that line.
    >
    > how can ireadfileso that every time i do a "read()" i get exactly
    > onerecord
    > up to the next ">"
    >
    > many thanks
    > Lee
    Lee Sander, May 31, 2007
    #2
    1. Advertising

  3. Lee Sander

    aspineux Guest

    something like

    name=None
    lines=[]
    for line in open('yourfilename.txt'):
    if line.startwith('>'):
    if name!=None:
    print 'Here is the record', name
    print lines
    print
    name=line.stripr('\r')
    lines=[]
    else:
    lines.append(line.stripr('\n'))



    On 31 mai, 14:39, Lee Sander <> wrote:
    > I wanted to also say that this file is really huge, so I cannot
    > just do a read() and then split on ">" to get a record
    > thanks
    > lee
    >
    > On May 31, 1:26 pm, Lee Sander <> wrote:
    >
    > > Dear all,
    > > I would like toreada really hugefilethat looks like this:

    >
    > > > name1....

    >
    > > line_11
    > > line_12
    > > line_13
    > > ...>name2 ...

    >
    > > line_21
    > > line_22
    > > ...
    > > etc

    >
    > > where line_ij is just a free form text on that line.

    >
    > > how can ireadfileso that every time i do a "read()" i get exactly
    > > onerecord
    > > up to the next ">"

    >
    > > many thanks
    > > Lee
    aspineux, May 31, 2007
    #3
  4. Lee Sander

    Tijs Guest

    Lee Sander wrote:

    > I wanted to also say that this file is really huge, so I cannot
    > just do a read() and then split on ">" to get a record
    > thanks
    > lee


    Below is the easy solution. To get even better performance, or if '<' is not
    always at the start of the line, you would have to implement the buffering
    that is done by readline() yourself (see _fileobject in socket.py in the
    standard lib for example).

    def chunkreader(f):
    name = None
    lines = []
    while True:
    line = f.readline()
    if not line: break
    if line[0] == '>':
    if name is not None:
    yield name, lines
    name = line[1:].rstrip()
    lines = []
    else:
    lines.append(line)
    if name is not None:
    yield name, lines

    if __name__ == '__main__':
    from StringIO import StringIO
    s = \
    """> name1
    line1
    line2
    line3
    > name2

    line 4
    line 5
    line 6"""
    f = StringIO(s)
    for name, lines in chunkreader(f):
    print '***', name
    print ''.join(lines)


    $ python test.py
    *** name1
    line1
    line2
    line3

    *** name2
    line 4
    line 5
    line 6

    --

    Regards,
    Tijs
    Tijs, May 31, 2007
    #4
  5. Lee Sander

    Tijs Guest

    aspineux wrote:

    >
    > something like
    >
    > name=None
    > lines=[]
    > for line in open('yourfilename.txt'):
    > if line.startwith('>'):
    > if name!=None:
    > print 'Here is the record', name
    > print lines
    > print
    > name=line.stripr('\r')
    > lines=[]
    > else:
    > lines.append(line.stripr('\n'))
    >


    That would miss the last chunk.

    --

    Regards,
    Tijs
    Tijs, May 31, 2007
    #5
  6. In <>, Lee Sander
    wrote:

    > Dear all,
    > I would like to read a really huge file that looks like this:
    >
    >> name1....

    > line_11
    > line_12
    > line_13
    > ...
    >>name2 ...

    > line_21
    > line_22
    > ...
    > etc
    >
    > where line_ij is just a free form text on that line.
    >
    > how can i read file so that every time i do a "read()" i get exactly
    > one record
    > up to the next ">"


    There was just recently a thread with a `itertools.groupby()` solution.
    Something like this:

    from itertools import count, groupby, imap
    from operator import itemgetter

    def mark_records(lines):
    counter = 0
    for line in lines:
    if line.startswith('>'):
    counter += 1
    yield (counter, line)


    def iter_records(lines):
    fst = itemgetter(0)
    snd = itemgetter(1)
    for dummy, record_lines in groupby(mark_records(lines), fst):
    yield imap(snd, record_lines)


    def main():
    source = """\
    > name1....

    line_11
    line_12
    line_13
    ....
    > name2 ...

    line_21
    line_22
    ....""".splitlines()

    for record in iter_records(source):
    print 'Start of record...'
    for line in record:
    print ':', line

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, May 31, 2007
    #6
  7. "Lee Sander" <>wrote:


    > I wanted to also say that this file is really huge, so I cannot
    > just do a read() and then split on ">" to get a record
    > thanks
    > lee
    >
    > On May 31, 1:26 pm, Lee Sander <> wrote:
    > > Dear all,
    > > I would like toreada really hugefilethat looks like this:
    > >
    > > > name1....

    > >
    > > line_11
    > > line_12
    > > line_13
    > > ...>name2 ...
    > >
    > > line_21
    > > line_22
    > > ...
    > > etc
    > >
    > > where line_ij is just a free form text on that line.
    > >
    > > how can ireadfileso that every time i do a "read()" i get exactly
    > > onerecord
    > > up to the next ">"
    > >
    > > many thanks
    > > Lee

    >


    I would do something like: (not tested):

    def get_a_record(f,sep):
    ret_rec = ''
    while True:
    char = f.read(1)
    if char == sep:
    break
    else:
    ret_rec += char
    return ret_rec

    - Hendrik
    Hendrik van Rooyen, Jun 1, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. bjam
    Replies:
    3
    Views:
    4,037
  2. Angelic Devil

    Record separator for readlines()

    Angelic Devil, Sep 2, 2005, in forum: Python
    Replies:
    3
    Views:
    305
    Bengt Richter
    Sep 3, 2005
  3. Steve Howell
    Replies:
    3
    Views:
    282
    George Sakkis
    Jun 2, 2007
  4. Johny

    Readline and record separator

    Johny, Oct 30, 2007, in forum: Python
    Replies:
    12
    Views:
    726
    Dennis Lee Bieber
    Nov 2, 2007
  5. William James
    Replies:
    8
    Views:
    157
    William James
    Dec 5, 2005
Loading...

Share This Page