regex over files

Discussion in 'Python' started by Robin Becker, Apr 25, 2005.

  1. Robin Becker

    Robin Becker Guest

    Is there any way to get regexes to work on non-string/unicode objects. I would
    like to split large files by regex and it seems relatively hard to do so without
    having the whole file in memory. Even with buffers it seems hard to get regexes
    to indicate that they failed because of buffer termination and getting a partial
    match to be resumable seems out of the question.

    What interface does re actually need for its src objects?
    --
    Robin Becker
     
    Robin Becker, Apr 25, 2005
    #1
    1. Advertising

  2. On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker <> wrote:

    >Is there any way to get regexes to work on non-string/unicode objects. I would
    >like to split large files by regex and it seems relatively hard to do so without
    >having the whole file in memory. Even with buffers it seems hard to get regexes
    >to indicate that they failed because of buffer termination and getting a partial
    >match to be resumable seems out of the question.
    >
    >What interface does re actually need for its src objects?


    ISTM splitting is a special situation where you can easily
    chunk through a file and split as you go, since if splitting
    the current chunk succeeds, you can be sure that all but the
    tail piece is valid[1]. So you can make an iterator that yields
    all but the last and then sets the buffer to last+newchunk
    and goes on until there are no more chunks, and the tail part
    will be a valid split piece. E.g., (not tested beyond what you see ;-)

    >>> def frxsplit(path, rxo, chunksize=8192):

    ... buffer = ''
    ... for chunk in iter((lambda f=open(path): f.read(chunksize)),''):
    ... buffer += chunk
    ... pieces = rxo.split(buffer)
    ... for piece in pieces[:-1]: yield piece
    ... buffer = pieces[-1]
    ... yield buffer
    ...
    >>> import re
    >>> rxo = re.compile('XXXXX')


    The test file:

    >>> print '----\n%s----'%open('tsplit.txt').read()

    ----
    This is going to be split on five X's
    like XXXXX but we will use a buffer of
    XXXXX length 2 to force buffer appending.
    We'll try a splitter at the end: XXXXX
    ----

    >>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)

    ...
    "This is going to be split on five X's\nlike "
    ' but we will use a buffer of\n'
    " length 2 to force buffer appending.\nWe'll try a splitter at the end: "
    '\n'

    >>> rxo = re.compile('(XXXXX)')
    >>> for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)

    ...
    "This is going to be split on five X's\nlike "
    'XXXXX'
    ' but we will use a buffer of\n'
    'XXXXX'
    " length 2 to force buffer appending.\nWe'll try a splitter at the end: "
    'XXXXX'
    '\n'

    [1] In some cases of regexes with lookahead context, you might
    have to check that the last piece not only exists but exceeds
    max lookahead length, in case there is a <withlookahead>|<plain>
    kind of thing in the regex where <lookahead> would have succeeded
    with another chunk appended to buffer, but <plain> did the split.

    Regards,
    Bengt Richter
     
    Bengt Richter, Apr 27, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robin Becker

    Re: regex over files

    Robin Becker, Apr 25, 2005, in forum: Python
    Replies:
    26
    Views:
    875
    Robin Becker
    Apr 29, 2005
  2. utab
    Replies:
    3
    Views:
    908
  3. Replies:
    3
    Views:
    833
    Reedick, Andrew
    Jul 1, 2008
  4. karthikbalaguru
    Replies:
    3
    Views:
    3,096
    Chris Dollin
    Nov 27, 2008
  5. RolfK
    Replies:
    1
    Views:
    1,922
    Martin Honnen
    Jun 7, 2009
Loading...

Share This Page