regex over files

Robin Becker · Apr 25, 2005

Is there any way to get regexes to work on non-string/unicode objects. I would
like to split large files by regex and it seems relatively hard to do so without
having the whole file in memory. Even with buffers it seems hard to get regexes
to indicate that they failed because of buffer termination and getting a partial
match to be resumable seems out of the question.

What interface does re actually need for its src objects?

Bengt Richter · Apr 27, 2005

Is there any way to get regexes to work on non-string/unicode objects. I would
like to split large files by regex and it seems relatively hard to do so without
having the whole file in memory. Even with buffers it seems hard to get regexes
to indicate that they failed because of buffer termination and getting a partial
match to be resumable seems out of the question.

What interface does re actually need for its src objects?

ISTM splitting is a special situation where you can easily
chunk through a file and split as you go, since if splitting
the current chunk succeeds, you can be sure that all but the
tail piece is valid[1]. So you can make an iterator that yields
all but the last and then sets the buffer to last+newchunk
and goes on until there are no more chunks, and the tail part
will be a valid split piece. E.g., (not tested beyond what you see ;-)
... buffer = ''
... for chunk in iter((lambda f=open(path): f.read(chunksize)),''):
... buffer += chunk
... pieces = rxo.split(buffer)
... for piece in pieces[:-1]: yield piece
... buffer = pieces[-1]
... yield buffer
...
The test file:
----
This is going to be split on five X's
like XXXXX but we will use a buffer of
XXXXX length 2 to force buffer appending.
We'll try a splitter at the end: XXXXX
----
...
"This is going to be split on five X's\nlike "
' but we will use a buffer of\n'
" length 2 to force buffer appending.\nWe'll try a splitter at the end: "
'\n'
...
"This is going to be split on five X's\nlike "
'XXXXX'
' but we will use a buffer of\n'
'XXXXX'
" length 2 to force buffer appending.\nWe'll try a splitter at the end: "
'XXXXX'
'\n'

[1] In some cases of regexes with lookahead context, you might
have to check that the last piece not only exists but exceeds
max lookahead length, in case there is a <withlookahead>|<plain>
kind of thing in the regex where <lookahead> would have succeeded
with another chunk appended to buffer, but <plain> did the split.

Regards,
Bengt Richter

Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
Regex driving me crazy...	33	Apr 7, 2010
Regex for unicode letter characters	4	Jan 11, 2009
Regex ^ beginning not strong?	2	Jul 26, 2010
Regex Speed	17	Feb 20, 2007
waling a directory with very many files	44	Jun 14, 2009
Regex challenge	15	Jun 4, 2008
Problem with tarfile module to open *.tar.gz files - unreliable ?	9	Aug 20, 2010

regex over files

Robin Becker

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads