split large file by string/regex

Martin Dieringer · Nov 22, 2004

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.

Steve Holden · Nov 22, 2004

Martin said:
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.

Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

regards
Steve

Jason Rennie · Nov 22, 2004

I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?

If the pattern is contained within a single line, do something like this:

import re
myre = re.compile(r'foo')
fh = open(f)
fh1 = open(f1,'w')
s = fh.readline()
while not myre.search(s):
fh1.write(s)
s = fh.readline()
fh1.close()
fh2.open(f1,'w')
while fh
fh2.write(s)
s = fh.readline()
fh2.close()
fh.close()

I'm doing this off the top of my head, so this code almost certainly
has bugs. Hopefully its enough to get you started... Note that only
one line is held in memory at any point in time. Oh, if there's a
chance that the pattern does not appear in the file, you'll need to
check for eof in the first while loop.

Jason

Diez B. Roggisch · Nov 22, 2004

Depends on your definition of "simple", I suppose. The problem with

*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all
matches. For me that would be more complex than using a lexer, given the
excellent range of modules such as SPARK and PLY, to mention but two.

At least spark operates on whole strings if used as lexer/tokenizer - you
can of course feed it a lazy sequence of tokens by using a generator - but
that's up to you.

Martin Dieringer · Nov 22, 2004

Steve Holden said:
Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a
sequence of overlapping chunks to make sure that a regex could pick up
all matches. For me that would be more complex than using a lexer,
given the excellent range of modules such as SPARK and PLY, to mention
but two.

yes lexing would be the simplest, but PLY also can't read from streams
and it looks to me (from the examples) as if it's the same with SPARK.
I wonder why something like this is not in any lib.
Is there any known lexer that can do this?
I don't have to parse, just write the junks to separate files.
I really hate doing that sequence thing...

m.

Martin Dieringer · Nov 22, 2004

Jason Rennie said:
If the pattern is contained within a single line, do something like this:

Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.

Bengt Richter · Nov 22, 2004

Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

m.

Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of
['1231', '45646', '45646', '78']

or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

or maybe

['1231xxx', '45646xxx', '45646xxx', '78']

??

Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):

--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen:
start, end = end, buf.find(splitstr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break

yield buf

def test(*args):
for chunk in splitfile(*args):
print repr(chunk)

if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------

Extent of testing follows

----------------------------------------
01234abc5678abc901234
567ab890abc
---------------------------------------- '01234'
'abc'
'5678'
'abc'
'901234\r\n567ab890'
'abc'
'\r\n' ''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890abc\r\n' Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration

(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).

Regards,
Bengt Richter

Denis S. Otkidach · Nov 22, 2004

Depends on your definition of "simple", I suppose. The problem with
*not* using a lexer is that you'd have to examine the file in a sequence
of overlapping chunks to make sure that a regex could pick up all

re module works fine with mmap-ed file, so no need to read it into memory.

Martin Dieringer · Nov 22, 2004

Denis S. Otkidach said:
re module works fine with mmap-ed file, so no need to read it into memory.

thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

m.

William Park · Nov 22, 2004

Martin Dieringer said:
Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...

man strings (-o option)

Martin Dieringer · Nov 22, 2004

William Park said:
man strings (-o option)

this doesn't make sense at all

m.

Denis S. Otkidach · Nov 23, 2004

thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

mmap-ed files also support subscription and slicing. I guess
mmfile[start:stop] would more readable.

Martin Dieringer · Nov 23, 2004

Denis S. Otkidach said:
thank you, this is the solution!
Now I can mmap.find all locations and then read the chunks them via
file.seek and file.read

Click to expand...

mmap-ed files also support subscription and slicing. I guess
mmfile[start:stop] would more readable.

yes, even better

m.

Problem Splitting Text String	2	Dec 29, 2022
Processing a large string	7	Aug 12, 2011
use python to split a video file into a set of parts	2	May 7, 2013
string split	0	Jan 9, 2009
splitting file/content into lines based on regex termination	0	Nov 7, 2013
regex line by line over file	8	Mar 27, 2014
string split	2	Jan 9, 2009
Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013

split large file by string/regex

Martin Dieringer

Steve Holden

Jason Rennie

Diez B. Roggisch

Martin Dieringer

Martin Dieringer

Bengt Richter

Denis S. Otkidach

Martin Dieringer

William Park

Martin Dieringer

Denis S. Otkidach

Martin Dieringer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads