Hmm it's binary data, I can't tell how long lines would be. OTOH a
line would certainly contain the pattern as it has no \n in it... and
the lines probably wouldn't be too large for memory...
m.
Do you want to keep the splitting string? I.e., if you split with xxx
from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of
['1231', '45646', '45646', '78']
or (I chose this for below)
['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']
or maybe
['1231xxx', '45646xxx', '45646xxx', '78']
??
Anyway, I'd use a generator to iterate through the file and look for the delimiter.
This is case-sensitive, BTW (practically untested ;-):
--< splitfile.py >----------------------------------------------
def splitfile(path, splitstr, chunksize=1024*64): # try a megabyte?
splen = len(splitstr)
chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
buf = ''
for chunk in chunks:
buf += chunk
start = end = 0
while end>=0 and len(buf)>=splen:
start, end = end, buf.find(splitstr, end)
if end>=0:
yield buf[start:end] #not including splitstr
yield splitstr # == buf[end:end+splen] # splitstr
end += splen
else:
buf = buf[start:]
break
yield buf
def test(*args):
for chunk in splitfile(*args):
print repr(chunk)
if __name__ == '__main__':
import sys
args = sys.argv[1:]
try:
if len(args)==3: args[2]=int(args[2])
except Exception:
raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksize=64k]'
test(*args)
----------------------------------------------------------------
Extent of testing follows
----------------------------------------
01234abc5678abc901234
567ab890abc
---------------------------------------- '01234'
'abc'
'5678'
'abc'
'901234\r\n567ab890'
'abc'
'\r\n' ''
'012'
'34abc5678abc9'
'012'
'34\r\n567ab890abc\r\n' Traceback (most recent call last):
File "<stdin>", line 1, in ?
StopIteration
(I put it in my ut package directory but you can put splitfile.py anywhere handy
and mod it to do what you need).
Regards,
Bengt Richter