Processing a large string

G

goldtech

Hi,

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?

Thanks.
 
M

MRAB

Hi,

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?
You could write a generator like this:

def split(string, sep):
pos = 0
try:
while True:
next_pos = string.index(sep, pos)
yield string[pos : next_pos]
pos = next_pos + 1
except ValueError:
yield string[pos : ]

string = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..."

for part in split(string, "3"):
print(part)
 
S

Steven D'Aprano

goldtech said:
Hi,

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....


Define "big".

What seems big to you is probably not big to your computer.

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?

You could, but unless there are a lot of 3s, it will probably be slow. If
the 3s are far apart, it will be better to do this:

# untested
def split(source):
start = 0
i = source.find("3")
while i >= 0:
yield source[start:i]
start = i+1
i = source.find("3", start)


That should give you the pieces of the string one at a time, as efficiently
as possible.
 
N

Nobody

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?

Use the .find() or .index() methods to find the next occurrence of a
character.

Building a large string by concatenation is inefficient, as each append
will copy the original string. If you must have the result as a
single string, using cStringIO would be preferable. But you'd be better
off if you can work with a list of strings.
 
P

Peter Otten

goldtech said:
Hi,

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?

You can read the file in chunks:

from functools import partial

def read_chunks(instream, chunksize=None):
if chunksize is None:
chunksize = 2**20
return iter(partial(instream.read, chunksize), "")

def split_file(instream, delimiter, chunksize=None):
leftover = ""
chunk = None
for chunk in read_chunks(instream):
chunk = leftover + chunk
parts = chunk.split(delimiter)
leftover = parts.pop()
for part in parts:
yield part
if leftover or chunk is None or chunk.endswith(delimiter):
yield leftover

I hope I got the corner cases right.

PS: This has come up before, but I couldn't find the relevant threads...
 
P

Peter Otten

Peter said:
goldtech wrote:
PS: This has come up before, but I couldn't find the relevant threads...

Alex Martelli a looong time ago:
from __future__ import generators

def splitby(fileobj, splitter, bufsize=8192):
buf = ''

while True:
try:
item, buf = buf.split(splitter, 1)
except ValueError:
more = fileobj.read(bufsize)
if not more: break
buf += more
else:
yield item + splitter

if buf:
yield buf

http://mail.python.org/pipermail/python-list/2002-September/770673.html
 
P

Paul Rudin

goldtech said:
Hi,

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?) I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?

Thanks.

s = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn"
for k, subs in itertools.groupby(s, lambda x: x=="3"):
print ''.join(subs)


what you actually do in the body of the loop depends on what you want to
do with the bits.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,778
Messages
2,569,605
Members
45,238
Latest member
Top CryptoPodcasts

Latest Threads

Top