Text Parsing - character at a time...

F

Fuzzyman

I want to parse some text and generate an output that is similar but
not identical to the input.

The string I produce will be of similar length to the input string -
but a bit longer.

I'm parsing character by character and adding the characters of the
input string to the output until I come to ones I want to modify. This
means creating a new string for every character (since strings are
immutable) which seems very inneficient - particularly when I know
roughly what the output length will be. In a language like c I think I
could reserve a chunk of memory and keep a track of how much I'd
filled... just putting characters into it.(If I filled it I could
reserve a smaller chunk more - not difficult to keep a track of).
What's an efficient equivalent in python ? I could use a list,
appending characters onto the end of it.. converting to a string at
the end using ''.join(thelist).


Regards,


Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html
 
P

Peter Hansen

Fuzzyman said:
I'm parsing character by character and adding the characters of the
input string to the output until I come to ones I want to modify. [...]
What's an efficient equivalent in python ?

I think the answer might depend on information you haven't provided.
How are you 'parsing' this? In other words, how do you know
when you've gotten to the "ones you want to modify"? Are you sure
there isn't a way of avoiding the one-by-one manual scan? If
there is, you can probably add characters whole sequences at a
time to a list, as you suggest, by slicing from the input string
once you know where a sequence of "kept" characters starts
and stops.

For that matter, could you use string.replace or re.sub to do
the job even more efficiently?

-Peter
 
J

Jeff Epler

It's not clear what you mean by claiming that "creating a new string for
every character" is inefficient:
$ timeit 'int()'
100000 loops, best of 3: 1.26 usec per loop
$ timeit 'str()'
1000000 loops, best of 3: 1.28 usec per loop
$ timeit 'chr(0)'
1000000 loops, best of 3: 1.73 usec per loop

If your output is a transformation of your input, I'd write
def transform(input):
def _transform():
for c in input:
yield a string zero or more times
return ''.join(_transform())
Python should automatically do some nice overallocation tricks to make
this fairly efficient. You could also write
def transform(input):
result = ''
for c in input:
result.append(a string) zero or more times
return ''.join(result)
and if you care about the absolute fastest code you'll benchmark both of
them.

A common "gotcha" for starting programmers would be to write something
like
def transform(input):
result = ''
for c in input:
result += a string zero or more times
return result
because in this case Python won't (currently, anyway) do any clever
overallocation tricks, but instead will do a copy of the partial result
at the site of each +=.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA7ou5Jd01MZaTXX0RAlqVAJ9MRmZeJGqEOkjDLXm84QrXjWHhTwCeM3+7
BlbaNXQZXUbpjt02H5Nm9Zg=
=fEQA
-----END PGP SIGNATURE-----
 
J

John Lenton

I want to parse some text and generate an output that is similar but
not identical to the input.

The string I produce will be of similar length to the input string -
but a bit longer.

I'm parsing character by character and adding the characters of the
input string to the output until I come to ones I want to modify. This
means creating a new string for every character (since strings are
immutable) which seems very inneficient - particularly when I know
roughly what the output length will be. In a language like c I think I
could reserve a chunk of memory and keep a track of how much I'd
filled... just putting characters into it.(If I filled it I could
reserve a smaller chunk more - not difficult to keep a track of).
What's an efficient equivalent in python ? I could use a list,
appending characters onto the end of it.. converting to a string at
the end using ''.join(thelist).

I'm not terribly clear on what you're trying to do, but I'm pretty
sure you can do it with regular expressions a lot easyer than the way
you're describing it; you might not even need that---you might get
away with the 'replace' method on strings. Which you use depends on
the complexity of what you want to do, and on which ends up being
faster on your machine; as soon as its more complicated than one or
two 'replace's, regular expressions usually win.

If you could describe (a subset of) the problem in a bit more detail,
you'll probably get more useful suggestions (as in, code to do it, or
even docs to read to do it).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top