Text Parsing - character at a time...

Fuzzyman · Jul 9, 2004

I want to parse some text and generate an output that is similar but
not identical to the input.

The string I produce will be of similar length to the input string -
but a bit longer.

I'm parsing character by character and adding the characters of the
input string to the output until I come to ones I want to modify. This
means creating a new string for every character (since strings are
immutable) which seems very inneficient - particularly when I know
roughly what the output length will be. In a language like c I think I
could reserve a chunk of memory and keep a track of how much I'd
filled... just putting characters into it.(If I filled it I could
reserve a smaller chunk more - not difficult to keep a track of).
What's an efficient equivalent in python ? I could use a list,
appending characters onto the end of it.. converting to a string at
the end using ''.join(thelist).

Regards,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Peter Hansen · Jul 9, 2004

Fuzzyman said:
I'm parsing character by character and adding the characters of the
input string to the output until I come to ones I want to modify. [...]
What's an efficient equivalent in python ?

I think the answer might depend on information you haven't provided.
How are you 'parsing' this? In other words, how do you know
when you've gotten to the "ones you want to modify"? Are you sure
there isn't a way of avoiding the one-by-one manual scan? If
there is, you can probably add characters whole sequences at a
time to a list, as you suggest, by slicing from the input string
once you know where a sequence of "kept" characters starts
and stops.

For that matter, could you use string.replace or re.sub to do
the job even more efficiently?

-Peter

Jeff Epler · Jul 9, 2004

It's not clear what you mean by claiming that "creating a new string for
every character" is inefficient:
$ timeit 'int()'
100000 loops, best of 3: 1.26 usec per loop
$ timeit 'str()'
1000000 loops, best of 3: 1.28 usec per loop
$ timeit 'chr(0)'
1000000 loops, best of 3: 1.73 usec per loop

If your output is a transformation of your input, I'd write
def transform(input):
def _transform():
for c in input:
yield a string zero or more times
return ''.join(_transform())
Python should automatically do some nice overallocation tricks to make
this fairly efficient. You could also write
def transform(input):
result = ''
for c in input:
result.append(a string) zero or more times
return ''.join(result)
and if you care about the absolute fastest code you'll benchmark both of
them.

A common "gotcha" for starting programmers would be to write something
like
def transform(input):
result = ''
for c in input:
result += a string zero or more times
return result
because in this case Python won't (currently, anyway) do any clever
overallocation tricks, but instead will do a copy of the partial result
at the site of each +=.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFA7ou5Jd01MZaTXX0RAlqVAJ9MRmZeJGqEOkjDLXm84QrXjWHhTwCeM3+7
BlbaNXQZXUbpjt02H5Nm9Zg=
=fEQA
-----END PGP SIGNATURE-----

John Lenton · Jul 10, 2004

I want to parse some text and generate an output that is similar but
not identical to the input.

The string I produce will be of similar length to the input string -
but a bit longer.

I'm parsing character by character and adding the characters of the
input string to the output until I come to ones I want to modify. This
means creating a new string for every character (since strings are
immutable) which seems very inneficient - particularly when I know
roughly what the output length will be. In a language like c I think I
could reserve a chunk of memory and keep a track of how much I'd
filled... just putting characters into it.(If I filled it I could
reserve a smaller chunk more - not difficult to keep a track of).
What's an efficient equivalent in python ? I could use a list,
appending characters onto the end of it.. converting to a string at
the end using ''.join(thelist).

I'm not terribly clear on what you're trying to do, but I'm pretty
sure you can do it with regular expressions a lot easyer than the way
you're describing it; you might not even need that---you might get
away with the 'replace' method on strings. Which you use depends on
the complexity of what you want to do, and on which ends up being
faster on your machine; as soon as its more complicated than one or
two 'replace's, regular expressions usually win.

If you could describe (a subset of) the problem in a bit more detail,
you'll probably get more useful suggestions (as in, code to do it, or
even docs to read to do it).

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Measuring a string of text	1	Sep 15, 2022
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
"input-group-text" help	7	Aug 10, 2023
XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
Genetic algoritm generating the text	0	Aug 18, 2023
parsing text from "ethtool" command	3	Nov 1, 2011
Text parsing via regex	10	Dec 8, 2008

Text Parsing - character at a time...

Fuzzyman

Peter Hansen

Jeff Epler

John Lenton

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads