Text Parsing - character at a time...

Discussion in 'Python' started by Fuzzyman, Jul 9, 2004.

  1. Fuzzyman

    Fuzzyman Guest

    I want to parse some text and generate an output that is similar but
    not identical to the input.

    The string I produce will be of similar length to the input string -
    but a bit longer.

    I'm parsing character by character and adding the characters of the
    input string to the output until I come to ones I want to modify. This
    means creating a new string for every character (since strings are
    immutable) which seems very inneficient - particularly when I know
    roughly what the output length will be. In a language like c I think I
    could reserve a chunk of memory and keep a track of how much I'd
    filled... just putting characters into it.(If I filled it I could
    reserve a smaller chunk more - not difficult to keep a track of).
    What's an efficient equivalent in python ? I could use a list,
    appending characters onto the end of it.. converting to a string at
    the end using ''.join(thelist).


    Regards,


    Fuzzy

    http://www.voidspace.org.uk/atlantibots/pythonutils.html
     
    Fuzzyman, Jul 9, 2004
    #1
    1. Advertising

  2. Fuzzyman

    Peter Hansen Guest

    Fuzzyman wrote:

    > I'm parsing character by character and adding the characters of the
    > input string to the output until I come to ones I want to modify.

    [...]
    > What's an efficient equivalent in python ?


    I think the answer might depend on information you haven't provided.
    How are you 'parsing' this? In other words, how do you know
    when you've gotten to the "ones you want to modify"? Are you sure
    there isn't a way of avoiding the one-by-one manual scan? If
    there is, you can probably add characters whole sequences at a
    time to a list, as you suggest, by slicing from the input string
    once you know where a sequence of "kept" characters starts
    and stops.

    For that matter, could you use string.replace or re.sub to do
    the job even more efficiently?

    -Peter
     
    Peter Hansen, Jul 9, 2004
    #2
    1. Advertising

  3. Fuzzyman

    Jeff Epler Guest

    It's not clear what you mean by claiming that "creating a new string for
    every character" is inefficient:
    $ timeit 'int()'
    100000 loops, best of 3: 1.26 usec per loop
    $ timeit 'str()'
    1000000 loops, best of 3: 1.28 usec per loop
    $ timeit 'chr(0)'
    1000000 loops, best of 3: 1.73 usec per loop

    If your output is a transformation of your input, I'd write
    def transform(input):
    def _transform():
    for c in input:
    yield a string zero or more times
    return ''.join(_transform())
    Python should automatically do some nice overallocation tricks to make
    this fairly efficient. You could also write
    def transform(input):
    result = ''
    for c in input:
    result.append(a string) zero or more times
    return ''.join(result)
    and if you care about the absolute fastest code you'll benchmark both of
    them.

    A common "gotcha" for starting programmers would be to write something
    like
    def transform(input):
    result = ''
    for c in input:
    result += a string zero or more times
    return result
    because in this case Python won't (currently, anyway) do any clever
    overallocation tricks, but instead will do a copy of the partial result
    at the site of each +=.

    Jeff

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQFA7ou5Jd01MZaTXX0RAlqVAJ9MRmZeJGqEOkjDLXm84QrXjWHhTwCeM3+7
    BlbaNXQZXUbpjt02H5Nm9Zg=
    =fEQA
    -----END PGP SIGNATURE-----
     
    Jeff Epler, Jul 9, 2004
    #3
  4. Fuzzyman

    John Lenton Guest

    On 9 Jul 2004 04:46:29 -0700, Fuzzyman <> wrote:
    > I want to parse some text and generate an output that is similar but
    > not identical to the input.
    >
    > The string I produce will be of similar length to the input string -
    > but a bit longer.
    >
    > I'm parsing character by character and adding the characters of the
    > input string to the output until I come to ones I want to modify. This
    > means creating a new string for every character (since strings are
    > immutable) which seems very inneficient - particularly when I know
    > roughly what the output length will be. In a language like c I think I
    > could reserve a chunk of memory and keep a track of how much I'd
    > filled... just putting characters into it.(If I filled it I could
    > reserve a smaller chunk more - not difficult to keep a track of).
    > What's an efficient equivalent in python ? I could use a list,
    > appending characters onto the end of it.. converting to a string at
    > the end using ''.join(thelist).


    I'm not terribly clear on what you're trying to do, but I'm pretty
    sure you can do it with regular expressions a lot easyer than the way
    you're describing it; you might not even need that---you might get
    away with the 'replace' method on strings. Which you use depends on
    the complexity of what you want to do, and on which ends up being
    faster on your machine; as soon as its more complicated than one or
    two 'replace's, regular expressions usually win.

    If you could describe (a subset of) the problem in a bit more detail,
    you'll probably get more useful suggestions (as in, code to do it, or
    even docs to read to do it).

    --
    John Lenton () -- Random fortune:
    bash: fortune: command not found
     
    John Lenton, Jul 10, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Velvet
    Replies:
    9
    Views:
    14,842
    Joerg Jooss
    Jan 19, 2006
  2. raavi
    Replies:
    2
    Views:
    914
    raavi
    Mar 2, 2006
  3. cgbusch
    Replies:
    6
    Views:
    7,510
    Mike Brown
    Sep 2, 2003
  4. mimmo
    Replies:
    4
    Views:
    27,974
  5. flamesrock
    Replies:
    8
    Views:
    478
    Hendrik van Rooyen
    Nov 24, 2006
Loading...

Share This Page