question about speed of sequential string replacement vs regex or

Discussion in 'Python' started by Xah Lee, Sep 28, 2011.

  1. Xah Lee

    Xah Lee Guest

    curious question.

    suppose you have 300 different strings and they need all be replaced
    to say "aaa".

    is it faster to replace each one sequentially (i.e. replace first
    string to aaa, then do the 2nd, 3rd,...)
    , or is it faster to use a regex with “or” them all and do replace one
    shot? (i.e. "1ststr|2ndstr|3rdstr|..." -> aaa)

    let's say the sourceString this replacement to be done on is 500k
    chars.

    Anyone? i suppose the answer will be similar for perl, python, ruby.

    btw, the origin of this question is about writing a emacs lisp
    function that replace ~250 html named entities to unicode char.

    Xah
     
    Xah Lee, Sep 28, 2011
    #1
    1. Advertising

  2. On Wed, Sep 28, 2011 at 7:28 PM, Xah Lee <> wrote:
    > suppose you have 300 different strings and they need all be replaced
    > to say "aaa".
    >
    > is it faster to replace each one sequentially


    Before you ask "is it faster", you need to first be sure it's correct.
    I would recommend doing all the replaces in a single function call to
    avoid risk of overlaps or other issues causing problems.

    ChrisA
     
    Chris Angelico, Sep 28, 2011
    #2
    1. Advertising

  3. >>>>> "Xah" == Xah Lee <> writes:

    Xah> curious question.
    Xah> suppose you have 300 different strings and they need all be replaced
    Xah> to say "aaa".

    And then suppose this isn't the *real* question, but one entirely of
    Fiction by Xah Lee.

    How helpful do you want to be?

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    See http://methodsandmessages.posterous.com/ for Smalltalk discussion
     
    Randal L. Schwartz, Sep 28, 2011
    #3
  4. Xah Lee

    Xah Lee Guest

    On Sep 28, 3:57 am, (Randal L. Schwartz) wrote:
    > >>>>> "Xah" == Xah Lee <> writes:

    >
    > Xah> curious question.
    > Xah> suppose you have 300 different strings and they need all be replaced
    > Xah> to say "aaa".
    >
    > And then suppose this isn't the *real* question, but one entirely of
    > Fiction by Xah Lee.
    >
    > How helpful do you want to be?


    it's a interesting question anyway.

    the question originally came from when i was coding elisp of a
    function that changes html entities to unicode char literal. The
    problem is slightly complicated, involving a few questions about speed
    in emacs. e.g. string vs buffer, and much more... i spent several
    hours on this but it's probably too boring to detail (but i'll do so
    if anyone wishes). But anyway, while digging these questions that's
    not clear in my mind, i thought of why not generate a regex or
    construct and do it in one shot, and wondered if that'd be faster. But
    afterwards, i realized this wouldn't be applicable to my problem
    because for my problem each string needs to be changed to a unique
    string, not all to the same string.

    Xah
     
    Xah Lee, Sep 28, 2011
    #4
  5. Xah Lee

    Xah Lee Guest

    here's more detail about the origin of this problem. Relevant to emacs
    lisp only.

    ------------------------------

    in the definition of “replace-regexp-in-string”, there's this comment:

    ;; To avoid excessive consing from multiple matches in long strings,
    ;; don't just call `replace-match' continually. Walk down the
    ;; string looking for matches of REGEXP and building up a (reversed)
    ;; list MATCHES. This comprises segments of STRING which weren't
    ;; matched interspersed with replacements for segments that were.
    ;; [For a `large' number of replacements it's more efficient to
    ;; operate in a temporary buffer; we can't tell from the function's
    ;; args whether to choose the buffer-based implementation, though it
    ;; might be reasonable to do so for long enough STRING.]

    note: «For a `large' number of replacements it's more efficient to
    operate in a temporary buffer».

    my question is, anyone got some idea, how “large” is large?

    currently, i have a function replace-pairs-in-string which is
    implemented by repeatedly calling “replace-pairs-in-string” like this:

    (while (< ii (length pairs))
    (setq mystr (replace-regexp-in-string
    (elt tempMapPoints ii)
    (elt (elt pairs ii) 1)
    mystr t t))
    (setq ii (1+ ii))
    )

    When there are 260 pairs of strings to be replaced on a file that's
    26k in size, my function takes about 3 seconds (which i think is too
    slow). I'm at pain deciding whether my function should be implemented
    like this or whether it should create a temp buffer. The problem with
    temp buffer is that, if you repeatedly call it, the overhead of
    creating buffer is going to make it much slower.

    i was actually surprised that replace-regexp-in-string isn't written
    in C, which i thought it was.

    is there technical reason the replace-regexp-in-string isn't C? (i
    suppose only because nobody every did it and the need for speed didn't
    suffice?) and also, shouldn't there also be a replace-in-string
    (literal, not regex)? because i thought implementing replacement for
    string should be much simpler and faster, because buffers comes with
    it a whole structure such as “point”, text properties, buffer names,
    buffier modifier, etc.

    Xah

    On Sep 28, 5:28 am, Xah Lee <> wrote:
    > On Sep 28, 3:57 am, (Randal L. Schwartz) wrote:
    >
    > > >>>>> "Xah" == Xah Lee <> writes:

    >
    > > Xah> curious question.
    > > Xah> suppose you have 300 different strings and they need all be replaced
    > > Xah> to say "aaa".

    >
    > > And then suppose this isn't the *real* question, but one entirely of
    > > Fiction by Xah Lee.

    >
    > > How helpful do you want to be?

    >
    > it's a interesting question anyway.
    >
    > the question originally came from when i was coding elisp of a
    > function that changes html entities to unicode char literal. The
    > problem is slightly complicated, involving a few questions about speed
    > in emacs. e.g. string vs buffer, and much more... i spent several
    > hours on this but it's probably too boring to detail (but i'll do so
    > if anyone wishes). But anyway, while digging these questions that's
    > not clear in my mind, i thought of why not generate a regex or
    > construct and do it in one shot, and wondered if that'd be faster. But
    > afterwards, i realized this wouldn't be applicable to my problem
    > because for my problem each string needs to be changed to a unique
    > string, not all to the same string.
    >
    >  Xah
     
    Xah Lee, Sep 28, 2011
    #5
  6. On Wed, Sep 28, 2011 at 10:28 PM, Xah Lee <> wrote:
    > each string needs to be changed to a unique
    > string, not all to the same string.


    And you'll find that this is, by and large, the most normal situation.
    Folding many strings down to one string is a lot less common. So,
    let's have some real-world use cases and then we can talk
    recommendations.

    ChrisA
     
    Chris Angelico, Sep 28, 2011
    #6
  7. Xah Lee

    Neil Cerutti Guest

    On 2011-09-28, Chris Angelico <> wrote:
    > On Wed, Sep 28, 2011 at 10:28 PM, Xah Lee <> wrote:
    >> each string needs to be changed to a unique string, not all to
    >> the same string.

    >
    > And you'll find that this is, by and large, the most normal
    > situation. Folding many strings down to one string is a lot
    > less common. So, let's have some real-world use cases and then
    > we can talk recommendations.


    I'd like to know what "string replacement" is supposed to mean in
    the context of Python.

    --
    Neil Cerutti
    "A politician is an arse upon which everyone has sat except a man."
    e. e. cummings
     
    Neil Cerutti, Sep 28, 2011
    #7
  8. On Wed, Sep 28, 2011 at 11:14 PM, Xah Lee <> wrote:
    > (while (< ii (length pairs))
    >      (setq mystr (replace-regexp-in-string
    >                   (elt tempMapPoints ii)
    >                   (elt (elt pairs ii) 1)
    >                   mystr t t))
    >      (setq ii (1+ ii))
    >      )


    from __future__ import lisp_syntax

    There, that makes it on-topic for the Python list... I guess.

    ChrisA
     
    Chris Angelico, Sep 28, 2011
    #8
  9. Xah Lee

    Roy Smith Guest

    In article <>,
    Neil Cerutti <> wrote:

    > On 2011-09-28, Chris Angelico <> wrote:
    > > On Wed, Sep 28, 2011 at 10:28 PM, Xah Lee <> wrote:
    > >> each string needs to be changed to a unique string, not all to
    > >> the same string.

    > >
    > > And you'll find that this is, by and large, the most normal
    > > situation. Folding many strings down to one string is a lot
    > > less common. So, let's have some real-world use cases and then
    > > we can talk recommendations.

    >
    > I'd like to know what "string replacement" is supposed to mean in
    > the context of Python.


    You just need to use "string" in the more generic computer-sciency
    sense, not in the python-data-type sense.

    s = "I am an immutable string"
    l = list(s) # now you can pretend strings are mutable
    do_stuff_to_string(l)
    s = ''.join(l)
     
    Roy Smith, Sep 28, 2011
    #9
  10. On Wed, Sep 28, 2011 at 11:22 PM, Neil Cerutti <> wrote:
    > I'd like to know what "string replacement" is supposed to mean in
    > the context of Python.
    >


    Substring replacement, such as:
    >>> "Hello, world!".replace(", "," -- ")

    'Hello -- world!'

    The str method doesn't accept a list, though, so it won't do
    simultaneous replaces. (At least, I don't think it can. Tried it only
    in Python 3.2 on Windows.)

    ChrisA
     
    Chris Angelico, Sep 28, 2011
    #10
  11. Xah Lee

    Willem Guest

    Xah Lee wrote:
    ) the question originally came from when i was coding elisp of a
    ) function that changes html entities to unicode char literal. The
    ) problem is slightly complicated, involving a few questions about speed
    ) in emacs. e.g. string vs buffer, and much more... i spent several
    ) hours on this but it's probably too boring to detail (but i'll do so
    ) if anyone wishes). But anyway, while digging these questions that's
    ) not clear in my mind, i thought of why not generate a regex or
    ) construct and do it in one shot, and wondered if that'd be faster. But
    ) afterwards, i realized this wouldn't be applicable to my problem
    ) because for my problem each string needs to be changed to a unique
    ) string, not all to the same string.

    In Perl, it would be applicable. You see, in Perl, you can call a function
    in the replacement of the regex substitution, which can then look up the
    html entity and return the wanted unicode literal.

    I think you can do that in some other languages as well.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Sep 28, 2011
    #11
  12. Xah Lee

    MRAB Guest

    On 28/09/2011 18:00, Willem wrote:
    > Xah Lee wrote:
    > ) the question originally came from when i was coding elisp of a
    > ) function that changes html entities to unicode char literal. The
    > ) problem is slightly complicated, involving a few questions about speed
    > ) in emacs. e.g. string vs buffer, and much more... i spent several
    > ) hours on this but it's probably too boring to detail (but i'll do so
    > ) if anyone wishes). But anyway, while digging these questions that's
    > ) not clear in my mind, i thought of why not generate a regex or
    > ) construct and do it in one shot, and wondered if that'd be faster. But
    > ) afterwards, i realized this wouldn't be applicable to my problem
    > ) because for my problem each string needs to be changed to a unique
    > ) string, not all to the same string.
    >
    > In Perl, it would be applicable. You see, in Perl, you can call a function
    > in the replacement of the regex substitution, which can then look up the
    > html entity and return the wanted unicode literal.
    >
    > I think you can do that in some other languages as well.
    >

    Including Python...
     
    MRAB, Sep 28, 2011
    #12
  13. Xah Lee

    Ian Kelly Guest

    On Wed, Sep 28, 2011 at 3:28 AM, Xah Lee <> wrote:
    > curious question.
    >
    > suppose you have 300 different strings and they need all be replaced
    > to say "aaa".
    >
    > is it faster to replace each one sequentially (i.e. replace first
    > string to aaa, then do the 2nd, 3rd,...)
    > , or is it faster to use a regex with “or” them all and do replace one
    > shot? (i.e. "1ststr|2ndstr|3rdstr|..." -> aaa)
    >
    > let's say the sourceString this replacement to be done on is 500k
    > chars.
    >
    > Anyone? i suppose the answer will be similar for perl, python, ruby.
    >
    > btw, the origin of this question is about writing a emacs lisp
    > function that replace ~250 html named entities to unicode char.


    I haven't timed it at the scale you're talking about, but for Python I
    expect regex will be your best bet:

    # Python 3.2: Supposing the match strings and replacements are
    # in a dict stored as `repls`...

    import re

    pattern = '|'.join(map(re.escape, repls.keys()))
    new_str = re.sub(pattern, lambda m: repls[m.group()], old_str)

    The problem with doing 300 str.replace calls is the 300 intermediate
    strings that would be created and then collected.

    Cheers,
    Ian
     
    Ian Kelly, Sep 28, 2011
    #13
  14. Xah Lee

    Willem Guest

    Eli the Bearded wrote:
    ) In comp.lang.perl.misc, Willem <> wrote:
    )> In Perl, it would be applicable. You see, in Perl, you can call a function
    )> in the replacement of the regex substitution, which can then look up the
    )> html entity and return the wanted unicode literal.
    )
    ) A function? I'd use a hash.

    A function can return a sensible value for unknown substitutions.
    In the case where you prebuild a giant regex or-list, that is not an issue,
    but I would match html entities generically.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Sep 28, 2011
    #14
  15. Xah Lee

    John Bokma Guest

    Willem <> writes:

    > Eli the Bearded wrote:
    > ) In comp.lang.perl.misc, Willem <> wrote:
    > )> In Perl, it would be applicable. You see, in Perl, you can call a function
    > )> in the replacement of the regex substitution, which can then look up the
    > )> html entity and return the wanted unicode literal.
    > )
    > ) A function? I'd use a hash.
    >
    > A function can return a sensible value for unknown substitutions.


    You can do that also in the RHS of the substitution and still keep it
    readable if you use something like

    s{..}{

    your
    code
    goes
    here
    }ge;

    However, a function can be easier on the eye:

    s{...}{ some_good_name( ... ) }ge;

    --
    John Bokma j3b

    Blog: http://johnbokma.com/ Perl Consultancy: http://castleamber.com/
    Perl for books: http://johnbokma.com/perl/help-in-exchange-for-books.html
     
    John Bokma, Sep 28, 2011
    #15
  16. Xah Lee

    Terry Reedy Guest

    On 9/28/2011 5:28 AM, Xah Lee wrote:
    > curious question.
    >
    > suppose you have 300 different strings and they need all be replaced
    > to say "aaa".
    >
    > is it faster to replace each one sequentially (i.e. replace first
    > string to aaa, then do the 2nd, 3rd,...)
    > , or is it faster to use a regex with “or†them all anddo replace one
    > shot? (i.e. "1ststr|2ndstr|3rdstr|..." -> aaa)


    Here the problem is replace multiple random substrings with one random
    substring that could create new matches. I would start with the re 'or'
    solution.

    > btw, the origin of this question is about writing a emacs lisp
    > function that replace ~250 html named entities to unicode char.


    As you noted this is a different problem in that there is a different
    replacement for each. Also, the substrings being searched for are not
    random but have a distinct and easily recognized structure. The
    replacement cannot create a new match. So the multiple scan approach
    *could* work.

    Unspecified is whether the input is unicode or ascii bytes. If the
    latter I might copy to a bytearray (mutable), scan forward, replace
    entity defs with utf-8 encoding of the corresponding unicode (with a
    dict lookup, and which I assume are *always* fewer chars), and shift
    other chars to close any gaps created.

    If the input is unicode, I might do the same with array.array (which is
    where bytearray came from). Or I might use the standard idiom of
    constructing a list of pieces of the original, with replacements, and
    ''.join() at the end.

    --
    Terry Jan Reedy
     
    Terry Reedy, Sep 28, 2011
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mladen Adamovic
    Replies:
    0
    Views:
    777
    Mladen Adamovic
    Dec 4, 2003
  2. Mladen Adamovic
    Replies:
    3
    Views:
    14,776
    Mladen Adamovic
    Dec 5, 2003
  3. Chris Nevill

    Regex String Replacement

    Chris Nevill, Jan 31, 2004, in forum: Java
    Replies:
    5
    Views:
    483
    Chris Nevill
    Jan 31, 2004
  4. Avshi
    Replies:
    2
    Views:
    413
    Avshi
    May 18, 2005
  5. Xah Lee
    Replies:
    6
    Views:
    258
    John Bokma
    Sep 28, 2011
Loading...

Share This Page