How to insert string in each match using RegEx iterator

Discussion in 'Python' started by 504crank@gmail.com, Jun 10, 2009.

  1. Guest

    By what method would a string be inserted at each instance of a RegEx
    match?

    For example:

    string = '123 abc 456 def 789 ghi'
    newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'

    Here's the code I started with:

    >>> rePatt = re.compile('\d+\s')
    >>> iterator = rePatt.finditer(string)
    >>> count = 0
    >>> for match in iterator:

    if count < 1:
    print string[0:match.start()] + ' INSERT ' + string[match.start
    ():match.end()]
    elif count >= 1:
    print ' INSERT ' + string[match.start():match.end()]
    count = count + 1

    My code returns an empty string.

    I'm new to Python, but I'm finding it really enjoyable (with the
    exception of this challenging puzzle).

    Thanks in advance.
     
    , Jun 10, 2009
    #1
    1. Advertising

  2. Roy Smith Guest

    In article
    <>,
    "" <> wrote:

    > By what method would a string be inserted at each instance of a RegEx
    > match?
    >
    > For example:
    >
    > string = '123 abc 456 def 789 ghi'
    > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'


    If you want to do what I think you are saying, you should be looking at the
    join() string method. I'm thinking something along the lines of:

    groups = match_object.groups()
    newstring = " INSERT ".join(groups)
     
    Roy Smith, Jun 10, 2009
    #2
    1. Advertising

  3. Guest

    On Jun 9, 11:19 pm, Roy Smith <> wrote:
    > In article
    > <>,
    >
    >  "" <> wrote:
    > > By what method would a string be inserted at each instance of a RegEx
    > > match?

    >
    > > For example:

    >
    > > string = '123 abc 456 def 789 ghi'
    > > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'

    >
    > If you want to do what I think you are saying, you should be looking at the
    > join() string method.  I'm thinking something along the lines of:
    >
    > groups = match_object.groups()
    > newstring = " INSERT ".join(groups)


    Fast answer, Roy. Thanks. That would be a graceful solution if it
    works. I'll give it a try and post a solution.

    Meanwhile, I know there's a logical problem with the way I was
    concatenating strings in the iterator loop.

    Here's a single instance example of what I'm trying to do:

    >>> string = 'abc 123 def 456 ghi 789'
    >>> match = rePatt.search(string)
    >>> print string[0:match.start()] + 'INSERT ' + string[match.end():len(string)]

    abc INSERT def 456 ghi 789
     
    , Jun 10, 2009
    #3
  4. Guest

    On Jun 9, 11:35 pm, "" <> wrote:
    > On Jun 9, 11:19 pm, Roy Smith <> wrote:
    >
    >
    >
    > > In article
    > > <>,

    >
    > >  "" <> wrote:
    > > > By what method would a string be inserted at each instance of a RegEx
    > > > match?

    >
    > > > For example:

    >
    > > > string = '123 abc 456 def 789 ghi'
    > > > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'

    >
    > > If you want to do what I think you are saying, you should be looking at the
    > > join() string method.  I'm thinking something along the lines of:

    >
    > > groups = match_object.groups()
    > > newstring = " INSERT ".join(groups)

    >
    > Fast answer, Roy. Thanks. That would be a graceful solution if it
    > works. I'll give it a try and post a solution.
    >
    > Meanwhile, I know there's a logical problem with the way I was
    > concatenating strings in the iterator loop.
    >
    > Here's a single instance example of what I'm trying to do:
    >
    > >>> string = 'abc 123 def 456 ghi 789'
    > >>> match = rePatt.search(string)
    > >>> print string[0:match.start()] + 'INSERT ' + string[match.end():len(string)]

    >
    > abc INSERT def 456 ghi 789


    Thanks Roy. A little closer to a solution. I'm still processing how to
    step forward, but this is a good start:

    >>> string = 'abc 123 def 456 ghi 789'
    >>> rePatt = re.compile('\s\d+\s')
    >>> foundGroup = rePatt.findall(string)
    >>> newstring = ' INSERT '.join(foundGroup)
    >>> print newstring

    123 INSERT 456

    What I really want to do is return the full string, not just the
    matches -- concatenated around the ' INSERT ' string.
     
    , Jun 10, 2009
    #4
  5. Peter Otten Guest

    wrote:

    > By what method would a string be inserted at each instance of a RegEx
    > match?
    >
    > For example:
    >
    > string = '123 abc 456 def 789 ghi'
    > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'


    Have a look at re.sub():

    >>> s = '123 abc 456 def 789 ghi'
    >>> re.compile(r"(\d+\s)").sub(r"INSERT \1", s)

    'INSERT 123 abc INSERT 456 def INSERT 789 ghi'

    Peter
     
    Peter Otten, Jun 10, 2009
    #5
  6. Paul McGuire Guest

    On Jun 9, 11:13 pm, "" <> wrote:
    > By what method would a string be inserted at each instance of a RegEx
    > match?
    >


    Some might say that using a parsing library for this problem is
    overkill, but let me just put this out there as another data point for
    you. Pyparsing (http://pyparsing.wikispaces.com) supports callbacks
    that allow you to embellish the matched tokens, and create a new
    string containing the modified text for each match of a pyparsing
    expression. Hmm, maybe the code example is easier to follow than the
    explanation...


    from pyparsing import Word, nums, Regex

    # an integer is a 'word' composed of numeric characters
    integer = Word(nums)

    # or use this if you prefer
    integer = Regex(r'\d+')

    # attach a parse action to prefix 'INSERT ' before the matched token
    integer.setParseAction(lambda tokens: "INSERT " + tokens[0])

    # use transformString to search through the input, applying the
    # parse action to all matches of the given expression
    test = '123 abc 456 def 789 ghi'
    print integer.transformString(test)

    # prints
    # INSERT 123 abc INSERT 456 def INSERT 789 ghi


    I offer this because often the simple examples that get posted are
    just the barest tip of the iceberg of what the poster eventually plans
    to tackle.

    Good luck in your Pythonic adventure!
    -- Paul
     
    Paul McGuire, Jun 10, 2009
    #6
  7. Brian D Guest

    On Jun 10, 5:17 am, Paul McGuire <> wrote:
    > On Jun 9, 11:13 pm, "" <> wrote:
    >
    > > By what method would a string be inserted at each instance of a RegEx
    > > match?

    >
    > Some might say that using a parsing library for this problem is
    > overkill, but let me just put this out there as another data point for
    > you.  Pyparsing (http://pyparsing.wikispaces.com) supports callbacks
    > that allow you to embellish the matched tokens, and create a new
    > string containing the modified text for each match of a pyparsing
    > expression.  Hmm, maybe the code example is easier to follow than the
    > explanation...
    >
    > from pyparsing import Word, nums, Regex
    >
    > # an integer is a 'word' composed of numeric characters
    > integer = Word(nums)
    >
    > # or use this if you prefer
    > integer = Regex(r'\d+')
    >
    > # attach a parse action to prefix 'INSERT ' before the matched token
    > integer.setParseAction(lambda tokens: "INSERT " + tokens[0])
    >
    > # use transformString to search through the input, applying the
    > # parse action to all matches of the given expression
    > test = '123 abc 456 def 789 ghi'
    > print integer.transformString(test)
    >
    > # prints
    > # INSERT 123 abc INSERT 456 def INSERT 789 ghi
    >
    > I offer this because often the simple examples that get posted are
    > just the barest tip of the iceberg of what the poster eventually plans
    > to tackle.
    >
    > Good luck in your Pythonic adventure!
    > -- Paul


    Thanks for all of the instant feedback. I have enumerated three
    responses below:

    First response:

    Peter,

    I wonder if you (or anyone else) might attempt a different explanation
    for the use of the special sequence '\1' in the RegEx syntax.

    The Python documentation explains:

    \number
    Matches the contents of the group of the same number. Groups are
    numbered starting from 1. For example, (.+) \1 matches 'the the' or
    '55 55', but not 'the end' (note the space after the group). This
    special sequence can only be used to match one of the first 99 groups.
    If the first digit of number is 0, or number is 3 octal digits long,
    it will not be interpreted as a group match, but as the character with
    octal value number. Inside the '[' and ']' of a character class, all
    numeric escapes are treated as characters.

    In practice, this appears to be the key to the key device to your
    clever solution:

    >>> re.compile(r"(\d+)").sub(r"INSERT \1", string)

    'abc INSERT 123 def INSERT 456 ghi INSERT 789'

    >>> re.compile(r"(\d+)").sub(r"INSERT ", string)

    'abc INSERT def INSERT ghi INSERT '

    I don't, however, precisely understand what is meant by "the group of
    the same number" -- or maybe I do, but it isn't explicit. Is this just
    a shorthand reference to match.group(1) -- if that were valid --
    implying that the group match result is printed in the compile
    execution?


    Second response:

    I've encountered a problem with my RegEx learning curve which I'll be
    posting in a new thread -- how to escape hash characters # in strings
    being matched, e.g.:

    >>> string = re.escape('123#456')
    >>> match = re.match('\d+', string)
    >>> print match

    <_sre.SRE_Match object at 0x00A6A800>
    >>> print match.group()

    123


    Third response:

    Paul,

    Thanks for the referring me to the Pyparsing module. I'm thoroughly
    enjoying Python, but I'm not prepared right now to say I've mastered
    the Pyparsing module. As I continue my work, however, I'll be tackling
    the problem of parsing addresses, exactly as the Pyparsing module
    example illustrates. I'm sure I'll want to use it then.
     
    Brian D, Jun 10, 2009
    #7
  8. Guest

    On Jun 10, 5:17 am, Paul McGuire <> wrote:
    > On Jun 9, 11:13 pm, "" <> wrote:
    >
    > > By what method would a string be inserted at each instance of a RegEx
    > > match?

    >
    > Some might say that using a parsing library for this problem is
    > overkill, but let me just put this out there as another data point for
    > you.  Pyparsing (http://pyparsing.wikispaces.com) supports callbacks
    > that allow you to embellish the matched tokens, and create a new
    > string containing the modified text for each match of a pyparsing
    > expression.  Hmm, maybe the code example is easier to follow than the
    > explanation...
    >
    > from pyparsing import Word, nums, Regex
    >
    > # an integer is a 'word' composed of numeric characters
    > integer = Word(nums)
    >
    > # or use this if you prefer
    > integer = Regex(r'\d+')
    >
    > # attach a parse action to prefix 'INSERT ' before the matched token
    > integer.setParseAction(lambda tokens: "INSERT " + tokens[0])
    >
    > # use transformString to search through the input, applying the
    > # parse action to all matches of the given expression
    > test = '123 abc 456 def 789 ghi'
    > print integer.transformString(test)
    >
    > # prints
    > # INSERT 123 abc INSERT 456 def INSERT 789 ghi
    >
    > I offer this because often the simple examples that get posted are
    > just the barest tip of the iceberg of what the poster eventually plans
    > to tackle.
    >
    > Good luck in your Pythonic adventure!
    > -- Paul


    Thanks for all of the instant feedback. I have enumerated three
    responses below:

    First response:

    Peter,

    I wonder if you (or anyone else) might attempt a different explanation
    for the use of the special sequence '\1' in the RegEx syntax.

    The Python documentation explains:

    \number
    Matches the contents of the group of the same number. Groups are
    numbered starting from 1. For example, (.+) \1 matches 'the the' or
    '55 55', but not 'the end' (note the space after the group). This
    special sequence can only be used to match one of the first 99 groups.
    If the first digit of number is 0, or number is 3 octal digits long,
    it will not be interpreted as a group match, but as the character with
    octal value number. Inside the '[' and ']' of a character class, all
    numeric escapes are treated as characters.

    In practice, this appears to be the key to the key device to your
    clever solution:

    >>> re.compile(r"(\d+)").sub(r"INSERT \1", string)


    'abc INSERT 123 def INSERT 456 ghi INSERT 789'

    >>> re.compile(r"(\d+)").sub(r"INSERT ", string)


    'abc INSERT def INSERT ghi INSERT '

    I don't, however, precisely understand what is meant by "the group of
    the same number" -- or maybe I do, but it isn't explicit. Is this just
    a shorthand reference to match.group(1) -- if that were valid --
    implying that the group match result is printed in the compile
    execution?

    Second response:

    I've encountered a problem with my RegEx learning curve which I'll be
    posting in a new thread -- how to escape hash characters # in strings
    being matched, e.g.:

    >>> string = re.escape('123#456')
    >>> match = re.match('\d+', string)
    >>> print match


    <_sre.SRE_Match object at 0x00A6A800>
    >>> print match.group()


    123

    Third response:

    Paul,

    Thanks for the referring me to the Pyparsing module. I'm thoroughly
    enjoying Python, but I'm not prepared right now to say I've mastered
    the Pyparsing module. As I continue my work, however, I'll be tackling
    the problem of parsing addresses, exactly as the Pyparsing module
    example illustrates. I'm sure I'll want to use it then.
     
    , Jun 10, 2009
    #8
  9. Peter Otten Guest

    wrote:

    > I wonder if you (or anyone else) might attempt a different explanation
    > for the use of the special sequence '\1' in the RegEx syntax.
    >
    > The Python documentation explains:
    >
    > \number
    > Matches the contents of the group of the same number. Groups are
    > numbered starting from 1. For example, (.+) \1 matches 'the the' or
    > '55 55', but not 'the end' (note the space after the group). This
    > special sequence can only be used to match one of the first 99 groups.
    > If the first digit of number is 0, or number is 3 octal digits long,
    > it will not be interpreted as a group match, but as the character with
    > octal value number. Inside the '[' and ']' of a character class, all
    > numeric escapes are treated as characters.
    >
    > In practice, this appears to be the key to the key device to your
    > clever solution:
    >
    >>>> re.compile(r"(\d+)").sub(r"INSERT \1", string)

    >
    > 'abc INSERT 123 def INSERT 456 ghi INSERT 789'
    >
    >>>> re.compile(r"(\d+)").sub(r"INSERT ", string)

    >
    > 'abc INSERT def INSERT ghi INSERT '
    >
    > I don't, however, precisely understand what is meant by "the group of
    > the same number" -- or maybe I do, but it isn't explicit. Is this just
    > a shorthand reference to match.group(1) -- if that were valid --
    > implying that the group match result is printed in the compile
    > execution?


    If I understand you correctly you are right. Another example:

    >>> re.compile(r"([a-z]+)(\d+)").sub(r"number=\2 word=\1", "a1 zzz42")

    'number=1 word=a number=42 word=zzz'

    For every match of "[a-z]+\d+" in the original string "\1" in
    "number=\2 word=\1" is replaced with the actual match for "[a-z]+" and
    "\2" is replaced with the actual match for "\d+".

    The result, e. g. "number=1 word=a", is then used to replace the actual
    match for group 0, i. e. "a1" in the example.

    Peter
     
    Peter Otten, Jun 10, 2009
    #9
  10. Guest

    On Jun 10, 10:13 am, Peter Otten <> wrote:
    > wrote:
    > > I wonder if you (or anyone else) might attempt a different explanation
    > > for the use of the special sequence '\1' in the RegEx syntax.

    >
    > > The Python documentation explains:

    >
    > > \number
    > >     Matches the contents of the group of the same number. Groups are
    > > numbered starting from 1. For example, (.+) \1 matches 'the the' or
    > > '55 55', but not 'the end' (note the space after the group). This
    > > special sequence can only be used to match one of the first 99 groups.
    > > If the first digit of number is 0, or number is 3 octal digits long,
    > > it will not be interpreted as a group match, but as the character with
    > > octal value number. Inside the '[' and ']' of a character class, all
    > > numeric escapes are treated as characters.

    >
    > > In practice, this appears to be the key to the key device to your
    > > clever solution:

    >
    > >>>> re.compile(r"(\d+)").sub(r"INSERT \1", string)

    >
    > > 'abc INSERT 123 def INSERT 456 ghi INSERT 789'

    >
    > >>>> re.compile(r"(\d+)").sub(r"INSERT ", string)

    >
    > > 'abc INSERT  def INSERT  ghi INSERT '

    >
    > > I don't, however, precisely understand what is meant by "the group of
    > > the same number" -- or maybe I do, but it isn't explicit. Is this just
    > > a shorthand reference to match.group(1) -- if that were valid --
    > > implying that the group match result is printed in the compile
    > > execution?

    >
    > If I understand you correctly you are right. Another example:
    >
    > >>> re.compile(r"([a-z]+)(\d+)").sub(r"number=\2 word=\1", "a1 zzz42")

    >
    > 'number=1 word=a number=42 word=zzz'
    >
    > For every match of "[a-z]+\d+" in the original string "\1" in
    > "number=\2 word=\1" is replaced with the actual match for "[a-z]+" and
    > "\2" is replaced with the actual match for "\d+".
    >
    > The result, e. g. "number=1 word=a", is then used to replace the actual
    > match for group 0, i. e. "a1" in the example.
    >
    > Peter- Hide quoted text -
    >
    > - Show quoted text -


    Wow! That is so cool. I had to process it for a little while to get
    it.

    >>> s = '111bbb333'
    >>> re.compile('(\d+)(+)(\d+)').sub(r'First string: \1 Second string: \2 Third string: \3', s)

    'First string: 111 Second string: bbb Third string: 333'

    MRI scans would no doubt reveal that people who attain a mastery of
    RegEx expressions must have highly developed areas of the brain. I
    wonder where the RegEx part of the brain might be located.

    That was a really clever teaching device. I really appreciate you
    taking the time to post it, Peter. I'm definitely getting a schooling
    on this list.

    Thanks!
     
    , Jun 11, 2009
    #10
  11. Aahz Guest

    In article <>,
    <> wrote:
    >
    >MRI scans would no doubt reveal that people who attain a mastery of
    >RegEx expressions must have highly developed areas of the brain. I
    >wonder where the RegEx part of the brain might be located.


    You want Friedl:
    http://www.powells.com/biblio/2-9780596528126-0
    --
    Aahz () <*> http://www.pythoncraft.com/

    "Many customs in this life persist because they ease friction and promote
    productivity as a result of universal agreement, and whether they are
    precisely the optimal choices is much less important." --Henry Spencer
     
    Aahz, Jun 12, 2009
    #11
  12. John S Guest

    On Jun 10, 12:13 am, "" <> wrote:
    > By what method would a string be inserted at each instance of a RegEx
    > match?
    >
    > For example:
    >
    > string = '123 abc 456 def 789 ghi'
    > newstring = ' INSERT 123 abc INSERT 456 def INSERT 789 ghi'
    >
    > Here's the code I started with:
    >
    > >>> rePatt = re.compile('\d+\s')
    > >>> iterator = rePatt.finditer(string)
    > >>> count = 0
    > >>> for match in iterator:

    >
    >         if count < 1:
    >                 print string[0:match.start()] + ' INSERT ' + string[match.start
    > ():match.end()]
    >         elif count >= 1:
    >                 print ' INSERT ' + string[match.start():match.end()]
    >         count = count + 1
    >
    > My code returns an empty string.
    >
    > I'm new to Python, but I'm finding it really enjoyable (with the
    > exception of this challenging puzzle).
    >
    > Thanks in advance.


    I like using a *callback* function instead of *plain text* with the
    re.sub() method. To do this, call the sub() function in the normal
    way, but instead of specifying a string as the replacement, specify a
    function. This function expects the same match object returned by
    re.search() or re.match(). The text matched by your RE is replaced by
    the return value of the function. This gives you a lot of flexibility;
    you can use the matched text to look up values in files or databases,
    or online, for instance, and you can do any sort of text manipulation
    desired.

    ----8<-----------------------------------------------------------------------
    import re

    # original string
    oldstring = '123 abc 456 def 789 ghi'

    # RE to match a sequence of 1 or more digits
    rx_digits = re.compile(r"\d+")

    # callback function -- expects a Match object, returns the replacement
    string
    def repl_func(m):
    return 'INSERT ' + m.group(0)

    # do the substitution
    newstring = rx_digits.sub(repl_func,oldstring)

    print "OLD:",oldstring
    print "NEW:",newstring
    ---------------------------------------------------------------------------------
    Output:
    OLD: 123 abc 456 def 789 ghi
    NEW: INSERT 123 abc INSERT 456 def INSERT 789 ghi


    You could also do it with a lambda function if you didn't want to
    write a separate function:
    newstring = rx_digits.sub(lambda m: 'INSERT ' + m.group(0),oldstring)

    I understand that for this simple case, '
    'INSERT ' + \1
    is sufficient, and a callback is overkill; I wanted to show the OP a
    more generic approach to complex substitutions.
     
    John S, Jun 13, 2009
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hiwa
    Replies:
    0
    Views:
    653
  2. Replies:
    4
    Views:
    204
    Christophe Grandsire
    Oct 28, 2005
  3. Old Echo
    Replies:
    1
    Views:
    199
    Adam Shelly
    Sep 4, 2008
  4. Li Chen
    Replies:
    4
    Views:
    113
    Li Chen
    Jan 25, 2009
  5. Ruby Newbee

    regex =~ string or string =~ regex?

    Ruby Newbee, Jan 4, 2010, in forum: Ruby
    Replies:
    3
    Views:
    148
    Kirk Haines
    Jan 4, 2010
Loading...

Share This Page