regex help

Discussion in 'Python' started by Gabriel Rossetti, Dec 16, 2009.

  1. Hello everyone,

    I'm going nuts with some regex, could someone please show me what I'm
    doing wrong?

    I have an XMPP msg :

    <message xmlns='jabber:client' to=''>
    <mynode xmlns='myprotocol:core' version='1.0' type='mytype'>
    <parameters>
    <param1>123</param1>
    <param2>456</param2>
    </parameters>
    <payload type='plain'>...</payload>
    </mynode>
    <x xmlns='jabber:x:expire' seconds='15'/>
    </message>

    the <parameter> node may be absent or empty (<parameter/>), the <x> node
    may be absent. I'd like to grab everything exept the <payload> nod and
    create something new using regex, with the XMPP message example above
    I'd get this :

    <message xmlns='jabber:client' to=''>
    <mynode xmlns='myprotocol:core' version='1.0' type='mytype'>
    <parameters>
    <param1>123</param1>
    <param2>456</param2>
    </parameters>
    </mynode>
    <x xmlns='jabber:x:expire' seconds='15'/>
    </message>

    for some reason my regex doesn't work correctly :

    r"(<message .*?>).*?(<mynode
    ..*?>).*?(?:(<parameters>.*?</parameters>)|<parameters/>)?.*?(<x .*/>)?"

    I group the opening <message> node, the opening <mynode> node and if the
    <parameters> node is present and not empty I group it and if the <x>
    node is present I group it. For some reason this doesn't work correctly :

    >>> import re
    >>> s1 = "<message xmlns='jabber:client' to=''><mynode

    xmlns='myprotocol:core' version='1.0'
    type='mytype'><parameters><param1>123</param1><param2>456</param2></parameters><payload
    type='plain'>...</payload></mynode><x xmlns='jabber:x:expire'
    seconds='15'/></message>"
    >>> s2 = "<message xmlns='jabber:client' to=''><mynode

    xmlns='myprotocol:core' version='1.0'
    type='mytype'><parameters/><payload
    type='plain'>...</payload></mynode><x xmlns='jabber:x:expire'
    seconds='15'/></message>"
    >>> s3 = "<message xmlns='jabber:client' to=''><mynode

    xmlns='myprotocol:core' version='1.0' type='mytype'><payload
    type='plain'>...</payload></mynode><x xmlns='jabber:x:expire'
    seconds='15'/></message>"
    >>> s4 = "<message xmlns='jabber:client' to=''><mynode

    xmlns='myprotocol:core' version='1.0'
    type='mytype'><parameters><param1>123</param1><param2>456</param2></parameters><payload
    type='plain'>...</payload></mynode></message>"
    >>> s5 = "<message xmlns='jabber:client' to=''><mynode

    xmlns='myprotocol:core' version='1.0'
    type='mytype'><parameters/><payload
    type='plain'>...</payload></mynode></message>"
    >>> s6 = "<message xmlns='jabber:client' to=''><mynode

    xmlns='myprotocol:core' version='1.0' type='mytype'><payload
    type='plain'>...</payload></mynode></message>"
    >>> exp = r"(<message .*?>).*?(<mynode

    ..*?>).*?(?:(<parameters>.*?</parameters>)|<parameters/>)?.*?(<x .*/>)?"
    >>>
    >>> re.match(exp, s1).groups()

    ("<message xmlns='jabber:client' to=''>", "<mynode
    xmlns='myprotocol:core' version='1.0' type='mytype'>",
    '<parameters><param1>123</param1><param2>456</param2></parameters>', None)
    >>>
    >>> re.match(exp, s2).groups()

    ("<message xmlns='jabber:client' to=''>", "<mynode
    xmlns='myprotocol:core' version='1.0' type='mytype'>", None, None)
    >>>
    >>> re.match(exp, s3).groups()

    ("<message xmlns='jabber:client' to=''>", "<mynode
    xmlns='myprotocol:core' version='1.0' type='mytype'>", None, None)
    >>>
    >>> re.match(exp, s4).groups()

    ("<message xmlns='jabber:client' to=''>", "<mynode
    xmlns='myprotocol:core' version='1.0' type='mytype'>",
    '<parameters><param1>123</param1><param2>456</param2></parameters>', None)
    >>>
    >>> re.match(exp, s5).groups()

    ("<message xmlns='jabber:client' to=''>", "<mynode
    xmlns='myprotocol:core' version='1.0' type='mytype'>", None, None)
    >>>
    >>> re.match(exp, s6).groups()

    ("<message xmlns='jabber:client' to=''>", "<mynode
    xmlns='myprotocol:core' version='1.0' type='mytype'>", None, None)
    >>>



    Does someone know what is wrong with my expression? Thank you, Gabriel
    Gabriel Rossetti, Dec 16, 2009
    #1
    1. Advertising

  2. Gabriel Rossetti

    r0g Guest

    Gabriel Rossetti wrote:
    > Hello everyone,
    >
    > I'm going nuts with some regex, could someone please show me what I'm
    > doing wrong?
    >
    > I have an XMPP msg :
    >

    <snip>
    >
    >
    > Does someone know what is wrong with my expression? Thank you, Gabriel





    Gabriel, trying to debug a long regex in situ can be a nightmare however
    the following technique always works for me...

    Use the interactive interpreter and see if half the regex works, if it
    does your problem is in the second half, if not it's in the first so try
    the first half of that and so on an so forth. You'll find the point at
    which it goes wrong in a snip.

    Non-trivial regexes are always best built up and tested a bit at a time,
    the interactive interpreter is great for this.

    Roger.
    r0g, Dec 16, 2009
    #2
    1. Advertising

  3. On Dec 16, 10:22 am, r0g <> wrote:
    > Gabriel Rossetti wrote:
    > > Hello everyone,

    >
    > > I'm going nuts with some regex, could someone please show me what I'm
    > > doing wrong?

    >
    > > I have an XMPP msg :

    >
    > <snip>
    >
    > > Does someone know what is wrong with my expression? Thank you, Gabriel

    >
    > Gabriel, trying to debug a long regex in situ can be a nightmare however
    > the following technique always works for me...
    >
    > Use the interactive interpreter and see if half the regex works, if it
    > does your problem is in the second half, if not it's in the first so try
    > the first half of that and so on an so forth. You'll find the point at
    > which it goes wrong in a snip.
    >
    > Non-trivial regexes are always best built up and tested a bit at a time,
    > the interactive interpreter is great for this.
    >
    > Roger.


    I'll just add that the "now you have two problems" quip applies here,
    especially when there are very good XML parsing libraries for Python
    that will keep you from having to reinvent the wheel for every little
    change.

    See sections 20.5 through 20.13 of the Python Documentation for
    several built-in options, and I'm sure there are many community
    projects that may fit the bill if none of those happen to.

    Personally, I consider regular expressions of any substantial length
    and complexity to be bad practice as it inhibits readability and
    maintainability. They are also decidedly non-Zen on at least
    "Readability counts" and "Sparse is better than dense".

    Intchanter
    Daniel Fackrell

    P.S. I'm not sure how any of these libraries are implemented yet, but
    I'd hope they're using a finite state machine tailored to the parsing
    task rather than using regexes, but even if they do the latter, having
    that abstracted out in a mature library with a clean interface is
    still a huge win.
    Intchanter / Daniel Fackrell, Dec 16, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    690
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,608
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    589
  4. Xah Lee
    Replies:
    1
    Views:
    931
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    734
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page