Regular Expression question

stevebread · Aug 21, 2006

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

My low quality regexp
re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__',
re.DOTALL)

cannot handle the case where there is a tag1 that is not followed by a
tag2. findall returns
john, tall
joe, short

Ideas?

Thanks.

Rob Wolfe · Aug 21, 2006

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

My low quality regexp
re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__',
re.DOTALL)

cannot handle the case where there is a tag1 that is not followed by a
tag2. findall returns
john, tall
joe, short

Ideas?

Have you tried this:

'tag1.+?name="(.+?)".*?(?=tag2).*?="adj__(.*?)__'

?

HTH,
Rob

stevebread · Aug 21, 2006

Thanks, i just tried it but I got the same result.

I've been thinking about it for a few hours now and the problem with
this approach is that the .*? before the (?=tag2) may have matched a
tag1 and i don't know how to detect it.

And even if I could, how would I make the search reset its start
position to the second tag1 it found?

bearophileHUGS · Aug 21, 2006

I am not expert of REs yet, this my first possible solution:

import re

txt = """
<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>"""

tfinder = r"""< # The opening < the tag to find
\s* # Possible space or newline
(tag[12]) # First subgroup, the identifier, tag1
or tag2
\s+ # There must be a space or newline or
more
(?:name|value) # Name or value, non-grouping
\s* # Possible space or newline
= # The =
\s* # Possible space or newline
" # Opening "
([^"]*) # Second subgroup, the tag string, it
can't contain "
" # Closing " of the string
\s* # Possible space or newline
/? # One optional ending /
\s* # Possible space or newline

> # The closing > of the tag

? # Greedy, match the first closing >
"""
patt = re.compile(tfinder, flags=re.I+re.X)

prec_type = ""
prec_string = ""
for mobj in patt.finditer(txt):
curr_type, curr_string = mobj.groups()
if curr_type == "tag2" and prec_type == "tag1":
print prec_string, curr_string.replace("adj__", "").strip("_")
prec_type = curr_type
prec_string = curr_string

Bye,
bearophile

Rob Wolfe · Aug 21, 2006

Thanks, i just tried it but I got the same result.

I've been thinking about it for a few hours now and the problem with
this approach is that the .*? before the (?=tag2) may have matched a
tag1 and i don't know how to detect it.

Maybe like this:
'tag1.+?name="(.+?)".*?(?:<)(?=tag2).*?="adj__(.*?)__'

HTH,
Rob

stevebread · Aug 21, 2006

got zero results on this one

Fredrik Lundh · Aug 21, 2006

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

import re

data = """
<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
"""

elems = re.findall("<(tag1|tag2)\s+(\w+)=\"([^\"]*)\"/>", data)

for i in range(len(elems)-1):
if elems[0] == "tag1" and elems[i+1][0] == "tag2":
print elems[2], elems[i+1][2]

</F>

Paddy · Aug 21, 2006

Hi, I am having some difficulty trying to create a regular expression.

Steve,
I find this tool is great for debugging regular expressions.
http://kodos.sourceforge.net/

Just put some sample text in one window, your trial RE in another, and
Kodos displays a wealth of information on what matches.

Try it.

- Paddy.

Neil Cerutti · Aug 21, 2006

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the
values of the tag1:name and tag2:value attributes. So my end
result here should be

john, tall
jack, short

Ideas?

It seems to me that an html parser might be a better solution.

Here's a slapped-together example. It uses a simple state
machine.

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.state = "get name"
self.name_attrs = None
self.result = {}

def handle_starttag(self, tag, attrs):
if self.state == "get name":
if tag == "tag1":
self.name_attrs = attrs
self.state = "found name"
elif self.state == "found name":
if tag == "tag2":
name = None
for attr in self.name_attrs:
if attr[0] == "name":
name = attr[1]
adj = None
for attr in attrs:
if attr[0] == "value" and attr[1][:3] == "adj":
adj = attr[1][5:-2]
if name == None or adj == None:
print "Markup error: expected attributes missing."
else:
self.result[name] = adj
self.state = "get name"
elif tag == "tag1":
# A new tag1 overrides the old one
self.name_attrs = attrs

p = MyHTMLParser()
p.feed("""
<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
""")
print repr(p.result)
p.close()

There's probably a better way to search for attributes in attr
than "for attr in attrs", but I didn't think of it, and the
example I found on the net used the same idiom. The format of
attrs seems strange. Why isn't it a dictionary?

Paul McGuire · Aug 21, 2006

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

A pyparsing solution may not be a speed demon to run, but doesn't take too
long to write. Some short explanatory comments:
- makeHTMLTags returns a tuple of opening and closing tags, but this example
does not use any closing tags, so simpler to just discard them (only use
zero'th return value)
- Your example includes not only <tag1> and <tag2> tags, but also a 
tag, which is presumably ignorable.
- The value returned from calling the searchString generator includes named
fields for the different tag attributes, making it easy to access the name
and value tag attributes.
- The expression generated by makeHTMLTags will also handle tags with other
surprising attributes that we didn't anticipate (such as " "
or "<tag2 value='adj__short__' modifier='adv__very__'/>")
- Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some
simple string slicing gets us the data we want

The pyparsing home page is at http://pyparsing.wikispaces.com.

-- Paul

from pyparsing import makeHTMLTags

tag1 = makeHTMLTags("tag1")[0]
tag2 = makeHTMLTags("tag2")[0]
br = makeHTMLTags("br")[0]

# define the pattern we're looking for, in terms of tag1 and tag2
# and specify that we wish to ignore tags
patt = tag1 + tag2
patt.ignore(br)

for tokens in patt.searchString(data):
print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2])

Prints:
john, tall
jack, short

Printing tokens.dump() gives:
['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True]
- empty: True
- name: jack
- startTag1: ['tag1', ['name', 'jack'], True]
- empty: True
- name: jack
- startTag2: ['tag2', ['value', 'adj__short__'], True]
- empty: True
- value: adj__short__
- value: adj__short__

Rob Wolfe · Aug 21, 2006

got zero results on this one
Really?

<tag1 name="joe"/>
<tag1 name="jack"/>
[('john', 'tall'), ('joe', 'short')]

Regards,
Rob

stevebread · Aug 21, 2006

Hi, thanks everyone for the information! Still going through it

The reason I did not match on tag2 in my original expression (and I
apologize because I should have mentioned this before) is that other
tags could also have an attribute with the value of "adj__" and the
attribute name may not be the same for the other tags. The only thing I
can be sure of is that the value will begin with "adj__".

I need to match the "adj__" value with the closest preceding tag1
irrespective of what tag the "adj__" is in, or what the attribute
holding it is called, or the order of the attributes (there may be
others). This data will be inside an html page and so there will be
plenty of html tags in the middle all of which I need to ignore.

Thanks very much!
Steve

Anthra Norell · Aug 22, 2006

Steve,
I thought Fredrik Lundh's proposal was perfect. Are you now saying it doesn't solve your problem because your description of the
problem was incomplete? If so, could you post a worst case piece of htm, one that contains all possible complications, or a
collection of different cases all of which you need to handle?

Frederic

----- Original Message -----
From: <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Monday, August 21, 2006 11:35 PM
Subject: Re: Regular Expression question

regular expressions and matching delimeters	17	May 21, 2014
How to pretty-print XML with a regular expression?	0	Sep 17, 2003
Question: Optional Regular Expression Grouping	4	Oct 10, 2011
Pathological regular expression	18	Apr 9, 2009
What's the best way to write this regular expression?	41	Mar 6, 2012
Regular expression to structure HTML	11	Oct 2, 2009
regular expression extracting groups	3	Aug 10, 2008
Regular expression question	14	Oct 26, 2011

Regular Expression question

stevebread

Rob Wolfe

stevebread

bearophileHUGS

Rob Wolfe

stevebread

Fredrik Lundh

Paddy

Neil Cerutti

Paul McGuire

Rob Wolfe

stevebread

Anthra Norell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads