Regular Expression question

S

stevebread

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

My low quality regexp
re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__',
re.DOTALL)

cannot handle the case where there is a tag1 that is not followed by a
tag2. findall returns
john, tall
joe, short

Ideas?

Thanks.
 
R

Rob Wolfe

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

My low quality regexp
re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__',
re.DOTALL)

cannot handle the case where there is a tag1 that is not followed by a
tag2. findall returns
john, tall
joe, short

Ideas?

Have you tried this:

'tag1.+?name="(.+?)".*?(?=tag2).*?="adj__(.*?)__'

?

HTH,
Rob
 
S

stevebread

Thanks, i just tried it but I got the same result.

I've been thinking about it for a few hours now and the problem with
this approach is that the .*? before the (?=tag2) may have matched a
tag1 and i don't know how to detect it.

And even if I could, how would I make the search reset its start
position to the second tag1 it found?
 
B

bearophileHUGS

I am not expert of REs yet, this my first possible solution:

import re

txt = """
<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>"""

tfinder = r"""< # The opening < the tag to find
\s* # Possible space or newline
(tag[12]) # First subgroup, the identifier, tag1
or tag2
\s+ # There must be a space or newline or
more
(?:name|value) # Name or value, non-grouping
\s* # Possible space or newline
= # The =
\s* # Possible space or newline
" # Opening "
([^"]*) # Second subgroup, the tag string, it
can't contain "
" # Closing " of the string
\s* # Possible space or newline
/? # One optional ending /
\s* # Possible space or newline
> # The closing > of the tag
? # Greedy, match the first closing >
"""
patt = re.compile(tfinder, flags=re.I+re.X)

prec_type = ""
prec_string = ""
for mobj in patt.finditer(txt):
curr_type, curr_string = mobj.groups()
if curr_type == "tag2" and prec_type == "tag1":
print prec_string, curr_string.replace("adj__", "").strip("_")
prec_type = curr_type
prec_string = curr_string

Bye,
bearophile
 
R

Rob Wolfe

Thanks, i just tried it but I got the same result.

I've been thinking about it for a few hours now and the problem with
this approach is that the .*? before the (?=tag2) may have matched a
tag1 and i don't know how to detect it.

Maybe like this:
'tag1.+?name="(.+?)".*?(?:<)(?=tag2).*?="adj__(.*?)__'

HTH,
Rob
 
F

Fredrik Lundh

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

import re

data = """
<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
"""

elems = re.findall("<(tag1|tag2)\s+(\w+)=\"([^\"]*)\"/>", data)

for i in range(len(elems)-1):
if elems[0] == "tag1" and elems[i+1][0] == "tag2":
print elems[2], elems[i+1][2]

</F>
 
P

Paddy

Hi, I am having some difficulty trying to create a regular expression.

Steve,
I find this tool is great for debugging regular expressions.
http://kodos.sourceforge.net/

Just put some sample text in one window, your trial RE in another, and
Kodos displays a wealth of information on what matches.

Try it.

- Paddy.
 
N

Neil Cerutti

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the
values of the tag1:name and tag2:value attributes. So my end
result here should be

john, tall
jack, short

Ideas?

It seems to me that an html parser might be a better solution.

Here's a slapped-together example. It uses a simple state
machine.

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.state = "get name"
self.name_attrs = None
self.result = {}

def handle_starttag(self, tag, attrs):
if self.state == "get name":
if tag == "tag1":
self.name_attrs = attrs
self.state = "found name"
elif self.state == "found name":
if tag == "tag2":
name = None
for attr in self.name_attrs:
if attr[0] == "name":
name = attr[1]
adj = None
for attr in attrs:
if attr[0] == "value" and attr[1][:3] == "adj":
adj = attr[1][5:-2]
if name == None or adj == None:
print "Markup error: expected attributes missing."
else:
self.result[name] = adj
self.state = "get name"
elif tag == "tag1":
# A new tag1 overrides the old one
self.name_attrs = attrs

p = MyHTMLParser()
p.feed("""
<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>
""")
print repr(p.result)
p.close()

There's probably a better way to search for attributes in attr
than "for attr in attrs", but I didn't think of it, and the
example I found on the net used the same idiom. The format of
attrs seems strange. Why isn't it a dictionary?
 
P

Paul McGuire

Hi, I am having some difficulty trying to create a regular expression.

Consider:

<tag1 name="john"/> <br/> <tag2 value="adj__tall__"/>
<tag1 name="joe"/>
<tag1 name="jack"/>
<tag2 value="adj__short__"/>

Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

A pyparsing solution may not be a speed demon to run, but doesn't take too
long to write. Some short explanatory comments:
- makeHTMLTags returns a tuple of opening and closing tags, but this example
does not use any closing tags, so simpler to just discard them (only use
zero'th return value)
- Your example includes not only <tag1> and <tag2> tags, but also a <br>
tag, which is presumably ignorable.
- The value returned from calling the searchString generator includes named
fields for the different tag attributes, making it easy to access the name
and value tag attributes.
- The expression generated by makeHTMLTags will also handle tags with other
surprising attributes that we didn't anticipate (such as "<br clear='all'/>"
or "<tag2 value='adj__short__' modifier='adv__very__'/>")
- Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some
simple string slicing gets us the data we want

The pyparsing home page is at http://pyparsing.wikispaces.com.

-- Paul


from pyparsing import makeHTMLTags

tag1 = makeHTMLTags("tag1")[0]
tag2 = makeHTMLTags("tag2")[0]
br = makeHTMLTags("br")[0]

# define the pattern we're looking for, in terms of tag1 and tag2
# and specify that we wish to ignore <br> tags
patt = tag1 + tag2
patt.ignore(br)

for tokens in patt.searchString(data):
print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2])


Prints:
john, tall
jack, short


Printing tokens.dump() gives:
['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True]
- empty: True
- name: jack
- startTag1: ['tag1', ['name', 'jack'], True]
- empty: True
- name: jack
- startTag2: ['tag2', ['value', 'adj__short__'], True]
- empty: True
- value: adj__short__
- value: adj__short__
 
S

stevebread

Hi, thanks everyone for the information! Still going through it :)

The reason I did not match on tag2 in my original expression (and I
apologize because I should have mentioned this before) is that other
tags could also have an attribute with the value of "adj__" and the
attribute name may not be the same for the other tags. The only thing I
can be sure of is that the value will begin with "adj__".

I need to match the "adj__" value with the closest preceding tag1
irrespective of what tag the "adj__" is in, or what the attribute
holding it is called, or the order of the attributes (there may be
others). This data will be inside an html page and so there will be
plenty of html tags in the middle all of which I need to ignore.

Thanks very much!
Steve
 
A

Anthra Norell

Steve,
I thought Fredrik Lundh's proposal was perfect. Are you now saying it doesn't solve your problem because your description of the
problem was incomplete? If so, could you post a worst case piece of htm, one that contains all possible complications, or a
collection of different cases all of which you need to handle?

Frederic

----- Original Message -----
From: <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Monday, August 21, 2006 11:35 PM
Subject: Re: Regular Expression question
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top