problem with regex, how to conclude more than one character

tecspring · Nov 7, 2008

I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Maybe I didn't describe this clearly, then feel free to tell me

thanks for any further reply!

tecspring · Nov 7, 2008

I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute
###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Maybe I didn't describe this clearly, then feel free to tell me
thanks for any further reply!

And by the way, I've tried both (!</td>) and (?:!</td>), many ways
doesn't work.... so sad...

Chris Rebert · Nov 7, 2008

I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute

Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com

###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Maybe I didn't describe this clearly, then feel free to tell me
thanks for any further reply!

tecspring · Nov 7, 2008

I always have no idea about how to express "conclude the entire word"
with regexp, while using python, I encountered this problem again...

Click to expand...

for example, if I want to match the "string" in "test a string",
re.findall(r"[^a]* (\w+)","test a string") will work, but what if
there is not "a" but "an"(test a string)? the [^an] will failed
because it will stop at the first character "a".

Click to expand...

I guess people not always use this kind of way to filter words?
Here comes the real problem I encountered:
I want to filter the text both in "<td>" block and the ""'s
title attribute

Click to expand...

Is there any particularly good reason why you're using regexps for
this rather than, say, an actual (X)HTML parser?

Cheers,
Chris
--
Follow the path of the Iguana...http://rebertia.com

###################### code #############################
import re
content='''<tr align="center" valign="middle" class="CellCss"><td
valign="middle">LA</td><td valign="middle">11/10/2008</td><td
valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td
valign="middle">Understand....</td><td title="Charisma"
valign="middle">Charisma</td><td valign="middle">Booked</td><td
valign="middle">'''

Click to expand...

re.findall(r'''<td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td
valign="middle">([^<]+)</td><td valign="middle"><span
title="([^"]*)"''',content)

Click to expand...

#################### code end ############################
As you saw above,
I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
the stock market"
there are two "" block but I can just get the "title" attribute
of the first "" using regexp.
for the second, which should be "Charisma" I need to use some kind of
[^</td>]* to match "class="MouseCursor">Understand....</td>",
then I can continue match the second "" block.

Click to expand...

Maybe I didn't describe this clearly, then feel free to tell me
thanks for any further reply!

Click to expand...

- Show quoted text -

Really thanks for quickly reply Chris!
Actually I tried BeautifulSoup and it's great.
But I'm not very familiar with it and it need more codes to parse the
html and get the right text.
I think regexp is more convenient if there is a way to filter out the
list just in one line

I did this all the way but stopped here...

How to have two html audio players on one page?	0	May 3, 2022
Help with code	0	Jun 11, 2022
Finding a sentence (more than one word & punctuation (, . ;)) ina string?	12	Jan 11, 2006
ASP.NET button control OnClick event doesn't fire	2	Jun 18, 2009
possible issue with mechanize/python parsing	0	Jul 10, 2006
How to escape # hash character in regex match strings	8	Jun 10, 2009
Strange problem with asp:Textbox and CSS filter:alpha	0	Jun 21, 2007
Problems with Drop Down List Control	0	Nov 20, 2006

problem with regex, how to conclude more than one character

tecspring

tecspring

Chris Rebert

tecspring

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads