Regular Expression help

RunLevelZero · Apr 27, 2006

I have some data and I need to put it in a list in a particular way. I
have that figured out but there is " stuff " in the data that I don't
want.

Example:

10:00am - 11:00am: <a
href="/tvpdb?d=tvp&id=167540528&cf=0&lineup=us_KS57836d&channels=us_KCTV&chspid=166030466&chname=CBS&progutn=1146150000&.intl=us">The
Price Is Right</a>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a>)')

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

TIA

Edward Elliott · Apr 27, 2006

RunLevelZero said:
10:00am - 11:00am: <a href="/tvpdb?d=tvp&id=167540528&[snip]>The
Price Is Right</a>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a>)')

1. A regex remembers everything it matches -- no need to wrap the entire
thing in parens. Just call group() on the returned MatchObject.

2. If all you want is the link text, you don't need to do so much matching.
If you don't need the time, don't match it in the first place. If you're
using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'. Not
as exact but a bit simpler. Or just r'[\d:apm]{6,7}'

3. To grab what's inside the link: r'<a[^>]*>(.*?)</a>'

4. If the link text itself contains html tags, you'll have to strip those
off separately. Extracting the text from arbitrarily nested html tags in
one shot requires a parser, not a regex.

5. If you're just going to run this regex repeatedly on an html doc and make
a list of the results, it's easier to read the whole doc into a string and
then use re.findall.

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

No one can help with that unless you show us how you're building your list.

RunLevelZero · Apr 27, 2006

Great I will test this out once I have the time... thanks for the quick
response

johnzenger · Apr 27, 2006

If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

RunLevelZero · Apr 27, 2006

I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well

johnzenger · Apr 27, 2006

If what you need is "simple," regular expressions are almost never the
answer. And how simple can it be if you are posting here?

BeautifulSoup isn't all that hard. Observe:

[ said:
from BeautifulSoup import BeautifulSoup
html = '10:00am - 11:00am: <a href="/tvpdb?d=tvp&id=167540528&[snip]>The Price Is Right</a>'
soup = BeautifulSoup(html)
soup('a')

Click to expand...

[ said:

for show in soup('a'):

Click to expand...

Click to expand...

print show.contents[0]

The Price Is Right

RunLevelZero · Apr 27, 2006

r'<a[^>]*>(.*?)</a>'

With a slight modification that did exactly what I wanted, and yes the
findall was the only way to get all that I needed as I buffered all the
read.

Thanks a bunch.

RunLevelZero · Apr 27, 2006

Interesting... thank you.

Edward Elliott · Apr 27, 2006

If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak havoc.
Regexes can also help if you only want elements preceded/followed by a
certain sibling or cousin in the parse tree. It all depends on what you're
trying to accomplish. In general though, yes parsers are better suited to
extracting from markup.

John Bokma · Apr 27, 2006

Edward Elliott said:
I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak
havoc. Regexes can also help if you only want elements
preceded/followed by a certain sibling or cousin in the parse tree.
It all depends on what you're trying to accomplish. In general
though, yes parsers are better suited to extracting from markup.

A parser can be written in such a way that it doesn't give up on malformed
HTML. Probably less hard then coming up with regexes that handle HTML
that's well-formed. (and that coming from a Perl programmer ;-) )

Kent Johnson · Apr 28, 2006

Edward said:
I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html.

Beautiful Soup is intended to handle malformed HTML and seems to do
pretty well.

Kent

Need help with this code	2	May 10, 2023
Help with regular expression in python	1	Aug 18, 2011
Problem creating a regular expression to parse open-iscsi, iscsiadmoutput (help?)	5	Jun 13, 2013
Regular expression	0	Jul 21, 2009
Regular expression to structure HTML	11	Oct 2, 2009
C coding a rotate function (help me pleasee)	1	Dec 26, 2022
Regular Expression Help	3	Apr 12, 2009
help on python regular expression named group	3	Jul 16, 2013

Regular Expression help

RunLevelZero

Edward Elliott

RunLevelZero

johnzenger

RunLevelZero

johnzenger

RunLevelZero

RunLevelZero

Edward Elliott

John Bokma

Kent Johnson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads