Regular Expression help

R

RunLevelZero

I have some data and I need to put it in a list in a particular way. I
have that figured out but there is " stuff " in the data that I don't
want.

Example:

10:00am - 11:00am:</b> <a
href="/tvpdb?d=tvp&id=167540528&cf=0&lineup=us_KS57836d&channels=us_KCTV&chspid=166030466&chname=CBS&progutn=1146150000&.intl=us">The
Price Is Right</a><em>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

TIA
 
E

Edward Elliott

RunLevelZero said:
10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The
Price Is Right</a><em>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')

1. A regex remembers everything it matches -- no need to wrap the entire
thing in parens. Just call group() on the returned MatchObject.

2. If all you want is the link text, you don't need to do so much matching.
If you don't need the time, don't match it in the first place. If you're
using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'. Not
as exact but a bit simpler. Or just r'[\d:apm]{6,7}'

3. To grab what's inside the link: r'<a[^>]*>(.*?)</a>'

4. If the link text itself contains html tags, you'll have to strip those
off separately. Extracting the text from arbitrarily nested html tags in
one shot requires a parser, not a regex.

5. If you're just going to run this regex repeatedly on an html doc and make
a list of the results, it's easier to read the whole doc into a string and
then use re.findall.

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

No one can help with that unless you show us how you're building your list.
 
R

RunLevelZero

Great I will test this out once I have the time... thanks for the quick
response
 
J

johnzenger

If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.
 
R

RunLevelZero

I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)
 
J

johnzenger

If what you need is "simple," regular expressions are almost never the
answer. And how simple can it be if you are posting here? :)

BeautifulSoup isn't all that hard. Observe:
from BeautifulSoup import BeautifulSoup
html = '10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The Price Is Right</a><em>'
soup = BeautifulSoup(html)
soup('a')
[ said:
for show in soup('a'):
print show.contents[0]


The Price Is Right
 
R

RunLevelZero

r'<a[^>]*>(.*?)</a>'

With a slight modification that did exactly what I wanted, and yes the
findall was the only way to get all that I needed as I buffered all the
read.

Thanks a bunch.
 
E

Edward Elliott

If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak havoc.
Regexes can also help if you only want elements preceded/followed by a
certain sibling or cousin in the parse tree. It all depends on what you're
trying to accomplish. In general though, yes parsers are better suited to
extracting from markup.
 
J

John Bokma

Edward Elliott said:
I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak
havoc. Regexes can also help if you only want elements
preceded/followed by a certain sibling or cousin in the parse tree.
It all depends on what you're trying to accomplish. In general
though, yes parsers are better suited to extracting from markup.

A parser can be written in such a way that it doesn't give up on malformed
HTML. Probably less hard then coming up with regexes that handle HTML
that's well-formed. (and that coming from a Perl programmer ;-) )
 
K

Kent Johnson

Edward said:
I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html.

Beautiful Soup is intended to handle malformed HTML and seems to do
pretty well.

Kent
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top