Regular Expression question

K

ken.carlino

Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.
 
P

Paul McGuire

Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.

As Fredrik pointed out, re's are not the only tool out there. Here's a
pyparsing solution.

-- Paul


import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending </img> tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com")
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each <img> tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src


Prints:

http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif
 
P

Paul McGuire

Frank Potter said:
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

Ouch - this fails to match any <img> tag that has some other attribute, such
as "height" or "width", before the "src" attribute. www.yahoo.com has
several such tags.

On the other hand, pyparsing's makeHTMLTags defines a starting tag
expression that looks for (conceptually):

< tagname ZeroOrMore(attrname '=' value) Optional('/') >

and does not assume that the first tag is "src", or anything else for that
matter.

The returned results make the tag attributes accessible as object attributes
or dictionary keys.

-- Paul
 
D

Duncan Booth

Paul said:
import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

Ouch - this fails to match any <img> tag that has some other
attribute, such as "height" or "width", before the "src" attribute.
www.yahoo.com has several such tags.

It also fails to match any image tag where the src attribute is quoted
using single quotes, or where the src attribute is not enclosed in quotes
at all.

Handle all of that correctly in the regex and the beautiful soup or
pyparsing options look even more attractive. In fact, if anyone can write a
regex which matches the source attribute in a single named group, and
correctly handles double, single and unquoted attributes, I'll admit to
being impressed (and probably also slightly queasy when looking at it).

Here's my best attempt at a regex that gets it right, but it still gets
confused by other attributes if they contain spaces.
ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
NOTSRC = '(?!src=)' + ATTR
PAT = '''<img\s(?:'''+NOTSRC +
'''\s*)*src=(?:["']?)(?P said:
htmlPage = '''<html><body><img width=42 src=fred.jpg><img
src=\"freda.jpg\"> <img title='the src="silly" title'
print m.group('image')


fred.jpg
freda.jpg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,152
Latest member
LorettaGur
Top