Regular Expression question

ken.carlino · Jun 7, 2006

Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.

Fredrik Lundh · Jun 7, 2006

I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

if you want to parse HTML, use an HTML parser. if you want to parse
sloppy HTML, use a tolerant HTML parser:

http://www.crummy.com/software/BeautifulSoup/

</F>

Paul McGuire · Jun 8, 2006

Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)<img src="href of the image source">(.*)

I need to get the href of the image source.

Thanks.

Frank Potter · Jun 8, 2006

Paul McGuire · Jun 8, 2006

Frank Potter said:
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/").read()

import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

Ouch - this fails to match any <img> tag that has some other attribute, such
as "height" or "width", before the "src" attribute. www.yahoo.com has
several such tags.

On the other hand, pyparsing's makeHTMLTags defines a starting tag
expression that looks for (conceptually):

< tagname ZeroOrMore(attrname '=' value) Optional('/') >

and does not assume that the first tag is "src", or anything else for that
matter.

The returned results make the tag attributes accessible as object attributes
or dictionary keys.

-- Paul

Duncan Booth · Jun 8, 2006

Paul said:
import re
r=re.compile('<img\s+src="(?P<image>[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

Click to expand...

Ouch - this fails to match any <img> tag that has some other
attribute, such as "height" or "width", before the "src" attribute.
www.yahoo.com has several such tags.

It also fails to match any image tag where the src attribute is quoted
using single quotes, or where the src attribute is not enclosed in quotes
at all.

Handle all of that correctly in the regex and the beautiful soup or
pyparsing options look even more attractive. In fact, if anyone can write a
regex which matches the source attribute in a single named group, and
correctly handles double, single and unquoted attributes, I'll admit to
being impressed (and probably also slightly queasy when looking at it).

Here's my best attempt at a regex that gets it right, but it still gets
confused by other attributes if they contain spaces.

'''\s*)*src=(?:["']?)(?P said:
ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
NOTSRC = '(?!src=)' + ATTR
PAT = '''<img\s(?:'''+NOTSRC +

Click to expand...

'''\s*)*src=(?:["']?)(?P said:

htmlPage = '''<html><body><img width=42 src=fred.jpg><img

Click to expand...

Click to expand...

src=\"freda.jpg\"> <img title='the src="silly" title'
print m.group('image')

fred.jpg
freda.jpg

strip away html tags from extracted links	2	Nov 29, 2013
HTTP post with urllib2	5	Aug 6, 2013
Help with Visual Lightbox: Scripts	2	May 3, 2023
Question: Optional Regular Expression Grouping	4	Oct 10, 2011
Make 'Image X' spin 360deg on y-Axis until reset button clicked	1	Jan 22, 2023
Regular Expression for the special character "\|" pipe	7	May 27, 2014
Sort by number of characters	1	Nov 2, 2023
Pattern Search Regular Expression	20	Jun 15, 2013

Regular Expression question

ken.carlino

Fredrik Lundh

Paul McGuire

Frank Potter

Paul McGuire

Duncan Booth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads