Trying to find regex for any script in an html source

2

28tommy

Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?
Thanks.
 
M

Mitja Trampus

28tommy said:
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?

Several things.
First, re.DOTALL is a flag, a _parameter_ to be passed to
the compile function, not sumething you stick inside the RE
itself:
re.compile('<script .+ src=.+>',re.DOTALL)

Second, this won't match your example above, because src
appears immediately after script. So you probably want
something like
re.compile('<script .*src=.+>',re.DOTALL)

Third, IIRC * and + are _greedy_ by default, this means they
will "eat up" as many characters as possible. Try and see
what I mean. The solution is to use the non-greedy variant
of *, that is *?
re.compile('<script .*?src=.+?>',re.DOTALL)

All this and more at
http://docs.python.org/lib/module-re.html
and, I'm sure, several online tutorials. To RTFM is never a
bad idea.
 
P

Paul McGuire

28tommy said:
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

28tommy -

pyparsing includes a built-in HTML tag definition method that handles tag
attributes automatically. You can also tell pyparsing to *not* accept tags
found inside HTML comments, something not so easy using re's (your target
HTML pages may not have comments, so I dont know if this is of much interest
to you). Finally, accessing the results is very easy, especially for
getting at the values of attributes defined in the opening tag. See the
following example.

Note - pyparsing is considered by some to be "way overkill" for simple HTML
scraping, and is probably 20-100X slower than regular expressions. But as
quick text processing and extraction tools go, it's pretty easy to put
together fairly complex match expressions, without the noisy typography of
regular expressions.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul


from pyparsing import *

data = """
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

<!--
<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/notSureAboutThisScript.js"
type="text/javascript"></script>
-->

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js"
type="text/javascript"></script>
"""

# next three lines define grammar for <script> and </script>,
# plus arbitrary HTML attributes on <script>, plus detection and
# ignoring of any matching expression that might be found inside
# an HTML comment
scriptStart,scriptEnd = makeHTMLTags("script")
expr = scriptStart + scriptEnd
expr.ignore(htmlComment)

# use the grammar to scan the data string
# for each match, return matching tokens as a ParseResults object
# - supports list-, dictionary-, and object-style token access
for toks,start,end in expr.scanString(data):
print toks.startScript
print toks.startScript[0]
print toks.startScript.keys()
print "src =", toks.startScript["src"]
print "src =", toks.startScript.src
print


====================
['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js

['script', ['language', 'JavaScript1.2'], ['src',
'http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js'], ['type',
'text/javascript'], False]
script
['src', 'type', 'language', 'empty']
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js
src = http://i.cnn.net/cnn/.element/ssi/js/1.3/anotherScript.js
 
M

Mike Meyer

28tommy said:
Hi,
I'm trying to find scripts in html source of a page retrieved from the
web.
I'm trying to use the following rule:

match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>')

I'm testing it on a page that includes the following source:

<script language="JavaScript1.2"
src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js"
type="text/javascript"></script>

But I get - 'None' as my result.
Here's (in words) what I'm trying to do: '<script ' followed by any
type and a number of charecters, and then followed by ' src=' followed
by any type and a number of charecters, and then finished by '>'

What am I doing wrong?

Trying to use an RE to parse HTML. While possible, it's not nearly as
easy as it looks, and there are lots of gotchas.

Paul has already pointed out the PyParsing comes with HTML parser. If
your HTML is well-formed, you can use HTMLParser in the standard
library. If your HTML comes from the web at large (meaning much of it
was written by the people who handed in code that didn't compile for
their programming assignments), you'll want to try something like
BeautifulSoup.

<mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top