regexp

V

vertigo

Hello

I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?

Maybe there is other way to achieve the same results ?

Thanx
 
F

Fredrik Lundh

vertigo said:
I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.

that won't cut out all javascript code period.

</F>
 
J

Jonathan Curran

Hello

I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?

Maybe there is other way to achieve the same results ?

Thanx

Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.html

You can modify the code there and get the results that you need. Buy the book
if you can :) It has lots of neat examples.

- Jonathan Curran
 
V

vertigo

Hello
Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.html

i read whole regexp chapter - but there was no solution for my problem.
Example:

re.sub("<!--.*-->","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:
<!--start
of
commend, end-->

it would not work. It's because '.' sign does not matches '\n' sign.

Does anybody knows solution for this particular problem ?

Thanx
 
J

johnzenger

You want re.sub("(?s)<!--.*?-->", "", htmldata)

Explanation: To make the dot match all characters, including newlines,
you need to set the DOTALL flag. You can set the flag using the (?_)
syntax, which is explained in section 4.2.1 of the Python Library
Reference.

A more readable way to do this is:

obj = re.compile("<!--.*?-->", re.DOTALL)
re.sub("", htmldata)
 
J

johnzenger

Oops, I mean obj.sub("", htmldata)

You want re.sub("(?s)<!--.*?-->", "", htmldata)

Explanation: To make the dot match all characters, including newlines,
you need to set the DOTALL flag. You can set the flag using the (?_)
syntax, which is explained in section 4.2.1 of the Python Library
Reference.

A more readable way to do this is:

obj = re.compile("<!--.*?-->", re.DOTALL)
re.sub("", htmldata)
 
P

Paul Arthur

Hello


i read whole regexp chapter -

Did you read Chapter 8? Regexes are 7; 8 is about processing HTML.
Regexes are not well suited to this type of processing.
but there was no solution for my problem.
Example:

re.sub("<!--.*-->","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:
<!--start
of
commend, end-->

it would not work. It's because '.' sign does not matches '\n' sign.

Does anybody knows solution for this particular problem ?

Yes. Use DOTALL mode.
 
V

vertigo

Hello

Thanx for help, i have one more question:

i noticed that while matching regexp python tries to match as wide as it's
possible,
for example:
re.sub("<!--.*-->","",htmldata)
would cut out everything before first "<!--" and last "-->" in the
document.
Can i force re to math as narrow as possible ?
(to match first "<!--" with the first "-->" after the "<!--" and to repeat
this procedure while mentioned pattern is still found) ?

Thanx
 
S

skip

vertigo> i noticed that while matching regexp python tries to match as wide as it's
vertigo> possible,
vertigo> for example:
vertigo> re.sub("<!--.*-->","",htmldata)
vertigo> would cut out everything before first "<!--" and last "-->" in the
vertigo> document.
vertigo> Can i force re to math as narrow as possible ?

http://docs.python.org/lib/re-syntax.html

Search for "greedy".

Skip
 
P

Paul Arthur

Please quote some context when responding to posts.
Thanx for help, i have one more question:

i noticed that while matching regexp python tries to match as wide as it's
possible,
for example:
re.sub("<!--.*-->","",htmldata)
would cut out everything before first "<!--" and last "-->" in the
document.
Can i force re to math as narrow as possible ?
(to match first "<!--" with the first "-->" after the "<!--" and to repeat
this procedure while mentioned pattern is still found) ?

RTFM. Please at least attempt to learn how to do something before
wasting other people's time by having them quote the reference docs to
you.

The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. Sometimes this behaviour isn't desired; if the RE <.*> is
matched against '<H1>title</H1>', it will match the entire string, and
not just '<H1>'. Adding "?" after the qualifier makes it perform the
match in non-greedy or minimal fashion; as few characters as possible
will be matched. Using .*? in the previous expression will match only
'<H1>'.
 
J

Jonathan Curran

Did you read Chapter 8? Regexes are 7; 8 is about processing HTML.
Regexes are not well suited to this type of processing.


Yes. Use DOTALL mode.

Paul, I mentioned Chapter 8 so that the HTML processing section would be taken
a look at. What Vertigo wants can be done with relative ease with SGMLlib.

Anyway, if you (Vertigo) want to use regular expressions to do this, you can
try and use some regular expression testing programs. I'm not quite sure of
the name but there is one that comes with KDE.

- Jonathan Curran
 
J

johnzenger

Not just Python, but every Regex engine works this way. You want a ?
after your *, as in <--(.*?)--> if you want it to catch the first
available "-->".

At this point in your adventure, you might be wondering whether regular
expressions are more trouble than they are worth. They are. There are
two libraries you need to take a look at, and soon: BeautifulSoup for
parsing HTML, and PyParsing for parsing everything else. Take the time
you were planning to spend on deciphering regexes like
"(\d{1,3}\.){3}\d{1,3}" and spend it learning the basics of those
libraries instead -- you will not regret it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,119
Latest member
IrmaNorcro
Top