regexp

vertigo · Dec 19, 2006

Hello

I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?

Maybe there is other way to achieve the same results ?

Thanx

Fredrik Lundh · Dec 19, 2006

vertigo said:
I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.

that won't cut out all javascript code period.

</F>

Jonathan Curran · Dec 19, 2006

Hello

I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?

Maybe there is other way to achieve the same results ?

Thanx

Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.html

You can modify the code there and get the results that you need. Buy the book
if you can

It has lots of neat examples.

- Jonathan Curran

vertigo · Dec 19, 2006

that won't cut out all javascript code period.

do you have any idea what will do ?
i need to cut everything but the pure text data.

Thanx

vertigo · Dec 19, 2006

Hello

Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.html

i read whole regexp chapter - but there was no solution for my problem.
Example:

re.sub("","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:


it would not work. It's because '.' sign does not matches '\n' sign.

Does anybody knows solution for this particular problem ?

Thanx

johnzenger · Dec 19, 2006

You want re.sub("(?s)", "", htmldata)

Explanation: To make the dot match all characters, including newlines,
you need to set the DOTALL flag. You can set the flag using the (?_)
syntax, which is explained in section 4.2.1 of the Python Library
Reference.

A more readable way to do this is:

obj = re.compile("", re.DOTALL)
re.sub("", htmldata)

johnzenger · Dec 19, 2006

Oops, I mean obj.sub("", htmldata)

You want re.sub("(?s)", "", htmldata)

Explanation: To make the dot match all characters, including newlines,
you need to set the DOTALL flag. You can set the flag using the (?_)
syntax, which is explained in section 4.2.1 of the Python Library
Reference.

A more readable way to do this is:

obj = re.compile("", re.DOTALL)
re.sub("", htmldata)

Paul Arthur · Dec 19, 2006

Hello

i read whole regexp chapter -

Did you read Chapter 8? Regexes are 7; 8 is about processing HTML.
Regexes are not well suited to this type of processing.

but there was no solution for my problem.
Example:

re.sub("","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:


it would not work. It's because '.' sign does not matches '\n' sign.

Does anybody knows solution for this particular problem ?

Yes. Use DOTALL mode.

vertigo · Dec 19, 2006

Hello

Thanx for help, i have one more question:

i noticed that while matching regexp python tries to match as wide as it's
possible,
for example:
re.sub("","",htmldata)
would cut out everything before first "" in the
document.
Can i force re to math as narrow as possible ?
(to match first "" after the "<!--" and to repeat
this procedure while mentioned pattern is still found) ?

Thanx

skip · Dec 19, 2006

vertigo> i noticed that while matching regexp python tries to match as wide as it's
vertigo> possible,
vertigo> for example:
vertigo> re.sub("","",htmldata)
vertigo> would cut out everything before first "" in the
vertigo> document.
vertigo> Can i force re to math as narrow as possible ?

http://docs.python.org/lib/re-syntax.html

Search for "greedy".

Skip

Paul Arthur · Dec 19, 2006

Please quote some context when responding to posts.

Thanx for help, i have one more question:

i noticed that while matching regexp python tries to match as wide as it's
possible,
for example:
re.sub("","",htmldata)
would cut out everything before first "" in the
document.
Can i force re to math as narrow as possible ?
(to match first "" after the "<!--" and to repeat
this procedure while mentioned pattern is still found) ?

RTFM. Please at least attempt to learn how to do something before
wasting other people's time by having them quote the reference docs to
you.

The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. Sometimes this behaviour isn't desired; if the RE <.*> is
matched against '<H1>title</H1>', it will match the entire string, and
not just '<H1>'. Adding "?" after the qualifier makes it perform the
match in non-greedy or minimal fashion; as few characters as possible
will be matched. Using .*? in the previous expression will match only
'<H1>'.

Jonathan Curran · Dec 20, 2006

Did you read Chapter 8? Regexes are 7; 8 is about processing HTML.
Regexes are not well suited to this type of processing.

Yes. Use DOTALL mode.

Paul, I mentioned Chapter 8 so that the HTML processing section would be taken
a look at. What Vertigo wants can be done with relative ease with SGMLlib.

Anyway, if you (Vertigo) want to use regular expressions to do this, you can
try and use some regular expression testing programs. I'm not quite sure of
the name but there is one that comes with KDE.

- Jonathan Curran

johnzenger · Dec 20, 2006

Not just Python, but every Regex engine works this way. You want a ?
after your *, as in <--(.*?)--> if you want it to catch the first
available "-->".

At this point in your adventure, you might be wondering whether regular
expressions are more trouble than they are worth. They are. There are
two libraries you need to take a look at, and soon: BeautifulSoup for
parsing HTML, and PyParsing for parsing everything else. Take the time
you were planning to spend on deciphering regexes like
"(\d{1,3}\.){3}\d{1,3}" and spend it learning the basics of those
libraries instead -- you will not regret it.

small regexp help	1	Oct 30, 2013
Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Regexp problem	7	Jul 30, 2009
regexp questoin	6	Jun 9, 2006
Perl RegExp question	20	Apr 19, 2011
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013
splitting perl-style find/replace regexp using python	8	Mar 1, 2007

regexp

vertigo

Fredrik Lundh

Jonathan Curran

vertigo

vertigo

johnzenger

johnzenger

Paul Arthur

vertigo

skip

Paul Arthur

Jonathan Curran

johnzenger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads