RE Module

Roman · Aug 24, 2006

I am trying to filter a column in a list of all html tags.

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

Simon Forman · Aug 25, 2006

Roman said:
I am trying to filter a column in a list of all html tags.
What?

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

Anthra Norell · Aug 25, 2006

Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:

If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2 beta
(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:
'name_of_output_file'

Or if you want to to view the output:
(... your text without tags ...)

If you want to keep the definitions for later use, do this:

Tag_Stripper.save ('[your_path/]tag_stripper.se')

Click to expand...

Click to expand...

Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')

Click to expand...

Click to expand...

You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes ( 
etc.) you'd simply add the name of the file that defines the ampersand replacements:

'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.

Frederic

----- Original Message -----
From: "Simon Forman" <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman said:
Roman said:

I am trying to filter a column in a list of all html tags.
What?

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

Click to expand...

I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

Roman · Aug 25, 2006

Thanks for your help.

A thing I didn't mention is that before the statement row[0] =
re.sub(r'<.*?>', '', row[0]), I have row[0]=re.sub(r'[^
0-9A-Za-z\"\'\.\,\#\@\!\(\)\*\&\%\%\\\/\:\;\?\`\~\<\>]', '', row[0])
statement. Hence, the line separators are going to be gone. You
mentioned the size of the string could be a factor. If so what is the
max size before I see problems?

Thanks again

Anthra said:
Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:

If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2 beta
(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:
'name_of_output_file'

Or if you want to to view the output:
(... your text without tags ...)

If you want to keep the definitions for later use, do this:

Tag_Stripper.save ('[your_path/]tag_stripper.se')

Click to expand...

Click to expand...

Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')

Click to expand...

Click to expand...

You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes ( 
etc.) you'd simply add the name of the file that defines the ampersand replacements:

'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.

Frederic

----- Original Message -----
From: "Simon Forman" <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman said:

I am trying to filter a column in a list of all html tags.
What?

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

Click to expand...

I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

Click to expand...

tobiah · Aug 25, 2006

Roman said:
I am trying to filter a column in a list of all html tags.

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The regex will be 'greedy' and match through one tag
all the way to the end of another on the same line.
There are more complete suggestions offered, but
it seems to me that the simple fix here is to not
match through the end of the tag, like this:

"<[^>]*>"

Roman · Aug 25, 2006

This is excellent. Thanks a lot.

Also, what made the expression greedy?

tobiah said:
Roman said:

I am trying to filter a column in a list of all html tags.

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

Click to expand...

The regex will be 'greedy' and match through one tag
all the way to the end of another on the same line.
There are more complete suggestions offered, but
it seems to me that the simple fix here is to not
match through the end of the tag, like this:

"<[^>]*>"

Anthra Norell · Aug 25, 2006

Roman,

I don't quite understand what you mean. Line separators gone? That would be the '\n', right? What of it if you process line by line,
as your variable name 'row' suggests?
As to the maximum size re can handle, I have no idea. I vaguely remember the topic being discussed. You should be able to find
the discussions in the archives, if a knowlegeable soul doesn't volunteer the info right away. With SE it is of no concern.

Anyway, I think the best thing to do is to just try with a real page:
( ... page without tags, but lots of empty lines ...)

If you want to take the empty lines out, do this:

"|" means do the preceding replacements (which happen to be deletions: replace with nothing) and go on from there. The expressions
we added say: delete lines that contain only spaces. Do that (another "|"). And finally replace multiple consecutive line feeds with
a single line feed.
So you can develop interactively. Add a definition. See what it does. Add another one. One little step at a time. Hacking at
its best!

Frederic

----- Original Message -----
From: "Roman" <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Friday, August 25, 2006 6:14 PM
Subject: Re: RE Module

Anthra said:
Thanks for your help.

A thing I didn't mention is that before the statement row[0] =
re.sub(r'<.*?>', '', row[0]), I have row[0]=re.sub(r'[^
0-9A-Za-z\"\'\.\,\#\@\!\(\)\*\&\%\%\\\/\:\;\?\`\~\<\>]', '', row[0])
statement. Hence, the line separators are going to be gone. You
mentioned the size of the string could be a factor. If so what is the
max size before I see problems?

Thanks again

Anthra said:

Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:

text = re.sub ('<(.|\n)*?>', '', text)

Click to expand...

If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2 beta

import SE
Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~~=" ')
print Tag_Stripper (text)

Click to expand...

(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:

Tag_Stripper ('name_of_file.htm', 'name_of_output_file')

Click to expand...

'name_of_output_file'

Or if you want to to view the output:

Tag_Stripper ('name_of_file.htm', '')

Click to expand...

(... your text without tags ...)

If you want to keep the definitions for later use, do this:

Tag_Stripper.save ('[your_path/]tag_stripper.se')

Click to expand...

Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')

Click to expand...

You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes ( 
etc.) you'd simply add the name of the file that defines the ampersand replacements:

Tag_Stripper = SE.SE ('tag_stripper.se htm2iso.se')

Click to expand...

'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.

Frederic

----- Original Message -----
From: "Simon Forman" <[email protected]>
Newsgroups: comp.lang.python
To: <[email protected]>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman wrote:
I am trying to filter a column in a list of all html tags.

What?

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

Click to expand...

Click to expand...

tobiah · Aug 25, 2006

Roman said:
This is excellent. Thanks a lot.

Also, what made the expression greedy?

They usually are, by default. It means that when there
are more than one ways to match the pattern, choose the
one that matches the most text. Often there are flags
available to change that behavior. I'm not sure off hand
how to do it with the re module.

Tim Chase · Aug 25, 2006

Also, what made the expression greedy?

They usually are, by default. It means that when there
are more than one ways to match the pattern, choose the
one that matches the most text. Often there are flags
available to change that behavior. I'm not sure off hand
how to do it with the re module.

In python's RE module, they're like Perl:

Greedy: "<.*>"
Nongreedy: "<.*?>"

By appending a question-mark onto the operator, one makes it a
non-greedy repeat. It also applies to the plus ("one or more")
and the questionmark ("zero or one")

-tkc

tobiah · Aug 25, 2006

In python's RE module, they're like Perl:

Greedy: "<.*>"
Nongreedy: "<.*?>"

Oh, I have never seen that. In that case, why
did Roman's first example not work well for
HTML tags?

'<.*?>'

Also, how does the engine decide whether I am adjusting
the greed of the previous operator, or just asking
for another possible character?

Suppose I want:

"x*?" to match "xxxxxxxO"

If the '?' means non greedy, then I should get 'x' back.
If the '?' means optional character then I should get
the full string back.

Checking in python:

######################################
import re

s = 'xxxxxxx0'

m = re.search("x*", s)
print "First way", m.group(0)

m = re.search("x*?", s)
print "Second way", m.group(0)
#####################################
First way xxxxxxx
Second way

So now I'm really confused. It didn't do a non-greedy
'x' match, nor did it allow the '?' to match the '0'.

Fredrik Lundh · Aug 25, 2006

tobiah said:
Also, how does the engine decide whether I am adjusting
the greed of the previous operator, or just asking
for another possible character?

"?" always modifies the *preceeding* RE element.

if the preceeding element is a pattern (e.g. a character or group), it
means that the pattern is optional.

if the preceeding element is a repeat modifier (*, +, or ?), it changes
the greediness.

Suppose I want:

"x*?" to match "xxxxxxxO"

If the '?' means non greedy, then I should get 'x' back.

no, because "*" means *ZERO* or more matches, not one or more.

If the '?' means optional character then I should get
the full string back.

no, because "?" never means anything on its own; it's a pattern
modifier, not a pattern.

Checking in python:

######################################
import re

s = 'xxxxxxx0'

m = re.search("x*", s)
print "First way", m.group(0)

m = re.search("x*?", s)
print "Second way", m.group(0)
#####################################
First way xxxxxxx
Second way

So now I'm really confused. It didn't do a non-greedy
'x' match, nor did it allow the '?' to match the '0'.

see above. reading the RE documentation again may also help.

</F>

Tim Chase · Aug 25, 2006

######################################

import re

s = 'xxxxxxx0'

m = re.search("x*", s)
print "First way", m.group(0)

m = re.search("x*?", s)
print "Second way", m.group(0)
#####################################
First way xxxxxxx
Second way

So now I'm really confused. It didn't do a non-greedy
'x' match, nor did it allow the '?' to match the '0'.

it did do a non-greedy match. It found as few "x"s as possible.
it found 0 of them, and quit. For a better test, use

s = '<tag 1><tag 2>'
print re.search('<tag.*?>',s).group(0)
print re.search('<tag.*>',s).group(0)

(the question/problem at hand)

-tkc

Roman · Aug 26, 2006

I looked at a book called beginning python and it claims that <.*?> is
a non-greedy match.

tobiah · Aug 28, 2006

Roman said:
I looked at a book called beginning python and it claims that <.*?> is
a non-greedy match.

Yeah, I get that now, but why didn't it work for you in
the first place?

Roman · Aug 29, 2006

It turns out false alarm. It work. I had other logic in the
expression involving punctuation marks and got all confused with the
escape characters. It becomes a mess trying to keep track of all the
reserved character as you are going from module to module.

Translater + module + tkinter	1	Feb 16, 2023
RE Module Performance	128	Jul 11, 2013
JavaFX tags not wrapping around	0	Sep 25, 2024
Insert replace text based on a name in other file python script	4	Mar 5, 2025
I need help with a Gemini prompt	1	May 14, 2025
re.sub does not replace all occurences	3	Aug 7, 2007
Changing .html in URL	3	Jul 11, 2022
Question regarding re module	1	Jun 4, 2008

RE Module

Roman

Simon Forman

Anthra Norell

Roman

tobiah

Roman

Anthra Norell

tobiah

Tim Chase

tobiah

Fredrik Lundh

Tim Chase

Roman

tobiah

Roman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads