python re - a not needed

kepes.krisztian · Dec 16, 2004

Hi !

I want to get infos from a html, but I need all chars except <.
All chars is: over chr(31), and over (128) - hungarian accents.
The .* is very hungry, it is eat < chars too.

If I can use not, I simply define an regexp.
[not<]*</a>

It is get all in the href.

I wrote this programme, but it is too complex - I think:

import re

l=[]
for i in range(33,65):
if i<>ord('<') and i<>ord('>'):
l.append('\\'+chr(i))
s='|'.join(l)
all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s)
sre='<Subj>([%s]{1,1024})</d>'%all
#sre='<Subj>([?!\\<]{1,1024})</d>'
s='<Subj>xmvccv ÁÁÁ sdfkdsfj eirfie</d><A></d>'

print sre
print s
cp=re.compile(sre)
m=cp.search(s)
print m.groups()

Have the python an regexp exception, or not function ? How to I use it ?

Thanx for help:
kk

Peter Otten · Dec 16, 2004

kepes.krisztian said:
Hi !

I want to get infos from a html, but I need all chars except <.
All chars is: over chr(31), and over (128) - hungarian accents.
The .* is very hungry, it is eat < chars too.

If I can use not, I simply define an regexp.
[not<]*</a>

It is get all in the href.

I wrote this programme, but it is too complex - I think:

import re

l=[]
for i in range(33,65):
if i<>ord('<') and i<>ord('>'):
l.append('\\'+chr(i))
s='|'.join(l)
all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s)
sre='<Subj>([%s]{1,1024})</d>'%all
#sre='<Subj>([?!\\<]{1,1024})</d>'
s='<Subj>xmvccv ÁÁÁ sdfkdsfj eirfie</d><A></d>'

print sre
print s
cp=re.compile(sre)
m=cp.search(s)
print m.groups()

Have the python an regexp exception, or not function ? How to I use it ?

Thanx for help:
kk

You could try these regexps or variants thereof:

"<Subj>([^<]*)"

'^' changes the character set to exclude any characters listed after '^'
from matching.

"<Subj>(.*?)<"

The '?' makes the preceding '*' non-greedy, i. e. the following '<' will
match the first '<' character encountered in the string to be searched.

Peter

Max M · Dec 16, 2004

kepes.krisztian said:
I want to get infos from a html, but I need all chars except <.
All chars is: over chr(31), and over (128) - hungarian accents.
The .* is very hungry, it is eat < chars too.

Instead of writing ad-hoc html parsers, use BeautifulSoup instead.

http://www.crummy.com/software/BeautifulSoup/

I will most likely do what you want in 2 or 3 lines of code.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

Paul Rubin · Dec 16, 2004

Max M said:
Instead of writing ad-hoc html parsers, use BeautifulSoup instead.

http://www.crummy.com/software/BeautifulSoup/

Hey, I like that. Thanks.

Reading in cooked mode (was Re: Python MSI not installing, log fileshowing name of a Viatnemese comm	8	Mar 23, 2014
groveling over a file for Q:: and A:: stmts	3	Jul 24, 2012
Fastest way to detect a non-ASCII character in a list of strings.	2	Oct 17, 2010
Python 2.4 does not marshal infinity floating point properly under Win32	2	Nov 30, 2006
My first Python program -- a lexer	25	Nov 8, 2008
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
RE Engine error with sub()	6	Apr 15, 2005
How can Python print the value of an attribute but complain it does not exist?	1	Oct 10, 2007

python re - a not needed

kepes.krisztian

Peter Otten

Max M

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads