python re - a not needed

Discussion in 'Python' started by kepes.krisztian, Dec 16, 2004.

  1. Hi !

    I want to get infos from a html, but I need all chars except <.
    All chars is: over chr(31), and over (128) - hungarian accents.
    The .* is very hungry, it is eat < chars too.

    If I can use not, I simply define an regexp.
    [not<]*</a>

    It is get all in the href.

    I wrote this programme, but it is too complex - I think:

    import re

    l=[]
    for i in range(33,65):
    if i<>ord('<') and i<>ord('>'):
    l.append('\\'+chr(i))
    s='|'.join(l)
    all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s)
    sre='<Subj>([%s]{1,1024})</d>'%all
    #sre='<Subj>([?!\\<]{1,1024})</d>'
    s='<Subj>xmvccv ÁÁÁ sdfkdsfj eirfie</d><A></d>'


    print sre
    print s
    cp=re.compile(sre)
    m=cp.search(s)
    print m.groups()

    Have the python an regexp exception, or not function ? How to I use it ?

    Thanx for help:
    kk
     
    kepes.krisztian, Dec 16, 2004
    #1
    1. Advertising

  2. kepes.krisztian

    Peter Otten Guest

    kepes.krisztian wrote:

    > Hi !
    >
    > I want to get infos from a html, but I need all chars except <.
    > All chars is: over chr(31), and over (128) - hungarian accents.
    > The .* is very hungry, it is eat < chars too.
    >
    > If I can use not, I simply define an regexp.
    > [not<]*</a>
    >
    > It is get all in the href.
    >
    > I wrote this programme, but it is too complex - I think:
    >
    > import re
    >
    > l=[]
    > for i in range(33,65):
    > if i<>ord('<') and i<>ord('>'):
    > l.append('\\'+chr(i))
    > s='|'.join(l)
    > all='\w|\s|\%s-\%s|%s'%(chr(128),chr(255),s)
    > sre='<Subj>([%s]{1,1024})</d>'%all
    > #sre='<Subj>([?!\\<]{1,1024})</d>'
    > s='<Subj>xmvccv ÁÁÁ sdfkdsfj eirfie</d><A></d>'
    >
    >
    > print sre
    > print s
    > cp=re.compile(sre)
    > m=cp.search(s)
    > print m.groups()
    >
    > Have the python an regexp exception, or not function ? How to I use it ?
    >
    > Thanx for help:
    > kk


    You could try these regexps or variants thereof:

    "<Subj>([^<]*)"

    '^' changes the character set to exclude any characters listed after '^'
    from matching.

    "<Subj>(.*?)<"

    The '?' makes the preceding '*' non-greedy, i. e. the following '<' will
    match the first '<' character encountered in the string to be searched.

    Peter
     
    Peter Otten, Dec 16, 2004
    #2
    1. Advertising

  3. kepes.krisztian

    Max M Guest

    kepes.krisztian wrote:

    > I want to get infos from a html, but I need all chars except <.
    > All chars is: over chr(31), and over (128) - hungarian accents.
    > The .* is very hungry, it is eat < chars too.


    Instead of writing ad-hoc html parsers, use BeautifulSoup instead.

    http://www.crummy.com/software/BeautifulSoup/

    I will most likely do what you want in 2 or 3 lines of code.

    --

    hilsen/regards Max M, Denmark

    http://www.mxm.dk/
    IT's Mad Science
     
    Max M, Dec 16, 2004
    #3
  4. kepes.krisztian

    Paul Rubin Guest

    Paul Rubin, Dec 16, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. greg
    Replies:
    0
    Views:
    447
  2. Timo
    Replies:
    4
    Views:
    490
  3. Thomas Nick
    Replies:
    0
    Views:
    1,897
    Thomas Nick
    Jun 13, 2005
  4. Ryan Macy

    Ideas needed & help needed!

    Ryan Macy, Jul 19, 2006, in forum: Ruby
    Replies:
    2
    Views:
    519
    Ryan Macy
    Jul 19, 2006
  5. John Farold
    Replies:
    0
    Views:
    391
    John Farold
    Aug 29, 2012
Loading...

Share This Page