html 2 plain text

Discussion in 'Python' started by robin, May 28, 2006.

  1. robin

    robin Guest

    hi,
    i remember seeing this simple python function which would take raw html
    and output the content (body?) of the page as plain text (no <..> tags
    etc)
    i have been looking at htmllib and htmlparser but this all seems to
    complicated for what i'm looking for. i just need the main text in the
    body of some arbitrary webbpage to then do some natural-language
    processing with it...
    thanks for pointing me to some helpful resources!

    robin
     
    robin, May 28, 2006
    #1
    1. Advertising

  2. robin

    Faber Guest

    robin wrote:

    > i remember seeing this simple python function which would take raw html
    > and output the content (body?) of the page as plain text (no <..> tags
    > etc)
    > i have been looking at htmllib and htmlparser but this all seems to
    > complicated for what i'm looking for. i just need the main text in the
    > body of some arbitrary webbpage to then do some natural-language
    > processing with it...
    > thanks for pointing me to some helpful resources!


    Have a look at the Beautiful Soup library:
    http://www.crummy.com/software/BeautifulSoup/

    Regards

    --
    Faber
    http://faberbox.com/
    http://smarking.com/

    A teacher must always teach to doubt his teaching. -- José Ortega y Gasset
     
    Faber, May 28, 2006
    #2
    1. Advertising

  3. robin

    robin Guest

    lucks yummy. merci beaucoup.

    robin
     
    robin, May 28, 2006
    #3
  4. robin

    Ravi Teja Guest

    > i remember seeing this simple python function which would take raw html
    > and output the content (body?) of the page as plain text (no <..> tags
    > etc)


    http://www.aaronsw.com/2002/html2text/
     
    Ravi Teja, May 28, 2006
    #4
  5. robin

    Guest

    robin <> wrote:
    > hi,
    > i remember seeing this simple python function which would take raw html
    > and output the content (body?) of the page as plain text (no <..> tags
    > etc)
    > i have been looking at htmllib and htmlparser but this all seems to
    > complicated for what i'm looking for. i just need the main text in the
    > body of some arbitrary webbpage to then do some natural-language
    > processing with it...
    > thanks for pointing me to some helpful resources!


    text=re.sub(r'(?s)\<.+?\>', '', html_text)
    (this will keep html entities, though)

    --
    -----------------------------------------------------------
    | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
    | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
    -----------------------------------------------------------
    Antivirus alert: file .signature infected by signature virus.
    Hi! I'm a signature virus! Copy me into your signature file to help me spread!
     
    , May 29, 2006
    #5
  6. Fredrik Lundh, May 29, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mike Bridge
    Replies:
    2
    Views:
    4,710
    Mike Bridge
    Feb 20, 2004
  2. Elton Pruitt
    Replies:
    2
    Views:
    5,816
    akjoshi
    Jun 12, 2006
  3. TimmyC
    Replies:
    0
    Views:
    1,544
    TimmyC
    Jun 8, 2007
  4. geoffbache
    Replies:
    8
    Views:
    641
    Stefan Behnel
    Feb 11, 2008
  5. Jake Barnes
    Replies:
    9
    Views:
    779
    dave cutts
    Feb 21, 2006
Loading...

Share This Page