Re: how to get text from a html file?

Discussion in 'Python' started by Chris Colbert, Apr 13, 2010.

  1. On Tue, Apr 13, 2010 at 1:58 PM, varnikat t <> wrote:
    >
    > Hi,
    > Can anyone tell me how to get text from a html file?I am trying to display
    > the text of an html file in textview(of glade).If i directly display the
    > file,it shows with html tags and attributes, etc. in textview.I don't want
    > that.I just want the text.
    > Can someone help me with this?
    >
    >
    > Regards
    > Varnika Tewari
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    >


    You should look into beautiful soup

    http://www.crummy.com/software/BeautifulSoup/
    Chris Colbert, Apr 13, 2010
    #1
    1. Advertising

  2. On Tue, Apr 13, 2010 at 1:58 PM, varnikat t <> wrote:

    > Can anyone tell me how to get text from a html file?I am trying to display
    > the text of an html file in textview(of glade).If i directly display the
    > file,it shows with html tags and attributes, etc. in textview.I don't want
    > that.I just want the text.


    [Parent article is unavailable on gmane, so my reply isn't quite in
    the right place in the tree]

    I generally just use something like this:

    Popen(['w3m','-dump',filename],stdout=PIPE).stdout.read()

    I'm sure there are more complex ways...

    --
    Grant Edwards grant.b.edwards Yow! I'm having fun
    at HITCHHIKING to CINCINNATI
    gmail.com or FAR ROCKAWAY!!
    Grant Edwards, Apr 13, 2010
    #2
    1. Advertising

  3. Chris Colbert

    rake Guest

    On Apr 13, 2:12 pm, Chris Colbert <> wrote:
    > On Tue, Apr 13, 2010 at 1:58 PM, varnikat t <> wrote:
    >
    > > Hi,
    > > Can anyone tell me how to get text from a html file?I am trying to display
    > > the text of an html file in textview(of glade).If i directly display the
    > > file,it shows with html tags and attributes, etc. in textview.I don't want
    > > that.I just want the text.
    > > Can someone help me with this?

    >
    > > Regards
    > > Varnika Tewari

    >
    > > --
    > >http://mail.python.org/mailman/listinfo/python-list

    >
    > You should look into beautiful soup
    >
    > http://www.crummy.com/software/BeautifulSoup/


    For more complex parsing beautiful soup is definitely the way to go.

    However, if all you want to do is strip the html and keep all
    remaining text I'd recommend pyparsing package with this short script:

    http://pyparsing.wikispaces.com/file/view/htmlStripper.py
    rake, Apr 14, 2010
    #3
  4. rake, 14.04.2010 02:45:
    > On Apr 13, 2:12 pm, Chris Colbert wrote:
    >> You should look into beautiful soup
    >>
    >> http://www.crummy.com/software/BeautifulSoup/

    >
    > For more complex parsing beautiful soup is definitely the way to go.


    Why would a library that even the author has lost interest in be "the way
    to go"?

    Stefan
    Stefan Behnel, Apr 14, 2010
    #4
  5. On 4/13/2010 11:43 PM Stefan Behnel said...
    > rake, 14.04.2010 02:45:
    >> On Apr 13, 2:12 pm, Chris Colbert wrote:
    >>> You should look into beautiful soup
    >>>
    >>> http://www.crummy.com/software/BeautifulSoup/

    >>
    >> For more complex parsing beautiful soup is definitely the way to go.

    >
    > Why would a library that even the author has lost interest in be "the
    > way to go"?
    >
    > Stefan
    >

    Why not when the recent release dates from only five days ago?

    Emile
    Emile van Sebille, Apr 14, 2010
    #5
  6. On 2010-04-14, Stefan Behnel <> wrote:
    >> On Apr 13, 2:12 pm, Chris Colbert wrote:
    >>> You should look into beautiful soup
    >>>
    >>> http://www.crummy.com/software/BeautifulSoup/

    >>
    >> For more complex parsing beautiful soup is definitely the way to go.

    >
    > Why would a library that even the author has lost interest in be "the way
    > to go"?


    Sure, if the library is still being maintained. I can't think of too
    many open-source projects where somebody else hasn't taken over from
    the original author.

    --
    Grant Edwards grant.b.edwards Yow! I'm dressing up in
    at an ill-fitting IVY-LEAGUE
    gmail.com SUIT!! Too late...
    Grant Edwards, Apr 14, 2010
    #6
  7. Emile van Sebille, 14.04.2010 15:24:
    > On 4/13/2010 11:43 PM Stefan Behnel said...
    >> rake, 14.04.2010 02:45:
    >>> On Apr 13, 2:12 pm, Chris Colbert wrote:
    >>>> You should look into beautiful soup
    >>>>
    >>>> http://www.crummy.com/software/BeautifulSoup/
    >>>
    >>> For more complex parsing beautiful soup is definitely the way to go.

    >>
    >> Why would a library that even the author has lost interest in be "the
    >> way to go"?

    >
    > Why not when the recent release dates from only five days ago?


    Interesting, even the web site has had a revamp.

    Nice - I like competition. ;)

    Stefan
    Stefan Behnel, Apr 14, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?Tmlja3k=?=
    Replies:
    2
    Views:
    708
    Matt Berther
    Feb 20, 2005
  2. Krish
    Replies:
    1
    Views:
    1,077
    =?Utf-8?B?Q3VydF9DIFtNVlBd?=
    Oct 20, 2005
  3. fitwell
    Replies:
    2
    Views:
    616
    fitwell
    Nov 13, 2003
  4. Replies:
    5
    Views:
    1,300
    Aur_Ros
    Oct 7, 2006
  5. walterbyrd
    Replies:
    7
    Views:
    307
    Asun Friere
    May 17, 2007
Loading...

Share This Page