How to extract texts from html source?

Discussion in 'Ruby' started by Sam Kong, May 9, 2005.

  1. Sam Kong

    Sam Kong Guest

    Hi, all!

    Quite often, when I need to read a list of web pages, I download the
    html sources and save them in a single file like a.html.
    If they are mostly texts, I open the html using web browser, select all
    and copy it to an editor and save it.
    I want to make the process shorter.
    How can I extract the text from html source?
    I'm sure there're many parsers for it.
    What is the most convenient one?

    Thanks.
    Sam
    Sam Kong, May 9, 2005
    #1
    1. Advertising

  2. Sam Kong

    James Britt Guest

    Sam Kong wrote:
    > Hi, all!
    >
    > Quite often, when I need to read a list of web pages, I download the
    > html sources and save them in a single file like a.html.
    > If they are mostly texts, I open the html using web browser, select all
    > and copy it to an editor and save it.
    > I want to make the process shorter.
    > How can I extract the text from html source?
    > I'm sure there're many parsers for it.
    > What is the most convenient one?



    Take a a look at Michael Neumann's WWW::Mechanize

    http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
    http://rubyforge.org/frs/?group_id=427&release_id=2014

    Or install the gem


    James

    >
    > Thanks.
    > Sam
    >
    >
    > .
    >



    --

    http://www.ruby-doc.org
    http://www.rubyxml.com
    http://catapult.rubyforge.com
    http://orbjson.rubyforge.com
    http://ooo4r.rubyforge.com
    http://www.jamesbritt.com
    James Britt, May 9, 2005
    #2
    1. Advertising

  3. On 09/05/05, James Britt <> wrote:
    > Sam Kong wrote:
    > > Hi, all!
    > >
    > > Quite often, when I need to read a list of web pages, I download the
    > > html sources and save them in a single file like a.html.
    > > If they are mostly texts, I open the html using web browser, select all
    > > and copy it to an editor and save it.
    > > I want to make the process shorter.
    > > How can I extract the text from html source?
    > > I'm sure there're many parsers for it.
    > > What is the most convenient one?

    >
    > Take a a look at Michael Neumann's WWW::Mechanize
    >
    > http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
    > http://rubyforge.org/frs/?group_id=427&release_id=2014
    >
    > Or install the gem
    >
    > James
    >
    > >
    > > Thanks.
    > > Sam
    > >
    > >
    > > .
    > >

    >
    > --
    >
    > http://www.ruby-doc.org
    > http://www.rubyxml.com
    > http://catapult.rubyforge.com
    > http://orbjson.rubyforge.com
    > http://ooo4r.rubyforge.com
    > http://www.jamesbritt.com
    >
    >


    You don't need ruby for this:

    $ apt-cache show w3m
    Package: w3m
    [snip]
    Description: WWW browsable pager with excellent tables/frames support
    w3m is a text-based World Wide Web browser with IPv6 support.
    It features excellent support for tables and frames. It can be used
    as a standalone file pager, too.
    Brian Schröder, May 9, 2005
    #3
  4. Sam Kong

    Sam Kong Guest

    James Britt wrote:
    > Sam Kong wrote:
    > > Hi, all!
    > >
    > > Quite often, when I need to read a list of web pages, I download

    the
    > > html sources and save them in a single file like a.html.
    > > If they are mostly texts, I open the html using web browser, select

    all
    > > and copy it to an editor and save it.
    > > I want to make the process shorter.
    > > How can I extract the text from html source?
    > > I'm sure there're many parsers for it.
    > > What is the most convenient one?

    >
    >
    > Take a a look at Michael Neumann's WWW::Mechanize
    >
    > http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
    > http://rubyforge.org/frs/?group_id=427&release_id=2014
    >
    > Or install the gem


    Thank James.
    That looks cool.
    However, it doesn't seem to have a function to extract texts from html.
    (Or did I miss it?)
    What I want is...

    <table><tr><td>TEST</td></tr></table> => TEST

    Is there a module that does this?

    Regards,
    Sam

    >
    >
    > James
    >
    > >
    > > Thanks.
    > > Sam
    > >
    > >
    > > .
    > >

    >
    >
    > --
    >
    > http://www.ruby-doc.org
    > http://www.rubyxml.com
    > http://catapult.rubyforge.com
    > http://orbjson.rubyforge.com
    > http://ooo4r.rubyforge.com
    > http://www.jamesbritt.com
    Sam Kong, May 9, 2005
    #4
  5. Sam Kong

    Sam Kong Guest

    Brian Schröder wrote:
    > On 09/05/05, James Britt <> wrote:
    > > Sam Kong wrote:
    > > > Hi, all!
    > > >
    > > > Quite often, when I need to read a list of web pages, I download

    the
    > > > html sources and save them in a single file like a.html.
    > > > If they are mostly texts, I open the html using web browser,

    select all
    > > > and copy it to an editor and save it.
    > > > I want to make the process shorter.
    > > > How can I extract the text from html source?
    > > > I'm sure there're many parsers for it.
    > > > What is the most convenient one?

    > >
    > > Take a a look at Michael Neumann's WWW::Mechanize
    > >
    > > http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
    > > http://rubyforge.org/frs/?group_id=427&release_id=2014
    > >
    > > Or install the gem
    > >
    > > James
    > >
    > > >
    > > > Thanks.
    > > > Sam
    > > >
    > > >
    > > > .
    > > >

    > >
    > > --
    > >
    > > http://www.ruby-doc.org
    > > http://www.rubyxml.com
    > > http://catapult.rubyforge.com
    > > http://orbjson.rubyforge.com
    > > http://ooo4r.rubyforge.com
    > > http://www.jamesbritt.com
    > >
    > >

    >
    > You don't need ruby for this:
    >
    > $ apt-cache show w3m
    > Package: w3m
    > [snip]
    > Description: WWW browsable pager with excellent tables/frames support
    > w3m is a text-based World Wide Web browser with IPv6 support.
    > It features excellent support for tables and frames. It can be used
    > as a standalone file pager, too.
    > .
    > * You can follow links and/or view images in HTML.
    > * Internet message preview mode, you can browse HTML mail.
    > * You can follow links in plain text if it includes URL forms.
    > * With w3m-img, you can view image inline.
    > .
    > For more information,
    > see http://sourceforge.net/projects/w3m
    >
    > $ w3m -dump http://ruby.brian-schroeder.de/quiz/mazes/ | head
    > A ruby a day!


    Oh, thanks.
    I just realized that even lynx can do that.

    Regards,
    Sam

    >
    > Ruby Quiz Solutions (Amazing Mazes)
    >
    > Amazing Mazes
    >
    > For a full description see: (Amazing Mazes on Ruby Quiz

    Homepage)[http://
    > www.rubyquiz.com/quiz31.html]
    >
    > Another graph algorithm. Create a maze that is fully connected and

    has only one
    > $
    >
    > regards,
    >
    > Brian
    >
    > --
    > http://ruby.brian-schroeder.de/
    >
    > multilingual _non rails_ ruby based vocabulary trainer:
    > http://www.vocabulaire.org/ | http://www.gloser.org/ |

    http://www.vokabeln.net/
    Sam Kong, May 9, 2005
    #5
  6. Sam Kong

    Tom Reilly Guest

    Several years ago, one of the members of the group offered me this
    routine which does a pretty good job of
    extracting the text from a html page.

    #--------------------------------------------------------------------
    # Strip HTML Tags from Line
    #--------------------------------------------------------------------

    def striphtml(line)
    line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
    end
    Tom Reilly, May 10, 2005
    #6
  7. Sam Kong

    James Britt Guest

    Sam Kong wrote:
    > Thank James.
    > That looks cool.
    > However, it doesn't seem to have a function to extract texts from html.
    > (Or did I miss it?)


    No, it is a library for the (fairly) easy creation of HTML munging code.

    Some coding is required, but it allows complete control (so you get just
    the text of interest).


    James
    James Britt, May 10, 2005
    #7
  8. Sam Kong

    daz Guest

    Sam Kong wrote:
    >
    > [...] If they are mostly texts, I open the html using
    > web browser, select all and copy it to an editor and save it.
    >


    Save As ... [text file].txt

    - Removes all tags.
    (Verified with Opera, Firefox & IE6, so I guess most browsers do this)
    ( e.g. test page: http://www.qurl.net/ )


    daz
    daz, May 10, 2005
    #8
  9. Sam Kong

    Sam Kong Guest

    Yes, that's right...:)
    I just want to do it all with my ruby program...hehe
    Thanks anyway.

    Sam
    Sam Kong, May 10, 2005
    #9
  10. Sam Kong

    Sam Kong Guest

    Tom Reilly wrote:
    > Several years ago, one of the members of the group offered me this
    > routine which does a pretty good job of
    > extracting the text from a html page.
    >
    > #--------------------------------------------------------------------
    > # Strip HTML Tags from Line
    > #--------------------------------------------------------------------
    >
    > def striphtml(line)
    > line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
    > end


    Thank you for sharing the code.
    However, this code works only for a simple line, right?
    When I tested it with a page of html by looping line by line, the
    result was not what I expected.
    Probably, I need to get a DOM parser...:-(

    Sam
    Sam Kong, May 10, 2005
    #10
  11. Sam Kong

    Ben Giddings Guest

    On Monday 09 May 2005 15:04, Sam Kong wrote:
    > Hi, all!
    >
    > Quite often, when I need to read a list of web pages, I download the
    > html sources and save them in a single file like a.html.
    > If they are mostly texts, I open the html using web browser, select all
    > and copy it to an editor and save it.
    > I want to make the process shorter.
    > How can I extract the text from html source?
    > I'm sure there're many parsers for it.
    > What is the most convenient one?


    You may find my HTMLTokenizer library convenient for this. To do what you
    need, all you'd do is keep calling "tokenizer.getText()"

    http://rubyforge.org/projects/htmltokenizer/

    Ben
    Ben Giddings, May 10, 2005
    #11
  12. Sam Kong

    James Britt Guest

    Ben Giddings wrote:
    > On Monday 09 May 2005 15:04, Sam Kong wrote:
    >
    >>Hi, all!
    >>
    >>Quite often, when I need to read a list of web pages, I download the
    >>html sources and save them in a single file like a.html.
    >>If they are mostly texts, I open the html using web browser, select all
    >>and copy it to an editor and save it.
    >>I want to make the process shorter.
    >>How can I extract the text from html source?
    >>I'm sure there're many parsers for it.
    >>What is the most convenient one?

    >
    >
    > You may find my HTMLTokenizer library convenient for this. To do what you
    > need, all you'd do is keep calling "tokenizer.getText()"
    >
    > http://rubyforge.org/projects/htmltokenizer/



    WWW::Mechanize sits atop such a process, but makes it easier to define
    what to do for elected elements and such.

    Just sayin' ...


    James
    James Britt, May 10, 2005
    #12
  13. Sam Kong

    William Park Guest

    Sam Kong <> wrote:
    > What I want is...
    >
    > <table><tr><td>TEST</td></tr></table> => TEST
    >
    > Is there a module that does this?


    I guess you run it through XML parser, like Expat which is everywhere
    these days. Even Bash and Gawk have interface to it.

    --
    William Park <>, Toronto, Canada
    ThinFlash: Linux thin-client on USB key (flash) drive
    http://home.eol.ca/~parkw/thinflash.html
    William Park, Jun 3, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cloud Burst
    Replies:
    11
    Views:
    1,010
  2. Francesco Moi
    Replies:
    8
    Views:
    556
    Martin Honnen
    Feb 21, 2005
  3. Leon

    Extract Data from HTML source

    Leon, Oct 25, 2006, in forum: ASP .Net
    Replies:
    5
    Views:
    486
    John Timney \(MVP\)
    Oct 25, 2006
  4. sujeet kumar
    Replies:
    3
    Views:
    268
    Eric Hodel
    Jun 12, 2005
  5. Stefan Ram

    How to scan Java source texts?

    Stefan Ram, Jun 11, 2013, in forum: Java
    Replies:
    14
    Views:
    364
    Jeff Higgins
    Jun 12, 2013
Loading...

Share This Page