How to extract texts from html source?

Sam Kong · May 9, 2005

Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Thanks.
Sam

James Britt · May 9, 2005

Sam said:
Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem

James

Thanks.
Sam

.

--

http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com

Brian Schröder · May 9, 2005

Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem

James

--

http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com

You don't need ruby for this:

$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.

Sam Kong · May 9, 2005

James said:
Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem

Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)
What I want is...

<table><tr><td>TEST</td></tr></table> => TEST

Is there a module that does this?

Regards,
Sam

Sam Kong · May 9, 2005

Brian said:
Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mechanize.rdoc
http://rubyforge.org/frs/?group_id=427&release_id=2014

Or install the gem

James

--

http://www.ruby-doc.org
http://www.rubyxml.com
http://catapult.rubyforge.com
http://orbjson.rubyforge.com
http://ooo4r.rubyforge.com
http://www.jamesbritt.com

Click to expand...

You don't need ruby for this:

$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
.
* You can follow links and/or view images in HTML.
* Internet message preview mode, you can browse HTML mail.
* You can follow links in plain text if it includes URL forms.
* With w3m-img, you can view image inline.
.
For more information,
see http://sourceforge.net/projects/w3m

$ w3m -dump http://ruby.brian-schroeder.de/quiz/mazes/ | head
A ruby a day!

Oh, thanks.
I just realized that even lynx can do that.

Regards,
Sam

Ruby Quiz Solutions (Amazing Mazes)

Amazing Mazes

For a full description see: (Amazing Mazes on Ruby Quiz Homepage)[http://
www.rubyquiz.com/quiz31.html]

Another graph algorithm. Create a maze that is fully connected and has only one
$

regards,

Brian

--
http://ruby.brian-schroeder.de/

multilingual _non rails_ ruby based vocabulary trainer:
http://www.vocabulaire.org/ | http://www.gloser.org/ |

http://www.vokabeln.net/

Tom Reilly · May 10, 2005

Several years ago, one of the members of the group offered me this
routine which does a pretty good job of
extracting the text from a html page.

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end

James Britt · May 10, 2005

Sam said:
Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)

No, it is a library for the (fairly) easy creation of HTML munging code.

Some coding is required, but it allows complete control (so you get just
the text of interest).

James

daz · May 10, 2005

Sam said:
[...] If they are mostly texts, I open the html using
web browser, select all and copy it to an editor and save it.

Save As ... [text file].txt

- Removes all tags.
(Verified with Opera, Firefox & IE6, so I guess most browsers do this)
( e.g. test page: http://www.qurl.net/ )

daz

Sam Kong · May 10, 2005

Yes, that's right...

I just want to do it all with my ruby program...hehe
Thanks anyway.

Sam

Sam Kong · May 10, 2005

Tom said:
Several years ago, one of the members of the group offered me this
routine which does a pretty good job of
extracting the text from a html page.

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end

Thank you for sharing the code.
However, this code works only for a simple line, right?
When I tested it with a page of html by looping line by line, the
result was not what I expected.
Probably, I need to get a DOM parser...:-(

Sam

Ben Giddings · May 10, 2005

Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

You may find my HTMLTokenizer library convenient for this. To do what you
need, all you'd do is keep calling "tokenizer.getText()"

http://rubyforge.org/projects/htmltokenizer/

Ben

James Britt · May 10, 2005

Ben said:
You may find my HTMLTokenizer library convenient for this. To do what you
need, all you'd do is keep calling "tokenizer.getText()"

http://rubyforge.org/projects/htmltokenizer/

WWW::Mechanize sits atop such a process, but makes it easier to define
what to do for elected elements and such.

Just sayin' ...

James

William Park · Jun 3, 2005

Sam Kong said:
What I want is...

<table><tr><td>TEST</td></tr></table> => TEST

Is there a module that does this?

I guess you run it through XML parser, like Expat which is everywhere
these days. Even Bash and Gawk have interface to it.

How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
How to scan Java source texts?	14	Jun 11, 2013
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
How to effectively develop a web application from scratch?	0	Jul 2, 2023
How to create language translation program from dsl or bgl (by Open Source)	0	Nov 12, 2022
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
Idk need help in editing this source code	0	Nov 5, 2022

How to extract texts from html source?

Sam Kong

James Britt

Brian Schröder

Sam Kong

Sam Kong

Tom Reilly

James Britt

daz

Sam Kong

Sam Kong

Ben Giddings

James Britt

William Park

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads