web page text extractor

kublai · Jul 12, 2007

Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

Miki · Jul 12, 2007

Hello jk,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Going simple

from os import system
from sys import argv

OUTFILE = "geturl.txt"
system("lynx -dump %s > %s" % (argv[1], OUTFILE))
system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)

Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.

HTH,

Jon Rosebaugh · Jul 12, 2007

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/BeautifulSoup/
http://www.holovaty.com/blog/archive/2007/07/06/0128

Andre Engels · Jul 12, 2007

2007/7/12 said:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

Andre Engels · Jul 12, 2007

2007/7/12, Andre Engels <[email protected]>:

I forgot to include

import urllib2, re

here

def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

Alex Popescu · Jul 12, 2007

2007/7/12, Andre Engels <[email protected]>:

I forgot to include

import urllib2, re

here

def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

Click to expand...

Andre I think that unfortunately your solution will not ignore inlined
scripting, inlined styling, etc.
On the otherside, I don't think there are many solutions available,
other than the Lynx approach somebody
has already suggested.

bests,
../alex

kublai · Jul 12, 2007

You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/BeautifulSoup/http://www.holovaty.com/blog/archive/2007/07/06/0128

Thanks all for your suggestions. I will try first the Lynx solution.

Cheers,
gk

Stefan Behnel · Jul 12, 2007

kublai said:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?
Super-simplistic:

http://codespeak.net/lxml/

You may want to use the incredibly versatile "lxml.html.clean" module first to
remove any annoying content. It's not released yet but available in a branch:

http://codespeak.net/svn/lxml/branch/html/

Stefan

kublai · Jul 12, 2007

http://codespeak.net/lxml/

You may want to use the incredibly versatile "lxml.html.clean" module first to
remove any annoying content. It's not released yet but available in a branch:

http://codespeak.net/svn/lxml/branch/html/

Stefan

Hi, Stefan,
This looks very interesting. I will look into this first thing
tonight. Gotta hit some golf bugs, I mean, balls first. It's a
beautiful afternoon here in Edmonton.
Cheers,
gk

Paul McGuire · Jul 13, 2007

Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

One of the examples provided with pyparsing is an HTML stripper - view
it online at http://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

-- Paul

kublai · Jul 13, 2007

One of the examples provided with pyparsing is an HTML stripper - view
it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

-- Paul

Stripping tags is indeed one strategy that came to mind. I'm wondering
how much information (for example, paragraphing) would be lost, and if
what would be lost would be acceptable (to the project). I looked at
pyparsing and I see that it's got a lot of text processing
capabilities that I can use along the way. I sure will try it. Thanks
for the post.

Best,
gk

rdahlstrom · Jul 13, 2007

To maintain paragraphs, replace any p or br tags with your favorite
operating system's crlf.

Thomas Dickey · Jul 22, 2007

Miki said:
(You can find lynx at http://lynx.browser.org/)

not exactly -

The current version of lynx is 2.8.6

It's available at
http://lynx.isc.org/lynx2.8.6/
2.8.7 Development & patches:
http://lynx.isc.org/current/index.html

Image shifts to the right when export the page to pdf	4	May 5, 2023
Print a text at a specific location on the page	1	Dec 4, 2013
Telnet to remote system and format output via web page	2	Sep 11, 2013
Only one table shows up with the information	2	Mar 29, 2023
auto expand for text in a web page	2	Jun 20, 2008
? Scraping XML data from web page	0	Aug 12, 2010
retain asp.net page state	0	May 24, 2011
Reproducing a web page and add own content to it.	4	Apr 8, 2008

web page text extractor

kublai

Miki

Jon Rosebaugh

Andre Engels

Andre Engels

Alex Popescu

kublai

Stefan Behnel

kublai

Paul McGuire

kublai

rdahlstrom

Thomas Dickey

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads