html 2 plain text

robin · May 28, 2006

hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

robin

Faber · May 28, 2006

robin said:
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

Have a look at the Beautiful Soup library:
http://www.crummy.com/software/BeautifulSoup/

Regards

--
Faber
http://faberbox.com/
http://smarking.com/

A teacher must always teach to doubt his teaching. -- José Ortega y Gasset

robin · May 28, 2006

lucks yummy. merci beaucoup.

robin

Ravi Teja · May 28, 2006

i remember seeing this simple python function which would take raw html

and output the content (body?) of the page as plain text (no <..> tags
etc)

http://www.aaronsw.com/2002/html2text/

garabik-news-2005-05 · May 29, 2006

robin said:
hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Fredrik Lundh · May 29, 2006

text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

here's a variation that handles that too:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

plain text parsing to html (newbie problem)	10	Dec 9, 2009
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
converting html to plain text	18	Apr 16, 2009
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
How to have two html audio players on one page?	0	May 3, 2022
Printing plain text with exact positioning on Windows	18	Jan 4, 2010
Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.	1	Jul 7, 2006
POP text/plain	0	Oct 20, 2004

html 2 plain text

robin

Faber

robin

Ravi Teja

garabik-news-2005-05

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads