html 2 plain text

R

robin

hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

robin
 
F

Faber

robin said:
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

Have a look at the Beautiful Soup library:
http://www.crummy.com/software/BeautifulSoup/

Regards

--
Faber
http://faberbox.com/
http://smarking.com/

A teacher must always teach to doubt his teaching. -- José Ortega y Gasset
 
G

garabik-news-2005-05

robin said:
hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top