How do I convert escaped HTML into a string?

Just Another Victim of the Ambient Morality · Nov 24, 2007

I've done a google search on this but, amazingly, I'm the first guy to
ever need this! Everyone else seems to need the reverse of this. Actually,
I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.
So, how do I convert HTML to plaintext? Something like this:

<div>This is a string.</div>

...into:

This is a string.

Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:

<div>This & that
or the other thing.</div>

...into:

This & that or the other thing.

...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
Surely, Python can already do this, right?
Thank you...

Stefan Behnel · Nov 24, 2007

Just said:
I've done a google search on this but, amazingly, I'm the first guy to
ever need this!

You cannot infer that from a Google search.

So, how do I convert HTML to plaintext? Something like this:

<div>This is a string.</div>

...into:

This is a string.

Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:

<div>This & that
or the other thing.</div>

...into:

This & that or the other thing.

...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).

So what you want to do is parse HTML and extract the text content. There are
quite a few ways to do that, including lxml.html:

http://codespeak.net/lxml/dev/lxmlhtml.html

Stefan

Sergio Correia · Nov 24, 2007

This may help:

http://effbot.org/zone/re-sub.htm#strip-html

You should take care that there are several issues about going from html to txt

1) <p> What should <b>we</b>do about<br />this?</p>
You need to strip all tags..

2) ", &, <, and &gt... and I could keep going.. we need to
convert all those

3) we need to remove all whitespace.. tab, new lines, etc. (Maybe
breaks should be considered as new lines in the new text?)

The link above solve several of this issues, it can serve as a good
starting point.

Best,
Sergio

Marc 'BlackJack' Rintsch · Nov 24, 2007

...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).

Not really. Just imagine how web pages would look like if whitespace is
preserved. What matters is the actual text in the source, not the
formatting. That's left to the browser.

Ciao,
Marc 'BlackJack' Rintsch

leej · Nov 24, 2007

I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.

<snip>

Replace "python" with "c++" and would that seem a reasonable belief?
(That said I'm a PyN00b)

Anyways, for all my HTML processing needs my first port of call has
been BeautifulSoup e.g.

soup = BeautifulSoup(html, convertEntities="html")
print soup.findAll(text=True)

Should be in the ballpark of what you want.

http://www.crummy.com/software/BeautifulSoup/documentation.html for
docs.

Stefan Behnel · Nov 24, 2007

Replace "python" with "c++" and would that seem a reasonable belief?

That's different, as Python comes with batteries included.

Stefan

Bruno Desthuilliers · Nov 24, 2007

Stefan Behnel a écrit :

That's different, as Python comes with batteries included.

Unfortunately, you still have to write a couple lines of code every once
in a while !-)

Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
How can I arrange a series of radio buttons?	2	Jan 25, 2024
How do i convert a Chinese DAT file from a game I play	2	Feb 4, 2022
I am not sure what to do :(	0	Jun 6, 2023
How do I convert String into Date object	8	Aug 13, 2011
How to push data from one HTML page to another	4	Jan 3, 2024

How do I convert escaped HTML into a string?

Just Another Victim of the Ambient Morality

Stefan Behnel

Sergio Correia

Marc 'BlackJack' Rintsch

leej

Stefan Behnel

Bruno Desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads