Excess whitespace in my soup

John Machin · Jan 19, 2008

I'm trying to recover the original data from some HTML written by a
well-known application.

Here are three original data items, in Python repr() format, with
spaces changed to tildes for clarity:

u'Saturday,~19~January~2008'
u'Line1\nLine2\nLine3'
u'foonly~frabjous\xa0farnarklingliness'

Here is the HTML, with spaces changed to tildes, angle brackets
changed to square brackets,
omitting \r\n from the end of each line, and stripping a large number
of attributes from the [td] tags.

~~[td]Saturday,~19
~~January~2008[/td]
~~[td]Line1[br]
~~~~Line2[br]
~~~~Line3[/td]
~~[td]foonly
~~frabjous farnarklingliness[/td]

Here are the results of feeding it to ElementSoup:

import ElementSoup as ES
elem = ES.parse('ws_soup1.htm')
from pprint import pprint as pp
pp([(e.tag, e.text, e.tail) for e in elem.getiterator()])

Click to expand...

Click to expand...

[snip]
(u'td', u'Saturday, 19\n January 2008', u'\n'),
(u'td', u'Line1', u'\n'),
(u'br', None, u'\n Line2'),
(u'br', None, u'\n Line3'),
(u'td', u'foonly\n frabjous\xa0farnarklingliness', u'\n')]

I'm happy enough with reassembling the second item. The problem is in
reliably and
correctly collapsing the whitespace in each of the above five
elements. The standard Python
idiom of u' '.join(text.split()) won't work because the text is
Unicode and u'\xa0' is whitespace
and would be converted to a space.

Should whitespace collapsing be done earlier? Note that BeautifulSoup
leaves it as   -- ES does the conversion to \xa0 ...

Does anyone know of an html_collapse_whitespace() for Python? Am I
missing something obvious?

Thanks in advance,
John

Fredrik Lundh · Jan 19, 2008

John said:
I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above
> fiveelements. The standard Python idiom of u' '.join(text.split())
> won't work because the text is Unicode and u'\xa0' is whitespace
and would be converted to a space.

would this (or some variation of it) work?

>>> re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")

Click to expand...

Click to expand...

u'foo frab\xa0farn'

</F>

John Machin · Jan 19, 2008

John said:
John said:

I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above

Click to expand...

fiveelements. The standard Python idiom of u' '.join(text.split())
won't work because the text is Unicode and u'\xa0' is whitespace

Click to expand...

and would be converted to a space.

Click to expand...

would this (or some variation of it) work?

re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")

Click to expand...

Click to expand...

u'foo frab\xa0farn'

</F>

Yes, partially. Leading and trailing whitespace has to be removed
entirely, not replaced by one space.

Cheers,
John

Stefan Behnel · Jan 19, 2008

John said:
John said:

I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above
fiveelements. The standard Python idiom of u' '.join(text.split())
won't work because the text is Unicode and u'\xa0' is whitespace

Click to expand...

and would be converted to a space.

Click to expand...

would this (or some variation of it) work?

re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")

Click to expand...

u'foo frab\xa0farn'

</F>

Click to expand...

Yes, partially. Leading and trailing whitespace has to be removed
entirely, not replaced by one space.

Sounds like adding a .strip() to me ...

Stefan

John Machin · Jan 20, 2008

Stefan said:
John said:

John Machin wrote:

I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above

fiveelements. The standard Python idiom of u' '.join(text.split())
won't work because the text is Unicode and u'\xa0' is whitespace

and would be converted to a space.

would this (or some variation of it) work?

re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
u'foo frab\xa0farn'

</F>

Click to expand...

Yes, partially. Leading and trailing whitespace has to be removed
entirely, not replaced by one space.

Click to expand...

Sounds like adding a .strip() to me ...

Sounds like adding a .strip(u' ') to me, otherwise any leading/trailing
u'\xa0' gets blown away and this must not happen.

John Machin · Jan 20, 2008

Remco said:
Not sure if this is sufficient for what you need, but how about

import re
re.sub(u'[\s\xa0]+', ' ', s)

That should replace all occurances of 1 or more whitespace or \xa0
characters, by a single space.

It does indeed, and so does
re.sub(u'\s\+', ' ', s)
because u'\xa0' *IS* whitespace in the Python unicode world, but it's
not whitespace in the HTML sense and it must be preserved.

Cheers,
John

Excess whitespace in my soup

John Machin

Fredrik Lundh

John Machin

Stefan Behnel

John Machin

John Machin

Members online

Forum statistics

Latest Threads