Excess whitespace in my soup

J

John Machin

I'm trying to recover the original data from some HTML written by a
well-known application.

Here are three original data items, in Python repr() format, with
spaces changed to tildes for clarity:

u'Saturday,~19~January~2008'
u'Line1\nLine2\nLine3'
u'foonly~frabjous\xa0farnarklingliness'

Here is the HTML, with spaces changed to tildes, angle brackets
changed to square brackets,
omitting \r\n from the end of each line, and stripping a large number
of attributes from the [td] tags.

~~[td]Saturday,~19
~~January~2008[/td]
~~[td]Line1[br]
~~~~Line2[br]
~~~~Line3[/td]
~~[td]foonly
~~frabjous farnarklingliness[/td]

Here are the results of feeding it to ElementSoup:
import ElementSoup as ES
elem = ES.parse('ws_soup1.htm')
from pprint import pprint as pp
pp([(e.tag, e.text, e.tail) for e in elem.getiterator()])
[snip]
(u'td', u'Saturday, 19\n January 2008', u'\n'),
(u'td', u'Line1', u'\n'),
(u'br', None, u'\n Line2'),
(u'br', None, u'\n Line3'),
(u'td', u'foonly\n frabjous\xa0farnarklingliness', u'\n')]

I'm happy enough with reassembling the second item. The problem is in
reliably and
correctly collapsing the whitespace in each of the above five
elements. The standard Python
idiom of u' '.join(text.split()) won't work because the text is
Unicode and u'\xa0' is whitespace
and would be converted to a space.

Should whitespace collapsing be done earlier? Note that BeautifulSoup
leaves it as   -- ES does the conversion to \xa0 ...

Does anyone know of an html_collapse_whitespace() for Python? Am I
missing something obvious?

Thanks in advance,
John
 
F

Fredrik Lundh

John said:
I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above
> fiveelements. The standard Python idiom of u' '.join(text.split())
> won't work because the text is Unicode and u'\xa0' is whitespace
and would be converted to a space.

would this (or some variation of it) work?
>>> re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
u'foo frab\xa0farn'

</F>
 
J

John Machin

John said:
I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above
fiveelements. The standard Python idiom of u' '.join(text.split())
won't work because the text is Unicode and u'\xa0' is whitespace
and would be converted to a space.

would this (or some variation of it) work?
re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
u'foo frab\xa0farn'

</F>

Yes, partially. Leading and trailing whitespace has to be removed
entirely, not replaced by one space.

Cheers,
John
 
S

Stefan Behnel

John said:
John said:
I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above
fiveelements. The standard Python idiom of u' '.join(text.split())
won't work because the text is Unicode and u'\xa0' is whitespace
and would be converted to a space.
would this (or some variation of it) work?
re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
u'foo frab\xa0farn'

</F>

Yes, partially. Leading and trailing whitespace has to be removed
entirely, not replaced by one space.

Sounds like adding a .strip() to me ...

Stefan
 
J

John Machin

Stefan said:
John said:
John Machin wrote:

I'm happy enough with reassembling the second item. The problem is in
reliably and correctly collapsing the whitespace in each of the above

fiveelements. The standard Python idiom of u' '.join(text.split())
won't work because the text is Unicode and u'\xa0' is whitespace


and would be converted to a space.

would this (or some variation of it) work?

re.sub("[ \n\r\t]+", " ", u"foo\n frab\xa0farn")
u'foo frab\xa0farn'

</F>
Yes, partially. Leading and trailing whitespace has to be removed
entirely, not replaced by one space.

Sounds like adding a .strip() to me ...

Sounds like adding a .strip(u' ') to me, otherwise any leading/trailing
u'\xa0' gets blown away and this must not happen.
 
J

John Machin

Remco said:
Not sure if this is sufficient for what you need, but how about

import re
re.sub(u'[\s\xa0]+', ' ', s)

That should replace all occurances of 1 or more whitespace or \xa0
characters, by a single space.
It does indeed, and so does
re.sub(u'\s\+', ' ', s)
because u'\xa0' *IS* whitespace in the Python unicode world, but it's
not whitespace in the HTML sense and it must be preserved.

Cheers,
John
 

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top