ElementTree.fromstring(unicode_html)

G

globophobe

This is likely an easy problem; however, I couldn't think of
appropriate keywords for google:

Basically, I have some raw data that needs to be preprocessed before
it is saved to the database e.g.

In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a <br />.

In [2]: e = ET.fromstring('<data>%s</data>' % unicode_html)
In [2]: e.text
Out[3]: u'\u3055\u3080\u3044\uff0f\n\u3064\u3081\u305f\u3044\n'
In [4]: len(e)
Out[4]: 0

How can I decode the unicode html <br /> into a string that
ElementTree can understand?
 
J

John Machin

This is likely an easy problem; however, I couldn't think of
appropriate keywords for google:

Basically, I have some raw data that needs to be preprocessed before
it is saved to the database e.g.

In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a <br />.
import unicodedata as ucd
s = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'
[ucd.name(c) if ord(c) >= 128 else c for c in s]
['HIRAGANA LETTER SA', 'HIRAGANA LETTER MU', 'HIRAGANA LETTER I',
'FULLWIDTH SOLIDUS', u'\r', u'\n', 'HIRAGANA LETTER TU', 'HIRAGANA
LETTER ME', 'HIRAGANA LETTER TA', 'HIRAGANA LETTER I', u'\r', u'\n']
Where in there is the <br /> ??
 
F

Fredrik Lundh

globophobe said:
In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a <br />.

where? <br /> is an element, not a character. "\r" and "\n" are
characters, not elements.

If you want to build a tree where "\r\n" is replaced with a <br />
element, you can encode the string as UTF-8, use the replace method to
insert the element, and then call fromstring.

Alternatively, you can build the tree yourself:

import xml.etree.ElementTree as ET

unicode_html =
u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'

parts = unicode_html.splitlines()

elem = ET.Element("data")
elem.text = parts[0]
for part in parts[1:]:
ET.SubElement(elem, "br").tail = part

print ET.tostring(elem)

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top