ElementTree.fromstring(unicode_html)

globophobe · Jan 26, 2008

This is likely an easy problem; however, I couldn't think of
appropriate keywords for google:

Basically, I have some raw data that needs to be preprocessed before
it is saved to the database e.g.

In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a .

In [2]: e = ET.fromstring('<data>%s</data>' % unicode_html)
In [2]: e.text
Out[3]: u'\u3055\u3080\u3044\uff0f\n\u3064\u3081\u305f\u3044\n'
In [4]: len(e)
Out[4]: 0

How can I decode the unicode html into a string that
ElementTree can understand?

John Machin · Jan 26, 2008

This is likely an easy problem; however, I couldn't think of
appropriate keywords for google:

Basically, I have some raw data that needs to be preprocessed before
it is saved to the database e.g.

In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a .

import unicodedata as ucd
s = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'
[ucd.name(c) if ord(c) >= 128 else c for c in s]

Click to expand...

Click to expand...

['HIRAGANA LETTER SA', 'HIRAGANA LETTER MU', 'HIRAGANA LETTER I',
'FULLWIDTH SOLIDUS', u'\r', u'\n', 'HIRAGANA LETTER TU', 'HIRAGANA
LETTER ME', 'HIRAGANA LETTER TA', 'HIRAGANA LETTER I', u'\r', u'\n']
Where in there is the ??

Fredrik Lundh · Jan 27, 2008

globophobe said:
In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a .

where? is an element, not a character. "\r" and "\n" are
characters, not elements.

If you want to build a tree where "\r\n" is replaced with a 
element, you can encode the string as UTF-8, use the replace method to
insert the element, and then call fromstring.

Alternatively, you can build the tree yourself:

import xml.etree.ElementTree as ET

unicode_html =
u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'

parts = unicode_html.splitlines()

elem = ET.Element("data")
elem.text = parts[0]
for part in parts[1:]:
ET.SubElement(elem, "br").tail = part

print ET.tostring(elem)

</F>

HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Python client/server that reads HTML body from server	1	Apr 12, 2023
Range / empty list issues??	1	Dec 11, 2023
Help with my responsive home page	2	Dec 14, 2022
Help with code	0	Jun 12, 2022
How to remove the undefined thing?	1	Oct 19, 2022
Registration form	13	May 19, 2021
Python point location of intersect between two lines	0	Feb 28, 2018

ElementTree.fromstring(unicode_html)

globophobe

John Machin

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads