minidom xml & non ascii / unicode & files

webdev · Aug 5, 2005

lo all,

some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..

so i started a python 2.3 script that grabs some web pages from the web,
regex parse the data and stores it localy to xml file for further use..

at first i had no problem using python minidom and everything concerning
my regex/xml processing works fine, until i tested my tool on some
french page with "non ascii" chars and my script started to throw errors
all over the place..

I've looked into the matter and discovered the unicode / string encoding
processes implied when dealing with non ascii texts and i must say i
almost lost my mind.. I'm loosing it actually..

so here are the few questions i'd like to have answers for :

1. when fetching a web page from the net, how am i supposed to know how
it's encoded.. And can i decode it to unicode and encode it back to a
byte string so i can use it in my code, with the charsets i want, like
utf-8.. ?

2. in the same idea could anyone try to post the few lines that would
actually parse an xml file, with non ascii chars, with minidom
(parseString i guess).
Then convert a string grabbed from the net so parts of it can be
inserted in that dom object into new nodes or existing nodes.
And finally write that dom object back to a file in a way it can be used
again later with the same script..

I've been trying to do that for a few days with no luck..
I can do each separate part of the job, not that i'm quite sure how i
decode/encode stuff in there, but as soon as i try to do everything at
the same time i get encoding errors thrown all the time..

3. in order to help me understand what's going on when doing
encodes/decodes could you please tell me if in the following example, s
and backToBytes are actually the same thing ??

s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

i knwo they both are bytestrings but i doubt they have actually the same
content..

4. I've also tried to set the default encoding of python for my script
using the sys.setdefaultencoding('utf-8') but it keeps telling me that
this module does not have that method.. i'm left no choice but to edit
the site.py file manually to change "ascii" to "utf-8", but i won't be
able to do that on the client computers so..
Anyways i don't know if it would help my script at all..

any help will be greatly appreciated
thx

Marc

Benjamin Niemann · Aug 5, 2005

webdev said:
lo all,

some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..

so i started a python 2.3 script that grabs some web pages from the web,
regex parse the data and stores it localy to xml file for further use..

at first i had no problem using python minidom and everything concerning
my regex/xml processing works fine, until i tested my tool on some
french page with "non ascii" chars and my script started to throw errors
all over the place..

I've looked into the matter and discovered the unicode / string encoding
processes implied when dealing with non ascii texts and i must say i
almost lost my mind.. I'm loosing it actually..

The general idea is:
- convert everything that's coming in (from the net, database, files) into
unicode
- do all your processing with unicode strings
- encode the strings to your preferred/the required encoding when you write
it to the net/database/file

so here are the few questions i'd like to have answers for :

1. when fetching a web page from the net, how am i supposed to know how
it's encoded.. And can i decode it to unicode and encode it back to a
byte string so i can use it in my code, with the charsets i want, like
utf-8.. ?

First look at the HTTP 'Content-Type' header. If it has a parameter
'charset', that the encoding to use, e.g.
Content-Type: text/html; charset=iso-8859-1

If there's not encoding specified in the header, look at the <?xml .. ?>
prolog, if you have a XHTML document at hand (and it's present). Look below
for the syntax.

The last fallback is the <meta http-equiv="Content-Type" content="..."> tag.
The content attribute has the same format as the HTTP header.

But you can still run into UnicodeDecodeErrors, because many website just
don't get their encoding issues right. Browser do some (more or less)
educated guesses and often manage to display the document as intended.
You should probably use htmlData.encode(encoding, "ignore") or
htmlData.encode(encoding, "replace") to work around these problems (but
loose some characters).

And, as said above: don't encode the unicode string into bytestrings and
process the bytestrings in your program - that's a bad idea. Defer the
encoding until you absolutely necessary (usually file.write()).

2. in the same idea could anyone try to post the few lines that would
actually parse an xml file, with non ascii chars, with minidom
(parseString i guess).

The parser determines the encoding of the file from the <?xml..?> line. E.g.
if your file is encoded in utf-8, add the line
<?xml version="1.0" encoding="utf-8"?>
at the top of it, if it's not already present.
The parser will then decode everything into unicode strings - all TextNodes,
attributes etc. should be unicode strings.

When writing the manipulated DOM back to disk, use toxml() which has an
encoding argument.

Then convert a string grabbed from the net so parts of it can be
inserted in that dom object into new nodes or existing nodes.
And finally write that dom object back to a file in a way it can be used
again later with the same script..

Just insert the unicode strings.

I've been trying to do that for a few days with no luck..
I can do each separate part of the job, not that i'm quite sure how i
decode/encode stuff in there, but as soon as i try to do everything at
the same time i get encoding errors thrown all the time..

3. in order to help me understand what's going on when doing
encodes/decodes could you please tell me if in the following example, s
and backToBytes are actually the same thing ??

s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

i knwo they both are bytestrings but i doubt they have actually the same
content..

Why not try it yourself?
"hello normal string" is just US-ASCII. The utf-8 encoded version of the
unicode string u"hello normal string" will be identical to the ASCII byte
string "hello normal string".

4. I've also tried to set the default encoding of python for my script
using the sys.setdefaultencoding('utf-8') but it keeps telling me that
this module does not have that method.. i'm left no choice but to edit
the site.py file manually to change "ascii" to "utf-8", but i won't be
able to do that on the client computers so..
Anyways i don't know if it would help my script at all..

There was just recently a discussing on setdefaultencoding() on various
pythonistic blogs, e.g.
http://blog.ianbicking.org/python-unicode-doesnt-really-suck.html

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 6, 2005

webdev said:
1. when fetching a web page from the net, how am i supposed to know how
it's encoded.. And can i decode it to unicode and encode it back to a
byte string so i can use it in my code, with the charsets i want, like
utf-8.. ?

It depends on the content type. If the HTTP header declares a charset=
attribute for content-type, then use that (beware: some web servers
report the content type incorrectly. To deal with that gracefully,
you have to implement very complex algorithms, which are part of
any recent web browser).

If there is no charset= attribute, then
- if the content type is text/html, look at a meta http-equiv tag
in the content. If that declares a charset, use that.
- if the content type is xml (plain, or xhtml+xml), look at the
XML declaration. Alternatively, pass it to your XML parser.

2. in the same idea could anyone try to post the few lines that would
actually parse an xml file, with non ascii chars, with minidom
(parseString i guess).

doc = xml.dom.minidom.parse("foo.xml")

Then convert a string grabbed from the net so parts of it can be
inserted in that dom object into new nodes or existing nodes.

doc..documentElement.setAttribute("bar", text_from_net.decode("koi-8r"))

And finally write that dom object back to a file in a way it can be used
again later with the same script..
open("/tmp/foo.txt","w").write(doc.toxml())

I've been trying to do that for a few days with no luck..
I can do each separate part of the job, not that i'm quite sure how i
decode/encode stuff in there, but as soon as i try to do everything at
the same time i get encoding errors thrown all the time..

It would help if you would state what precise code you are using,
and what precise error you are getting (for what precise input).

3. in order to help me understand what's going on when doing
encodes/decodes could you please tell me if in the following example, s
and backToBytes are actually the same thing ??

s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

i knwo they both are bytestrings but i doubt they have actually the same
content..

They do have the same content. There is nothing to a byte string except
for the bytes. If the byte string is meant to represent characters,
they are the same "thing" only if the assumed encoding is the same.
Since the assumed encoding is "utf-8" for both s and backToBytes,
they are the same thing.

4. I've also tried to set the default encoding of python for my script
using the sys.setdefaultencoding('utf-8') but it keeps telling me that
this module does not have that method.. i'm left no choice but to edit
the site.py file manually to change "ascii" to "utf-8", but i won't be
able to do that on the client computers so..

Don't do that. It's meant as a last resort for backwards compatibility,
and shouldn't be used for new code.

Regards,
Martin

webdev · Aug 6, 2005

Thx Martin for your comments.

indeed the charset of the web document is set in the meta tag, it's
iso-8859-1 so i'll decode it to unicode using something like:

html = html.decode('iso-8859-1')

html then contains the unicode version of the html document

As i've finally managed to make this work i'll post here my comments on
the few things i still don't understand, maybe you can explain why it
works that way with more technical terms than i can provide myself..

So the whole thing is to regex parse some html document, and store the
results inside an xml file that can be parsed again by python minidom
for further use..

############### CODE START ###############

import urllib, string, codecs, types
import sys, traceback, os.path, re, shutil
import cachedhttp

from xml.dom.minidom import parse, parseString

NODE_ELEMENT=1
NODE_ATTRIBUTE=2
NODE_TEXT=3
NODE_CDATA_SECTION=4

httpFetcher=cachedhttp.CachedHTTP()

# Fetch Menu Links Page, httpFetcher is from the cachedhttp lib
developped by someone for another script, it returns a bytestring from
the local cached file, once downloaded of the internet, using a simple f
= open(file,'r') & f.read()

data = httpFetcher.urlopen('http://www.canalplus.fr/pid6.htm')
data = data.decode('iso-8859-1')
# at that point i have my html document in unicode

# utf8bin.xml is an utf-8 encoded xml file, "bin" is because of the way
i have to use to save it back to file, see at bottom
dom = parse('utf8bin.xml')

# find the data we need from the html document
# title contains the text and so some special chars
x = re.compile('<li[^>]*>[^<]*<a
href="http://www.canalplus.fr/(?P<url>[^"]+)"[^>]*>(?:<b>)?(?P<title>[^<]+)(?:</b>)?</a>[^<]*</li>',
re.DOTALL|re.IGNORECASE|re.UNICODE)
for match in x.finditer(data):
urlid = match.group('url')
url = match.expand('http://www.canalplus.fr/\g<url>')
title = match.expand('\g<title>')
# everything here is still unicode objects

match = None
nodes = dom.getElementsByTagName('page')
for node in nodes:
if GetNodeValue(node,'title') == title:
print 'Found Match: ' + title + ' == ' + GetNodeValue(node,'title')
match = node
break

if match is None:
# create page node and set attributes
newnode = dom.createElement('page')
att = dom.createAttribute('id')
newnode.setAttributeNode(att)
newnode.setAttribute('id',urlid)

# create title childnode and set CDATA section
vnode = dom.createElement('title')
newnode.appendChild(vnode)
dnode = dom.createCDATASection(title)
vnode.appendChild(dnode)

# create value childnode and set CDATA section
vnode = dom.createElement('value')
newnode.appendChild(vnode)
dnode = dom.createCDATASection(url)
vnode.appendChild(dnode)

root = dom.documentElement
root.appendChild(newnode)

f = open('utf8bin.xml', 'wb')
f.write(dom.toxml(encoding="utf-8"))
f.close()

# just to make sure we can still parse our xml file
print '\nParsing utf8bin.xml and Printing titles'
dom = parse('utf8bin.xml')
nodes = dom.getElementsByTagName('page')
for node in nodes:
print GetNodeValue(node,'title')

# Some xml helper functions
# GetNodeText returns a unicode object
def GetNodeText(node):
dout=''
for tnode in node.childNodes:
if (tnode.nodeType==NODE_TEXT)|(tnode.nodeType==NODE_CDATA_SECTION):
dout=dout+tnode.nodeValue
return dout

# GetNodeValue returns a unicode object or None
def GetNodeValue(node,tag=None):
if tag is None: return GetNodeText(node)
nattr=node.attributes.getNamedItem(tag)
if not (nattr is None): return nattr.value
for child in node.childNodes:
if child.nodeName == tag:
return GetNodeText(child)
return None

############### CODE END ###############

Now the comments :

so what i understood of all this, is that once you're using unicode
objects you're safe !
At least as long as you don't use statements or operators that will
implicitely try to convert the unicode object back to bytestring using
your default encoding (ascii) which will most certainly result in codec
Errors...

Also, minidom seems to use unicode object what was not really documented
in the python 2.3 doc i've read about it..

so passing the unicode object from my regex matches to minidom elements
will make minidom behave nicely..

If you start to pass encoded bytestrings to minidom elements it may fail
when you call "toxml()".. I know i managed to do that once or twice i
don't remember exactly what kind of bytestrings i passed to the minidom
element but one thing's for sure it made "toxml()" fail whatever
encoding you specify..

So if you stick to unicode, it will then encode all that unicode content
to whatever encoding you've specified when calling
"dom.toxml(encoding="utf-8")"
then you just have to store the output of that as it is without any
further encoding

As a matter of fact using the following sequence will most certainly fail :
f = codecs.open('utf8codecs.xml', 'w', 'utf-8')
f.write(dom.toxml(encoding="utf-8"))
f.close()

then again maybe this will work, i just thought of it..
f = codecs.open('utf8codecs.xml', 'w', 'utf-8')
f.write(dom.toxml())
f.close()

I didn't understand at first that once you're using unicode object and
as long as you've properly decoded your bytestring source, then unicode
is unicode and you can forget about encodings "ascii", "iso-", "utf-"..

The next important thing is to make sure to use functions and objects
that support unicode all the way, like minidom seems to do..

my original script has another function "FindDataNode" that will do a
more sofisticated loop, into the dom object you provide, in order to
check if there's already a node with the same title, and i use there
some .lower() methods and a another "Sanitize" function that replaces a
few chars.. So i guess i'll have to make sure that none of those
manipulations converts my unicode obect back to bytestrings..

Thx for reading, let me know if you see really really weird (bad?)
things in my code, or if you have further comments to add on the unicode
topic..

Marc

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 6, 2005

so what i understood of all this, is that once you're using unicode

objects you're safe !
At least as long as you don't use statements or operators that will
implicitely try to convert the unicode object back to bytestring using
your default encoding (ascii) which will most certainly result in codec
Errors...
Correct.

Also, minidom seems to use unicode object what was not really documented
in the python 2.3 doc i've read about it..

It might be somewhat hidden:

http://docs.python.org/lib/dom-type-mapping.html

"DOMString defined in the recommendation is mapped to a Python string or
Unicode string. Applications should be able to handle Unicode whenever a
string is returned from the DOM."

http://docs.python.org/lib/minidom-and-dom.html
"The type DOMString maps to Python strings. xml.dom.minidom supports
either byte or Unicode strings, but will normally produce Unicode
strings. Values of type DOMString may also be None where allowed to have
the IDL null value by the DOM specification from the W3C."

In principle, you should fill Unicode strings into DOM trees all the
time, but it will work with byte strings as well as long as they are
ASCII.

As a matter of fact using the following sequence will most certainly fail :
f = codecs.open('utf8codecs.xml', 'w', 'utf-8')
f.write(dom.toxml(encoding="utf-8"))
f.close()

Correct. A codecs.StreamWriter expects Unicode objects, whereas toxml
returns byte strings (atleast if you pass an encoding - because of a
bug, it might return a Unicode string otherwise)

then again maybe this will work, i just thought of it..
f = codecs.open('utf8codecs.xml', 'w', 'utf-8')
f.write(dom.toxml())
f.close()

Yeah, toxml() returned Unicode because of a bug - but for backwards
compatibility, this cannot be changed. People should explicitly pass
an encoding.

The next important thing is to make sure to use functions and objects
that support unicode all the way, like minidom seems to do..

Indeed, there are still many functions in the standard library which
don't work with Unicode strings, but should. Some functions, of course,
are only meaningful for byte strings (like networking API).

Regards,
Martin

etree, minidom unicode	0	Dec 5, 2008
trying to strip out non ascii.. or rather convert non ascii	38	Oct 26, 2013
Ascii to Unicode.	16	Jul 28, 2010
Thinking Unicode	0	Aug 8, 2013
xml, minidom, ElementTree	1	Dec 13, 2011
Simple interface to minidom for creating XML files	0	Sep 27, 2010
Unicode	20	Dec 16, 2012
problem parsing utf-8 encoded xml - minidom	2	Jul 4, 2008

minidom xml & non ascii / unicode & files

webdev

Benjamin Niemann

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

webdev

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads