XML / Unicode / SAX question

IamIan · Jul 4, 2007

I am using SAX to parse XML that has numeric html entities I need to
convert and feed to JavaScript as part of a CGI. I can get the
characters to print correctly, but not without being surrounded by
linebreaks:

from xml.sax import make_parser
from xml.sax.handler import ContentHandler
import htmlentitydefs, re

def unescape_charref(ref):
name = ref[2:-1]
base = 10
if name.startswith("x"):
name = name[1:]
base = 16
return unichr(int(name, base))

def replace_entities(match):
ent = match.group()
if ent[1] == "#":
return unescape_charref(ent)

repl = htmlentitydefs.name2codepoint.get(ent[1:-1])
if repl is not None:
repl = unichr(repl)
else:
repl = ent
return repl

def unescape(data):
return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)

class newsHandler(ContentHandler):
def __init__(self):
self.isNews = 0

def startElement(self, name, attrs):
if name == 'title':
self.isNews = 1

def characters(self, ch):
if self.is ch = unescape(ch)
print ch

def endElement(self, name):
if name == 'title':
self.isNews = 0

parser = make_parser()
parser.setContentHandler(newsHandler())
parser.parse('http://www.some.com/rss/rss.xml')

For a line like 'Mark à Capbreton'
my results print as:
'Mark
à
Capbreton'

Is this another SAX quirk? I've already had to hack my way around SAX
not being able to split results on a colon. No matter if I try strip,
etc the results are always the same: newlines surrounding the html
entities. I'm using version 2.3.5 and need to stick to the standard
libraries. Thanks.

Stefan Behnel · Jul 4, 2007

IamIan said:
I am using SAX to parse XML that has numeric html entities I need to
convert and feed to JavaScript as part of a CGI. I can get the
characters to print correctly, but not without being surrounded by
linebreaks:

def characters(self, ch):
if self.isch = unescape(ch)
print ch

The print statement introduces line breaks at the end. Use

print ch,

instead. Note that you have to merge character sequences yourself in SAX.
There is no guarantee into how many chunks the textual context of a single tag
is broken before it is passed to the characters() SAX method.

For a line like 'Mark à Capbreton'
my results print as:
'Mark
à
Capbreton'

Is this another SAX quirk? I've already had to hack my way around SAX
not being able to split results on a colon. No matter if I try strip,
etc the results are always the same: newlines surrounding the html
entities. I'm using version 2.3.5 and need to stick to the standard
libraries. Thanks.

Too bad. If an external library was acceptable (Python 2.3 is ok), I would
have proposed lxml, maybe lxml.html (which will be in lxml 2.0), or the Atom
implementation on top of lxml.etree.

http://codespeak.net/lxml
http://codespeak.net/svn/lxml/branch/html/
https://svn.openplans.org/svn/TaggerClient/trunk/taggerclient/atom.py

Hope it helps,
Stefan

Stefan Behnel · Jul 4, 2007

Stefan said:
Note that you have to merge character sequences yourself in SAX.
There is no guarantee into how many chunks the textual context of a single tag

^ content ^

is broken before it is passed to the characters() SAX method.

Oh, well...

Stefan

Splitting SAX results	6	Jun 7, 2007
SAX XML Parse Python error message	5	Jul 13, 2008
Daily WTF with XML, or error handling in SAX	0	May 3, 2008
Error handling in SAX	1	May 3, 2008
XML file parsing with SAX	3	Apr 23, 2005
Newbie XML SAX Parsing: How do I ignore an invalid token?	5	Jan 5, 2007
trying to use sax for a very basic first xml parser	4	Jul 14, 2008
sax.handler.Contenthandler.__init__	1	Aug 30, 2013

XML / Unicode / SAX question

IamIan

Stefan Behnel

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads