codec for html/xml entities!?

M

Martin Bless

Hi friends, I've been OFF-Python now for quite a while and am glad
being back. At least to some part as work permits.

Q:
What's a good way to encode and decode those entities like € or
€ ?

I need isolated functions to process lines. Looking at the xml and
sgmlib stuff I didn't really get a clue as to what's the most pythonic
way. Are there library functions I didn't see?

FYI, here is what I hacked down and what will probably (hopefully...)
do the job.

Feel free to comment.

# -*- coding: iso-8859-1 -*-
"""\
entity_stuff.py, mb, 2008-03-14, 2008-03-18

"""

import htmlentitydefs
import re

RE_OBJ_entity = re.compile('(&.+?;)')

def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('&foobar;') -> ('&foobar;',False)
"""

gotCodepoint = False
gotUnichr = False
if entity.startswith('&#'):
if entity[2] == 'x':
base = 16
digits = entity[3:-1]
else:
base = 10
digits = entity[2:-1]
try:
v = int(digits,base)
gotCodepoint = True
except:
pass
else:
v = htmlentitydefs.name2codepoint.get(entity[1:-1],None)
if not v is None:
gotCodepoint = True

if gotCodepoint:
try:
v = unichr(v)
gotUnichr = True
except:
pass
if gotUnichr:
return v, gotUnichr
else:
return entity, gotUnichr

def line_entities_to_uc(line):
result = []
cntProblems = 0
for e in RE_OBJ_entity.split(line):
if e.startswith('&'):
e,success = entity2uc(e)
if not success:
cntProblems += 1
result.append(e)
return u''.join(result), cntProblems


def uc2entity(uc):
cp = ord(uc)
if cp > 127:
name = htmlentitydefs.codepoint2name.get(cp,None)
if name:
result = '&%s;' % name
else:
result = '&#x%x;' % cp
else:
result = chr(cp)
return result

def encode_line(line):
return ''.join([uc2entity(u) for u in line])


if 1 and __name__=="__main__":
import codecs
infile = 'temp.ascii.xml'
outfile = 'temp.utf8.xml'
of = codecs.open(outfile,'wb','utf-8')
totalProblems = 0
totalLines = 0
for line in file(infile,'rb'):
line2, cntProblems = line_entities_to_uc(line)
of.write(line2)
totalLines += 1
totalProblems += cntProblems
of.close()
print
print "Summary:"
print " Infile : %s" % (infile,)
print " Outfile: %s" % (outfile,)
print ' %8d %s %s' % (totalLines,
['lines','line'][totalLines==1], 'written.')
print ' %8d %s %s' % (totalProblems,
['entities','entity'][totalProblems==1], 'left unconverted.')
print '%s' % ('Done.',)


Have a nice day and
ru, Martin
(read you, ;-)
 
S

Stefan Behnel

Martin said:
What's a good way to encode and decode those entities like € or
€ ?

Hmm, since you provide code, I'm not quite sure what your actual question is.

So I'll just comment on the code here.

def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('&foobar;') -> ('&foobar;',False)
"""

Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Stefan
 
M

Martin Bless

[Stefan Behnel] wrote & said:
Martin said:
What's a good way to encode and decode those entities like € or
€ ?

Hmm, since you provide code, I'm not quite sure what your actual question is.

- What's a GOOD way?
- Am I reinventing the wheel?
- Are there well tested, fast, state of the art, builtin ways?
- Is something like line.decode('htmlentities') out there?
- Am I in conformity with relevant RFCs? (I'm hoping so ...)
So I'll just comment on the code here.



Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Mainly a matter of style. When I'll be using the function in future
this way it's unambigously clear that there might have been
unconverted entities. But I don't have to deal with the details of how
this has been discovered. And may be I'd like to change the algorithm
in future? This way it's nicely encapsulated.

Have a nice day

Martin
 
S

Stefan Behnel

Martin said:
[Stefan Behnel] wrote & said:
def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('&foobar;') -> ('&foobar;',False)
"""
Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Mainly a matter of style. When I'll be using the function in future
this way it's unambigously clear that there might have been
unconverted entities. But I don't have to deal with the details of how
this has been discovered. And may be I'd like to change the algorithm
in future? This way it's nicely encapsulated.

The normal case is that it could be replaced, and it is an exceptional case
that it failed, in which case the caller has to deal with the problem in one
way or another. You are making the normal case more complicated, as the caller
*always* has to check the result indicator to see if the return value is the
expected result or something different. I don't think there is any reason to
require that, except when the conversion really failed.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top